Store a large number of important files

harpocrates · April 6, 2020, 3:27pm

I have a data set (about 3 TB) that, due to the programs used to read and process it, needs to remain in many small (mostly KB) sized files.

I’m mostly worried about them all storing correctly, but they are also occasionally retrieved in different selections (by folder).

Are there any straight forward ways to address these concerns?

Thanks

hjmangalam · April 13, 2020, 3:52am

Depending on who wrote the programs you’re using and how the programs access the files, there are a few answers.
There are many problems with using the Zillions of Tiny (ZOT) file structure for storing data. Keeping them all separate, the insane overhead of opening, reading, then closing a zillion files, the overhead of writing / reading ASCII characters when binary representation is a lot more compact, the wear and tear and blocking that you subject a shared system to when you run a 3T ZOTfile analysis, etc.

If you or someone who can be bribed or coerced into changing the code, I’d suggest using binary format such as HDF5 which is an extremely compact data structure that is essentially a hierarchical file of files (hence the name). It takes care of all kinds of problems with ZOTfiles and also provides mechanisms for doing all kinds of useful things while still maintaining POSIX compatibility. Google ‘HDF5’ for the full dump. Also notable is that it is compatible with the NCO utilities (by Charlie Zender) which were developed for the NETCDF format, but NETCDF4 has now essentially merged with HDF5. The NCO utilities allow for extremely sophisticated processing of HDF files in streaming mode. R, MATLAB, many bioinformatics apps, and most mature analytical apps now support HDF5 as a native format.

Even simply concatenating your files into a few much larger files will help. If the file names are significant, add the file name as a data line separator behind a ‘#’ or special comment. YOu said that the files have to be selected by folder - if you tar all the files in a dir as a single file, you’ve already compacted a significant amount (unless you have 1 file per dir).

If you must keep your files as individual files, store them as compressed tar files (essentially concatenations) and move them around as such. Then when you need to analyze them decompress them in a pipe and extract the required data while doing so; if you can then analyze the data in the same pipe, you’ve saved yourself a huge amount of IO - every time data hits a spinning disk, you’ve slowed processing by 1K-1MX for that step. This assumes you know how to use the pipe char ‘|’.

If the individual files MUST be stored on a disk-like device temporarily, especially useful for ZOTfiles are either in-memory filesystems (/dev/shm, which by default allows for up to half installed RAM as a temp filesystem, or the use of an SSD system (preferably an NVME SSD) which sidesteps the SATA/SAS overhead. ZOTfiles on spinning disks is a bad idea for many, many reasons.

So vomit the tarball contents to /dev/shm/tmp/you or /nvme/tmp/you (or whatever your sysadmin has decreed) and process them from there, and then IMMEDIATELY cleanup after yourself. This can easily be done in your scheduler script.

hjm