For a particular type of analysis, I have a large set of *.tar.gz files. While they are relatively modest in size (100MB-1GB), they are full of roughly 1kB files. In testing, the overhead of extracting hundreds of thousands of files to disk is far more expensive than the processing I need to do on the data itself.
Is there a way of directly processing the data inside the tar file, without having to extract it first?
import tarfile
import tempfile
import stat
import os
tar_file = "input.tar.gz"
tar = tarfile.open(tar_file, "r:gz")
members = tar.getmembers()
file_permission = stat.S_IRUSR | stat.S_IWUSR
folder_permission = stat.S_IRUSR | stat.S_IWUSR | stat.S_IXUSR
# Let's pretend we want to edit, and write to new tar
if len(members) > 0:
fd, tmp_tar = tempfile.mkstemp(prefix=("%s.fixed." % tar_file))
os.close(fd)
fixed_tar = tarfile.open(tmp_tar, "w:gz")
# Then process members
for member in members:
# add o+rwx for directories
if member.isdir() and not member.issym():
member.mode = folder_permission | member.mode
extracted = tar.extractfile(member)
fixed_tar.addfile(member, extracted)
# add o+rw for plain files
elif member.isfile() and not member.issym():
member.mode = file_permission | member.mode
extracted = tar.extractfile(member)
fixed_tar.addfile(member, extracted)
else:
fixed_tar.addfile(member)
fixed_tar.close()
tar.close()
# Rename the fixed tar to be the old name
os.rename(tmp_tar, tar_file)
else:
tar.close()
That example is from the original Singularty source code, and there are other examples to:
One approach you could take would be to extract files to a temporary location in memory. Given the size you are working with, the available space on the ram-disk that is mounted by default at /dev/shm should be enough, as long as you make sure to clean these files up after they are analyzed, before extracting the next set.
/dev/shm has been a standard feature of linux installs for at least a decade (I don’t actually have an introduction date, but I have seen references as early as 2006). Unless it was specifically removed for some reason, I would expect it on any modern system.
Rather than extracting the files, you could consider doing your analysis in a language that supports directly manipulating tar files. For example, python has a tarfile module, which has a streaming mode. This will allow you to go through and process your files, without having to ever have them reach a disk.