Is there a way to process archived files without fully extracting them to disk?

jkingsley · July 6, 2018, 3:20pm

For a particular type of analysis, I have a large set of *.tar.gz files. While they are relatively modest in size (100MB-1GB), they are full of roughly 1kB files. In testing, the overhead of extracting hundreds of thousands of files to disk is far more expensive than the processing I need to do on the data itself.

Is there a way of directly processing the data inside the tar file, without having to extract it first?

vsoch · November 14, 2018, 1:36am

If you use python (or your language of choice) there is a core module called tarfile that can do wonders to:

read a tarfile into memory
either edit members in place and write to memory (and then update file) or write to new thing.

For example, I just wrote up this little snippet https://gist.github.com/vsoch/5ef7cc7974d3bf94a83c16fbc8cda9a8 to read a .tar.gz into memory, check permissions, and change if necessary. I’ll also include it here:

import tarfile
import tempfile
import stat
import os

tar_file = "input.tar.gz"
tar = tarfile.open(tar_file, "r:gz")
members = tar.getmembers()

file_permission = stat.S_IRUSR | stat.S_IWUSR
folder_permission = stat.S_IRUSR | stat.S_IWUSR | stat.S_IXUSR

# Let's pretend we want to edit, and write to new tar
if len(members) > 0:
    fd, tmp_tar = tempfile.mkstemp(prefix=("%s.fixed." % tar_file))
    os.close(fd)
    fixed_tar = tarfile.open(tmp_tar, "w:gz")

    # Then process members
    for member in members:

        # add o+rwx for directories
        if member.isdir() and not member.issym():
            member.mode = folder_permission | member.mode
            extracted = tar.extractfile(member)
            fixed_tar.addfile(member, extracted)

        # add o+rw for plain files
        elif member.isfile() and not member.issym():
            member.mode = file_permission | member.mode
            extracted = tar.extractfile(member)
            fixed_tar.addfile(member, extracted)
        else:
            fixed_tar.addfile(member)
            
    fixed_tar.close()
    tar.close()

    # Rename the fixed tar to be the old name
    os.rename(tmp_tar, tar_file)
else:
    tar.close()

That example is from the original Singularty source code, and there are other examples to:

If you have a specific need or example I’d be happy to help! We can also try outside of Python.

jkingsley · July 6, 2018, 3:21pm

One approach you could take would be to extract files to a temporary location in memory. Given the size you are working with, the available space on the ram-disk that is mounted by default at /dev/shm should be enough, as long as you make sure to clean these files up after they are analyzed, before extracting the next set.

jpessin1 · July 24, 2018, 1:07am

@jkingsley /dev/shm, seems like a system specific detail. Is there a more generic name for this?

jkingsley · July 24, 2018, 6:11pm

/dev/shm has been a standard feature of linux installs for at least a decade (I don’t actually have an introduction date, but I have seen references as early as 2006). Unless it was specifically removed for some reason, I would expect it on any modern system.

jpessin1 · July 24, 2018, 9:55pm

@jkinsley think this https://superuser.com/questions/45342/when-should-i-use-dev-shm-and-when-should-i-use-tmp covers some of that … (linux kernal 2.6) … most common linux distros do have it on by defualt, but it is an optional config.

jpessin1 · May 31, 2019, 8:27pm

There is also the tar pipe. Piping it to a filtering tool like grep or sed

tar xf -O tarfile.tar.gz | sed ‘regex p’

jkingsley · July 6, 2018, 3:23pm

Rather than extracting the files, you could consider doing your analysis in a language that supports directly manipulating tar files. For example, python has a tarfile module, which has a streaming mode. This will allow you to go through and process your files, without having to ever have them reach a disk.