1

I have 44 TB worth of gzipped files that are 1.5 GB each when compressed or 2.0 GB when uncompressed. I have a tool that can read only uncompressed files. I would like to avoid the overhead of uncompressing the entire file and writing the result to disk, since I might only need access to a small part of the file. The tool cannot read gzip-compressed data on the fly.

Is there a way to create a (read-only) file-like object that has all the features of a file from the application point of view, but rather than storing any data to disk, calculates data on-the-fly (possibly caching in memory)? I could try a named pipe, but this doesn't allow seeking. Uncompressing to tmpfs is somewhat (10–20%) faster than uncompressing to disk, but still requires uncompressing the entire file. I do not need any write access.

The machine has 2 TB RAM and runs on Red Hat Enterprise Linux Server release 6.7. The data are on a panfs filesystem. Other filesystems are a small (20 GB) tmpfs and and some scratch space (15 TB shared with others). I do not have system administrator privileges.

gerrit
  • 3,457
  • 6
  • 25
  • 41
  • 2
    You can try a [fuse gzip filesystem](https://code.google.com/p/fusecompress/) but there is absolutely no getting around the hard fact that gzip files must be read sequentially from start to the location you need. You can transparently *emulate* seeking but it has to be done by internally uncompressing everything up to the seek point. – Celada Jan 20 '16 at 15:10
  • @Celada I see. Still, at least it should be faster when reading only the header, and it can uncompress to memory rather than to disk. The 28630 gzipped files are already there on a panfs filesystem, I am not in a position to move/copy them to a different kind of filesystem, which if I understand correctly would be required to make use of fusecompress. – gerrit Jan 20 '16 at 15:14
  • I think I wrote this tool in another answer [here once](http://unix.stackexchange.com/a/210024/52934).. There I ungzipped input and passing it chunks at a time as reliably separated on record barriers through another filter before dispatching it elsewhere with explicit `dd` pipe buffering. The concept is similar, at least. Seems it could be adapted. . especially w/ a tmpfs... – mikeserv Jan 20 '16 at 15:21
  • Here's another [similar thing](http://unix.stackexchange.com/a/252414/52934)... – mikeserv Jan 20 '16 at 15:33
  • @gerrit, I see now that fusecompress won't work for you because it expects the compressed backing store to be in its own special format which may use gzip compression but isn't an actual gzip file. It won't understand a backing store that is already pre-filled with gzip files (your panfs filesystem). Sorry about the reference to that, it was just the first hit I got when searching for something like this. I still feel like there's a decent chance someone else might have already written a different fuse filesystem that is suitable. – Celada Jan 20 '16 at 17:05
  • If you need to seek, gzip is the wrong format. You should use a format that compresses by reasonably-sized blocks so that you only need to decompress one or two blocks to get at data in the middle of the file. If that's not an option, I think scripting the decompression will be faster than anything FUSE-based. FUSE is very convenient but performance isn't its strong point. – Gilles 'SO- stop being evil' Jan 20 '16 at 22:43
  • @Gilles The format is out of my control. The 28630 gzipped files are what I am presented with and I am not in a position to put them in any other format. The files are read-only for me. – gerrit Jan 21 '16 at 00:01

0 Answers0