Fastest way to find lines starting with string in gzip file

Question

I have a flat database based on 65536 files, each one containing one word by line starting with two hexadecimal characters.

They look like this:

afword
46word2
Feword3
...

I make tens of thousands of requests a day on this, so I'm looking for a better way to find a line starting with two hexadecimal characters. The files were sorted before being gzipped.

As of now, I do:

LC=ALL zgrep --text '^af' file

Is there any other faster way to do this in perl or bash or any command line?

Looking for lines of sorted files starting with a prefix is what the `look` utility does. But here, do you have one file with many lines, or several files with many lines and you need to look into all of them? If you need to uncompress many small files by running one invocation of `zcat -f` on each file (like the `zgrep` script does), that's what is going to be inefficient. — Stéphane Chazelas, Aug 22 '21 at 16:36
It sounds like you may want to put this data into a database. Uncompressing all the data tens of thousands of times each days seems a bit excessive and wasteful. — Kusalananda, Aug 22 '21 at 16:38
You say your files are sorted but then the sample you showed is not. Or do you mean it's sorted based on the contents of the lines starting with the 3rd character (ignoring the 2 digit hex number)? — Stéphane Chazelas, Aug 22 '21 at 16:41
Thank you for your answers. I don't have enough space to store this on a regular database, that's why I made my own system. I know in which file to look, no need to look in several files. Would zcat piped to look be faster? I can try. As of now i have a 1.7tb ssd db, it would be more than 3.5tb uncompressed which is hard to find at a good price. The files were sorted before gzip, do not take care of sample I put sorry. — James, Aug 22 '21 at 16:43
No, `look` would probably not help here as it needs to mmap the files, so can't work on pipes other than by reverting to a `grep`-like approach instead of doing a binary search. — Stéphane Chazelas, Aug 22 '21 at 16:45
How large are the files? You say you want to look for *one* line. Is that to say that there's only one line for each pair of digits (so at most 256 lines in the file)? — Stéphane Chazelas, Aug 22 '21 at 16:46
Note that a different approach could be to have one huge sorted file, compressed with `xz` with reasonably short blocks, and access it as a nbd device using nbdkit's xz module and use `look` on that. [I once did that with the hibp database](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=893033). That may be more space efficient, but maybe not faster though. — Stéphane Chazelas, Aug 22 '21 at 16:52
There is about 50 billion lines in the files, so about 760k lines in each file. Zgrep returns to me all corresponding lines then i have to find the good one using php. This was a balance between cpu usage and disk space. I tried xz to maximize space and xzgrep but it was too slow. Maybe your idea is the good one, but as you say I'm not sure the reading would be faster. — James, Aug 22 '21 at 17:05
It is sorted by the prefix (the two digit hex), or the words? Is it static - as in files do not change? Not sure if it would be possible to 1) compress to xz using one block per prefix (total 256 per file). 2) have a index file which tell which block each prefix starts at. You could have the index mapped in RAM. — ibuprofen, Aug 22 '21 at 17:30
@ibuprofen, or even put each such block into a separate file. If it's not otherwise significant for there to be exactly those 65536 files. That would be 16.7 M files, which isn't an insignificant number, but shouldn't be too much for a 1.7 TB drive. Would definitely need a two-level hierarchy though, so files like `12345/fd` for "file" 12345, prefix id "fd". That would also be trivial to process from a shell script. — ilkkachu, Aug 22 '21 at 17:36
I tried to create more files. You access the files by subdir which are name with one hexadecimal character, so the looking part takes advantage of that, for instance if I look for a hash that starts with abcdef, it will go to a/b/cd and look in that file every word starting with ef, then process them to find the right one. I tried to create one more subdir, or have 4096 files in each dir, but every time that failed, creation process would take month I don't know why, I tried several things but didn't work out. Too bad because that would fasten the process. — James, Aug 22 '21 at 19:16
I typed the command but can't seem to be able to know what to do with those numbers. Any clue ? Thank you :) — James, Aug 22 '21 at 20:14
`df -i` show inode stats (or `df -i /dev/some_device`). Total, used and free. The number of inodes is typically set at a a fixed max on file-system creation. Each file and directory uses an inode. When all is used one can not create more directories or files … E.g. `16 × 16 × 256 × 4096 = 268,435,456` – I.e. a 1.5T ext4 partition has about 100M inode limit with default settings (take that number with a pinch of salt, but you might get the gist of it.) — ibuprofen, Aug 22 '21 at 21:43
On the other hand `ab/cd/ef` would be `256 × 256 × 256 = 16,777,216` ... or am I wrong now? — ibuprofen, Aug 22 '21 at 21:56
I tried with 16*65536 files, all the files were successfully created, but after about 40/50gB it started going veeeery slowly, like thousands words per hour when there is 50 billion to process. I have debian on a ssd, but the script does have to fopen and fclose each file, and even with a buffer that fills my 64gB ram it's going too slow. I actually did a thread on stack overflow about this issue but didn't find how to solve it. — James, Aug 23 '21 at 06:18
Maybe you should consider indexing the gzip files. Google will give you some tools. This might then help to find your entries much quicker. — Rolf, Sep 17 '21 at 07:27

Stéphane Chazelas · Answer 1 · 2021-08-23T04:54:55.697

3

zgrep (the one shipped with gzip) is a shell script which in the end does something like zcat | grep. The one from zutils does the same except it's written in C++ and supports more compression formats. It still calls gzip and grep in separate processes, connected with a pipe.

With such a simple search, grep has a much easier job than zcat, so if you keep the same approach to organise your data, I would suggest focussing on working at improving the compression side of things.

Here working on a file generated with xxd -p -c35 < /dev/urandom | head -n 760000 | sort, I find that with it being gzip-compressed, using pigz -dc instead of zcat (aka gzip -dc) speeds things up by a factor 2.

Compressing it with lz4 --best, I get a 30% bigger file, but decompression times are reduced 100 fold:

$ zstat +size a*(m-1)| sort -k2n | column -t
a.xz    26954744
a.lrz   26971363
a.bz2   27412562
a.gz    30353089
a.gz3   30727911
a.lzop  38000050
a.lz4   40261510
a       53960000

$ time lz4cat a.lz4 > /dev/null
lz4cat a.lz4 > /dev/null  0.06s user 0.01s system 98% cpu 0.064 total
$ time pigz -dc  a.gz > /dev/null
pigz -dc a.gz > /dev/null  0.36s user 0.02s system 126% cpu 0.298 total
$ time gzip -dc a.gz > /dev/null
gzip -dc a.gz > /dev/null  0.47s user 0.00s system 99% cpu 0.476 total

$ time lz4cat a.lz4 | LC_ALL=C grep '^af' > /dev/null
lz4cat a.lz4  0.07s user 0.02s system 60% cpu 0.142 total
LC_ALL=C grep '^af' > /dev/null  0.07s user 0.00s system 53% cpu 0.141 total
$ time pigz -dc a.gz | LC_ALL=C grep '^af' > /dev/null
pigz -dc a.gz  0.36s user 0.04s system 130% cpu 0.303 total
LC_ALL=C grep '^af' > /dev/null  0.06s user 0.01s system 23% cpu 0.302 total
$ time gzip -dc a.gz | LC_ALL=C grep '^af' > /dev/null
gzip -dc a.gz  0.51s user 0.00s system 99% cpu 0.513 total
LC_ALL=C grep '^af' > /dev/null  0.08s user 0.01s system 16% cpu 0.512 total

lzop --best is not far behind lz4, and compresses slightly better on my sample.

$ time lzop -dc a.lzop | LC_ALL=C grep '^af' > /dev/null
lzop -dc a.lzop  0.24s user 0.01s system 85% cpu 0.293 total
LC_ALL=C grep '^af' > /dev/null  0.07s user 0.01s system 27% cpu 0.292 total

edited Aug 23 '21 at 04:54

answered Aug 22 '21 at 18:24

Stéphane Chazelas

522,931
91
1,010
1,501

Thank you for your time. I tried lz4 but only managed to get 50% more space disk usage, not 30%. I got to say, the words contains all the alphanumeric, so maybe it doesn't compress as well as only hexa on letters or so ? I don't know. I tried lzop, I get 40% more disk usage and about 35% less searching time. Also my gzip compression is slightly better than gzip because I use pigz -11 which takes time, but the files are destined to be static. I also tried gzip -dc but it didn't do much on time. – James Aug 22 '21 at 19:13
1

@James, did you try `pigz` (the parallel `gzip` (de)compressor)? – Stéphane Chazelas Aug 22 '21 at 19:45
@StephenKitt. Thanks. I've added the note about zutils. I was enquiring about pigz because as though it does parallelize the decompression, it's still faster than gzip for me on decompression (it still uses several threads as indicated in the man page). – Stéphane Chazelas Aug 23 '21 at 04:57
@Stéphane right, using `pigz` for compression doesn’t imply that it’s also used for decompression, so it is worth asking. – Stephen Kitt Aug 23 '21 at 04:59
1

@StephenKitt, sorry typo in my comment above. I meant it does **not** parallelize the decompression (but still is faster for me and uses several threads other than for decompression). – Stéphane Chazelas Aug 23 '21 at 05:07
I don't use pigz for decompression as I used zgrep, but I could pipe it to grep and benchmark that. – James Aug 23 '21 at 06:34
@James, as I said, `zgrep` just does `gzip -dc | grep`, so it's just a convenience thing. Doing it yourself would remove some (very little) overhead and allow you do use different compressors. – Stéphane Chazelas Aug 23 '21 at 06:52
So I could even go with 7z compression, which is very good space disk wise, i'll try it performance wise though. – James Aug 23 '21 at 10:09
@James, AFAIK, 7z uses the same compression algorithm as xz but is more a Windows tool. I'd expect you'd get the same kind of ratio and performance as with xz. Seems like you should be looking at compressors faster than gzip rather than ones that compress more (and are generally slower). – Stéphane Chazelas Aug 23 '21 at 10:11
@James, google's `brotli -q 11` gives the best compression ratio on my sample from the ones I tested so far (`25926225`) and is a bit faster than `pigz` to decompress it. Very slow to compress though. – Stéphane Chazelas Aug 23 '21 at 11:04

Fastest way to find lines starting with string in gzip file

1 Answers1