5

I have a 42M line text file. Th first nine characters of each line are a numeric key. What is the most efficient way to extract only the lines whose key exists in another list of about 1.5M keys? Both the file and the list of keys are sorted.

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
joebolte
  • 183
  • 5
  • 1
    Similar question: http://unix.stackexchange.com/questions/29624/grep-huge-number-of-patterns-from-huge-file – angus Aug 22 '12 at 18:44
  • Similar question: [Fastest way to find lines of a file from another larger file in Bash](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-file-from-another-larger-file-in-bash) – codeforester Mar 03 '18 at 18:43

3 Answers3

5

Using awk should be efficient enough - it provides builtin associative arrays, where the key lookup time is logarithmically proportional to the number of keys (of your lookup table - which is relatively small in your example).

For your input this would be:

42M * log2(1.5M) -> 42M * 20 key comparisons 

(where M means 10^6)

In case your awk uses hash tables, every key lookup would only cost a constant amount of time.

An example of an efficient awk based solution (using the default field separator):

$ awk 'ARGIND == 1 { a[$1] = 1; next } a[$1] { print $0 }' keys.dat largefile.dat

Since both inputs are sorted you could write a script that would be more efficient (with a runtime scaling linearly with both input file sizes). But it would cost more time programming it.

Or you could use join which expect sorted files as input - restriction is that your key needs to be alphabetically sorted - and perhaps you have to tweak the output format. For example:

$ join -j1 keys.dat largefile.dat

Use -t to configure the field separator and -o to adjust the output format.

This should run in time linear to the input size.

Peter.O
  • 32,426
  • 28
  • 115
  • 163
maxschlepzig
  • 56,316
  • 50
  • 205
  • 279
  • Sorry about my editing faux pas.. I did a quick Ctrl-A, Ctrl-V (without even looking), to replace everything with a single line, but I was in the *wrong answer* :( .. It's a good thing we have rollback... – Peter.O Aug 22 '12 at 19:33
  • 1
    join looks like exactly what I want, except that my fields are fixed-width, and join appears to only support lines delimited by single characters. – joebolte Aug 22 '12 at 19:45
4

Note that this method uses the length of a fixed-length key which begins at the first byte of the record.

By using \x01 (or any unique single-byte char) as a temporary field separator, records can be more easily manipulated.

join -t$'\x01' <(sed -r 's/.{9}/&\x01/' main) <(cut -b -9 keys) |sed -r 's/(.{9})./\1/'

maxschlepzig's awk example was faster for 45,000,000 records, but it failed on a bigger file. How much free RAM do you have?

Here are the results:

45,000,000 unique records, 1,500,000 keys
=========================
awk

real    0m31.971s
user    0m28.782s
sys     0m2.972s

join

real    0m53.733s
user    0m54.255s
sys     0m0.708s

(2x45) 90,000,000 records, 1,500,000 keys
=========================
awk
awk: (FILENAME=main2 FNR=54334297) fatal: assoc_lookup: bucket->ahname_str: can't allocate 11 bytes of memory (Cannot allocate memory)

join

real    1m35.306s
user    1m34.754s
sys     0m1.344s

===================
Peter.O
  • 32,426
  • 28
  • 115
  • 163
0

Assuming it's a line based file, grep should be pretty efficient. Use -f keyfile and -F for fixed strings:

grep -F -f keys textfile

Note: heed the warning about false positives by PeterO in the comments below.

Thor
  • 16,942
  • 3
  • 52
  • 69
  • 1
    `-F` is problematic unless you are matching the entire record: eg: `grep -Ff <(echo 123456) <(echo 633322 data:123456789)` returns a record whose key is actually `633322` – Peter.O Aug 23 '12 at 04:28
  • PS... I've just timed it against the data used in my answer, for which there is no possibility of a clash. It **is** fast; clocking in at *7.5sec* and *9.2sec* for the 45M and 90M files respecively.. It's a pity about the *clash* issue... (I wonder if `-F` can be anchored to the start-of-record?)... I tried a regex (without -F), but it got bogged down. – Peter.O Aug 23 '12 at 05:10
  • Matching only on the key column should solve these issues. I've updated the answer. – Thor Aug 23 '12 at 08:44
  • 1
    It is surprisingly slow.. It is pushing out matched records at a steady rate, and I'm sure it will get there, but I can't wait *625* hours for it to finish the 1.5 million matches in my test data :) .. ie. approximately 1.5sec per record... It is really a surprise. – Peter.O Aug 23 '12 at 11:51
  • Could it be the second `sed` being very slow? Could you try with: `... | sed 's/^/NR==/' | awk -f - textfile` at the end of the pipe. – Thor Aug 23 '12 at 12:08
  • Using awk is slower -- All data and keys are evenly spread; though the 90M file had every second record identical to the preceding one. This showed up as *a long delay* then *two quick(-ish) succession* matches... Here are the results (using 1.5M keys; every key is matchable) --- 45M records after 563 matches: **3.7** sec/match: time real/usr/sys: 44m9.897s 44m16.998s 0m1.612s -- 90M Data after 214 matches **8.6** sec/match: time real/usr/sys: 30m50.324s 30m56.712s 0m3.484s – Peter.O Aug 23 '12 at 20:28
  • Thank you for testing, I'll remove the piped option as it seems impractical inefficient. – Thor Aug 23 '12 at 22:02