19

I have a file that's growing about 200,000 lines a day, and it is all formed with blocks of three lines as such:

1358726575123       # key
    Joseph Muller   # name
    carpenter       # job
9973834728345
    Andres Smith
    student
7836472098652
    Mariah Anthony
    dentist

Now, I have another file from which I extract about 10,000 key patterns, such as 1358726575123. Then I run a for loop with these patterns and have to check them against the first file. If the file doesn't contain such pattern, I save the pattern in a third file for further processing:

for number in $(grep -o '[0-9]\{12\}' file2); do  # finds about 10.000 keys
     if ! grep -q ^$number$ file1; then           # file1 is a huge file
         printf "$number\n" >>file3               # we'll process file3 later
     fi
done

The example code greps a huge file 10,000 times, and I run this loop about once a minute, during the whole day.

Since the huge file keeps growing, what can I do to make all this faster and save some CPU? I wonder whether sorting the file somehow by its key (if so, how?) or using a db instead of plain text would help...

admirabilis
  • 4,642
  • 9
  • 41
  • 57
  • See this post: [Fastest way to find lines of a file from another larger file in Bash](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-file-from-another-larger-file-in-bash) – codeforester Mar 03 '18 at 18:43

11 Answers11

17

The problem, of course, is that you run grep on the big file 10,000 times. You should read both files only once. If you want to stay outside scripting languages, you can do it this way:

  1. Extract all numbers from file 1 and sort them
  2. Extract all numbers from file 2 and sort them
  3. Run comm on the sorted lists to get what's only on the second list

Something like this:

$ grep -o '^[0-9]\{12\}$' file1 | sort -u -o file1.sorted
$ grep -o  '[0-9]\{12\}'  file2 | sort -u -o file2.sorted
$ comm -13 file1.sorted file2.sorted > file3

See man comm.

If you could truncate the big file every day (like a log file) you could keep a cache of sorted numbers and wouldn't need to parse it whole every time.

angus
  • 12,131
  • 3
  • 44
  • 40
  • 1
    Neat! 2 seconds (on not particularly fast drives) with 200,000 random lines entries in mainfile (ie. 600,000 lines) and 143,000 random keys (that's just how my test data ended up)... tested, and it works (but you knew that :) ... I do wonder about the `{12}`.. OP has used 12, but the example keys are 13 long... – Peter.O Jan 21 '12 at 15:20
  • 2
    Just a little note, you can do it without dealing with temporary files by using `<(grep...sort)` where the file names are. – Kevin Jan 21 '12 at 17:20
  • Thank you, but grepping and sorting the files takes much longer than my previous loop (+2min.). – admirabilis Jan 21 '12 at 19:36
  • @Teresa e Junior. How big is your main file? ... You have mentioned that it grows at 200,000 lines a day, but not how big it is... To reduce the amount of data you nee to process, you can read just the current days' 200,000 lines by taking a note of the last line number processed (yesterday) and using `tail -n +$linenum` to output only the latest data. That way you will be processing only approximately 200,000 lines each day.. I just tested it with **6 million** lines in main-file and *10 thousand* keys... **time**: real 0m0.016s, user 0m0.008s, sys 0m0.008s – Peter.O Jan 21 '12 at 19:56
  • 1
    I'm really quite puzzled/curious about how you can grep your main file **10,000** times and find it faster than this method which only greps it **once** (and once for the much smaller *file1*)... Even if your sort takes longer than my test, I just can't get my head around the idea that reading a **big** file that many times doesn't outweigh a single sort (timewise) – Peter.O Jan 21 '12 at 20:16
  • @Peter.O Not so big yet, `grep | sort` takes a long time, but you had a good idea to keep sorted files for each day. Unfortunately, my example showed only 12 number keys, but it has keys of different sizes (I wrote that code for simplicity). So I don't know why, but `comm` is complaining the files are not sorted and is outputting crap... (I tried `sort` with and without `-n`). – admirabilis Jan 21 '12 at 20:21
  • If you have no whitespace or comments on lines which hold a key, then you don't need to *grep* your key-file at all; just *sort* it as-is... As for the main-file's *grep* pattern, just change it to `^[0-9]\+$` (if it has no white-space or comments, otherwise you could use `sed` ) – Peter.O Jan 21 '12 at 20:45
  • @TeresaeJunior: `comm` complaining is just because of `sort -n`. The numeric sort is unnecessary, I'll update the answer. [edit: oops, you tested that. Then I don't know. Why don't you put the real specifications?] Regarding speed, on a 2.5 GB file1 I found this method takes ~10% the time than performing 10,000 `grep -q`s (which is very fast, but not so much). – angus Jan 21 '12 at 22:53
  • You might want to compare the speed with `--line-regexp '[0-9]\{12\}'` for the first file. – l0b0 Apr 14 '12 at 10:27
  • Fastest performance will most often result from using the tool(s) you know best. – DocSalvager Jan 26 '14 at 06:43
11

This answer is based on the awk answer posted by potong..
It is twice as fast as the comm method (on my system), for the same 6 million lines in main-file and 10 thousand keys... (now updated to use FNR,NR)

Although awk is faster than your current system, and will give you and your computer(s) some breathing space, be aware that when data processing is as intense as you've described, you will get best overall results by switching to a dedicated database; eg. SQlite, MySQL...


awk '{ if (/^[^0-9]/) { next }              # Skip lines which do not hold key values
       if (FNR==NR) { main[$0]=1 }          # Process keys from file "mainfile"
       else if (main[$0]==0) { keys[$0]=1 } # Process keys from file "keys"
     } END { for(key in keys) print key }' \
       "mainfile" "keys" >"keys.not-in-main"

# For 6 million lines in "mainfile" and 10 thousand keys in "keys"

# The awk  method
# time:
#   real    0m14.495s
#   user    0m14.457s
#   sys     0m0.044s

# The comm  method
# time:
#   real    0m27.976s
#   user    0m28.046s
#   sys     0m0.104s

Peter.O
  • 32,426
  • 28
  • 115
  • 163
  • This is fast, but I don't understand much of awk: what the file names should look like? I tried `file1 -> mainfile` and `file2 -> keys` with gawk and mawk, and it outputs wrong keys. – admirabilis Jan 22 '12 at 06:39
  • file1 has keys, names and jobs. – admirabilis Jan 22 '12 at 06:41
  • 'mainfile' is the big file (with keys, names and jobs). I've just called it "mainfile' because I kept getting mixed up which file was which (file1 vs file2).. 'keys' contains only the 10 thousand, or however many, keys.. For your situaton do NOT *redirect* anyting ... just use **file1 EOF file2** They are the **names** of your files.. "EOF" is a 1-line file creadte by the script to indicate the end of the first file (main data file) and the start of teh second file (keys). `awk` allow you to read in a series of files.. In this case that series has *3* files in it. The output goes to `stdout` – Peter.O Jan 22 '12 at 07:40
  • This script will print any keys which are present in `mainfile`, **AND** it will also print any keys from the `keys` file which are **NOT** in `mainfile`... That's probably what is happening... (I'll look a bit further into it... – Peter.O Jan 22 '12 at 08:01
  • Thank you, @Peter.O! Since the files are confidential, I'm trying to create sample files with `$RANDOM` for uploading. – admirabilis Jan 22 '12 at 08:38
  • Fixed the issue with printing keys from the `keys` file which are not in `mainfile` ... see comment in script *# THIS LINE CHANGED AS PER COMMENT IN ANSWER* .. The line which follows it is the replacement. – Peter.O Jan 22 '12 at 08:44
  • Here is a link to my $RANDOM test data generator.... [paste.Ubuntu](http://paste.ubuntu.com/812819/) – Peter.O Jan 22 '12 at 09:03
  • I uploaded the files here: http://tinyurl.com/files-tar-bz2 ; keys has 5000 keys that mainfile has not. – admirabilis Jan 22 '12 at 09:21
  • Your test data *works fine here*... It resonded exactly as I expected... Here are *[my notes](http://paste.ubuntu.com/812891/)* on what I did and the results I got. Maybe I've misunderstood something... if so we should go to chat, as the Stackexchange messege keeps suggesting... I'm looking more carefully at the **5000** keys you mention.. – Peter.O Jan 22 '12 at 10:38
  • Could you please explain what `136703` means? I also get this number. – admirabilis Jan 22 '12 at 10:44
  • `# The first run of the script created a list of 136703 new keys..` but we only have 10000 keys to check xD – admirabilis Jan 22 '12 at 10:51
  • Finally! (I get 5000 keys being listed :) .. It is twice as fast as the `comm` method for the same data on my system... Updating the answer now.. – Peter.O Jan 22 '12 at 11:42
  • Thank you so much! It's 30 times faster than the first loop! – admirabilis Jan 22 '12 at 12:13
  • I wonder if a SQL database would be any faster? This awk script does wonders ;) – admirabilis Jan 22 '12 at 12:20
  • Assuming that you set up your SQL database with appropriate keys/indexing, it would be faster (relatively) as the data-**base** becomes larger... With flat files you always need to read every record/line sequentially.. The database will use binary-split and hash lookup searches, for only as many as there are keys you wish to read.... – Peter.O Jan 22 '12 at 12:27
8

Yes, definitely do use a database. They're made exactly for tasks like this.

Mika Fischer
  • 2,212
  • 1
  • 16
  • 16
  • Thanks! I don't have much experience with databases. Which database do you recommend? I've got MySQL and the sqlite3 command installed. – admirabilis Jan 21 '12 at 09:47
  • 1
    They are both fine for this, sqlite is simpler because it's basically just a file and an SQL API to access it. With MySQL you need to set up a MySQL server in order to use it. While that's not very difficult either, sqlite might be best to start with. – Mika Fischer Jan 21 '12 at 14:07
3

This might work for you:

 awk '/^[0-9]/{a[$0]++}END{for(x in a)if(a[x]==1)print x}' file{1,2} >file3

EDIT:

Amended script to allow for duplicates and unknown keys in both files, still produces keys from the first file not present in the second:

 awk '/^[0-9]/{if(FNR==NR){a[$0]=1;next};if($0 in a){a[$0]=2}}END{for(x in a)if(a[x]==1)print x}' file{1,2} >file3
potong
  • 71
  • 2
  • This will miss new keys which occur more than once in the main file (and for that matter, which occurr more than once in the keys file) It seems to require that the array count incrementing of the main file is not alowed to exceed 1, or some equivalent workaround (+1 because it is pretty close to the mark) – Peter.O Jan 22 '12 at 04:40
  • 1
    I tried with gawk and mawk, and it outputs wrong keys... – admirabilis Jan 22 '12 at 06:42
  • @Peter.O I assumed the main file had unique keys and that file 2 was a subset of the main file. – potong Jan 22 '12 at 15:11
  • @potong The second one works good and very fast! Thank you! – admirabilis Jan 22 '12 at 16:55
  • *@Teresa e Junior* Are you sure it is working correctly yet?.. Using the [test data you provided](http://tinyurl.com/files-tar-bz2), which should output **5000** keys, when I run it, it produces **136703** keys, just as I got until I finally understood what your requirements were... *@potong* Of course! *FNR==NR* (I've never used it before :) – Peter.O Jan 22 '12 at 22:17
  • @Peter.O The first command has that problem, the second works good. – admirabilis Jan 27 '12 at 08:06
2

With that much data, you should really switch to a database. In the meantime, one thing you must do to get anywhere near decent performance is not to search file1 separately for each key. Run a single grep to extract all the non-excluded keys at once. Since that grep also returns lines that don't contain a key, filter those away.

grep -o '[0-9]\{12\}' file2 |
grep -Fxv -f - file1 |
grep -vx '[0-9]\{12\}' >file3

(-Fx means to search whole lines, literally. -f - means to read a list of patterns from standard input.)

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
  • Unless I'm mistaken, this doesn't address the problem of storing keys that aren't in the big file, it will store the keys that are in it. – Kevin Jan 21 '12 at 17:15
  • @Kevin exactly, and this has forced me to use the loop. – admirabilis Jan 21 '12 at 18:07
  • @TeresaeJunior: adding `-v` (`-Fxv`) may take care of that. – Dennis Williamson Jan 21 '12 at 18:28
  • @DennisWilliamson That would pick all the lines in the big file that don't match any in the key file, including names, jobs, etc. – Kevin Jan 22 '12 at 03:26
  • @Kevin Thanks, I'd misread the question. I've added a filter for non-key lines, though my preference now goes to [using `comm`](http://unix.stackexchange.com/questions/29624/grep-huge-number-of-patterns-from-huge-file/29641#29641). – Gilles 'SO- stop being evil' Jan 22 '12 at 19:12
2

Permit me to reinforce what others have said, "Get thee to a database!"

There are MySQL binaries freely available for most platforms.

Why not SQLite? It's memory-based, loading a flat-file when you start it, then closing it when you're done. This means that if your computer crashes or the SQLite process goes away, so does all the data.

Your problem looks like just a couple lines of SQL, and will run in milliseconds!

After installing MySQL (which I recommend over other choices), I'd shell out $40 for O'Reilly's SQL Cookbook by Anthony Molinaro, which has lots of problem patterns, starting with simple SELECT * FROM table queries, and going through aggregates and multiple joins.

Mat
  • 51,578
  • 10
  • 158
  • 140
Jan Steinman
  • 181
  • 2
  • Yes, I'll start migrating my data to SQL in a few days, thank you! The awk scripts have been helping me a lot until I get it all done, though! – admirabilis Jan 27 '12 at 07:59
1

I'm not sure if this is the exact output that you are looking for, but probably the easiest way is:

grep -o '[0-9]\{12\}' file2 | sed 's/.*/^&$/' > /tmp/numpatterns.grep
grep -vf /tmp/numpatterns.grep file1 > file3
rm -f /tmp/numpatterns.grep

You could also use:

sed -ne '/.*\([0-9]\{12\}.*/^\1$/p' file2 > /tmp/numpatterns.grep
grep -vf /tmp/numpatterns.grep file1 > file3
rm -f /tmp/numpatterns.grep

Each of these creates a temporary pattern file which is used to glean out the numbers from the large file (file1).

Arcege
  • 22,287
  • 5
  • 56
  • 64
1

I fully agree with you getting a database (MySQL is fairly easy to use). Until you get that running, I like Angus's comm solution, but so many people are trying with grep and getting it wrong that I thought I'd show the (or at least one) correct way to do it with grep.

grep -o '[0-9]\{12\}' keyfile | grep -v -f <(grep -o '^[0-9]\{12\}' bigfile) 

The first grep gets the keys. The third grep (in the <(...)) takes all the keys used in the big file, and the <(...) passes it like a file as an argument to -f in the second grep. That causes the second grep to use it as a list of lines to match. It then uses this to match its input (the list of keys) from the pipe (first grep), and prints any keys extracted from the key file and not (-v) the big file.

Of course you can do this with temporary files you have to keep track of and remember to delete:

grep -o '[0-9]\{12\}'  keyfile >allkeys
grep -o '^[0-9]\{12\}' bigfile >usedkeys
grep -v -f usedkeys allkeys

This prints all lines in allkeys that don't appear in usedkeys.

Kevin
  • 40,087
  • 16
  • 88
  • 112
1

The keyfile does not change? Then you should avoid searching the old entries again and again.

With tail -f you can get the output of a growing file.

tail -f growingfile | grep -f keyfile 

grep -f reads the patterns from a file, one line as a pattern.

user unknown
  • 10,267
  • 3
  • 35
  • 58
1

Wasn't going to post my answer because I thought that such amount of data shouldn't be processed with a shell script, and the right answer to use a database was already given. But since now there are 7 other approaches...

Reads the first file in memory, then greps the second file for numbers and checks if values are stored in memory. It should be faster than multiple greps, if you have enough memory to load the whole file, that is.

declare -a record
while read key
do
    read name
    read job
    record[$key]="$name:$job"
done < file1

for number in $(grep -o '[0-9]\{12\}' file2)
do
    [[ -n ${mylist[$number]} ]] || echo $number >> file3
done
forcefsck
  • 7,884
  • 34
  • 31
1

I agree with @jan-steinman that you should use a database for this kind of task. There are lots of ways to hack together a solution with a shell script as the other answers show, but doing it that way will lead to a lot of misery if you're going to use and maintain the code for any length of time more than just a one-day throw-away project.

Assuming you're on a Linux box then you most likely have Python installed by default which includes the sqlite3 library as of Python v2.5. You can check your Python version with:

% python -V
Python 2.7.2+

I recommend using sqlite3 library because it is a simple file-based solution that exists for all platforms (including inside your web browser!) and it does not require a server to be installed. Essentially zero-configuration and zero-maintenance.

Below is a simple python script that will parse the file format that you gave as an example and then does a simple "select all" query and outputs everything it stored in the db.

#!/usr/bin/env python

import sqlite3
import sys

dbname = '/tmp/simple.db'
filename = '/tmp/input.txt'
with sqlite3.connect(dbname) as conn:
    conn.execute('''create table if not exists people (key integer primary key, name text, job text)''')
    with open(filename) as f:
        for key in f:
            key = key.strip()
            name = f.next().strip()
            job = f.next().strip()
            try:
                conn.execute('''insert into people values (?,?,?)''', (key, name, job))
            except sqlite3.IntegrityError:
                sys.stderr.write('record already exists: %s, %s, %s\n' % (key, name, job))
    cur = conn.cursor()

    # get all people
    cur.execute('''select * from people''')
    for row in cur:
        print row

    # get just two specific people
    person_list = [1358726575123, 9973834728345]
    cur.execute('''select * from people where key in (?,?)''', person_list)
    for row in cur:
        print row

    # a more general way to get however many people are in the list
    person_list = [1358726575123, 9973834728345]
    template = ','.join(['?'] * len(person_list))
    cur.execute('''select * from people where key in (%s)''' % (template), person_list)
    for row in cur:
        print row

Yes, this means that you'll need to learn some SQL, but it will be well worth it in the long run. Also, instead of parsing your log files, maybe you could write data directly to your sqlite database.

aculich
  • 1,180
  • 10
  • 14
  • Thank you for the python script! I think `/usr/bin/sqlite3` works the same way for shell scripts (http://packages.debian.org/squeeze/sqlite3), though I've never used it. – admirabilis Jan 27 '12 at 08:04
  • Yes, you can use `/usr/bin/sqlite3` with shell scripts, however I recommend avoiding shell scripts except for simple throw-away programs and instead use a language like python which has better error handling and is easier to maintain and grow. – aculich Jan 27 '12 at 16:46