3

I have two files. I want to exclude anything that is in file 2 from file 1.

Example)

File #1 - List of 500 domain names

domain1
domain2
domain3
etc..

File #2 - Alexa's Top 1,000,000 domain names

domain1
domain2
domain3
etc..

I would think this would work.

cat file1 | grep -v -f file2 > results

This always results in "Killed" for anything over 10k+ in file2.

/var/log/messages shows it runs out of memory. The box has 12GB RAM.

Aug 25 02:21:18 V-RHEL-EM kernel: Out of memory: Kill process 13779 (grep) score 860 or sacrifice child
Aug 25 02:21:18 V-RHEL-EM kernel: Killed process 13779 (grep), UID 0, total-vm:9377064kB, anon-rss:7400368kB, file-rss:0kB, shmem-rss:0kB

Is there a better way to do this?

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
poppopretn
  • 33
  • 5
  • `cat` command is useless, directly use `grep ..... file1`. – Prvt_Yadav Aug 25 '19 at 06:48
  • Also, you should be using `-F` because the domain names in file2 are all fixed strings, not regular expressions. – cas Aug 25 '19 at 07:13
  • and see also [Removing lines in LARGE text file containing string found in other LARGE text file - FILES SORTED](https://unix.stackexchange.com/q/376036/7696) – cas Aug 25 '19 at 07:14
  • For me it works fine with 4 GB RAM and option `-F` as @cas suggested. – Cyrus Aug 25 '19 at 07:28
  • if order of output doesn't matter, how about this? `awk 'NR==FNR{a[$0]; next} ($0 in a){delete a[$0]} END{for(k in a) print k}' file1 file2` note that file1 is used first to create an array and any match in file2 will keep removing it from the array and then finally all keys of array is printed – Sundeep Aug 25 '19 at 08:36
  • you can also use `comm -23 file1_sorted file2_sorted` where you use `sort` to first create sorted versions and then use `comm`, but I think the `awk` one can easily handle larger file2 as it depends on file1 size, not file2 – Sundeep Aug 25 '19 at 08:40
  • 1
    I voted to reopen. This is not an exact dupe as the linked solution does not address the _exclusion_ (`-v`) of lines. Also, it does not `grep` for whole lines (`-x`). Imagine `domain1.com` in file1 and `domain1.co` in file2. – Freddy Aug 25 '19 at 09:21

1 Answers1

2

Since you are working with fixed strings, add the -F flag and to match whole lines, add the -x flag. You don't need cat here, grep can be used with a file argument.

grep -F -x -v -f file2 file1 > results


You could split file2 into N parts, run grep on each part and use the result as input file for the next run:
# split file2 into N=4 parts file2.00 file2.01 file2.02 file2.03
split -nl/4 -d file2 file2.

# use results as input file
cp file1 results

for f2 in file2.??; do
        grep -F -x -v -f "$f2" results > rtemp && mv rtemp results
done

# cleanup
rm file2.??

Adjust N=4 as needed.

Freddy
  • 25,172
  • 1
  • 21
  • 60