1

I have 2 files, A, with 6 million lines, and B with 5 million lines, I'm trying to get lines that are in A, but are missing from B, with grep -v -f B A, but it's very slow. Is there any way to speed it up?

Fluffy
  • 2,047
  • 3
  • 15
  • 18
  • is the input data ASCII? you could add `-F` and `-x` options if you are matching whole lines literally (no regex) – Sundeep Apr 13 '18 at 09:11
  • https://unix.stackexchange.com/questions/418429/find-intersection-of-lines-in-two-files might help – Sundeep Apr 13 '18 at 09:13
  • Related: [Linux tools to treat files as sets and perform set operations on them](https://unix.stackexchange.com/q/11343) – Stéphane Chazelas Feb 28 '23 at 14:38

2 Answers2

2

If the two files are sorted (in the same locale as the current one), use this command.

comm -23 A.txt B.txt

If they're not sorted and your shell supports ksh-style process substitution:

(export LC_ALL=C; comm -23 <(sort A.txt) <(sort B.txt))

(LC_ALL=C to get a deterministic (and fast) sorting order).

See also the combine utility from moreutils that doesn't require files to be sorted:

combine A.txt not B.txt

Beware it loads the whole files in memory though.

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
Eranda Peiris
  • 335
  • 2
  • 14
0

If, like me, you need to grep for lines in a file where file1 and file2 don't have identical lines, but file1 contains strings to grep for, you may be able to sort, and then use join.

Max Bileschi
  • 221
  • 2
  • 4