I have 2 files, A, with 6 million lines, and B with 5 million lines, I'm trying to get lines that are in A, but are missing from B, with grep -v -f B A, but it's very slow. Is there any way to speed it up?
Asked
Active
Viewed 372 times
1
Fluffy
- 2,047
- 3
- 15
- 18
-
is the input data ASCII? you could add `-F` and `-x` options if you are matching whole lines literally (no regex) – Sundeep Apr 13 '18 at 09:11
-
https://unix.stackexchange.com/questions/418429/find-intersection-of-lines-in-two-files might help – Sundeep Apr 13 '18 at 09:13
-
Related: [Linux tools to treat files as sets and perform set operations on them](https://unix.stackexchange.com/q/11343) – Stéphane Chazelas Feb 28 '23 at 14:38
2 Answers
2
If the two files are sorted (in the same locale as the current one), use this command.
comm -23 A.txt B.txt
If they're not sorted and your shell supports ksh-style process substitution:
(export LC_ALL=C; comm -23 <(sort A.txt) <(sort B.txt))
(LC_ALL=C to get a deterministic (and fast) sorting order).
See also the combine utility from moreutils that doesn't require files to be sorted:
combine A.txt not B.txt
Beware it loads the whole files in memory though.
Stéphane Chazelas
- 522,931
- 91
- 1,010
- 1,501
Eranda Peiris
- 335
- 2
- 14
0
If, like me, you need to grep for lines in a file where file1 and file2 don't have identical lines, but file1 contains strings to grep for, you may be able to sort, and then use join.
Max Bileschi
- 221
- 2
- 4