Removing lines in LARGE text file containing string found in other LARGE text file - FILES SORTED

Question

Still having issues trying to remove lines in a LARGE file containing strings listed in another LARGE file.

  grep -vwFf file1 file2 - FAILS due to memory exhaustion.

I have used:

  comm -23 file1 file2

[https://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file][1]

It works for sorted files and really large files - but it only works for removing duplicate lines - not removing lines containing a string.

The two large files I have are sorted so that the strings I am searching for are at the begining of every line:

text file:

  AAAAA blah blah blah
  AAAAB blas blas blas
  CCCCC sdf sf sdf

string file

  AAAAA
  CCCCC

Thanks.

Maybe will be wise to add those files in database and run some select (and subselect) — Romeo Ninov, Jul 07 '17 at 16:30
does https://unix.stackexchange.com/questions/375294/how-to-memory-limited-grep-f-f-file-a-file-b-output-txt/375347#375347 help? — Jeff Schaller, Jul 07 '17 at 16:52
Similar question: [Fastest way to find lines of a file from another larger file in Bash](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-file-from-another-larger-file-in-bash) — codeforester, Mar 03 '18 at 18:44

score 2 · Answer 1 · answered Jul 07 '17 at 17:03

Why do you not burst $file1 into many smaller ones (in /tmp or use mktemp), then loop over each $file1, using it for grep ... what the ideal size of the pattern file ($file1) is, depends on your system.

Here, each $file1 will have 1000 lines.

i=1 while [ $i -lt $count ] do sed -n "$i,$(($i +1000))p" file1.txt >> /tmp/file${1}.txt i=$(( $i + 1001)) done

Now you have a bunch of files in /tmp named file.txt, so you do:

for file1 in $(ls /tmp/file*.txt) do grep -vwFf $file1 file2 done

Safer with mktemp:

TEMP_DIR=$(mktemp)

for file1 in $(ls ${TEMP_DIR}/file*.txt) do grep -vwFf $file1 file2 done

Removing lines in LARGE text file containing string found in other LARGE text file - FILES SORTED

1 Answers1

Linked