2

Still having issues trying to remove lines in a LARGE file containing strings listed in another LARGE file.

  grep -vwFf file1 file2 - FAILS due to memory exhaustion.

I have used:

  comm -23 file1 file2

[https://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file][1]

It works for sorted files and really large files - but it only works for removing duplicate lines - not removing lines containing a string.

The two large files I have are sorted so that the strings I am searching for are at the begining of every line:

text file:

  AAAAA blah blah blah
  AAAAB blas blas blas
  CCCCC sdf sf sdf

string file

  AAAAA
  CCCCC

Thanks.

speld_rwong
  • 799
  • 1
  • 12
  • 21
  • Maybe will be wise to add those files in database and run some select (and subselect) – Romeo Ninov Jul 07 '17 at 16:30
  • does https://unix.stackexchange.com/questions/375294/how-to-memory-limited-grep-f-f-file-a-file-b-output-txt/375347#375347 help? – Jeff Schaller Jul 07 '17 at 16:52
  • Similar question: [Fastest way to find lines of a file from another larger file in Bash](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-file-from-another-larger-file-in-bash) – codeforester Mar 03 '18 at 18:44

1 Answers1

2

Why do you not burst $file1 into many smaller ones (in /tmp or use mktemp), then loop over each $file1, using it for grep ... what the ideal size of the pattern file ($file1) is, depends on your system.

Here, each $file1 will have 1000 lines.

i=1 while [ $i -lt $count ] do sed -n "$i,$(($i +1000))p" file1.txt >> /tmp/file${1}.txt i=$(( $i + 1001)) done

Now you have a bunch of files in /tmp named file.txt, so you do:

for file1 in $(ls /tmp/file*.txt) do grep -vwFf $file1 file2 done

Safer with mktemp:

TEMP_DIR=$(mktemp)

for file1 in $(ls ${TEMP_DIR}/file*.txt) do grep -vwFf $file1 file2 done

thecarpy
  • 3,885
  • 1
  • 16
  • 35