My input text file contains a 1 line header, then a sorted list of lines containing: x y with x,y 64-bit integers separated by a space. The Input file is many million lines long.
Now I want to remove from lines 2 to #end# any line starting with a value <= $input. This removes the complete line for each match. My inelegant solution so far has been:
head -1 inputFile > inputFile2 # preserve header lineNum=$( grep -n $input test | cut -f1 -d ':' ) # find line # match for $input tail -n +$( $lineNum+1 ) inputFile >> inputFile2 # skip down the input until get to values > $input rm inputFile mv inputFile2 inputFile
Example inputFile
5066314878607:a1:a2:a3:a4:a5 1 27 3 27 7 27 11 27 13 27 17 27 ...
[GNU Tool split] inputFile into inputFile-1 and inputFile-2 (2 cores, may be z cores)
inputFile-1: 5066314878607:a1:a2:a3:a4:a5 1 27 7 27 13 27 inputFile-2 5066314878607:a1:a2:a3:a4:a5 3 27 11 27 17 27
Now inputFile-1 has processed and completed line up to '7 27'. From the main inputFile I want to only the following two lines: (Note <= in the split inputFile-1 BUT this is not a simple <= x removal from the original inputFile due to the round robin split)
1 27 7 27
This leaves inputFile with:
5066314878607:a1:a2:a3:a4:a5 3 27 11 27 13 27 17 27
Running on current Ubuntu 16.04 although this is likely the same for any modern Linux distro.
Question:
Each separately processed inputFile-x will be processed sequentially. I just don't know how to handle the removing the processed lines from the main file with the round robin split. In particular since this is run on many computers with different speed, so inputFile-1 may be processed to line 300 while inputFile-2 may be processed to line 500.
To explain for generalizing to z cores each processing separately. inputFile is round robin split into inputFile-1 inputFile-2 inputFile-3 ... inputFile-z [i.e. split -n r/$z, for 50 cores: split -n r/50 inputFile ]
Core1: inputFile-1 with (values for lines 2 to #end# ) <= $input1 --> store list/array remove1. Now remove only matching values from remove1 from original inputFile. Continue processing for each core.