2

My input text file contains a 1 line header, then a sorted list of lines containing: x y with x,y 64-bit integers separated by a space. The Input file is many million lines long.

Now I want to remove from lines 2 to #end# any line starting with a value <= $input. This removes the complete line for each match. My inelegant solution so far has been:

head -1 inputFile > inputFile2   # preserve header
lineNum=$( grep -n $input test | cut -f1 -d ':' ) # find line # match for $input
tail -n +$( $lineNum+1 ) inputFile >> inputFile2  # skip down the input until get to values > $input
rm inputFile
mv inputFile2 inputFile

Example inputFile

5066314878607:a1:a2:a3:a4:a5
1 27
3 27
7 27
11 27
13 27
17 27
...

[GNU Tool split] inputFile into inputFile-1 and inputFile-2 (2 cores, may be z cores)

inputFile-1:
5066314878607:a1:a2:a3:a4:a5
1 27
7 27
13 27

inputFile-2
5066314878607:a1:a2:a3:a4:a5
3 27
11 27
17 27

Now inputFile-1 has processed and completed line up to '7 27'. From the main inputFile I want to only the following two lines: (Note <= in the split inputFile-1 BUT this is not a simple <= x removal from the original inputFile due to the round robin split)

1 27
7 27

This leaves inputFile with:

5066314878607:a1:a2:a3:a4:a5
3 27
11 27
13 27
17 27

Running on current Ubuntu 16.04 although this is likely the same for any modern Linux distro.

Question:

  • Can my existing code be improved?
  • How do I generalize this to handle many separate remove files?

Each separately processed inputFile-x will be processed sequentially. I just don't know how to handle the removing the processed lines from the main file with the round robin split. In particular since this is run on many computers with different speed, so inputFile-1 may be processed to line 300 while inputFile-2 may be processed to line 500.

To explain for generalizing to z cores each processing separately. inputFile is round robin split into inputFile-1 inputFile-2 inputFile-3 ... inputFile-z [i.e. split -n r/$z, for 50 cores: split -n r/50 inputFile ]

Core1: inputFile-1 with (values for lines 2 to #end# ) <= $input1 --> store list/array remove1. Now remove only matching values from remove1 from original inputFile. Continue processing for each core.

  • 1
    First question: `awk -v cutoff=299851915672 'FNR == 1 || $1+0 > cutoff+0' inputFile`. I'm afraid I can't parse your second question. – Satō Katsura Jul 09 '16 at 17:39
  • @SatoKatsura Thanks, very nice. Many cores, I just take inputFile split (like linux split command) it into say 50 files for 50 cores. On the first split file, inputFile-1 I make a list of values x <= 299851915672, and then remove only those exactly matching lines containing from the main inputFile. I guess I could make a million element array and loop – StackAbstraction Jul 09 '16 at 17:56
  • 2
    Please [edit] your question and clarify. Your 2nd question might be easier to understand with a specific example. I have no idea what you're asking at the moment. Also, please add the output you would like to see from the input example you've given. – terdon Jul 09 '16 at 18:01
  • @terdon Thanks, I edited to additionally clarify. Does that make it clearer? I am using the standard linux/unix GNU split tool. https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html – StackAbstraction Jul 09 '16 at 19:41
  • your edits still don't make much sense to me. for example, i have no idea how `$input` (that you want to use for deciding which lines to delete) is defined. or if there's some reason you're not using `cat` to join all the "remove" files into one and then using `grep -f`. And most of all, I'm still struggling to see why you think that splitting the job for parallel execution will be any faster - unless your process is far more complicated than I think it sounds, that will just slow it down. My best guess is that you're making the job far more complicated than it needs to be. – cas Jul 10 '16 at 13:26
  • btw, the reason why i think parallel execution will be slower is that it seems very likely that splitting the file will take about as long as a single pass through the file with an `awk` script would take....so it would require split time PLUS processing time PLUS multi-file remove time, instead of just processing time. – cas Jul 10 '16 at 13:30
  • @cas thanks, note I am splitting up the file to do computation number theory on it, on separate cores or on separate cloud servers. Perfectly parallel. Then I pull back the log files to see how much of each split file was processed. Then I need to remove only those processed lines from the master list noting the round robin splitting. – StackAbstraction Jul 21 '16 at 23:44

2 Answers2

2

@SatoKatsura already answered your first question in a comment: awk -v cutoff=299851915672 'FNR == 1 || $1+0 > cutoff+0' inputFile

It's very difficult to interpret what you're asking in your second question (can you update your question with an algorithm or pseudo-code?), but it sounds like you're wanting to run many (50?) instances of your process at once (one per CPU core on the system). If so, You've started correctly by splitting the file into 50 smaller files.

The missing piece of the puzzle is that you need to use GNU parallel (or, alternatively, xargs with the -P option) to run the processes in parallel. For example:

find . -type f -name 'inputFile-*' -print0 |
    parallel -n 1 \
    awk -v cutoff=299851915672 \
      \'FNR == 1 \|\| \$1+0 > cutoff+0 {print \> FILENAME".out"}\'

(See notes 1, 2, and 3 below)

parallel will, by default, run one process per core on the system. You can override that by using the -j option to specify the number of simultaneous jobs.

The awk script saves the output from each input file a file with the same name and an extra .out extension - e.g. inputFile-1 -> inputFile-1.out. To join them all together again in one big file, you can use cat:

cat inputFile*.out > complete.output.txt
rm -f *.out

NOTE1: you need to escape quotes and other special characters (e.g. |, $, >, &, ; and more) with a backslash on the command line to be executed by parallel. It's easier to save your awk script in a standalone file (with #!/usr/bin/awk -f as the first line), make it executable with chmod, and run that script with parallel.

NOTE2: this probably won't do exactly what you want because I have no idea what it is you're actually asking for. It's meant as a general example of how to process multiple files in parallel. The awk script will almost certainly have to be changed to meet your (incomprehensible) requirements.

NOTE3: You may find that the potential time savings of running multiple processes in parallel is more than offset by the time required to split the input into multiple files, and the overhead of starting a new instance of your process (e.g. an awk script) for each file. This depends on the nature and size of the files, and the nature of the processing to be performed on each file. Running in parallel doesn't always mean getting the results faster. Or you may have over-complicated what you're doing so that it's difficult to understand and/or replicate with other data.

cas
  • 1
  • 7
  • 119
  • 185
  • Thank you, I think I understand the confusion. I am doing computational processing in parallel on many cores; however, all of this text processing is done sequentially. (updated question) – StackAbstraction Jul 10 '16 at 13:08
1

Can we avoid reading the whole file? Yes: Because it is sorted, we can do a binary search to find the byte of the relevant line: Binary search in a sorted text file and https://gitlab.com/ole.tange/tangetools/blob/master/bsearch/bsearch

Can we avoid processing most lines? Yes, as soon as we have found the relevant line, we can just copy the rest.

With that byte you can do a head of your 1 line header, and a tail from the found byte.

Ole Tange
  • 33,591
  • 31
  • 102
  • 198