best way to split huge files based on a field when awk is too slow

Question

I am having problems dealing with huge .gz files (greater than 500G). My goal is to split each of these by the 4th field within these files. There is this beautiful awk one-liner I have used before to do this:

zcat file.txt.gz | awk 'NR>1{print >  $4}'

But unfortunately this takes ages with huge files, so I am trying to first split them by size and then concatenate each files after being split by the field. I can split them using:

i=file.txt.gz
dir=$i
mkdir -p $dir
cd $dir
split -b 200M ../$i $i

for file in `ls *`; do zcat $file | awk 'NR>1{print >  $4}'; done

But how do I then concatenate all the correct files by the 4th field? Also, is there really no better way to do this? I am also getting an error when I work with gz files split like this saying "unexpected end of file", so I guess my splitting is wrong too, but I am not sure if I am heading in the right direction anyway, please if you have suggestions it would be very helpful.

Thanks so much for the help! Fra

Yeah, you can't usefully split a `.gz` for anything other than putting it back together – you'll need to `gunzip` your file, split the uncompressed file and (optionally) gzip the parts again. — Ulrich Schwarz, May 02 '17 at 05:42
Thank you Ulrich... What if I can't uncompress the file because they are just too large, is there anyway out of this? — user971102, May 02 '17 at 05:45
If `awk` is too slow you'd use `perl`. If `perl` is too slow you'd use C. If C is too slow you'd use better hardware. If better hardware is too slow you'd find a better job. — Satō Katsura, May 02 '17 at 06:17
ha Ok last option is not possible so I guess I'll try it out with perl. thanks — user971102, May 02 '17 at 06:19
[Here](http://perltricks.com/article/162/2015/3/27/Gzipping-data-directly-from-Perl/)'s a useful hint for doing it with `perl`. You probably want to keep track of the output files, and perhaps close them all once in a while (otherwise you'd run out of file descriptors). — Satō Katsura, May 02 '17 at 06:37
@user971102: but the sum of your output files will be as large as the uncompressed input file anyway, won't it, since every line will go somewhere? — Ulrich Schwarz, May 02 '17 at 09:15
According to my tests, what is slow in your case is two things: 1) The combination of cat + pipe + awk. If you could do this fully in awk would be much faster 2) The dumping of the results in the screen. If you redirect the results to another file (>file) will be much faster. Moreover you could experiment a little bit by using something like `$ awk '{ ... }' <(gzip -dc file.gz)` since might perform better than zcat. — George Vasiliou, May 02 '17 at 09:15
@GeorgeVasiliou `awk '...' <(gzip -dc file.gz)` is exactly equivalent to `zcat file.gz | awk '...'`, there shouldn't be any speed differences between the two. — Satō Katsura, May 02 '17 at 10:06
How about python? Is that any faster/memory efficient? I tried this https://github.com/gstaubli/split_file_by_key, but I get the error "Too many open files:" — user971102, May 06 '17 at 07:54
Even if we could split the gzipped data, why would the `NR > 1` condition apply to all the pieces? — Kaz, Jul 22 '17 at 05:31
Are you sure you don't mean `NF > 1` (process only lines that have two or more fields?) — Kaz, Jul 22 '17 at 05:32
*"But how do I then concatenate all the correct files by the 4th field?"* By using `>>` in the awk code rather than `>` (and ensuring none of those files exist before the job starts). — Kaz, Jul 22 '17 at 05:33
Question is missing sample data, and some key information, like how many different unique values there are in field `$4`. — Kaz, Jul 22 '17 at 05:36

mr.spuratic · Answer 1 · 2018-01-25T10:13:00.827

Satō Katsura's file descriptor comment is on the right track, assuming that there are more than 1021 (commonly the user FD limit of 1024, -3 for stdin/stdout/stderr) distinct values of $4 and that you are using gawk.

When you print to a file using > or >>, the file remains open until an explicit close(), so your script is accumulating FDs. Since before Gawk v3.0, running out of FDs (ulimit -n) is handled transparently: a linked list of open files is traversed and the LRU (least recently used) is "temporarily" closed (closed from the OS point of view to free an FD, gawk keeps track of it internally for transparent reopening later if needed). You can see this happening (from v3.1) by adding -W lint when invoking it.

We can simulate the problem like this (in bash):

printf "%s\n" {0..999}\ 2\ 3\ 0{0..9}{0..9}{0..9} | time gawk -f a.awk

This generates 1,000,000 lines of output with 1000 unique values of $4, and takes ~17s on my laptop. My limit is 1024 FDs.

 printf "%s\n" {0..499}\ 2\ 3\ {0..1}{0..9}{0..9}{0..9} | time gawk -f a.awk

This also generates 1,000,000 lines of output, but with 2000 unique values of $4 — this takes ~110 seconds to run (more than six times longer, and with 1M extra minor page faults).

The above is the "most pessimal" input from the point of view of keeping track of $4, the output file changes every single line (and guarantees the needed output file needs to be (re)opened every time).

There are two ways to help with this: less churn in filename use (i.e. pre-sort by $4), or chunk the input with GNU split.

Presorting:

printf "%s\n" {0..499}\ 2\ 3\ {0..1}{0..9}{0..9}{0..9} | 
  sort -k 4 | time gawk -f a.awk

(you may need to adjust sort options to agree with awk's field numbering)

At ~4.0s, this is even faster than the first case since file handling is minimised. (Note that sorting large files will probably use on-disk temporary files in $TMPDIR or /tmp.)

And with split:

printf "%s\n" {0..499}\ 2\ 3\ {0..1}{0..9}{0..9}{0..9} | 
  time split -l 1000 --filter "gawk -f a.awk"

That takes ~38 seconds (so you can conclude the even the overhead of starting 1000 gawk processes is less than the the inefficient internal FD handling). In this case you must use >> instead of > in the awk script, otherwise each new process will clobber the previous output. (The same caveat applies if you rejig your code to call close().)

You can of course combine both methods:

printf "%s\n" {0..499}\ 2\ 3\ {0..1}{0..9}{0..9}{0..9} | 
  time split -l 50000 --filter "sort -k 4 | gawk -f a.awk"

That takes about 4s for me, adjusting the chunking (50000) lets you trade off process/file handling overhead with sort's disk usage requirements. YMMV.

If you know the number of output files in advance (and it's not too large), you can either use root to increase (e.g. ulimit -n 8192, then su to yourself), or you might also be able to adjust the limit generally, see How can I increase open files limit for all processes? . The limit will be determined by your OS and its configuration (and possibly the libc if you're unlucky).

best way to split huge files based on a field when awk is too slow

1 Answers1