Using GNU Parallel With Split

Question

I'm loading a pretty gigantic file to a postgresql database. To do this I first use split in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel and psql copy.

The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel and it starts loading the files at the time split finish writing it. Something like this:

split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}

I have read the split man pages and I can't find anything. Is there a way to do this with split or any other tool?

Ole Tange · Answer 1 · 2013-03-01T22:28:13.143

Use --pipe:

cat 2011.psv | parallel --pipe -l 50000000 ./carga_postgres.sh

It requires ./carga_postgres.sh to read from stdin and not from a file, and is slow for GNU Parallel version < 20130222.

If you do not need exactly 50000000 lines the --block is faster:

cat 2011.psv | parallel --pipe --block 500M ./carga_postgres.sh

This will pass chunks of around 500MB split on \n.

I do not know what ./carga_postgres.sh contains, but my guess is it contains psql with username password. In that case you might want to use GNU SQL (which is part of GNU Parallel):

cat 2011.psv | parallel --pipe --block 500M sql pg://user:pass@host/db

The major benefit is that you do not need to save temporary files, but can keep all in memory/pipes.

If ./carga_postgres.sh cannot read from stdin, but must read from a file, you can save it to a file:

cat 2011.psv | parallel --pipe --block 500M "cat > {#}; ./carga_postgres.sh {#}"

Large jobs often fail half way through. GNU Parallel can help you by re-running the failed jobs:

cat 2011.psv | parallel --pipe --block 500M --joblog my_log --resume-failed "cat > {#}; ./carga_postgres.sh {#}"

If this fails then you can re-run the above. It will skip the blocks that are already processed successfully.

If you have a newer version of GNU Parallel >20140422 use @RobertB's answer with --pipepart. If that does not work directly see if --fifo or --cat can help you out. — Ole Tange, Jul 07 '16 at 12:14

score 2 · Answer 2 · answered Apr 22 '15 at 20:18

2

Why not use --pipe AND --pipepart with GNU Parallel? This eliminates the extra cat and starts direct reads from the file on disk:

parallel --pipe --pipepart -a 2011.psv --block 500M ./carga_postgres.sh

answered Apr 22 '15 at 20:18

Robert B.

21
1

score 1 · Answer 3 · edited May 23 '17 at 12:40

I found the answers posted here to be way to complex so I asked on Stack Overflow and I got this answer:

If you use GNU split, you can do this with the --filter option

‘--filter=command’
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.

You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background

#! /bin/sh

cat >$FILE
./carga_postgres.sh $FILE &

and use that script as the filter

split -l 50000000 --filter=./filter.sh 2011.psv

score 0 · Answer 4 · answered Feb 28 '13 at 21:30

An alternative to making split print the file names is to detect when the files are ready. On Linux, you can use the inotify facility, and specifically the inotifywait utility.

inotifywait -m -q -e close_write --format %f carga | parallel ./carga_postgres.sh &
split -l 50000000 2011.psv carga/2011_

You'll need to kill inotifywait manually. Killing it automatically is a little hard because there's a potential race condition: if you kill it as soon as split finishes, it may have received events that it hasn't reported yet. To make sure that all events are reported, count the matching files.

{
  sh -c 'echo $PPID' >inotifywait.pid
  exec inotifywait -m -q -e close_write --format %f carga
} | tee last.file \
  | parallel ./carga_postgres.sh &
split -l 50000000 2011.psv carga/2011_
(
  set carga/2011_??; eval "last_file=\${$#}"
  while ! grep -qxF "$last_file" last.file; do sleep 1; done
)
kill $(cat inotifywait.pid)

Using GNU Parallel With Split

4 Answers4