9

I'm loading a pretty gigantic file to a postgresql database. To do this I first use split in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel and psql copy.

The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel and it starts loading the files at the time split finish writing it. Something like this:

split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}

I have read the split man pages and I can't find anything. Is there a way to do this with split or any other tool?

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
Topo
  • 285
  • 1
  • 4
  • 9

4 Answers4

13

Use --pipe:

cat 2011.psv | parallel --pipe -l 50000000 ./carga_postgres.sh

It requires ./carga_postgres.sh to read from stdin and not from a file, and is slow for GNU Parallel version < 20130222.

If you do not need exactly 50000000 lines the --block is faster:

cat 2011.psv | parallel --pipe --block 500M ./carga_postgres.sh

This will pass chunks of around 500MB split on \n.

I do not know what ./carga_postgres.sh contains, but my guess is it contains psql with username password. In that case you might want to use GNU SQL (which is part of GNU Parallel):

cat 2011.psv | parallel --pipe --block 500M sql pg://user:pass@host/db

The major benefit is that you do not need to save temporary files, but can keep all in memory/pipes.

If ./carga_postgres.sh cannot read from stdin, but must read from a file, you can save it to a file:

cat 2011.psv | parallel --pipe --block 500M "cat > {#}; ./carga_postgres.sh {#}"

Large jobs often fail half way through. GNU Parallel can help you by re-running the failed jobs:

cat 2011.psv | parallel --pipe --block 500M --joblog my_log --resume-failed "cat > {#}; ./carga_postgres.sh {#}"

If this fails then you can re-run the above. It will skip the blocks that are already processed successfully.

Ole Tange
  • 33,591
  • 31
  • 102
  • 198
  • 1
    If you have a newer version of GNU Parallel >20140422 use @RobertB's answer with --pipepart. If that does not work directly see if --fifo or --cat can help you out. – Ole Tange Jul 07 '16 at 12:14
2

Why not use --pipe AND --pipepart with GNU Parallel? This eliminates the extra cat and starts direct reads from the file on disk:

parallel --pipe --pipepart -a 2011.psv --block 500M ./carga_postgres.sh
Robert B.
  • 21
  • 1
1

I found the answers posted here to be way to complex so I asked on Stack Overflow and I got this answer:

If you use GNU split, you can do this with the --filter option

‘--filter=command’
With this option, rather than simply writing to each output file, write through a pipe to the specified shell command for each output file. command should use the $FILE environment variable, which is set to a different output file name for each invocation of the command.

You can create a shell script, which creates a file and start carga_postgres.sh at the end in the background

#! /bin/sh

cat >$FILE
./carga_postgres.sh $FILE &

and use that script as the filter

split -l 50000000 --filter=./filter.sh 2011.psv
Topo
  • 285
  • 1
  • 4
  • 9
0

An alternative to making split print the file names is to detect when the files are ready. On Linux, you can use the inotify facility, and specifically the inotifywait utility.

inotifywait -m -q -e close_write --format %f carga | parallel ./carga_postgres.sh &
split -l 50000000 2011.psv carga/2011_

You'll need to kill inotifywait manually. Killing it automatically is a little hard because there's a potential race condition: if you kill it as soon as split finishes, it may have received events that it hasn't reported yet. To make sure that all events are reported, count the matching files.

{
  sh -c 'echo $PPID' >inotifywait.pid
  exec inotifywait -m -q -e close_write --format %f carga
} | tee last.file \
  | parallel ./carga_postgres.sh &
split -l 50000000 2011.psv carga/2011_
(
  set carga/2011_??; eval "last_file=\${$#}"
  while ! grep -qxF "$last_file" last.file; do sleep 1; done
)
kill $(cat inotifywait.pid)
Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175