How to split a large file into two parts, at a pattern?
Given an example file.txt:
ABC
EFG
XYZ
HIJ
KNL
I want to split this file at XYZ such that file1 contains lines up-to XYZ and rest of the lines in file2.
This is a job for csplit:
csplit -sf file -n 1 large_file /XYZ/
would silently split the file, creating pieces with prefix file and numbered using a single digit, e.g. file0 etc. Note that using /regex/ would split up to, but not including the line that matches regex. To split up to and including the line matching regex add a +1 offset:
csplit -sf file -n 1 large_file /XYZ/+1
This creates two files, file0 and file1. If you absolutely need them to be named file1 and file2 you could always add an empty pattern to the csplit command and remove the first file:
csplit -sf file -n 1 large_file // /XYZ/+1
creates file0, file1 and file2 but file0 is empty so you can safely remove it:
rm -f file0
With awk you can do:
awk '{print >out}; /XYZ/{out="file2"}' out=file1 largefile
Explanation: The first awk argument (out=file1) defines a variable with the filename that will be used for output while the subsequent argument (largefile) is processed. The awk program will print all lines to the file specified by the variable out ({print >out}). If the pattern XYZ will be found the output variable will be redefined to point to the new file ({out="file2}") which will be used as target to print the subsequent data lines.
References:
{ sed '/XYZ/q' >file1; cat >file2; } <infile
With GNU sed you should use the -unbuffered switch. Most other seds should just work though.
To leave XYZ out...
{ sed -n '/XYZ/q;p'; cat >file2; } <infile >file1
With a modern ksh here's a shell variant (i.e. without sed) of one of the sed based answers above:
{ read in <##XYZ ; print "$in" ; cat >file2 ;} <largefile >file1
And another variant in ksh alone (i.e. also omitting the cat):
{ read in <##XYZ ; print "$in" ; { read <##"" ;} >file2 ;} <largefile >file1
(The pure ksh solution seem to be quite performant; on a 2.4 GB test file it needed 19-21 sec, compared to 39-47 sec with the sed/cat based approach).
Try this with GNU sed:
sed -n -e '1,/XYZ/w file1' -e '/XYZ/,${/XYZ/d;w file2' -e '}' large_file
An easy hack is to print either to STDOUT or STDERR, depending on whether the target pattern has been matched. You can then use the shell's redirection operators to redirect the output accordingly. For example, in Perl, assuming the input file is called f and the two output files f1 and f2:
Discarding the line that matches the split pattern:
perl -ne 'if(/XYZ/){$a=1; next} ; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2
Including the matched line:
perl -ne '$a=1 if /XYZ/; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2
Alternatively, print to different file handles:
Discarding the line that matches the split pattern:
perl -ne 'BEGIN{open($fh1,">","f1");open($fh2,">","f2");}
if(/XYZ/){$a=1; next}$a==1 ? print $fh1 "$_" : print $fh2 "$_";' f
Including the matched line:
perl -ne 'BEGIN{open($fh1,">","f1"); open($fh2,">","f2");}
$a=1 if /XYZ/; $a==1 ? print $fh1 "$_" : print $fh2 "$_";' f