25

I have a large file that's a couple hundred lines long. This file is partitioned into many parts by a specific identifier, lets say 'ABC'. This line 'ABC' appears 6 times so I want 6 output files. I'm familiar with split and awk but can't seem to create a command line that will do what I've described, any ideas?

Here's an example

ABC
line 1
line 2
line 3
ABC
line 1
line 2
ABC
line1

I'd like three files where ABC is the first line in the new file and it ends before the next ABC is encountered.

don_crissti
  • 79,330
  • 30
  • 216
  • 245
openingceremony
  • 351
  • 1
  • 4
  • 5
  • 1
    `csplit` is usually good for this kind of thing - however not knowing exactly what you mean by "partitioned by" and whether you want the ABCs to be part of the output it's hard to suggest a specific command – steeldriver Feb 17 '16 at 18:42
  • @steeldriver, I've added some clarifications. will look into csplit – openingceremony Feb 17 '16 at 18:56
  • a bit late to the party, but this worked for me https://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html – Darragh Jul 22 '21 at 00:36

2 Answers2

31

Using csplit

csplit -z somefile /ABC/ '{*}'

The output files will be xx00, xx01, ... by default but you can change the format and numbering if desired - see man csplit

Stephen Kitt
  • 411,918
  • 54
  • 1,065
  • 1,164
steeldriver
  • 78,509
  • 12
  • 109
  • 152
8
NEEDLE=ABC
HAYSTACK=/path/to/bigfile
csplit -f splitfile_ $HAYSTACK /$NEEDLE/ "{$(($(grep -c -- $NEEDLE $HAYSTACK)-1))}"
for file in splitfile_*; do
    sed --in-place "s/$NEEDLE//" $file
done

The above will split the file as requested no matter how many instances of the marker line you have, and then remove the marker from the resultant files. The output files will be called e. g. splitfile_00, splitfile_01, and so forth.

Picking apart that bit at the end of the csplit invocation: "{$(($(grep -c $NEEDLE HAYSTACK)-1))}": We use the subshell grep to get the number of instances of your marker within the file, and subtract one- this tells csplit just exactly how many splits it's going to be making.

Note that as written, things might go pear-shaped if your marker appears within the data.

DopeGhoti
  • 73,792
  • 8
  • 97
  • 133
  • 1
    This results in errors when the pattern to search for starts with a dash, which prevents the answer from being applied to more generic situations. It is then interpreted as a command line argument to `grep`. The solution is to add `--` before `$NEEDLE` in the grep command. – Geert Smelt Jun 25 '23 at 12:56