Need to split a file into multiple files, but ensure the grouped data remains in same file

Question

This question is not a duplicate, because here we need also to ensure, that group of common values/rows remain in the same split file (and is not scattered across multiple split files).

All, I tried to google a solution but did not find any that suits my requirement.

Question: I have a huge file that needs to be split into multiple files if the the size exceeds 2GB. This, I am planning to do using the record count. But the challenge is, when I split the file, the group data should not split. It should remain in single file.

ex:

A,1,2,6/11/2018,X,Y,Z
A,2,2,6/11/2018,X,Y,B
A,3,2,6/11/2018,X,Y,Z
A,4,2,6/12/2018,X,Y,Z
B,1,2,6/11/2018,X,Y,Z
B,2,2,6/11/2018,X,Y,B
A,5,2,6/15/2018,X,C,Z
A,6,3,6/110/2018,A,Y,Z
C,3,2,6/11/2018,X,Y,Z
C,4,2,6/12/2018,X,Y,Z
C,5,2,6/15/2018,X,C,Z
D,6,3,6/110/2018,A,Y,Z
E,6,3,6/110/2018,A,Y,Z
E,6,3,6/110/2018,A,Y,Z
G,6,3,6/110/2018,A,Y,Z

In the above sample example, let's say my group key is the first column. Therefore, if I decide to split a file into multiple files (with record count cut off in each file as 7 records max), I don't want records with value "A" to be split into multiple files. Similarly, records with 'B", "C" etc should also remain in single file. All grouped data should remain in the same splitted file.A total of 3 files are created from above example (but same group data should remain in a single splitted file)

For above example: below should be a sample output:

op_file_1

A,1,2,6/11/2018,X,Y,Z
A,2,2,6/11/2018,X,Y,B
A,3,2,6/11/2018,X,Y,Z
A,4,2,6/12/2018,X,Y,Z
A,5,2,6/15/2018,X,C,Z
A,6,3,6/110/2018,A,Y,Z
G,6,3,6/110/2018,A,Y,Z

op_file_2

B,1,2,6/11/2018,X,Y,Z
B,2,2,6/11/2018,X,Y,B
C,3,2,6/11/2018,X,Y,Z
C,4,2,6/12/2018,X,Y,Z
C,5,2,6/15/2018,X,C,Z
E,6,3,6/110/2018,A,Y,Z
E,6,3,6/110/2018,A,Y,Z

op_file_3

D,6,3,6/110/2018,A,Y,Z

A total of three files? Into which file to the `E` and `G` lines go? — DopeGhoti, Jun 11 '18 at 17:18
What if you have 47 lines wherein the value of the first field is `A`? — DopeGhoti, Jun 11 '18 at 18:36
It still looks like a duplicate to me, despite your claim that it's not. Why doesn't [Splitting a file into multiple files based on 1st column value](https://unix.stackexchange.com/q/297683/100397) match your requirement exactly? Please [edit your question again](https://unix.stackexchange.com/posts/449151/edit) to clarify _why_ this question/answer doesn't work for you here. — roaima, Jun 12 '18 at 15:36
@roaima, It's not a duplicate. Because, if you see my first file "op_file_1", it also has a row with first column as "G". Please note that, the output files need not be the exact ones I mentioned. It can be different also. My whole point is the group of records (based on first column of a row) should always remain in 1 single file and should not be present in multiple files. Refer my comment to 'DopeGhot' as well. Hope my question is clear. — cmaroju, Jun 14 '18 at 14:29
@DopeGhoti, In my actual requirement, the max lines a file can have is 1,000,000. So there is no chance that the first field has so may rows. The example I gave is just a sample one. — cmaroju, Jun 14 '18 at 14:31
Why have you put the `G` record with the `A` records? Why is the `D` record by itself but the `B`, `C`, `E` records are together? Why only three files of output but you have six distinct key field values? — roaima, Jun 14 '18 at 20:17
@roaima: That is just one combination. The files could have been generated in any other way. All I am achieve is that the grouped rows (based on column-1) should not get scattered in multiple o/p files. — cmaroju, Jun 18 '18 at 15:24
Then it's still a duplicate because at least one answer in the linked question satisfies what you're currently saying. If it does not match you need to provide a counter-example in your question. — roaima, Jun 18 '18 at 15:39

score 1 · Answer 1 · answered Jun 11 '18 at 16:55

1

$ awk -F, '{outfile="output."$1; print $0 > outfile}' input
$ ls
input    output.A output.B output.C output.D

answered Jun 11 '18 at 16:55

DopeGhoti

73,792
8
97
133

just to clarify, the above example I gave is a single input file. Now that file should be split into multiple files if the max record count (per file) reaches 7. But at the same time, the grouped data should not get split. – cmaroju Jun 11 '18 at 16:59
Ah, I misunderstood your question. This will take the single input file and split based on the first value (e. g. all records with`A` as the first field go into `output.A`, etc.) – DopeGhoti Jun 11 '18 at 17:01
Updated my question a bit to avoid confusion. – cmaroju Jun 11 '18 at 17:05
Based on your clarification, this does exactly what you ask: one file per first field value, with all records with that matching first field in each file. – DopeGhoti Jun 11 '18 at 17:12
No its not one file per first field value. The number of split should be solely based on the max record count (in above example 7). But while splitting the file, i need all the grouped values to remain in same file. – cmaroju Jun 11 '18 at 18:18
Comments are not the place for this sort of clarification. Please edit your question. – DopeGhoti Jun 11 '18 at 18:33
Yup I have edited my question.. – cmaroju Jun 11 '18 at 18:59

Need to split a file into multiple files, but ensure the grouped data remains in same file

1 Answers1