11

I have a huge csv file with 10 fields separated by commas. Unfortunately, some lines are malformed and do not contain exactly 10 commas (what causes some problems when I want to read the file into R). How can I filter out only the lines that contain exactly 10 commas?

sitems
  • 245
  • 1
  • 2
  • 8
  • 1
    your question and the linked question are *not* the same question. you ask how to how to handle lines with no more or fewer than a certain number of matches, whereas that question requires only a minimum match count. the reality is that question is more easily answered - it doesn't require scanning a line in full, or *(at least, as the `sed` does here)* only as far as one more match than is looked for, though this question does. You should not have closed this. – mikeserv Jan 13 '16 at 18:42
  • 1
    actually, looking closer, the asker there *does* want *no more or fewer than* matches. that question needs a new title. but the `grep` answer there is not an acceptable answer for either question... – mikeserv Jan 13 '16 at 18:52

8 Answers8

24

Another POSIX one:

awk -F , 'NF == 11' <file

If the line has 10 commas, then there will be 11 fields in this line. So we simply make awk use , as the field delimiter. If the number of fields is 11, the condition NF == 11 is true, awk then performs the default action print $0.

EightBitTony
  • 20,963
  • 4
  • 61
  • 62
cuonglm
  • 150,973
  • 38
  • 327
  • 406
  • 5
    That's actually the first thing that came to my mind on this question. I thought it was overkill, but looking at the code...it sure is clearer. For the benefit of others: `-F` sets the field separator and `NF` refers to the number of fields in a given line. Since no code block `{statement}` is appended to the condition `NF == 11`, the default action is to print the line. (@cuonglm, feel free to incorporate this explanation if you like.) – Wildcard Jan 13 '16 at 09:16
  • 5
    +1: Very elegant and readable solution that is also very general. I can e.g. found all malformed lines with `awk -F , 'NF != 11' – sitems Jan 13 '16 at 10:08
  • @gardenhead: It's easy to get it, as you see the OP said in his comment. I sometime answer from my mobile, so it's difficult to add the details explanation. – cuonglm Jan 13 '16 at 10:15
  • 1
    @mikeserv: No, sorry if I made you confused, it's just my bad English. You can't have 11 fields with 1-9 commas. – cuonglm Jan 13 '16 at 10:27
  • oh! of course! right - if they're the splitter they make eleven fields. thanks, man. – mikeserv Jan 13 '16 at 10:38
  • +1 for using the right tool for the job. But I wonder why you '< file', instead of just 'file' to give it as a parameter to awk (which also allows awk to know its filename, etc) ? – Olivier Dulac Jan 13 '16 at 13:45
  • 1
    @OlivierDulac: It guards you against file start with `-` or named `-`. – cuonglm Jan 13 '16 at 15:42
  • oh, then "./file" instead of "file" is the portable way ^^ (as it can be reused in many other situations, so i would rather get used to that one, with or without the "<" (even if the latter makes it superflous)). thx for the reason. – Olivier Dulac Jan 13 '16 at 16:33
  • @OlivierDulac how is that "the portable way"? ` – terdon Jan 13 '16 at 17:26
  • @terdon: When you say "any command that takes a filename as input", do you mean "any command (program) that takes ***data*** as ***standard** input*?  Because it's not true for "any command that takes a filename as ***a command-line argument***; for example, `cp` and `rm`.  And some commands won't look at stdin unless you give them a **`-`** argument; e.g., `file` and `stat`. – G-Man Says 'Reinstate Monica' Jan 13 '16 at 18:25
  • I compared this to mine and to the grep like: `yes "$(set 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1; IFS=,; for s in 1 2 3; do shift; echo "$*"; done)" | head -n 5000000 | ...cmd... | wc -l` and this was three times faster than the `sed` which completed half again as fast as the `grep`, which took the longest. – mikeserv Jan 13 '16 at 19:15
  • Umm... another check and I may have been hasty - I guess *which* `sed` is very important. While still trailing `awk` by a full 3 seconds to completion, my statically-linked musl libc heirloom `sed` does in 8 seconds what my GNU `sed` requires 18 to do. The variation between `grep`s was not significant, but the `gawk` was still faster than the `mawk`. – mikeserv Jan 13 '16 at 19:24
  • @G-Man I meant any command that takes a file name and then opens it to process the data it contains. I just wanted to point out that there is nothing non-portable about `< file`. – terdon Jan 13 '16 at 22:28
  • This solution will not work with empty fields – tashuhka May 03 '16 at 14:31
  • @tashuhka: It will, can you show an example? – cuonglm May 03 '16 at 14:33
  • @cuonglm True. Thank you for the correction :) – tashuhka May 03 '16 at 14:56
  • Or if you have lots of little (or big) files and you want to merge it, you could use a variant of ``cat `ls *.csv` | awk -F , 'NF == 11' > merged.log`` then change that file's extension to csv when through. – mpag Mar 23 '18 at 21:54
8

Using egrep (or grep -E in POSIX):

egrep "^([^,]*,){10}[^,]*$" file.csv

This filters out anything not containing 10 commas: it matches full lines (^ at the start and $ at the end), containing exactly ten repetitions ({10}) of the sequence "any number of characters except ',', followed by a single ','" (([^,]*,)), followed again by any number of characters except ',' ([^,]*).

You can also use the -x parameter to drop the anchors:

grep -xE "([^,]*,){10}[^,]*" file.csv

This is less efficient than cuonglm's awk solution though; the latter is typically six times faster on my system for lines with around 10 commas. Longer lines will cause huge slowdowns.

Stephen Kitt
  • 411,918
  • 54
  • 1,065
  • 1,164
5

The simplest grep code that will work:

grep -xE '([^,]*,){10}[^,]*'

Explanation:

-x ensures that the pattern must match the entire line, rather than just part of it. This is important so you don't match lines with more than 10 commas.

-E means "extended regex", which makes for less backslash-escaping in your regex.

Parentheses are used for grouping, and the {10} afterwards means there must be exactly ten matches in a row of the pattern within the parantheses.

[^,] is a character class—for instance, [c-f] would match any single character that is a c, a d, an e or an f, and [^A-Z] would match any single character that is NOT an uppercase letter. So [^,] matches any single character except a comma.

The * after the character class means "zero or more of these."

So the regex part ([^,]*,) means "Any character except a comma any number of times (including zero times), followed by a comma" and the {10} specifies 10 of these. Then [^,]* to match the rest of the non-comma characters to the end of the line.

Wildcard
  • 35,316
  • 26
  • 130
  • 258
5
sed -ne's/,//11;t' -e's/,/&/10p' <in >out

That first branches out any line with 11 or more commas, and then prints of what remains only those that match 10 commas.

Apparently I answered this before... Here's a me-plagiarism from a question looking for exactly 4 occurrences of some pattern:

You can target [num]th occurrence of a pattern with a sed s///ubstitution command by just adding the [num] to the command. When you test for a successful substitution and don't specify a target :label, the test branches out of the script. This means all you have to do is test for s///5 or more commas, then print what remains.

Or, at least, that handles the lines which exceed your maximum of 4. Apparently you also have a minimum requirement. Luckily, that is just as simple:

sed -ne 's|,||5;t' -e 's||,|4p'

...just replace the 4th occurrence of , on a line with itself and tack your print on to the s///ubstitution flags. Because any lines matching , 5 or more times have already been pruned, the lines containing 4 , matches contain only 4.

mikeserv
  • 57,448
  • 9
  • 113
  • 229
  • 1
    @cuonglm - that is what I had actually, at first, but people are always telling me i should write more readable code. since i can read the stuff others dispute as unreadable im not sure what to keep and what to drop...? so i put the second comma. – mikeserv Jan 13 '16 at 10:04
  • @cuonglm - you can mock me - it wont hurt my feelings. i can take a joke. if you were mocking me it was a little funny. its ok - i just wasn't sure and wanted to know. in my opinion, people should be able to laugh at themselves. anyway, i still dont get it! – mikeserv Jan 13 '16 at 10:34
  • Haha, right, it's a very positive thinking. Anyway, it's very funny to chat with you and sometimes, you *stress* my brain. – cuonglm Jan 13 '16 at 10:42
  • It's interesting that in [this answer](http://unix.stackexchange.com/a/255233/38906), if I replace `s/hello/world/2` with `s//world/2`, GNU sed work fine. With two `sed` from heirloom, `/usr/5bin/posix/sed` raise segfault, `/usr/5bin/sed` goes into infinitive loop. – cuonglm Jan 14 '16 at 04:23
  • @mikeserv, in reference to [our earlier discussion about `sed` and `awk`](http://unix.stackexchange.com/a/254317/135943) (in comments)—I like this answer and upvoted it, but notice the translation of the accepted `awk` answer is: "Print lines with 11 fields" and the translation of this `sed` answer is: "Attempt to remove the 11th comma; skip to next line if you fail. Attempt to replace the 10th comma with itself; print line if you succeed." The `awk` answer gives the instructions to the computer just the way you would express them in English. (`awk` is good for field based data.) – Wildcard Jan 14 '16 at 06:24
  • @Wildcard - I never argued that it wasn't, only that `sed` was more flexible. And it is - you can *skip to next line if it fails* because the tests are your own to construct - you have more control. And so on the next line the field delimiter could change entirely if you like - because you define fields with each match - or even in the middle of this one. It is more basic - more fundamental - and, in my hands at least, far more powerful. I don't see how the points you bring up here are supposed to contradict any I offered there. – mikeserv Jan 14 '16 at 06:54
  • Anyway, I don't consider that English would make a very good programming language. – mikeserv Jan 14 '16 at 07:00
  • @cuonglm - it didn't for me - it processed 5 million lines without issue - several times. – mikeserv Jan 14 '16 at 07:15
4

Throwing some short python:

#!/usr/bin/env python2
with open('file.csv') as f:
    print '\n'.join(line for line in f if line.count(',') == 10)

This will read each line and check if the number of commas in the line is equal to 10 line.count(',') == 10, if so print it will the line.

heemayl
  • 54,820
  • 8
  • 124
  • 141
2

And here's a Perl way:

perl -F, -ane 'print if $#F==10'

The -n causes perl to read its input file line by line and execute the script given by -e on each line. The -a turns on automatic splitting: each input line will be split on the value given by -F (here, a comma) and saved as the array @F.

The $#F (or, more generally $#array), is the highest index of the array @F. Since arrays start at 0, a line with 11 fields will have an @F of 10. The script, therefore, prints the line if it has exactly 11 fields.

terdon
  • 234,489
  • 66
  • 447
  • 667
  • 1
    You could also do `print if @F==11` as an array in a scalar context returns the number of elements. – Sobrique Jan 14 '16 at 09:34
2

If fields can contain commas or newlines your code needs to understand csv. Example (with three columns):

$ cat filter.csv
a,b,c
d,"e,f",g
1,2,3,4
one,two,"three
...continued"

$ cat filter.csv | python3 -c 'import sys, csv
> csv.writer(sys.stdout).writerows(
> row for row in csv.reader(sys.stdin) if len(row) == 3)
> '
a,b,c
d,"e,f",g
one,two,"three
...continued"

I suppose that most solutions so far would discard the second and fourth row.

Peter Otten
  • 121
  • 1
0

To display those lines NOT containing the exact number of delimiters :


[ $(awk -F, '!(NF == 11)' <myfile.csv | wc -l) -gt 0 ] && printf "\nPlease maintain the comma separator for the following line(s) in the myfile.csv: \n\n" && awk -F, '!(NF == 11)' <myfile.csv && echo && exit 1

Useful for those looking to validate their input data files.

This one liner is taken from my open source project called Automated_Quiz, which is hosted here : https://sourceforge.net/projects/automated-quiz/

Nathan SR
  • 35
  • 2