Search in specific column for pattern and output entire line

Question

I'm working in HDFS and am trying to get the entire line where the 4th column starts with the number 5:

100|20151010|K|5001
695|20151010|K|1010
309|20151010|R|5005
410|20151010|K|5001
107|20151010|K|1062
652|20151010|K|5001

Hence should output:

100|20151010|K|5001
309|20151010|R|5005
410|20151010|K|5001
652|20151010|K|5001

terdon · Answer 1 · 2015-12-02T23:31:18.807

11

The simplest approach would probably be awk:

awk -F'|' '$4~/^5/' file

The -F'|' sets the field separator to |. The $4~/^5/ will be true if the 4th field starts with 5. The default action for awk when something evaluates to true is to print the current line, so the script above will print what you want.

Other choices are:

Perl
```
perl -F'\|' -ane 'print if $F[3]=~/^5/' file
```
Same idea. The -a switch causes perl to split its input fields on the value given by -F into the array @F. We then print if the 4th element (field) of the array (arrays start counting at 0) starts with a 5.
grep
```
grep -E  '^([^|]*\|){3}5' file 
```
The regex will match a string of non-| followed by a | 3 times, and then a 5.
GNU or BSD sed
```
sed -En '/([^|]*\|){3}5/p' file 
```
The -E turns on extended regular expressions and the -n suppresses normal output. The regex is the same as the grep above and the p at the end makes sed print only lines matching the regex.

edited Dec 02 '15 at 23:31

answered Dec 02 '15 at 22:55

terdon

234,489
66
447
667

@mikeserv thanks, greater portability is always a good thing but where is that documented? I tried it and it does indeed work on GNU sed but the `-E` isn't mentioned in either `man` or `info`. It does activate ERE, right? – terdon Dec 02 '15 at 23:33
its not, except in the source. that happens a lot with open source stuff - people submit a patch to do the same thing something already does because they're used to doing it with a different letter but then dont care to write a new SYNOPSIS. anyway, its worked for a long time. and `-E`xtended regexp is slated for the next POSIX version, too, so might as well just get used to it. plus, `-r` doesn't make any sense. and yeah, it's a synonym - they both do exactly the same thing. *almost* the same - i think w/ `-r` you can switch back `-re ... -Ge ...` or something, but who would? – mikeserv Dec 02 '15 at 23:34
@terdon about `awk -F'|' '$4~/^5/' file` - what does `~` mean? – Manuel Jordan Feb 19 '23 at 14:17
1

@ManuelJordan that's the match operator and is used to test for a match against a regular expression. – terdon Feb 19 '23 at 14:27
Thank you, therefore it is the same as `=~` as for Perl. – Manuel Jordan Feb 19 '23 at 14:40
1

@ManuelJordan ah yes, exactly – terdon Feb 19 '23 at 14:44
I did do the question in the beginning because I read some tutorials about `awk` working with regex patterns, but none of them included `~` – Manuel Jordan Feb 19 '23 at 14:46

score 2 · Answer 2 · edited Dec 02 '15 at 23:00

2

This will print all lines that match |5 and then no more | until the end of the line:

grep '|5[^|]*$' <in >out

edited Dec 02 '15 at 23:00

terdon

234,489
66
447
667

answered Dec 02 '15 at 22:55

mikeserv

57,448
9
113
229

score 0 · Answer 3 · answered Sep 21 '22 at 12:04

If you want answers using tools that are CSV-aware to account for CSV files containing fields with embedded | characters and newlines, then here's how you do it with mlr (Miller):

mlr --csv --fs '|' -N filter -S '${4} =~ "^5"' file

This makes mlr read the original data as a header-less CSV file using | as the field separator (this is what --csv --fs '|' -N does). It applies a filter expression that extracts the records for which the expression is true. While doing so, it avoids inferring data types and treats the data as strings (-S, because regular expressions are generally only applicable to strings).

The expression matches the regular expression ^5 to the fourth field of the record.

Extracted records are reproduced as CSV with the same field separator as the input.

You can do the same sort of thing with the tools from the csvkit package, but since there's no way to tell csvgrep to use a custom field separator for the output, you will have to reformat the result with csvformat if you want to retain your | separators:

csvgrep -d '|' -H -c 4 -r '^5' file | csvformat -K 1 -D '|'

The -K 1 options to csvformat skips the anonymous header line produced by csvgrep.

jubilatious1 · Answer 4 · 2022-09-22T23:10:18.850

Using Raku (formerly known as Perl_6)

~$ raku -ne '.put if .split("|")[3].starts-with("5");' file

Sample Input:

100|20151010|K|5001
695|20151010|K|1010
309|20151010|R|5005
410|20151010|K|5001
107|20151010|K|1062
652|20151010|K|5001

Sample Output:

100|20151010|K|5001
309|20151010|R|5005
410|20151010|K|5001
652|20151010|K|5001

Briefly, Raku is directed to read input linewise off the command line using the -ne flags (n means non-autoprinting). Lines are output if, when split on | vertical-bar, the zero-indexed 3-rd element (i.e. fourth column) starts-with("5").

For more complicated CSV files, use Raku's Text::CSV module.

https://unix.stackexchange.com/a/705099/227738
https://raku.org

Search in specific column for pattern and output entire line

4 Answers4