Extract string from each line of a file

Question

I have a file where each line contains a sentence where one word is found between the character > and <. For example:

Martin went shopping at >Wallmart< and lost his wallet
French food >tastes< great

I am looking for a command to run from the shell that will print the word inside ">" and "<" for every line.

Thanks in advance.

Is there only 1 of those words per line? and can there be any occurence of a `>` or a `<` elsewhere than around that 1 occurence ? — Olivier Dulac, Jul 09 '19 at 12:22
Is Wallmart a better-constructed version of [Walmart](https://en.wikipedia.org/wiki/Walmart), perhaps? ;-) — Toby Speight, Jul 09 '19 at 15:34
@OlivierDulac no, it occurs more than once, my example was over simplified, and I was also wondering what happens if I want the word between, say, "food >" and "< great" — ZakS, Jul 09 '19 at 18:52

schrodingerscatcuriosity · Answer 1 · 2019-07-10T14:50:05.293

13

What about grep?

grep -oP "(?<=\>).*(?=<)"  file

Output:

Wallmart
tastes

EDIT:

Following @Toby Speight comment, and assuming that between > and < there are only words, to avoid matching > and < in other contexts the command should be

grep -oP "(?<=\>)\w+(?=<)"  file

edited Jul 10 '19 at 14:50

answered Jul 09 '19 at 00:21

schrodingerscatcuriosity

12,087
3
29
57

1

Replace the `.*` with `\w+` if `<` and `>` may occur in other contexts, and we want only the matches where they delimit single words. – Toby Speight Jul 09 '19 at 15:32
Could you provide an explanation for the code? – user1993 Jul 09 '19 at 23:35
3

@user1993 `-o` option retrieves *only* the match, not the line (default behaviour of grep).`-P` option allows to perform perl like regular expressions. `(?<=\>)content(?=<)` captures the pattern >content<, *content* being another regular expression, which is what is being returned. – schrodingerscatcuriosity Jul 10 '19 at 15:04

Nasir Riley · Accepted Answer · 2019-07-09T17:57:21.620

8

For awk:

awk -F '[><]' '{print $2}' file

That sets the field separator as either > or < and prints the second field which is what is between those two characters.

For sed:

sed 's|.*>\(.*\)<.*|\1|' file

That uses the () to print what is between the > and anything coming after it and the < and anything coming before it.

The output

Wallmart
tastes

edited Jul 09 '19 at 17:57

answered Jul 09 '19 at 01:06

Nasir Riley

10,665
2
18
27

Thanks for the explanation! – ZakS Jul 09 '19 at 06:34
3

to be completely honest, the awk solution : would also match : `anotherthing>` ... and if a line contains, say, `>> this is >important<`, it would yeld `""` (as it is the empty field between the first 2 `>`). and your sed : will matche the *longuest* occurence of ` and < in it .. >` in a line. You could use a (little bit) better version : `sed -e 's#.*>\([^><]*\)<.*#\1#'` (will replace the line with the first occurence of ``) – Olivier Dulac Jul 09 '19 at 12:17
OlivierDulac That's true but the questioner has indicated that that's isn't the case. If it were then I'd keep the `sed` solution and use Perl instead of `awk`. – Nasir Riley Jul 09 '19 at 13:47
@OlivierDulac in your example, wouldn't it create 3 fields, one empty, one " this is " between the 2nd and 3rd >'s, and then finally a third field "important"? – ZakS Jul 09 '19 at 18:56
Note that those two commands have different behavior on lines that don't have the `>word<`: the `awk` solution will replace them with a blank line, and `sed` will print them unchanged. – Kevin Jul 09 '19 at 19:01
@Kevin I know but it's been confirmed that all of the lines have the same format specified in the sample text. If they didn't then my answer would change. It would be rather difficult to have an answer that would cover every possibility. – Nasir Riley Jul 09 '19 at 19:21
1

@ZakS the answer's awk would create 5 fields, not 3. Each occurrence of the separator (which for the awk is set at "exactly 1 `>` or 1 `<`") separating a field, it would also have a first empty field (before the first `>`) and a 5th empty one after the last `<`. – Olivier Dulac Jul 10 '19 at 08:01

score 2 · Answer 3 · answered Jul 09 '19 at 09:00

I tried with below command and it worked fine

awk -F ">" '{print $2}' filename| sed  "s/<.*//g"

output

Wallmart
tastes

python

#!/usr/bin/python
o=open('filename','r')
for i in o:
    k=i.split('>')[1].split('<')[0].strip()
    print k

output

Wallmart
tastes

score 0 · Answer 4 · edited Feb 18 '21 at 12:28

0

awk -F ">" '{print $2}' filename| sed  "s/<.*//g"

I have used this one and it works for longer strings instead of > and ...<...

awk -F "string1" '{print $2}' filename| sed  "s/string2.*//g"

edited Feb 18 '21 at 12:28

GAD3R

63,407
31
131
192

answered Feb 18 '21 at 11:33

Thierry Beliere

1

Extract string from each line of a file

4 Answers4