How extract string between start and end pattern with sed AWK?

Question

I have html file , I want extract string between pattern . this file look like this :

<span>aghahan.com</span>
<span>pouyamannequin.com</span>

i need that domain with span : aghahan.com , pouyamannequin.com

I am try with this command :

sed -e 's/>!\(.*\)>.com<\/span>/\1/' domain.txt

but I get wrong result . thankful if anybody help me.

See http://stackoverflow.com/a/1732454/1081936 and the xmllint and xmlstarlet answers at https://unix.stackexchange.com/q/83385/133219 — Ed Morton, Mar 27 '20 at 04:38

score 1 · Answer 1 · answered Mar 26 '20 at 12:32

As each line begins with <span> and ends with </span>:

sed 's|<span>\(.*\)</span>|\1|' domain.txt

You can also do it this way with awk by setting the field separator as either < or > and printing the third column:

awk -F '[<>]' '{print $3}' domain.txt

Output:

aghahan.com
pouyamannequin.com

These are the simplest ways that it can be done and it will also work if the lines have trailing white space.

score 0 · Answer 2 · answered Mar 26 '20 at 10:50

0

With sed

 sed 's/\(.*\)>\(.*\)<\(.*\)/\2/g' domain.txt

answered Mar 26 '20 at 10:50

zorbax

330
1
2
12

pLumo · Answer 3 · 2020-03-26T11:39:00.370

With python and BeautifulSoup:

python -c '
from bs4 import BeautifulSoup
f = open("domain.txt", "r")
soup = BeautifulSoup(f.read(),"html.parser")
for span in soup.find_all("span"):
  print(span.string)
'

Might be a bit overkill for your simple task, but will work much better and will be easier on more difficult tasks, e.g. if you have different html like:

<span>
 aghahan.com
</span>
<span>
 pouyamannequin.com
</span>

score 0 · Answer 4 · answered Mar 26 '20 at 20:02

0

awk -F ">" '{print $2}' filename| sed "s/<.*//g"

output

aghahan.com
pouyamannequin.com

answered Mar 26 '20 at 20:02

Praveen Kumar BS

5,139
2
9
14

How extract string between start and end pattern with sed AWK?

4 Answers4