0

I have html file , I want extract string between pattern . this file look like this :

<span>aghahan.com</span>
<span>pouyamannequin.com</span>

i need that domain with span : aghahan.com , pouyamannequin.com

I am try with this command :

sed -e 's/>!\(.*\)>.com<\/span>/\1/' domain.txt

but I get wrong result . thankful if anybody help me.

  • See http://stackoverflow.com/a/1732454/1081936 and the xmllint and xmlstarlet answers at https://unix.stackexchange.com/q/83385/133219 – Ed Morton Mar 27 '20 at 04:38

4 Answers4

1

As each line begins with <span> and ends with </span>:

sed 's|<span>\(.*\)</span>|\1|' domain.txt

You can also do it this way with awk by setting the field separator as either < or > and printing the third column:

awk -F '[<>]' '{print $3}' domain.txt

Output:

aghahan.com
pouyamannequin.com

These are the simplest ways that it can be done and it will also work if the lines have trailing white space.

Nasir Riley
  • 10,665
  • 2
  • 18
  • 27
0

With sed

 sed 's/\(.*\)>\(.*\)<\(.*\)/\2/g' domain.txt
zorbax
  • 330
  • 1
  • 2
  • 12
0

With python and BeautifulSoup:

python -c '
from bs4 import BeautifulSoup
f = open("domain.txt", "r")
soup = BeautifulSoup(f.read(),"html.parser")
for span in soup.find_all("span"):
  print(span.string)
'

Might be a bit overkill for your simple task, but will work much better and will be easier on more difficult tasks, e.g. if you have different html like:

<span>
 aghahan.com
</span>
<span>
 pouyamannequin.com
</span>
pLumo
  • 22,231
  • 2
  • 41
  • 66
0
awk -F ">" '{print $2}' filename| sed "s/<.*//g"

output

aghahan.com
pouyamannequin.com
Praveen Kumar BS
  • 5,139
  • 2
  • 9
  • 14