Simple way to extract value from HTML

Question

I have a very simple html file with a value inside. Value is 57 in this case.

<eta version="1.0"><value uri="/user/var/48/10391/0/0/12528" strValue="57" unit="%" decPlaces="0" scaleFactor="10" advTextOffset="0">572</value></eta>

What is an easy bash script way to extract and write in a variable? Is there a way to not even require a wget into a file as an intermediate step, so as not require to open and use a file where it is stored, but directly work with the wget?

To clarify, I could do a simple wget, save to a file and check the file for the value or is there an even more enhanced way to do the wget somewhere in RAM and not require an explicit file to be stored?

Thanks a million times, highly appreciated Norbert

HTML is a subset of XML. You need to read up on using an XML Reader in Linux, which is most likely why you were downvoted. — eyoung100, Nov 12 '14 at 21:40
@eyoung100 [HTML5 is not XML](https://stackoverflow.com/a/39560454/4970442) — Pablo A, Dec 31 '19 at 03:48

jimmij · Answer 1 · 2014-11-13T13:03:40.017

12

You can extract a value in your example with grep and assign it to the variable in the following way

$ x=$(wget -0 - 'http://foo/bar.html' | grep -Po '<value.*strValue="\K[[:digit:]]*')
$ echo $x
57

Explanation:

$(): command substitution
grep -P: grep with Perl regexp enable
grep -o: grep shows only matched part of the line
\K: do not show in the output anything what was matched up to this point
wget -O -: prints downloaded document to standard output (not to file)

However, for general approach it is better to use dedicated parser for html code.

edited Nov 13 '14 at 13:03

answered Nov 12 '14 at 21:49

jimmij

46,064
19
123
136

what is \K doing can u explain – Hackaholic Nov 12 '14 at 21:56
See updated edit. `\K` works only with `-P` option. – jimmij Nov 12 '14 at 22:01
+ 1 for \K using perl regex – Hackaholic Nov 12 '14 at 22:03
+1 but since you're using `-P`, why not use `\d+` instead of `[[:digit:]]*`? – terdon Nov 12 '14 at 23:29
Nice explanations. No temp file would be nice. – geedoubleya Nov 12 '14 at 23:59
i know it is a stupid question, but can you show how it would look like if I do a wget to a website....so there is no need to intermediate have a local file stored? – njordan Nov 13 '14 at 12:47
@njordan See the update, you just need to use `-O -` option with `wget` as in [terdon](http://unix.stackexchange.com/a/167656/80886) answer. The `-` means to use standard output for downloaded document, not a file. – jimmij Nov 13 '14 at 13:08
one more question, it seems that [[:digit:]]* does only extract a integer value.....I did use the same great line to extract another parameter....that is float (e.g., 15,4) and it cuts at 15....what do I have to do to take the complete string in the "" as a float variable? – njordan Nov 22 '14 at 23:09
Try `grep -Po ' – jimmij Nov 22 '14 at 23:38
Should my last question not also work directly this way: grep -Po ' – njordan Nov 25 '14 at 21:39
@njordan `grep -Po ' – jimmij Nov 25 '14 at 21:59
Sorry, I have another usecase now....what if I need to take a STRING....so everything between "". Thanks – njordan Dec 11 '14 at 21:59
@njordan as I've said in last comment with `awk` that would be `awk -F'"' '{print $6}'`, just change 6 to string position. If you want `grep` then crucial regexp would be `"[^"]*"`, but the whole command would depend on specific case. – jimmij Dec 11 '14 at 22:12

score 5 · Answer 2 · answered Nov 12 '14 at 23:29

5

I have no idea what wget you're talking about but I am guessing that you want to download the file. If so, yes, you can download it and parse it with no intermediate temp file:

$ value=$(wget -O - http://example.com/file.html | grep -oP 'strValue="\K[^"]+')
$ echo $value
57

answered Nov 12 '14 at 23:29

terdon

234,489
66
447
667

Yes, you are right `\d+` is shorter, also `[^"]+` is better because value inside `""` probably(?) doesn't need to be numerical. – jimmij Nov 12 '14 at 23:39

peak · Answer 3 · 2017-06-21T08:15:44.620

Apart from the wget -O - ... technique, you can also use curl -Ss ... to avoid the hassle of a temporary file.
The following illustrates the use of pup (https://github.com/ericchiang/pup), which supports a CSS-based query language.

a) To extract the "text" value of the <value> tag:

pup 'value text{}'  # yields 572

b) To extract the value of the strValue attribute of the <value> tag:

pup 'value attr{strvalue}' # yields 57

DisplayName · Answer 4 · 2014-11-12T22:01:32.983

-1

cat input | grep -o strValue=".*" | sed 's/strValue=//g' | sed 's/"//g'

edited Nov 12 '14 at 22:01

answered Nov 12 '14 at 21:36

DisplayName

11,468
20
73
115

2

Useless use of cat and doesn't work anyway. If you really want to involve `sed` try `sed 's/.*strValue="$[[:digit:]]*$.*/\1/' file`. – jimmij Nov 12 '14 at 22:22
Yeah, i tried.. – DisplayName Nov 12 '14 at 23:55
I suck at everything. – DisplayName Nov 13 '14 at 00:19
No, you just need some practise and you have very good questions, I like especially this one: http://unix.stackexchange.com/q/159489/80886 for obvious reason. BTW, it was not me who downvoted. – jimmij Nov 13 '14 at 00:24
I't doesn't matter who down voted, i don't really care about internet points that much :). – DisplayName Nov 13 '14 at 00:53

Simple way to extract value from HTML

4 Answers4