Extract an attribute value from XML

Question

Using Bash,

File:

<?xml version="1.0" encoding="UTF-8"?>
<blah>
    <blah1 path="er" name="andy" remote="origin" branch="master" tag="true" />
    <blah1 path="er/er1" name="Roger" remote="origin" branch="childbranch" tag="true" />
    <blah1 path="er/er2" name="Steven" remote="origin" branch="master" tag="true" />

</blah>

I have tried the following:

grep -i 'name="andy" remote="origin" branch=".*\"' <filename>

But it returns the whole line:

<blah1 path="er" name="andy" remote="origin" branch="master" tag="true" />

I would like to match the line based on the following:

name="andy"

I just want it to return:

master

[I guess I'll leave this here.](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — JoL, Jul 12 '19 at 16:07

score 43 · Accepted Answer · answered Jul 11 '19 at 20:53

43

Use an XML parser for parsing XML data. With xmlstarlet it just becomes an XPath exercise:

$ branch=$(xmlstarlet sel -t -v '//blah1[@name="andy"]/@branch' file.xml)
$ echo $branch
master

answered Jul 11 '19 at 20:53

glenn jackman

84,176
15
116
168

11

This is the better answer since it will continue to work even after someone decided to change the order of the attributes. – Hermann Jul 11 '19 at 21:05
4

@Hermann Or changes the whitespace, or adds another element with attributes `name="andy" branch="foo"`, or changes the character encoding, or puts an escaped `\"` in the `branch` attribute, or or or... I agree; just use an XML parser! – marcelm Jul 12 '19 at 08:16
4

`branch=$(xmllint --xpath 'string(//blah1[@name="andy"]/@branch)' file.xml)` is the equivalent command with xmllint. – David Conrad Jul 12 '19 at 17:23
3

@DavidConrad make that an answer. – RonJohn Jul 13 '19 at 00:05
@RonJohn Done. I also decided to change it to an absolute XPath. – David Conrad Jul 13 '19 at 18:03
done :) thanks a lot – John Jul 22 '19 at 18:28

score 17 · Answer 2 · answered Jul 11 '19 at 20:19

17

With grep:

grep -Pio 'name="andy".*branch="\K[^"]*' file

-P enable perl regular expressions (PCRE)
-i ignore case
-o print only matched parts

In the regex, the \K is a zero-width lookbehind to match the part before the \K, but to not include it in the match.

answered Jul 11 '19 at 20:19

Freddy

25,172
1
21
60

Ah, using Grep, I tried to do this way, but I guess my knowledge was very limited and I kept getting frustrated :$ – John Jul 11 '19 at 20:21
Wonderful solution, I learn every day. – Edward Jul 11 '19 at 20:28
4

Parsing XML using grep is asking for trouble. What if the order of the attributes changes? What if there's some other (non-`blah1`) element that has similar attributes? What if the branch name includes `\"`? Also, why `-i`? XML element and attribute names are case-sensitive. Now, all of these things are bugs waiting to surface at some point in the future. I recommend using the proper tool for the job; an XML parser. – marcelm Jul 12 '19 at 08:21
The `-i` is taken from OP and could be handy to handle the _attribute values_ (Roger, Steven). If the branch name had an `\"`, then it should have been escaped with `\"`. Yes, you're right, XML may change, have line breaks, etc. pp., an XML parser _is_ definitely the better answer, but OP asked for `grep` and it could be that he knows what he is doing. – Freddy Jul 12 '19 at 10:50

score 11 · Answer 3 · answered Jul 13 '19 at 18:00

Use xmllint to extract the value of the attribute using XPath:

xmllint --xpath 'string(/blah/blah1[@name="andy"]/@branch)' file.xml

It's better to use an XML parser to process XML since the order of the attributes can change and line breaks could be inserted resulting in the name and branch attributes being in different lines of the file.

jesse_b · Answer 4 · 2019-07-11T20:18:11.947

3

Using awk:

awk '/name="andy"/{ for (i=1;i<=NF;i++) { if ($i ~ "branch=") { sub(/branch=/, ""); gsub(/"/, ""); print $i } } }' input

This will find a line containing name="andy" and then loop through each field in that line. If the field contains branch= we will remove branch= and all double quotes and print the remainder of the field.

sub(/branch=/, "") is looking for a match of branch= and replacing it with "" (nothing)

gsub is similar except it will replace globally (all occurances instead of just the first occurance).

edited Jul 11 '19 at 20:18

answered Jul 11 '19 at 20:12

jesse_b

35,934
12
91
140

Thank you so much, I will google to understand sub and gsub – John Jul 11 '19 at 20:16
I wish I could rate this up but another answer is better as you mentioned. – John Jul 11 '19 at 20:21
This is good but only works if branch is on the same line with name. – David Conrad Jul 12 '19 at 16:28
@DavidConrad: Yes that is the requirement. If you notice, branch is on every line but OP only wants to return the value of branch that **is on the same line** as the name. – jesse_b Jul 12 '19 at 16:39
That isn't exactly the *requirement*, though, that's just the way this file happens to look. XML allows whitespace, so if you break the lines on spaces it will still work with the highest-upvoted answer but it will break with awk. It's a caveat people using this solution should be aware of. That said, this is a good quick-and-dirty solution, and I upvoted you. – David Conrad Jul 12 '19 at 17:15

Edward · Answer 5 · 2019-07-11T20:12:00.763

1

I think this works:

$ grep -i 'name="andy" remote="origin" branch=".*\"' <filename> | awk -F' ' '{print $5}' | sed -E 's/branch=\"(.*)\"/\1/'
master

The awk part makes sure only branch="master" is returned, the sed part gives back what's between the double quotes with a reference (the \1 matches the part between the parentheses).

Now I know there are a lot of people out here with far more knowledge on the art that is awk and sed, so I'm prepared for some criticism :-)

edited Jul 11 '19 at 20:12

answered Jul 11 '19 at 20:07

Edward

2,364
3
16
26

But I am passing in the file thought :$ Thanks a lot for the answer, I didn't think of using awk. I don't want to read each line, I kinda want to read the whole file and do this? Not possible? – John Jul 11 '19 at 20:10
Editing my answer to show you how to pipe it through. – Edward Jul 11 '19 at 20:11
This works, but like any solution that doesn't treat the XML as XML, it will stop working if the order of attributes changes or line breaks are inserted. – David Conrad Jul 13 '19 at 18:05

score 0 · Answer 6 · answered Nov 20 '19 at 12:54

If you don't have access to xmllint or xmlstarlet on your machine. Make sure to transform your xml to one line before using grep like this

cat <filename> | tr -d '\n'

now you are sure that tags are not broken up on separate lines

| grep -Eo  "<blah1[>\ ][^<]+name=\"andy\"[^>]+."

will cut out (like in xpath /blah1[@name="andy"])

<blah1 path="er" name="andy" remote="origin" branch="master" tag="true" />

now

| grep  -oP "(?<=branch\=\")[^\"]*"

will return (like in xpath /@branch)

master

all together

cat <filename> | tr -d '\n'| grep -Eo  "<blah1[>\ ][^<]+name=\"andy\"[^>]+." | grep  -oP "(?<=branch\=\")[^\"]*"

Extract an attribute value from XML

6 Answers6

Linked