17

Using Bash,

File:

<?xml version="1.0" encoding="UTF-8"?>
<blah>
    <blah1 path="er" name="andy" remote="origin" branch="master" tag="true" />
    <blah1 path="er/er1" name="Roger" remote="origin" branch="childbranch" tag="true" />
    <blah1 path="er/er2" name="Steven" remote="origin" branch="master" tag="true" />

</blah>

I have tried the following:

grep -i 'name="andy" remote="origin" branch=".*\"' <filename>

But it returns the whole line:

<blah1 path="er" name="andy" remote="origin" branch="master" tag="true" />

I would like to match the line based on the following:

name="andy"

I just want it to return:

master
glenn jackman
  • 84,176
  • 15
  • 116
  • 168
John
  • 173
  • 1
  • 1
  • 4
  • 3
    [I guess I'll leave this here.](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – JoL Jul 12 '19 at 16:07

6 Answers6

43

Use an XML parser for parsing XML data. With it just becomes an XPath exercise:

$ branch=$(xmlstarlet sel -t -v '//blah1[@name="andy"]/@branch' file.xml)
$ echo $branch
master
glenn jackman
  • 84,176
  • 15
  • 116
  • 168
17

With grep:

grep -Pio 'name="andy".*branch="\K[^"]*' file
  • -P enable perl regular expressions (PCRE)
  • -i ignore case
  • -o print only matched parts

In the regex, the \K is a zero-width lookbehind to match the part before the \K, but to not include it in the match.

Freddy
  • 25,172
  • 1
  • 21
  • 60
  • Ah, using Grep, I tried to do this way, but I guess my knowledge was very limited and I kept getting frustrated :$ – John Jul 11 '19 at 20:21
  • Wonderful solution, I learn every day. – Edward Jul 11 '19 at 20:28
  • 4
    Parsing XML using grep is asking for trouble. What if the order of the attributes changes? What if there's some other (non-`blah1`) element that has similar attributes? What if the branch name includes `\"`? Also, why `-i`? XML element and attribute names are case-sensitive. Now, all of these things are bugs waiting to surface at some point in the future. I recommend using the proper tool for the job; an XML parser. – marcelm Jul 12 '19 at 08:21
  • The `-i` is taken from OP and could be handy to handle the _attribute values_ (Roger, Steven). If the branch name had an `\"`, then it should have been escaped with `\"`. Yes, you're right, XML may change, have line breaks, etc. pp., an XML parser _is_ definitely the better answer, but OP asked for `grep` and it could be that he knows what he is doing. – Freddy Jul 12 '19 at 10:50
11

Use xmllint to extract the value of the attribute using XPath:

xmllint --xpath 'string(/blah/blah1[@name="andy"]/@branch)' file.xml

It's better to use an XML parser to process XML since the order of the attributes can change and line breaks could be inserted resulting in the name and branch attributes being in different lines of the file.

David Conrad
  • 298
  • 2
  • 7
3

Using awk:

awk '/name="andy"/{ for (i=1;i<=NF;i++) { if ($i ~ "branch=") { sub(/branch=/, ""); gsub(/"/, ""); print $i } } }' input

This will find a line containing name="andy" and then loop through each field in that line. If the field contains branch= we will remove branch= and all double quotes and print the remainder of the field.

sub(/branch=/, "") is looking for a match of branch= and replacing it with "" (nothing)

gsub is similar except it will replace globally (all occurances instead of just the first occurance).

jesse_b
  • 35,934
  • 12
  • 91
  • 140
  • Thank you so much, I will google to understand sub and gsub – John Jul 11 '19 at 20:16
  • I wish I could rate this up but another answer is better as you mentioned. – John Jul 11 '19 at 20:21
  • This is good but only works if branch is on the same line with name. – David Conrad Jul 12 '19 at 16:28
  • @DavidConrad: Yes that is the requirement. If you notice, branch is on every line but OP only wants to return the value of branch that **is on the same line** as the name. – jesse_b Jul 12 '19 at 16:39
  • That isn't exactly the *requirement*, though, that's just the way this file happens to look. XML allows whitespace, so if you break the lines on spaces it will still work with the highest-upvoted answer but it will break with awk. It's a caveat people using this solution should be aware of. That said, this is a good quick-and-dirty solution, and I upvoted you. – David Conrad Jul 12 '19 at 17:15
1

I think this works:

$ grep -i 'name="andy" remote="origin" branch=".*\"' <filename> | awk -F' ' '{print $5}' | sed -E 's/branch=\"(.*)\"/\1/'
master

The awk part makes sure only branch="master" is returned, the sed part gives back what's between the double quotes with a reference (the \1 matches the part between the parentheses).

Now I know there are a lot of people out here with far more knowledge on the art that is awk and sed, so I'm prepared for some criticism :-)

Edward
  • 2,364
  • 3
  • 16
  • 26
  • But I am passing in the file thought :$ Thanks a lot for the answer, I didn't think of using awk. I don't want to read each line, I kinda want to read the whole file and do this? Not possible? – John Jul 11 '19 at 20:10
  • Editing my answer to show you how to pipe it through. – Edward Jul 11 '19 at 20:11
  • This works, but like any solution that doesn't treat the XML as XML, it will stop working if the order of attributes changes or line breaks are inserted. – David Conrad Jul 13 '19 at 18:05
0

If you don't have access to xmllint or xmlstarlet on your machine. Make sure to transform your xml to one line before using grep like this

cat <filename> | tr -d '\n'

now you are sure that tags are not broken up on separate lines

| grep -Eo  "<blah1[>\ ][^<]+name=\"andy\"[^>]+."

will cut out (like in xpath /blah1[@name="andy"])

<blah1 path="er" name="andy" remote="origin" branch="master" tag="true" />

now

| grep  -oP "(?<=branch\=\")[^\"]*"

will return (like in xpath /@branch)

master

all together

cat <filename> | tr -d '\n'| grep -Eo  "<blah1[>\ ][^<]+name=\"andy\"[^>]+." | grep  -oP "(?<=branch\=\")[^\"]*"
AnJo
  • 1