XML parsing using xmllint and customizing the output

Question

I have xml file (say input.xml) of the following schema:

<?xml version="1.0"?>
  <TagA>
    <TagB>
      <File Folder="FOLDER1M\1" File="R1.txt" />
    </TagB>
    <TagB>
      <File Folder="FOLDER1M\2" File="R2.txt" />
    </TagB>
    <TagB>
      <File Folder="FOLDER2M\1" File="R3.txt" />
    </TagB>
  </TagA>

I need to parse this file and write the output to another file. The required output should be of the following form:

www.xyz.com\FOLDER1M\1\R1.txt
www.xyz.com\FOLDER1M\2\R2.txt
www.xyz.com\FOLDER2M\1\R3.txt

What I have got so far is:

echo 'cat /TagA/TagB/File/@*[name()="Folder" or name()="File"]' | xmllint --shell input.xml | grep '=' > xml_parsed

This gives me o/p of the form:

/ > cat /TagA/TagB/File/@*[name()="Folder" or name()="File"]
Folder="FOLDER1M\1"
File="R1.txt"
Folder="FOLDER1M\2"
File="R2.txt"
Folder="FOLDER2M\3"
File="R3.txt"

How should I go about getting my required output instead of this current o/p?

What programming languages are you familiar with? Are you trying to solve this using only bash or some other shell? — slm, Apr 16 '13 at 21:48
I'm trying to get it done by bash..it's a part of overall automation I'm trying to achieve using bash scripting — NGambit, Apr 16 '13 at 21:49

Kusalananda · Answer 1 · 2021-07-10T16:59:57.623

It would be difficult to use xmllint to perform this action.

Using xmlstarlet:

xmlstarlet sel -t \
    -m '//TagB/File' \
    -v 'concat("www.xyz.com", "\", @Folder, "\", @File)' \
    -nl file.xml

or, to safely give the web-site address on the command line,

thesite=www.xyz.com
xmlstarlet sel -t --var site="'$thesite'" \
    -m '//TagB/File' \
    -v 'concat($site, "\", @Folder, "\", @File)' \
    -nl file.xml

This first selects the set of all TagB/File nodes in the document, and then, for each of these, concatenates the string www.xyz.com with the value of the Folder attribute and the value of the File attribute (with \ as delimiter between these). The -nl causes a newline to be emitted after the concatenated value.

The output, given the XML document in the question:

www.xyz.com\FOLDER1M\1\R1.txt
www.xyz.com\FOLDER1M\2\R2.txt
www.xyz.com\FOLDER2M\1\R3.txt

slm · Accepted Answer · 2013-04-17T01:55:25.687

Here's one way to do it. I just put your output into a file called sample.txt to make it easier to test, you can just append my commands to the end of your echo command:

sample.txt

Folder="FOLDER1M\1"
File="R1.txt"
Folder="FOLDER1M\2"
File="R2.txt"
Folder="FOLDER2M\3"
File="R3.txt"

command

% cat sample.txt | sed 'h;s/.*//;G;N;s/\n//g' | sed 's/Folder=\|"//g' | sed 's/File=/\\/' | sed 's/^/www.xyz.com\\/'

Breakdown of the command

join every 2 lines together

# sed 'h;s/.*//;G;N;s/\n//g'
Folder="FOLDER1M\1"File="R1.txt"
Folder="FOLDER1M\2"File="R2.txt"
Folder="FOLDER2M\3"File="R3.txt"

strip out Folder= & "

# sed 's/Folder=\|"//g'
FOLDER1M\1File=R1.txt
FOLDER1M\2File=R2.txt
FOLDER2M\3File=R3.txt

Replace File= with a '\'

# sed 's/File=/\\/'
FOLDER1M\1\R1.txt
FOLDER1M\2\R2.txt
FOLDER2M\3\R3.txt

insert www.xyz.com

# sed 's/^/www.xyz.com\\/'
www.xyz.com\FOLDER1M\1\R1.txt
www.xyz.com\FOLDER1M\2\R2.txt
www.xyz.com\FOLDER2M\3\R3.txt

EDIT #1

The OP updated his question asking how to modify my answer to delete the first line of output, for example:

/ > cat /TagA/TagB/File/@*[name()="Folder" or name()="File"]
...
...

I mentioned to him that you can use grep -v ... to filter out lines that aren't relevant like so:

% cat sample.txt | grep -v "/ >" | sed 'h;s/.*//;G;N;s/\n//g' | sed 's/Folder=\|"//g' | sed 's/File=/\\/' | sed 's/^/www.xyz.com\\/'

Additionally to write the entire bit out to a file, that can be done like so:

% cat sample.txt | grep -v "/ >" | sed 'h;s/.*//;G;N;s/\n//g' | sed 's/Folder=\|"//g' | sed 's/File=/\\/' | sed 's/^/www.xyz.com\\/' > /path/to/some/file.txt

XML parsing using xmllint and customizing the output

2 Answers2

sample.txt

command

Breakdown of the command

EDIT #1