3

I'm trying to create a csv from an xml with just some information from the xml.

This is my xml :

<?xml version="1.0" encoding="UTF-8"?>
<hashlist version = "2.0" xmlns = "urn:ASC:MHL:v2.0">
    <creatorinfo>
        <creationdate>2022-11-06T01:22:14+00:00</creationdate>
        <hostname>MacBook-Pro-de-Baptiste.local</hostname>
        <tool>ARRI HDET job</tool>
    </creatorinfo>
    <processinfo>
        <process>in-place</process>
    </processinfo>
    <hashes>
        <hash>
            <path size="3435540600" lastmodificationdate="2022-11-06T01:21:00+00:00">A_0900C001_220927_102036_a1BZ0_hde.mxf</path>
            <xxh64 action="original" hashdate="2022-11-06T01:21:00+00:00">3f93f215ec277fc7</xxh64>
        </hash>
        <hash>
            <path size="3280802936" lastmodificationdate="2022-11-06T01:21:14+00:00">A_0900C002_220927_102120_a1BZ0_hde.mxf</path>
            <xxh64 action="original" hashdate="2022-11-06T01:21:14+00:00">6a3c2be7577f31bd</xxh64>
        </hash>
        <hash>
            <path size="2657895544" lastmodificationdate="2022-11-06T01:21:26+00:00">A_0900C003_220927_102240_a1BZ0_hde.mxf</path>
            <xxh64 action="original" hashdate="2022-11-06T01:21:26+00:00">6606cf4d3b1ebc17</xxh64>
        </hash>
        <hash>
            <path size="4988562588" lastmodificationdate="2022-11-06T01:21:49+00:00">A_0900C004_220927_102334_a1BZ0_hde.mxf</path>
            <xxh64 action="original" hashdate="2022-11-06T01:21:49+00:00">cd0a2dca6f8f6c21</xxh64>
        </hash>
        <hash>
            <path size="633346644" lastmodificationdate="2022-11-06T01:21:52+00:00">A_0900C005_220927_102506_a1BZ0_hde.mxf</path>
            <xxh64 action="original" hashdate="2022-11-06T01:21:52+00:00">e617e05dae72e5a6</xxh64>
        </hash>
        <hash>
            <path size="3889553016" lastmodificationdate="2022-11-06T01:22:13+00:00">A_0900C006_220927_102615_a1BZ0_hde.mxf</path>
            <xxh64 action="original" hashdate="2022-11-06T01:22:13+00:00">d6e487264d1246b0</xxh64>
        </hash>
        <hash>
            <path size="273064020" lastmodificationdate="2022-11-06T01:22:14+00:00">A_0900C007_220927_102720_a1BZ0_hde.mxf</path>
            <xxh64 action="original" hashdate="2022-11-06T01:22:14+00:00">80f5f5683e1f326d</xxh64>
        </hash>
    </hashes>
</hashlist>

And I want something like that :

A_0900C001_220927_102036_a1BZ0_hde.mxf;3f93f215ec277fc7
A_0900C002_220927_102120_a1BZ0_hde.mxf;6a3c2be7577f31bd

etc...

I've tried

xmllint --xpath '/hashlist/hashes/hash/path/text()' file.xml

but the return is "XPath set is empty"

Marcus Müller
  • 21,602
  • 2
  • 39
  • 54
MrBotus
  • 31
  • 2

3 Answers3

4

My xmllint-foo is a bit rusty, especially with regards to proper use of namespaces, so I would probably use xmlstarlet instead:

xmlstarlet sel -N ns='urn:ASC:MHL:v2.0' --template \
    --match '/ns:hashlist/ns:hashes/ns:hash' \
    --value-of 'concat(ns:path, ";", ns:xxh64)' --nl \
    file.xml

This matches each hash node by its absolute path and then outputs the concatenation of the values of its path and xxh64 child nodes, with a ; in between them (followed by a newline character).

Since the document uses an implicit namespace, we need to declare an explicit namespace prefix using the namespace in the document's root element and then use that to prefix each node name in our XPath expressions.

However, it is pointed out in the comments below (now deleted) that xmlstarlet allows one to use an anonymous catch-all namespace called _:

xmlstarlet sel --template \
    --match '/_:hashlist/_:hashes/_:hash' \
    --value-of 'concat(_:path, ";", _:xxh64)' --nl \
    file.xml

Given the XML in the question, either of the above commands would produce

A_0900C001_220927_102036_a1BZ0_hde.mxf;3f93f215ec277fc7
A_0900C002_220927_102120_a1BZ0_hde.mxf;6a3c2be7577f31bd
A_0900C003_220927_102240_a1BZ0_hde.mxf;6606cf4d3b1ebc17
A_0900C004_220927_102334_a1BZ0_hde.mxf;cd0a2dca6f8f6c21
A_0900C005_220927_102506_a1BZ0_hde.mxf;e617e05dae72e5a6
A_0900C006_220927_102615_a1BZ0_hde.mxf;d6e487264d1246b0
A_0900C007_220927_102720_a1BZ0_hde.mxf;80f5f5683e1f326d

Using xq (from Andrey Kislyuk), you may get a properly quoted CSV document using the following:

xq -r '.hashlist.hashes.hash | map([.path."#text",.xxh64."#text"] | @csv)[]' file.xml

or,

xq -r '.hashlist.hashes.hash[] | [.path."#text",.xxh64."#text"] | @csv' file.xml

If you want unquoted fields with ; as the delimiter, you may replace @csv with join(";") in the above commands.

Kusalananda
  • 320,670
  • 36
  • 633
  • 936
2

The problem with xmllint is that it is not friendly with namespaces.

To do what you need on a file with namespace you need to write:

xmllint --xpath "/*[local-name()='hashlist']/*[local-name()='hashes']/*[local-name()='hash']/*[local-name()='path']/text()" file.xml

Or just remove namespace from them original file beforehand.

White Owl
  • 4,511
  • 1
  • 4
  • 15
0

You can use xidel and jq:

xidel -s -e "[//path, //xxh64]" < test.xml | jq -r '. | transpose| .[] | @tsv'

(Assuming your xml data is in test.xml)

knb
  • 141
  • 4