0

I have a big size file like 5GB with .gz. Inside that file, we have few XML files that contains values that I want to search and extract just in case if those values are there.

For example I want to extract the tags that contains the name NOOSS and also the subcontent of this tags like <pmJobId>, <requestedJobState>, <reportingPeriod>, <jobPriority> from the the .gz file

<Pm xmlns="urnCmwPm">
    <pmId>1</pmId>
    <PmJob>
        <pmJobId>NOOSSCONTROLExample</pmJobId>
        <requestedJobState>ACTIVE</requestedJobState>
        <reportingPeriod>FIVE_MIN</reportingPeriod>
        <jobType>MEASUREMENTJOB</jobType>
        <jobPriority>HIGH</jobPriority>
        <granularityPeriod>FIVE_MIN</granularityPeriod>
        <jobGroup>Sla</jobGroup>
        <reportContentGeneration>CHANGED_ONLY</reportContentGeneration>
        <MeasurementReader>
            <measurementReaderId>mr_2</measurementReaderId>
            <measurementSpecification struct="MeasurementSpecification">
                <measurementTypeRef>Anything</measurementTypeRef>
            </measurementSpecification>
            <thresholdRateOfVariation>PER_SECOND</thresholdRateOfVariation>
        </MeasurementReader>
        <MeasurementReader>
            <measurementReaderId>mr_1</measurementReaderId>
            <measurementSpecification struct="MeasurementSpecification">
                <measurementTypeRef>ManagedElement=1,SystemFunctions=1,Pm=1,PmGroup=OSProcessingLogicalUnit,MeasurementType=CPULoad.Total</measurementTypeRef>
            </measurementSpecification>
            <thresholdRateOfVariation>PER_SECOND</thresholdRateOfVariation>
        </MeasurementReader>
    </PmJob>
</Pm>

I was using cat *gz 1 zgrep -a "PmJobId" but the output only show the <pmJobId> value and not the rest of the information or tags.

Please your help, I'm noobie on this.

Im using CentOS - RedHat Linux.

Thanks

schrodingerscatcuriosity
  • 12,087
  • 3
  • 29
  • 57
  • 1
    A single compressed file can't contain multiple other files unless it's an archive of some sort. Do I understand you correctly that you want the `PmJob` node and all its sub-nodes if the corresponding `pmJobId` node's value contains the substring `NOOSS`? And you want the actual XML? It would benefit if you could show the result, given the example document you include in the question. – Kusalananda Sep 29 '21 at 16:37
  • The namespace declaration, i.e `xmlns="urnCmwPm"` contains an invalid URI. – fpmurphy Sep 29 '21 at 20:27

2 Answers2

0

Assuming that the XML document in file.xml is well-formed and correct in all ways (the example in the question has a faulty namespace declaration), then you would be able to extract the part of the document that corresponds to the PmJob nodes that have a pmJobID value that contains the substring NOOSS with the command-line XML parser xmlstarlet.

xmlstarlet sel -t -c '//PmJob[contains(pmJobId,"NOOSS")]' -nl file.xml

This command selects all PmJob nodes with a child node, pmJobId, whose value contains the substring NOOSS. The utility will return a copy of the selected PmJob node and all its child nodes.

Kusalananda
  • 320,670
  • 36
  • 633
  • 936
0

Assuming that the XML document is well-formed and valid, you can use the xmllint utility to output the requisite nodes.

$ xmllint --xpath '//PmJob[contains(pmJobId,"NOOSS")]' file.xml
<PmJob>
        <pmJobId>NOOSSCONTROLExample</pmJobId>
        <requestedJobState>ACTIVE</requestedJobState>
        <reportingPeriod>FIVE_MIN</reportingPeriod>
        <jobType>MEASUREMENTJOB</jobType>
        <jobPriority>HIGH</jobPriority>
        <granularityPeriod>FIVE_MIN</granularityPeriod>
        <jobGroup>Sla</jobGroup>
        <reportContentGeneration>CHANGED_ONLY</reportContentGeneration>
        <MeasurementReader>
            <measurementReaderId>mr_2</measurementReaderId>
            <measurementSpecification struct="MeasurementSpecification">
                <measurementTypeRef>Anything</measurementTypeRef>
            </measurementSpecification>
            <thresholdRateOfVariation>PER_SECOND</thresholdRateOfVariation>
        </MeasurementReader>
        <MeasurementReader>
            <measurementReaderId>mr_1</measurementReaderId>
            <measurementSpecification struct="MeasurementSpecification">
                <measurementTypeRef>ManagedElement=1,SystemFunctions=1,Pm=1,PmGroup=OSProcessingLogicalUnit,MeasurementType=CPULoad.Total</measurementTypeRef>
            </measurementSpecification>
            <thresholdRateOfVariation>PER_SECOND</thresholdRateOfVariation>
        </MeasurementReader>
    </PmJob>
$

This utility is installed by default on many Linux distributions.

fpmurphy
  • 4,556
  • 3
  • 23
  • 26