I hope that there is some character — or at least some string —
that never appears in your file.
I will assume that this is true for |.
To be safer, I’ll use ||.
Run this command:
sed -n -e H -e '/^ *\/> *$/ { s/.*//; x; s/.*NS1:name="\([^"]*\)/\1&/; s/\n/||/gp }' your_file |
sort |
nl -ba |
sed -e 's/ *\([0-9]*\)[^|]*||\(.* RDF:about="rdf:#$CHROME\)[0-9]*/\2\1/' -e 's/||/\n/g'
Note: This (probably) requires that you have GNU sed.
Overview
- Use
sed to transform the file into a format suitable for sorting
(details below).
- Sort the output from
sed.
- Apply (prepend) line numbers.
Use any command that will generate suitable numbers.
I like
nl -ba, but cat -n would work just as well,
and there are probably other options.
- Use
sed to strip the line number from the beginning of the line
and insert it after CHROME.
Unmangle the data back into the original format.
Details — First sed command
The sort command treats each line as a record.
Therefore, we take each (delimited) record from your input file
and concatenate all the lines, forming one long line.
We also copy the name value to the beginning of the line,
to avoid having to specify a sort key.
Use the -n option to suppress automatic printing.
Lines will be printed only when we say p.
Execute H on every line.
This appends the current line to the hold space.
Logically, it might make more sense
to copy the < line to the hold space (with the h command)
and then append all subsequent lines.
I arbitrarily chose this approach.
Note that, because we append the < line to an empty hold space,
the aggregated record has an extra newline at the beginning.
Look for a line containing />,
optionally preceded and/or followed by spaces.
When we find it, we know that we have a complete record in the hold space.
Do the following commands only on those lines.
s/.*// clears the pattern space (i.e., it wipes out the /> line).
This isn’t really throwing away any information;
the /> line was already appended to the hold space
(because every line is appended to the hold space).
x exchanges the pattern space and the hold space.
This retrieves the aggregated (appended / concatenated) record
from the hold space into the pattern space.
Because of the previous (s/.*//) command, this clears the hold space.
s/.*NS1:name="\([^"]*\)/\1&/ looks for the name field
and copies its value to the beginning of the record.
This will fail if you can have a name with quote characters in it.
s/\n/||/gp replaces every newline in the pattern space with ||.
(This is the step that converts the record into one line.)
Because of the p, this prints the record.
The output of the first sed command, when run on your sample file, is
AAA Carolinas||<RDF:Description RDF:about="rdf:#$CHROME1"|| NS1:name="AAA Carolinas"|| NS1:urlToUse=""|| NS1:whereLeetLB="off"|| NS1:leetLevelLB="1"|| NS1:hashAlgorithmLB="md5"|| NS1:passwordLength="16"|| NS1:usernameTB="user"|| NS1:counter=""|| NS1:charset="a9b0c8d1e7f2g6h3i5j4klmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV123456789"|| NS1:prefix="6%Fl"|| NS1:suffix="I$5g"|| NS1:protocolCB="false"|| NS1:subdomainCB="true"|| NS1:domainCB="true"|| NS1:pathCB="false"|| />
Adobe Forums||<RDF:Description RDF:about="rdf:#$CHROME2"|| NS1:name="Adobe Forums"|| NS1:urlToUse="adobeforums.com"|| NS1:whereLeetLB="off"|| NS1:leetLevelLB="1"|| NS1:hashAlgorithmLB="md5"|| NS1:passwordLength="12"|| NS1:usernameTB="username"|| NS1:counter=""|| NS1:charset="a9b0c8d1e7f2g6h3i5j4klmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV"|| NS1:prefix=""|| NS1:suffix=""|| NS1:protocolCB="false"|| NS1:subdomainCB="true"|| NS1:domainCB="true"|| NS1:pathCB="false"|| NS1:pattern0="*adobeforums.com*"|| NS1:patternenabled0="true"|| NS1:patterndesc0=""|| NS1:patterntype0="wildcard"|| />
Adorama||<RDF:Description RDF:about="rdf:#$CHROME3"|| NS1:name="Adorama"|| NS1:urlToUse="adorama.com"|| NS1:whereLeetLB="off"|| NS1:leetLevelLB="1"|| NS1:hashAlgorithmLB="md5"|| NS1:passwordLength="8"|| NS1:usernameTB="username"|| NS1:counter=""|| NS1:charset="a9b0c8d1e7f2g6h3i5j4klmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV"|| NS1:prefix=""|| NS1:suffix=""|| NS1:protocolCB="false"|| NS1:subdomainCB="false"|| NS1:domainCB="true"|| NS1:pathCB="false"|| NS1:pattern0="*adorama.com*"|| NS1:patternenabled0="true"|| NS1:patterndesc0=""|| NS1:patterntype0="wildcard"|| NS1:pattern1="www.adoramapix.com*"|| NS1:patternenabled1="true"|| NS1:patterndesc1=""|| NS1:patterntype1="wildcard"|| />
Details — Second sed command
s/ *\([0-9]*\)[^|]*||\(.* RDF:about="rdf:#$CHROME\)[0-9]*/\2\1/
breaks the line into pieces:
- Zero or more spaces.
- The line number (zero or more digits).
This becomes the
\1 group.
- The tab after the line number, the
name value, and the || after it.
- The record up though
RDF:about="rdf:#$CHROME.
This becomes the \2 group.
- The old record number (zero or more digits).
- Implicitly, the rest of the record.
It then replaces the first five pieces
with RDF:about="rdf:#$CHROME and the line number (the new record number).
Since the rest of the record was not matched,
it is not affected by the command.
s/||/\n/g replaces each || with a newline,
restoring (recreating) the original multi-line structure of the file.
Obviously, …
… to send the output to a file,
add > your_output_file at the very end
of the last line of the command (i.e., at the end of the second sed).
You can then move (mv) your_output_file to your original file.
It makes no sense whatsoever to specify the --output= (or -o) option
to the sort command;
the output from sort must go into the command
that applies the line numbers.
If you want to capture an intermediate file, say so.