How can I sort blocks of data of varying length on a field in each block

Question

I have an RDF file with blocks of data of varying number of lines delineated by < and />. Within each block, there is a field identified by name="some name". I need to sort the blocks on the value of name without changing the order of any of the lines within each block. Additionally, there is a field in each block with a number. I need to renumber these fields from 1 to n based on the sorted position of each block.

Here is an example of 3 blocks:

<RDF:Description RDF:about="rdf:#$CHROME1"
 NS1:name="AAA Carolinas"
  NS1:urlToUse=""
  NS1:whereLeetLB="off"
  NS1:leetLevelLB="1"
  NS1:hashAlgorithmLB="md5"
  NS1:passwordLength="16"
  NS1:usernameTB="user"
  NS1:counter=""
  NS1:charset="a9b0c8d1e7f2g6h3i5j4klmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV123456789"
  NS1:prefix="6%Fl"
  NS1:suffix="I$5g"
  NS1:protocolCB="false"
  NS1:subdomainCB="true"
  NS1:domainCB="true"
  NS1:pathCB="false"
  />
<RDF:Description RDF:about="rdf:#$CHROME2"
 NS1:name="Adobe Forums"
  NS1:urlToUse="adobeforums.com"
  NS1:whereLeetLB="off"
  NS1:leetLevelLB="1"
  NS1:hashAlgorithmLB="md5"
  NS1:passwordLength="12"
  NS1:usernameTB="username"
  NS1:counter=""
  NS1:charset="a9b0c8d1e7f2g6h3i5j4klmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV"
  NS1:prefix=""
  NS1:suffix=""
  NS1:protocolCB="false"
  NS1:subdomainCB="true"
  NS1:domainCB="true"
  NS1:pathCB="false"
  NS1:pattern0="*adobeforums.com*"
  NS1:patternenabled0="true"
  NS1:patterndesc0=""
  NS1:patterntype0="wildcard"
  />
<RDF:Description RDF:about="rdf:#$CHROME3"
 NS1:name="Adorama"
  NS1:urlToUse="adorama.com"
  NS1:whereLeetLB="off"
  NS1:leetLevelLB="1"
  NS1:hashAlgorithmLB="md5"
  NS1:passwordLength="8"
  NS1:usernameTB="username"
  NS1:counter=""
  NS1:charset="a9b0c8d1e7f2g6h3i5j4klmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV"
  NS1:prefix=""
  NS1:suffix=""
  NS1:protocolCB="false"
  NS1:subdomainCB="false"
  NS1:domainCB="true"
  NS1:pathCB="false"
  NS1:pattern0="*adorama.com*"
  NS1:patternenabled0="true"
  NS1:patterndesc0=""
  NS1:patterntype0="wildcard"
  NS1:pattern1="www.adoramapix.com*"
  NS1:patternenabled1="true"
  NS1:patterndesc1=""
  NS1:patterntype1="wildcard"
  />

The number I alluded to is the number following $CHROME in the above example. I'm an old Assembler, COBOL, Fortran, Basic programmer, but I am not up to snuff on scripting or newer languages. I could probably do this in a Basic program, but I would like a Linux solution if possible.

one block won't be enough. Post a testable input structure with 2-3 blocks (ready for sorting) — RomanPerekhrest, Jul 17 '18 at 17:47
@cecil - https://unix.stackexchange.com/help/merging-accounts — slm, Jul 17 '18 at 20:09
@Cecil: It would appear that you have accidentally created two accounts. (You were told this two months ago.) This will interfere with commenting, editing your own posts, and accepting an answer. You should use the [contact form](/contact) and select “I need to merge user profiles” to have your accounts merged. In order to merge them, you will need to provide links to the two accounts. For your information, these are https://unix.stackexchange.com/users/301047/cecil and https://unix.stackexchange.com/users/301054/cecil-carpenter. (You were told about this two months ago.) … (Cont’d) — G-Man Says 'Reinstate Monica', Sep 22 '18 at 21:18
(Cont’d) … You’ll then be able to comment on answers to your question, and [*accept*](/help/someone-answers) an answer that you find to be correct and useful. — G-Man Says 'Reinstate Monica', Sep 22 '18 at 21:19
P.S. I answered your follow-up question; see my revised answer. — G-Man Says 'Reinstate Monica', Sep 22 '18 at 21:19

G-Man Says 'Reinstate Monica' · Answer 1 · 2021-01-03T02:48:42.877

I hope that there is some character — or at least some string — that never appears in your file. I will assume that this is true for |. To be safer, I’ll use ||.

Run this command:

sed -n -e H -e '/^ *\/> *$/ { s/.*//; x; s/.*NS1:name="\([^"]*\)/\1&/; s/\n/||/gp }' your_file |
        sort |
        nl -ba |
        sed -e 's/ *\([0-9]*\)[^|]*||\(.* RDF:about="rdf:#$CHROME\)[0-9]*/\2\1/' -e 's/||/\n/g'

Note: This (probably) requires that you have GNU sed.

Overview

Use sed to transform the file into a format suitable for sorting (details below).
Sort the output from sed.
Apply (prepend) line numbers. Use any command that will generate suitable numbers. I like nl -ba, but cat -n would work just as well, and there are probably other options.
Use sed to strip the line number from the beginning of the line and insert it after CHROME. Unmangle the data back into the original format.

Details — First `sed` command

The sort command treats each line as a record. Therefore, we take each (delimited) record from your input file and concatenate all the lines, forming one long line. We also copy the name value to the beginning of the line, to avoid having to specify a sort key.

Use the -n option to suppress automatic printing. Lines will be printed only when we say p.
Execute H on every line. This appends the current line to the hold space. Logically, it might make more sense to copy the < line to the hold space (with the h command) and then append all subsequent lines. I arbitrarily chose this approach.

Note that, because we append the < line to an empty hold space, the aggregated record has an extra newline at the beginning.
Look for a line containing />, optionally preceded and/or followed by spaces. When we find it, we know that we have a complete record in the hold space. Do the following commands only on those lines.
- s/.*// clears the pattern space (i.e., it wipes out the /> line). This isn’t really throwing away any information; the /> line was already appended to the hold space (because every line is appended to the hold space).
- x exchanges the pattern space and the hold space. This retrieves the aggregated (appended / concatenated) record from the hold space into the pattern space. Because of the previous (s/.*//) command, this clears the hold space.
- s/.*NS1:name="$[^"]*$/\1&/ looks for the name field and copies its value to the beginning of the record. This will fail if you can have a name with quote characters in it.
- s/\n/||/gp replaces every newline in the pattern space with ||. (This is the step that converts the record into one line.) Because of the p, this prints the record.

The output of the first sed command, when run on your sample file, is

AAA Carolinas||<RDF:Description RDF:about="rdf:#$CHROME1"|| NS1:name="AAA Carolinas"||  NS1:urlToUse=""||  NS1:whereLeetLB="off"||  NS1:leetLevelLB="1"||  NS1:hashAlgorithmLB="md5"||  NS1:passwordLength="16"||  NS1:usernameTB="user"||  NS1:counter=""||  NS1:charset="a9b0c8d1e7f2g6h3i5j4klmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV123456789"||  NS1:prefix="6%Fl"||  NS1:suffix="I$5g"||  NS1:protocolCB="false"||  NS1:subdomainCB="true"||  NS1:domainCB="true"||  NS1:pathCB="false"||  />
Adobe Forums||<RDF:Description RDF:about="rdf:#$CHROME2"|| NS1:name="Adobe Forums"||  NS1:urlToUse="adobeforums.com"||  NS1:whereLeetLB="off"||  NS1:leetLevelLB="1"||  NS1:hashAlgorithmLB="md5"||  NS1:passwordLength="12"||  NS1:usernameTB="username"||  NS1:counter=""||  NS1:charset="a9b0c8d1e7f2g6h3i5j4klmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV"||  NS1:prefix=""||  NS1:suffix=""||  NS1:protocolCB="false"||  NS1:subdomainCB="true"||  NS1:domainCB="true"||  NS1:pathCB="false"||  NS1:pattern0="*adobeforums.com*"||  NS1:patternenabled0="true"||  NS1:patterndesc0=""||  NS1:patterntype0="wildcard"||  />
Adorama||<RDF:Description RDF:about="rdf:#$CHROME3"|| NS1:name="Adorama"||  NS1:urlToUse="adorama.com"||  NS1:whereLeetLB="off"||  NS1:leetLevelLB="1"||  NS1:hashAlgorithmLB="md5"||  NS1:passwordLength="8"||  NS1:usernameTB="username"||  NS1:counter=""||  NS1:charset="a9b0c8d1e7f2g6h3i5j4klmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUV"||  NS1:prefix=""||  NS1:suffix=""||  NS1:protocolCB="false"||  NS1:subdomainCB="false"||  NS1:domainCB="true"||  NS1:pathCB="false"||  NS1:pattern0="*adorama.com*"||  NS1:patternenabled0="true"||  NS1:patterndesc0=""||  NS1:patterntype0="wildcard"||  NS1:pattern1="www.adoramapix.com*"||  NS1:patternenabled1="true"||  NS1:patterndesc1=""||  NS1:patterntype1="wildcard"||  />

Details — Second `sed` command

s/ *$[0-9]*$[^|]*||$.* RDF:about="rdf:#$CHROME$[0-9]*/\2\1/ breaks the line into pieces:
- Zero or more spaces.
- The line number (zero or more digits). This becomes the \1 group.
- The tab after the line number, the name value, and the || after it.
- The record up though RDF:about="rdf:#$CHROME. This becomes the \2 group.
- The old record number (zero or more digits).
- Implicitly, the rest of the record.
It then replaces the first five pieces with RDF:about="rdf:#$CHROME and the line number (the new record number). Since the rest of the record was not matched, it is not affected by the command.
s/||/\n/g replaces each || with a newline, restoring (recreating) the original multi-line structure of the file.

Obviously, …

… to send the output to a file, add > your_output_file at the very end of the last line of the command (i.e., at the end of the second sed). You can then move (mv) your_output_file to your original file. It makes no sense whatsoever to specify the --output= (or -o) option to the sort command; the output from sort must go into the command that applies the line numbers. If you want to capture an intermediate file, say so.

How can I sort blocks of data of varying length on a field in each block

1 Answers1

Overview

Details — First sed command

Details — Second sed command

Obviously, …

Details — First `sed` command

Details — Second `sed` command