Parsing multidimensional data in paragraphs

Question

I'm trying to parse data from a PDF report and filter out certain interesting elements. Using pdftotext -layout I get data in this format as my starting point:

Record   Info           Interesting  
123      apple          yep         
         orange         nope         
         lemon          yep          
----------------------------------------------- 
456      dragonfruit    yep
         cucumber       nope         
-----------------------------------------------
789      kumquat        nope         
         lychee         yep          
         passionfruit   yep          
         yam            nope         
-----------------------------------------------
987      grapefruit     nope

My intended output is this - every 'Interesting' fruit and its record number except when the fruit is the first fruit in its record:

Record   Info
123      lemon
789      lychee
789      passionfruit

Currently, inspired by this question, I'm replacing the ------ record delimiters with \n\n and stripping out the record headers using sed. Then I can find paragraphs with matching records with awk:

awk -v RS='' '/\n   .....................yep/'

(Figuring out how to write {3}.{21} or similar with one of the awks is definitely a battle for another day :/ )

This produces the cleaned-up paragraphs like so:

123      apple          yep         
         orange         nope         
         lemon          yep          

789      kumquat        nope         
         lychee         yep          
         passionfruit   yep          
         yam            nope

From here I could get the desired output by:

adding a second record number column, populated from the first record number column or the previous row's second record number column
delete rows which have a record number in the first column
delete rows which aren't intereresting
cut out the final columns

Am I going broadly in the right direction here, or is there a more straightforward way to parse multidimensional data? Perhaps by grepping an interesting row (has yep and no record number), then grep backwards from there to the next row with a nonblank record number?

`pdftotext` is a good, useful tool but be wary of its output. Depending on exactly how the PDF file was created, the output lines may not be in the order you expect (or the order they appear to be in when you view the PDF). This is particularly common when the PDF has been created or manually edited by a GUI PDF editor rather than generated by a program (e.g. command-line tools or export to pdf from a word-processor or spreadsheet etc). If you're lucky enough to have access to text or the original/source data files, you're always better off using them. — cas, Jul 16 '17 at 04:39

DopeGhoti · Accepted Answer · 2017-07-14T22:41:29.137

You might be overcomplicating things:

$ cat input
Record   Info           Interesting
123      apple          yep
         orange         nope
         lemon          yep
-----------------------------------------------
456      dragonfruit    yep
         cucumber       nope
-----------------------------------------------
789      kumquat        nope
         lychee         yep
         passionfruit   yep
         yam            nope
-----------------------------------------------
987      grapefruit     nope
$ awk 'BEGIN {OFS="\t"; print "Record","Info"} NF==3 && NR!=1 { number=$1 } NF!=3 && $2 ~ /yep/ {print number,$1}' input
Record  Info
123     lemon
789     lychee
789     passionfruit

To make the awk script a little more veritcal, for to explain how it works:

BEGIN {                    # This block executes before any data
   OFS="\t";               # are parsed, and simply prints a header.
   print "Record","Info"
}
NF==3 && NR!=1 {           # This block will run on any row (line)
   number=$1               # with three fields other than the first
}
NF!=3 && $2 ~ /yep/ {      # On rows with three fields where the second
   print number,$1         # matches the regex /yup/, print the number
}                          # grabbed before, and the fruit.

Parsing multidimensional data in paragraphs

1 Answers1