I'm trying to parse data from a PDF report and filter out certain interesting elements. Using pdftotext -layout I get data in this format as my starting point:
Record Info Interesting
123 apple yep
orange nope
lemon yep
-----------------------------------------------
456 dragonfruit yep
cucumber nope
-----------------------------------------------
789 kumquat nope
lychee yep
passionfruit yep
yam nope
-----------------------------------------------
987 grapefruit nope
My intended output is this - every 'Interesting' fruit and its record number except when the fruit is the first fruit in its record:
Record Info
123 lemon
789 lychee
789 passionfruit
Currently, inspired by this question, I'm replacing the ------ record delimiters with \n\n and stripping out the record headers using sed. Then I can find paragraphs with matching records with awk:
awk -v RS='' '/\n .....................yep/'
(Figuring out how to write {3}.{21} or similar with one of the awks is definitely a battle for another day :/ )
This produces the cleaned-up paragraphs like so:
123 apple yep
orange nope
lemon yep
789 kumquat nope
lychee yep
passionfruit yep
yam nope
From here I could get the desired output by:
- adding a second record number column, populated from the first record number column or the previous row's second record number column
- delete rows which have a record number in the first column
- delete rows which aren't intereresting
cutout the final columns
Am I going broadly in the right direction here, or is there a more straightforward way to parse multidimensional data? Perhaps by grepping an interesting row (has yep and no record number), then grep backwards from there to the next row with a nonblank record number?