0

I have a script called get_numbers.sh, which I want to use to extract data from .pdf files labelled sequentially by date, using pdfgrep.

Let me simplify my problem to what I believe are its essentials:

The .pdfs are named file-07-01.pdf, file-07-02.pdf, ..., file-07-31.pdf, where the numbers correspond to the month and day of the data in the file.

I can enter a shell command like pdfgrep -i "Total Number:" file-07-{01..12}.pdf

and I get exactly what I want, the appropriate text from each file for the dates 07-01 to 07-12.

I want to make a script for this, where all I have to do is enter the start and end dates as well as the month. This was my first go:

    #!/usr/bin/bash

    if [ "$#" -eq 3 ]
    then
        START_DAY=$1 
        END_DAY=$2
        MONTH=$3
    else
        echo "INCORRECT USAGE:"
        exit 1
    fi   


    pdfgrep -i "Total Number:" file-$MONTH-{$START_DAY..$END_DAY}.pdf

But doing bash get_numbers.sh 01 12 07 gives me the error message

pdfgrep: Could not open file-07-{01..12}.pdf

Looking around, I've realized that you have to be careful when doing this, because the script will interpret file-$MONTH-{$START_DAY..$END_DAY}.pdf as a literal string rather than a glob. I've tried to modify this, by making this a variable, or putting double quotes around it, but it doesn't change the result. What am I missing?

nonreligious
  • 113
  • 5
  • 1
    It's the passing of arguments into the *brace expansion* that's the issue here - see for example [How can I use $variable in a shell brace expansion of a sequence?](https://unix.stackexchange.com/questions/7738/how-can-i-use-variable-in-a-shell-brace-expansion-of-a-sequence). Personally I'd use `seq` rather than the accepted answer based on `eval`. – steeldriver Aug 09 '21 at 16:41
  • @steeldriver Ah, that did the trick - thank you. Switched to a for loop with `seq` like one of the later answers/examples, because `eval` was doing something funny with the search pattern. You learn something every day! – nonreligious Aug 09 '21 at 17:27
  • fwiw not sure if you realize that you can still pass all the filenames to a single pdfgrep if you assemble them into an array: `for m in $(seq -w "$1" "$2"); do files+=("file-$3-$m.pdf"); done` then `pdfgrep -i "Total Number:" "${files[@]}"` – steeldriver Aug 09 '21 at 17:33
  • I see, yes I was doing the `pdfgrep` inside the loop - this does seem to be noticeably faster. Thanks again! – nonreligious Aug 09 '21 at 18:04

1 Answers1

2

Switch from bash to zsh which does have a decimal range glob operator (it can also, contrary to bash use variables in {$start..$end}, but that's not a glob operator).

#!/usr/bin/zsh -
if
  (( $# == 3 )) &&
    start_day=$1 end_day=$2 month=$3 &&
    [[ $start_day = <1-31> && $end_day = <1-31> && $month = <1-12> ]] &&
    (( end_day >= start_day ))
then
  pattern="file-<$month-$month>-<$start_day-$end_day>.pdf"
  exec pdfgrep -i "Total Number:" $~pattern(n) 
else
  print -u2 Incorrect usage
  exit 1
fi  

<1-5> will match on any sequence of decimal digits that represent integer numbers from 1 to 5, including 2, 02, 002... That's why we also use <$month-$month> so that <2-2> can match on both 2 and 02.

The (n) glob qualifier causes the glob expansion to sort numerically, so that file-12-2.pdf comes before file-12-13.pdf and file-12-03.pdf for instance.

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501