Can we search in a pdf file for pages containing several words in no particular order?

Question

I would like to search in a pdf file for all the pages, each containing several given words in no particular order. For example, I want to find all the pages which contain both "hello" and "world" in no particular order.

I am not sure if pdfgrep can do it.

I am trying to do something similar to how we can search for several words in a book shown in Google Books.

Thanks.

score 3 · Accepted Answer · 2019-04-20T05:33:20.640

Yes, you can do it with zero-width lookahead assertions, if you use the -P option (which let it use the PCRE engine and perl-like regexps).

$ pdfgrep -Pn '(?=.*process)(?=.*preparation)' ~/Str-Cmp.pdf
8:•     If a preparation process is used, the method used shall be declared.
10:Standard, preparation may be an important part of the ordering process. See Annex C for some examples of
38:padding. The preparation processing could move the original numerals (in order of occurrence) to the very

The above will only works if the two words are on the same line; if the words can occur on separate lines of the same page, the following will do:

$ pdfgrep -Pn '^(?s:(?=.*process)(?=.*preparation))' ~/Str-Cmp.pdf
8:ISO/IEC 14651:2007(E)
9:                                                                                                  ISO/IEC 14651:2007(E)
10:ISO/IEC 14651:2007(E)
12:ISO/IEC 14651:2007(E)
...

The s flag in (?s: means that . will also match a newline. Notice that that will only print the first line of the page; you can adjust that with the -A option:

$ pdfgrep -A4 -Pn '^(?s:(?=.*process)(?=.*preparation))' ~/Str-Cmp.pdf
8:ISO/IEC 14651:2007(E)
8-•     Any specific internal format for intermediate keys used when comparing, nor for the table used. The use of
8-      numeric keys is not mandated either.
8-•     A context-dependent ordering.
8-•     Any particular preparation of character strings prior to comparison.
--
9:                                                                                                  ISO/IEC 14651:2007(E)
...

A crude wrapper script that will print the lines matching any of the patterns from the pages matching all of the patterns in any order:

usage: pdfgrepa [options] files ... -- patterns ...

#! /bin/sh
r1= r2=
for a; do
        if [ "$r2" ]; then
                r1="$r1(?=.*$a)"; r2="$r2|$a"
        else
                case $a in
                --)     r2='(?=^--$)';;
                *)      set -- "$@" "$a";;
                esac
        fi
        shift
done
pdfgrep -A10000 -Pn "(?s:$r1)" "$@" | grep -P --color "$r2"

$ pdfgrepa ~/Str-Cmp.pdf -i -- obtains process preparation 37- the strings after preparation are identical, and the end result (as the user would normally see it) could be 37- collation process applying the same rules. This kind of indeterminacy is undesirable. 37-one obtains after this preparation the following strings:

Assuming the PDF has the text in it, and not bitmaps. Not all PDFs have text content :-( — Stephen Harris, Apr 20 '19 at 02:44
@Tim No that's not true. It shows the pages containing **all** the given words. To show the pages containing any of the given words, `pdfgrep -P 'process|preparation'` would've sufficed. — , Apr 20 '19 at 03:06
Thanks. Why does it show the first line? Is it default behaviour or by some option? — Tim, Apr 20 '19 at 03:18
Because a zero-width assertion --by definition-- has no width, and the matched part of the page will be the empty string before the 1st char from the page (and consequently, the 1st char from the 1st line from the page). If you want the matched words colored, the simplest thing I can think of is to pipe the output to another `| egrep --color 'process|preparation'`. You can make the whole thing into a function. — , Apr 20 '19 at 03:23
Thanks. `-A4` only outputs the first five lines in a matching page, not necessarily the lines containing the query words. How can you show the lines containing the query words? — Tim, Apr 20 '19 at 04:20
Print the whole page (eg. with `-A10000`) and then pipe the output to `egrep --color 'word1|word2'`. One limitation of `pdfgrep` seems to be that it isn't able to find words that were broken over 2 lines (eg. `redi- rection`), but maybe there's some hidden option for that ;-) — , Apr 20 '19 at 04:26
Thanks. where do you get the initial value of `a` in the script? — Tim, Apr 20 '19 at 11:27
`for a; do ... done` sets it; but this is already off-topic. If you don't trust my demo script, use `set -x` or escape the `|` from the last line and prepend a `printf` to see what it's actually doing: `printf '{%s} ' pdfgrep ... \| ...`. — , Apr 20 '19 at 16:00

score 1 · Answer 2 · answered Apr 20 '19 at 02:45

1

pdfgrep -nP 'hello.{1,99}world|world.{1,99}hello' a.pdf

https://pdfgrep.org/doc.html

answered Apr 20 '19 at 02:45

user1133275

5,488
1
19
37

Could you explain why `{1,99}`? – Tim Apr 20 '19 at 02:49

Can we search in a pdf file for pages containing several words in no particular order?

2 Answers2