Find PDFs that don't have text

Question

I have many folders with lots of PDFs and I want to Optical Character Recognise those that do not have a text layer. So first, I want to find them. I thought that maybe a pipe with pdfgrep would do the job, but I'm lost.

How can I find PDFs that do not have text?

Stéphane Chazelas · Accepted Answer · 2021-01-15T09:45:41.960

7

Yes, using pdfgrep sounds like a good idea. Something like:

find . -name '*.[Pp][Dd][Ff]' -type f \
  ! -exec pdfgrep -q '\w' {} ';' -print

Would report the list of pdf files where pdfgrep can't find any word character (alnums or underscore).

(with some find implementations, you can use -iname '*.pdf' instead of -name '*.[Pp][Dd][Ff]' above. Beware it assumes file names are valid text in the current locale)

To look for files with fewer than 1000 word characters:

find . -name '*.[Pp][Dd][Ff]' -type f -exec sh -c '
  for file do
    [ "$(pdfgrep -c "\w" "$file")" -lt 1000 ] &&
      printf "%s\n" "$file"
  done' sh {} +

edited Jan 15 '21 at 09:45

answered Jan 15 '21 at 07:16

Stéphane Chazelas

522,931
91
1,010
1,501

That will get a lot of false positives, since many scanned pdfs include notices / watermarks as text. – Jan 15 '21 at 09:33
@user414777, ITYM false negatives as it would fail to report those files. I've added a variant that count the number of word characters (and which could have false positives in addition to false negatives). – Stéphane Chazelas Jan 15 '21 at 09:49

Find PDFs that don't have text

1 Answers1