5

I have many folders with lots of PDFs and I want to Optical Character Recognise those that do not have a text layer. So first, I want to find them. I thought that maybe a pipe with pdfgrep would do the job, but I'm lost.

How can I find PDFs that do not have text?

ctrl-alt-delor
  • 27,473
  • 9
  • 58
  • 102
fich
  • 290
  • 1
  • 12

1 Answers1

7

Yes, using pdfgrep sounds like a good idea. Something like:

find . -name '*.[Pp][Dd][Ff]' -type f \
  ! -exec pdfgrep -q '\w' {} ';' -print

Would report the list of pdf files where pdfgrep can't find any word character (alnums or underscore).

(with some find implementations, you can use -iname '*.pdf' instead of -name '*.[Pp][Dd][Ff]' above. Beware it assumes file names are valid text in the current locale)

To look for files with fewer than 1000 word characters:

find . -name '*.[Pp][Dd][Ff]' -type f -exec sh -c '
  for file do
    [ "$(pdfgrep -c "\w" "$file")" -lt 1000 ] &&
      printf "%s\n" "$file"
  done' sh {} +
Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
  • That will get a lot of false positives, since many scanned pdfs include notices / watermarks as text. –  Jan 15 '21 at 09:33
  • @user414777, ITYM false negatives as it would fail to report those files. I've added a variant that count the number of word characters (and which could have false positives in addition to false negatives). – Stéphane Chazelas Jan 15 '21 at 09:49