6

I got a lot of images, and I need to find, which of them contain any text in English (to delete them). Is it possible to do it automatically?

2 Answers2

4

I had the same problem, sharing my solution:

find . -type f \( -name "*.jpg" -or -name "*.png" \) -exec sh -c 'for x; do printf "%s :" "$x"; tesseract $x temp; if (grep -f blacklist temp.txt) then rm $x; rm temp.txt; fi; done' _ {} +

scans all subdirectories and deletes matching OCR patterns according to a file named "blacklist". only problem: if there is a space in file, it doesn't parse it correctly and instead tries to run on the first word of the file.

edit: careful not to leave any blank lines on the blacklist file.

hydrix
  • 141
  • 2
2

You could use an open source OCR engine, say Tessaract, in order to figure out is there an english text or not.

akond
  • 1,622
  • 1
  • 10
  • 15
  • Thanks, I understand the general idea that I need to use OCR, the question is - how to do it? Can tesseract return yes/no for each image? Or do I need to put text into file, and then analyze the created files? – Andrey Chetverikov Oct 18 '12 at 19:40
  • Image -> OCR -> text file. – akond Oct 19 '12 at 06:19