How to find all images containing any text?

Question

I got a lot of images, and I need to find, which of them contain any text in English (to delete them). Is it possible to do it automatically?

"To delete them", that could be massive task, the program need to recognize the texts, fill in the missing background color — daisy, Oct 18 '12 at 06:58

hydrix · Answer 1 · 2015-10-20T06:42:04.270

I had the same problem, sharing my solution:

find . -type f \( -name "*.jpg" -or -name "*.png" \) -exec sh -c 'for x; do printf "%s :" "$x"; tesseract $x temp; if (grep -f blacklist temp.txt) then rm $x; rm temp.txt; fi; done' _ {} +

scans all subdirectories and deletes matching OCR patterns according to a file named "blacklist". only problem: if there is a space in file, it doesn't parse it correctly and instead tries to run on the first word of the file.

edit: careful not to leave any blank lines on the blacklist file.

score 2 · Accepted Answer · answered Oct 17 '12 at 10:11

2

You could use an open source OCR engine, say Tessaract, in order to figure out is there an english text or not.

answered Oct 17 '12 at 10:11

akond

1,622
1
10
15

Thanks, I understand the general idea that I need to use OCR, the question is - how to do it? Can tesseract return yes/no for each image? Or do I need to put text into file, and then analyze the created files? – Andrey Chetverikov Oct 18 '12 at 19:40
Image -> OCR -> text file. – akond Oct 19 '12 at 06:19

How to find all images containing any text?

2 Answers2