I got a lot of images, and I need to find, which of them contain any text in English (to delete them). Is it possible to do it automatically?
Asked
Active
Viewed 2,602 times
6
-
"To delete them", that could be massive task, the program need to recognize the texts, fill in the missing background color – daisy Oct 18 '12 at 06:58
-
"to delete them" - to delete images, not the text – Andrey Chetverikov Oct 18 '12 at 19:37
2 Answers
4
I had the same problem, sharing my solution:
find . -type f \( -name "*.jpg" -or -name "*.png" \) -exec sh -c 'for x; do printf "%s :" "$x"; tesseract $x temp; if (grep -f blacklist temp.txt) then rm $x; rm temp.txt; fi; done' _ {} +
scans all subdirectories and deletes matching OCR patterns according to a file named "blacklist". only problem: if there is a space in file, it doesn't parse it correctly and instead tries to run on the first word of the file.
edit: careful not to leave any blank lines on the blacklist file.
hydrix
- 141
- 2
2
You could use an open source OCR engine, say Tessaract, in order to figure out is there an english text or not.
akond
- 1,622
- 1
- 10
- 15
-
Thanks, I understand the general idea that I need to use OCR, the question is - how to do it? Can tesseract return yes/no for each image? Or do I need to put text into file, and then analyze the created files? – Andrey Chetverikov Oct 18 '12 at 19:40
-