48

I have several thousand pages of scanned book pages. Each page is saved individually as a JPG. The writing is clear, but fonts vary, and the pages do include pictures and illustrations.

I need to create a list of all of the words appearing in each JPG file. Is there a command line tool for scanning an image listing the words that appear? It does not need to have perfect scanning, just an estimate.

Village
  • 4,655
  • 14
  • 46
  • 80
  • https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage - that's the engine behind ocrmypdf, and, in general, will allow you to have fewer steps in your setup. – oakad Jul 10 '17 at 05:08
  • 4
    Possible duplicate of [OCR on Linux systems](https://unix.stackexchange.com/questions/548/ocr-on-linux-systems) – curiousdannii Jul 10 '17 at 06:32
  • 2
    duplicate is a bit old, neweer stuff might exists. I'll vote leave open. – Archemar Jul 10 '17 at 07:38

4 Answers4

43

tesseract is probably the most-used solution here. It's available in most package repositories, e.g.,

sudo apt install tesseract-ocr

and can be used with

tesseract input.png out.txt
Nico Schlömer
  • 589
  • 3
  • 13
24

Install imagemagick, pdftotext (found in a package named poppler-utils within some package managers) and ocrmypdf. The latter is a fast (ocr takes a lot of cpu, and it is configured to use all your cores), open-source and frequently updated piece of OCR software. This approach is possibly overkill as it actually tries to assign a string to each word instead of just labeling a word, but I've had a lot of trouble finding good and easy to use opensource OCR software in general. Then, in the directory where you have saved all your JPGs:

$ convert *.jpg pictures.pdf
$ ocrmypdf pictures.pdf scanned.pdf
$ pdftotext scanned.pdf scanned.txt
$ wc -w scanned.txt
rien333
  • 613
  • 4
  • 15
16

Upscale image file.png by 480%, change to greyscale, backfill with white, sharpen and then extract using tesseract OCR. It works well most of the time for me, except for very large fonts, and white on black. If fonts are very large only upscale 200% or 300%.

 convert -colorspace gray -fill white  -resize 480%  -sharpen 0x1  file.png file.jpg
 tesseract file.jpg file

The result is in file.txt.

Eamonn Kenny
  • 411
  • 4
  • 6
  • 1
    This is what worked for me with a very small piece of non-english text with tiny font size. Amazing. – Avio Dec 13 '18 at 14:47
-1

For linux users, nothing works as well as using Calibre to convert pdf to docx. https://calibre-ebook.com/download_linux