How to use OCR from the command line in Linux?

Question

I have several thousand pages of scanned book pages. Each page is saved individually as a JPG. The writing is clear, but fonts vary, and the pages do include pictures and illustrations.

I need to create a list of all of the words appearing in each JPG file. Is there a command line tool for scanning an image listing the words that appear? It does not need to have perfect scanning, just an estimate.

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage - that's the engine behind ocrmypdf, and, in general, will allow you to have fewer steps in your setup. — oakad, Jul 10 '17 at 05:08
Possible duplicate of [OCR on Linux systems](https://unix.stackexchange.com/questions/548/ocr-on-linux-systems) — curiousdannii, Jul 10 '17 at 06:32
duplicate is a bit old, neweer stuff might exists. I'll vote leave open. — Archemar, Jul 10 '17 at 07:38

Nico Schlömer · Answer 1 · 2019-10-07T07:59:37.717

43

tesseract is probably the most-used solution here. It's available in most package repositories, e.g.,

sudo apt install tesseract-ocr

and can be used with

tesseract input.png out.txt

edited Oct 07 '19 at 07:59

answered Aug 09 '17 at 15:16

Nico Schlömer

589
3
13

rien333 · Answer 2 · 2017-07-10T11:39:03.333

24

Install imagemagick, pdftotext (found in a package named poppler-utils within some package managers) and ocrmypdf. The latter is a fast (ocr takes a lot of cpu, and it is configured to use all your cores), open-source and frequently updated piece of OCR software. This approach is possibly overkill as it actually tries to assign a string to each word instead of just labeling a word, but I've had a lot of trouble finding good and easy to use opensource OCR software in general. Then, in the directory where you have saved all your JPGs:

$ convert *.jpg pictures.pdf
$ ocrmypdf pictures.pdf scanned.pdf
$ pdftotext scanned.pdf scanned.txt
$ wc -w scanned.txt

edited Jul 10 '17 at 11:39

answered Jul 09 '17 at 21:45

rien333

613
4
15

4

fwiw, this uses the below mentioned Tesseract. – exic Mar 13 '18 at 07:06
1

`ocrmypdf` made my day – oh really Jan 27 '19 at 08:22
the idea of having to convert to pdf first is just goofy. why can't i just input a jpg file and get some raw text out? – Michael Mar 24 '19 at 17:03
You can use a bash file to do all the command lines for you. – projetmbc Jul 13 '19 at 09:20
Convert doesn't work any more. They turned off that option in its policy file and that's buried in a container - at least in Ubuntu. img2pdf will do it now. – Joe Jan 30 '22 at 14:01

score 16 · Answer 3 · answered Feb 07 '18 at 12:47

16

Upscale image file.png by 480%, change to greyscale, backfill with white, sharpen and then extract using tesseract OCR. It works well most of the time for me, except for very large fonts, and white on black. If fonts are very large only upscale 200% or 300%.

 convert -colorspace gray -fill white  -resize 480%  -sharpen 0x1  file.png file.jpg
 tesseract file.jpg file

The result is in file.txt.

answered Feb 07 '18 at 12:47

Eamonn Kenny

411
4
6

1

This is what worked for me with a very small piece of non-english text with tiny font size. Amazing. – Avio Dec 13 '18 at 14:47

score -1 · Answer 4 · answered Jun 01 '18 at 23:02

-1

For linux users, nothing works as well as using Calibre to convert pdf to docx. https://calibre-ebook.com/download_linux

answered Jun 01 '18 at 23:02

Larry Bradley

1

How to use OCR from the command line in Linux?

4 Answers4