Questions tagged [ocr]

OCR (Optical character recognition) is the conversion of an image of characters into a machine-readable encoded text. Use this tag to indicate questions involving this type of conversion or software that performs OCR. When possible indicate the software you use, source and target of the conversion.

Tools used for OCR:

tesseract
pdfsandwich, developed till 2018
gocr
ocrad
ocrfeeder
ocrmypdf
ocropus
cuneiform
clara
Linux-Intelligent-Ocr-Solution

39 questions

votes

4 answers

How to OCR a PDF file and get the text stored within the PDF?

First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support. I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file…

command-line pdf ocr

asked Aug 04 '16 at 15:39

ingli

1,665
1
15
33

votes

4 answers

How to use OCR from the command line in Linux?

I have several thousand pages of scanned book pages. Each page is saved individually as a JPG. The writing is clear, but fonts vary, and the pages do include pictures and illustrations. I need to create a list of all of the words appearing in each…

command-line ocr

asked Jul 09 '17 at 21:22

Village

4,655
14
46
80

votes

4 answers

Is there some sort of PDF-to-text converter?

I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro? Perhaps related post, OCR with Ubuntu here.

search pdf ocr text

asked Dec 11 '10 at 14:46

otto

votes

5 answers

OCR on Linux systems

I have always found OCR technology to be behind on open source systems. I've also watched the Ocropus project since its infancy. I've tried what I've heard is the best OCR engine available for Linux, Tesseract, and have found it woefully lacking…

opensource-projects ocr documents

asked Aug 16 '10 at 22:27

jjclarkson

2,147
2
17
16

votes

2 answers

Tesseract: High CPU Usage and slow speed, only when running multiple processes in parallel

Problem pytesseract.image_to_string() takes too much time when I run the script through supervisordd, but executes almost instantaneously when run directly in shell (on the same server and simultaneously with supervisor scripts). Apart from taking…

ocr tesseract

asked Jul 18 '19 at 08:29

Ashish

votes

3 answers

How can I rasterize all of the text in a PDF?

You know when you have a pdf, which is a scan of a document and it's a really huge file, because it just stores the picture of the scanned document? And there are OCR tools which can help you to make a proper document which just stores the…

linux pdf pdftk ocr

asked Apr 26 '15 at 14:09

Dimitri Schachmann

votes

2 answers

How to find all images containing any text?

I got a lot of images, and I need to find, which of them contain any text in English (to delete them). Is it possible to do it automatically?

images ocr

asked Oct 17 '12 at 09:59

Andrey Chetverikov

votes

1 answer

Find PDFs that don't have text

I have many folders with lots of PDFs and I want to Optical Character Recognise those that do not have a text layer. So first, I want to find them. I thought that maybe a pipe with pdfgrep would do the job, but I'm lost. How can I find PDFs that do…

find pdf ocr

asked Jan 15 '21 at 04:11

fich

votes

1 answer

tesseract: is it possible to change font output in OCRed pdf?

Following up on how to OCR a pdf file and get the text stored within pdf? I have successfully produced OCRed pdf pages. In Evince, however, the letters are not shown; by this I mean that I cannot see the characters, but I can select them, copy them…

fonts pdf evince ocr tesseract

asked Aug 27 '16 at 08:14

ingli

1,665
1
15
33

votes

0 answers

Replace Scanned Text with OCRed Text in PDF

I have a scanned book as a PDF. When viewed in Evince, the book appears as it did when scanned, with old fashioned fonts that appear as they were scanned. However, Evince recognises the letters as characters, and I am able to select, cut, and copy…

pdf ocr

asked Feb 24 '19 at 00:39

zhanmusi

votes

1 answer

Delete OCR from PDF

I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the text layer in order to get a lighter file (and to…

pdf ocr

asked Jun 11 '17 at 22:46

Seninha

1,035
1
9
17

votes

1 answer

De-obfuscate a picture with statistical information?

I need to get this kind of information into numbers, how? Perhaps…

ocr dsp

asked Feb 04 '12 at 18:01

user2362

votes

3 answers

sed one-liner to replace word-medial capitals

I used OCR to turn some scans into plaintext, but unfortunately the letters 'fi' which are commonly joined in some fonts, got read in as capital W's. Now I need to replace all the W's with 'fi', and these can easily be distinguished by the fact that…

sed ocr

asked May 26 '11 at 23:47

ixtmixilix

13,040
27
82
118

votes

1 answer

Linux equivalent of GraphClick?

Is there a piece of Linux software that does what GraphClick does in Mac OS X? That is, is there a Linux software that "is a graph digitizer software which allows to automatically retrieve the original (x,y)-data from the image of a scanned graph"?

data-recovery image-editor ocr graph

asked Apr 29 '11 at 15:30

hpy

4,517
8
53
73

votes

0 answers

OCR that outputs probability data

I would like to convert printed books I own into audio by scanning them with OCR and then running the text through a TTS engine. These titles are not available as ebooks. Since OCR can make small errors especially when converting images containing…

ocr

asked Sep 27 '13 at 16:17

themirror

6,898
11
31
36

2 3 Next