Questions tagged [ocr]

OCR (Optical character recognition) is the conversion of an image of characters into a machine-readable encoded text. Use this tag to indicate questions involving this type of conversion or software that performs OCR. When possible indicate the software you use, source and target of the conversion.

Tools used for OCR:

39 questions
78
votes
4 answers

How to OCR a PDF file and get the text stored within the PDF?

First, apologies if this has been asked before - I searched for a while through the existing posts, but could not find support. I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file…
ingli
  • 1,665
  • 1
  • 15
  • 33
48
votes
4 answers

How to use OCR from the command line in Linux?

I have several thousand pages of scanned book pages. Each page is saved individually as a JPG. The writing is clear, but fonts vary, and the pages do include pictures and illustrations. I need to create a list of all of the words appearing in each…
Village
  • 4,655
  • 14
  • 46
  • 80
40
votes
4 answers

Is there some sort of PDF-to-text converter?

I need PDF files in text so I can search over them in bulk from commandline. Is there some converter for Ubuntu, OBSD or similar distro? Perhaps related post, OCR with Ubuntu here.
otto
  • 501
  • 1
  • 4
  • 3
15
votes
5 answers

OCR on Linux systems

I have always found OCR technology to be behind on open source systems. I've also watched the Ocropus project since its infancy. I've tried what I've heard is the best OCR engine available for Linux, Tesseract, and have found it woefully lacking…
jjclarkson
  • 2,147
  • 2
  • 17
  • 16
10
votes
2 answers

Tesseract: High CPU Usage and slow speed, only when running multiple processes in parallel

Problem pytesseract.image_to_string() takes too much time when I run the script through supervisordd, but executes almost instantaneously when run directly in shell (on the same server and simultaneously with supervisor scripts). Apart from taking…
Ashish
  • 270
  • 1
  • 2
  • 10
7
votes
3 answers

How can I rasterize all of the text in a PDF?

You know when you have a pdf, which is a scan of a document and it's a really huge file, because it just stores the picture of the scanned document? And there are OCR tools which can help you to make a proper document which just stores the…
6
votes
2 answers

How to find all images containing any text?

I got a lot of images, and I need to find, which of them contain any text in English (to delete them). Is it possible to do it automatically?
5
votes
1 answer

Find PDFs that don't have text

I have many folders with lots of PDFs and I want to Optical Character Recognise those that do not have a text layer. So first, I want to find them. I thought that maybe a pipe with pdfgrep would do the job, but I'm lost. How can I find PDFs that do…
fich
  • 290
  • 1
  • 12
5
votes
1 answer

tesseract: is it possible to change font output in OCRed pdf?

Following up on how to OCR a pdf file and get the text stored within pdf? I have successfully produced OCRed pdf pages. In Evince, however, the letters are not shown; by this I mean that I cannot see the characters, but I can select them, copy them…
ingli
  • 1,665
  • 1
  • 15
  • 33
4
votes
0 answers

Replace Scanned Text with OCRed Text in PDF

I have a scanned book as a PDF. When viewed in Evince, the book appears as it did when scanned, with old fashioned fonts that appear as they were scanned. However, Evince recognises the letters as characters, and I am able to select, cut, and copy…
zhanmusi
  • 141
  • 2
4
votes
1 answer

Delete OCR from PDF

I have PDF file containing corrupted OCR. It is a bunch of handwritten pages with a lot of symbols and abbreviations, and I got this file with an automatically generated OCR. How can I remove the text layer in order to get a lighter file (and to…
Seninha
  • 1,035
  • 1
  • 9
  • 17
4
votes
1 answer

De-obfuscate a picture with statistical information?

I need to get this kind of information into numbers, how? Perhaps…
user2362
4
votes
3 answers

sed one-liner to replace word-medial capitals

I used OCR to turn some scans into plaintext, but unfortunately the letters 'fi' which are commonly joined in some fonts, got read in as capital W's. Now I need to replace all the W's with 'fi', and these can easily be distinguished by the fact that…
ixtmixilix
  • 13,040
  • 27
  • 82
  • 118
3
votes
1 answer

Linux equivalent of GraphClick?

Is there a piece of Linux software that does what GraphClick does in Mac OS X? That is, is there a Linux software that "is a graph digitizer software which allows to automatically retrieve the original (x,y)-data from the image of a scanned graph"?
hpy
  • 4,517
  • 8
  • 53
  • 73
2
votes
0 answers

OCR that outputs probability data

I would like to convert printed books I own into audio by scanning them with OCR and then running the text through a TTS engine. These titles are not available as ebooks. Since OCR can make small errors especially when converting images containing…
themirror
  • 6,898
  • 11
  • 31
  • 36
1
2 3