Using tesseract for character recongniton, result is not as expected (much worse). How to get better?

Question

I wanted to add output of Linux boot to my question and decided to try to use optical character recognition thinking now in 2022 surely there should be decent open source options (have not tried OCR for a long time). Links found via Web search "praise" tesseract. https://www.linuxlinks.com/ocrtools/ second best on chart. https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution

Tesseract is probably the most accurate open source OCR engine available.

I've installed it from distro via apt-get and run. Result with out-of-the-box is IMO awful. Why? Maybe it can be ealily fixed? Or advice another package that does the job. The page I've tried to recognize lacks pictures, as I see it it is rather easy task. See below the result:

Edit: in fact result when that small part is processed were much better, but when whole is processed than results are not ok. I understand making lines more horizontal and not skewed might help a lot, still I was hoping software got good at recognizing non-perfectly aligned text.

oon usb 1-@: |
“3792661 usb 1-8: New USB device found, idVendor=1343, idProduct:

7.983163] usb 1-8: New USB dev bs P luct=5662, bedDevice=16.6?

re eh peeled haibbetaia a

: new high-speed USB device number 5 PhS |
i

Per Samm SCR Can)
t pela ee rcpt PP cay
: 2.998668) usb 1-8: er
t
Ct

When only small part is processed:

2.837811) usb 1-8: new high-speed USB device number 5 using xhei_hed

2.979266] usb 1-8: New USB device ECU CREME Cnt ttc cain Tt teen Td
7.983163] usb 1-8: New USB device strings: Mfr=1, Product=2, SerialNumbers@

?.9869291 usb 1-8: Product: Integrated Camera

Added 1:

Tried again smaller and less skewed picture, I guess software considers time stamps as separate column, I have not seen on man page options to tweak that:

f a eg
| 7.849264]
Device= 6.44
f 7 .6492961
| 7.849355]
f 7.849415]
[ 7.849492]
| Van eos
fl 7.861846]
if Va ACB
| 7.864776]
if eel Be
Ha Bs) bs 4
if be A be ge
C ie BD LB
ce B)
te] Bs]
rage
lb eae
8.962076)
ie Ke Lb
9.600567)
9.696957)
9 .6970371

YS SF SS Se

usb 1-8: new high-speed USB device number 4 using xhci_hcd
usb 1-8: New USB device found, idVendor=04f2, idProduct=b449, bed

usb 1-8: New USB device strings: Mfr=3, Product=1, SerialNumber=2
usb 1-8: Product: Integrated Camera

usb 1-8: Manufacturer: Chicony Electronics Co.,Ltd.
usb 1-8: SerialNumber: 6x0001

usb-storage 1-1:1.6: USB Mass Storage device detected

scsi host3:

usb-storage 1-1:1.6

usbcore: registered new interface driver usb-storage
usbcore: registered new interface driver uas

scsi 3:0:6:@: Direct-fAccess General UDisk eg
sd 3:0:0:0: Attached scsi generic sgi type @

eM Pee PM eA PA ed) te) ae
Py Me ee dd

Py ee ee eee dm

sd 3:0:0:0: [sdb] Assuming drive cache: write through

sdb: sdbi sdb2 sdb3

sd 3:0:0:0: [sdb] Attached SCSI removable disk

squashfs: version 4.6 (2609/01/31) Phillip Lougher

Copying live image to RAM...
Ca ewe te Mae

I’m voting to close this question because it is about fine-tuning settings for an analysis software. This is impossible to answer without the sample dataset and knowledge of the specific criteria. It would be better placed in a discussion forum on tessaract or other general optical recognition tools. — AdminBee, Jan 10 '22 at 08:09
Tesseract isn't great with fuzzy or skewed images (which is why ocrmypdf has a `-d`, `--deskew` option to help with PDFs made from poor scans). Fortunately, boot failure images are one of the few kinds of images in U&L questions that aren't likely to trigger complaints. — cas, Jan 10 '22 at 08:41
If you want to try deskewing the images yourself, you could try converting your photo to a PDF and then using [ocrmypdf](https://github.com/jbarlow83/OCRmyPDF), or save the image as a tiff and use `tiff_findskew` from [pageutils](https://sourceforge.net/projects/pagetools/). Not guaranteed to work, but I've had some great results with ocrmypdf's `-d` option on some PDFs made from abysmally bad scanned images. BTW `ocrmypdf` and `pageutils` are available as packages for Debian and derivatives, and probably for other distros too. — cas, Jan 10 '22 at 08:42
@cas, thank you. I've tried `ocrmypdf`. It produces text overlayed on picture, when text is copied it is pasted as separate short objects / many short lines, I do not get long lines from log as single lines. Summary: not ok. — Martian2020, Jan 10 '22 at 12:14
Yeah, well, garbage in = garbage out. There's only so much that can be done to fix poor quality images. Try `pdftotext -layout` on the PDF generated by ocrmypdf - the -layout option often produces better results, especially on PDFs with multiple columns of text. Also BTW, if you have less configured to use `lesspipe`, it will automatically run `pdftotext -layout` if you use less to "view" a pdf. NOTE: pdftotext does not do OCR, it just extracts the text layer (if one exists) from a PDF. — cas, Jan 10 '22 at 12:32
Finally, the thing you really need to understand about OCR is that it's complicated and can be difficult to get good results. Unless you have high-quality images that are perfectly suited to being OCR-ed, you're probably not going to get good results first time (and probably not at all). You need a fair amount of experience and understanding of how it all works and what's happening. I'm certainly no expert on the topic, but I've messed around with it for enough years to mostly get reasonable results, and even then I often have to do a lot of manual proof-reading and editing with vi. — cas, Jan 10 '22 at 12:37
That said, tesseract is many times better today than it was just a few years ago....it rarely makes mistakes on decent quality images, and most of the mistakes it does make are kind of excusable (e.g. mistaking a "." for a "," or an "i" or "1" for an "l"). — cas, Jan 10 '22 at 12:40
I was playing with `imagemagick-6.q16` to improve the image quality so that the image can be detected better. Though modifying the image increase accuracy in one part of the image , in decrease accuracy in other parts. Please try `convert -units PixelsPerInch in.png -set colorspace Gray -separate -average -bordercolor White -border 10x10 -unsharp 6.8x1+2.69+0 -alpha off -resample 300 out.png` to see what I mean. — Ahmad Ismail, Jan 10 '22 at 13:10

Using tesseract for character recongniton, result is not as expected (much worse). How to get better?

0 Answers0