07-15-2024, 02:33 PM | #1 |
Junior Member
Posts: 3
Karma: 10
Join Date: Sep 2018
Device: Kindle
|
Having trouble OCR-ing a 70MB pdf file
Hey all, I hope this is the appropriate forum section to post this.
I have a PDF of a book I downloaded from archive.og, and have been trying to OCR it with multiple offline programs and online services, but to no avail. I always get unspecified errors and am not sure if it's to do with the file size. Any suggestions would be appreciated. |
07-16-2024, 03:08 PM | #2 |
Reading till the spring
Posts: 12,518
Karma: 94058919
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
|
You need pre-process it with The Gimp or Image Magick or K2pdfopt.
Don't use online tools. You don't write which tools and OS you are using. It's not the file size. |
07-17-2024, 08:30 PM | #3 | |
Evangelist
Posts: 422
Karma: 2737916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
|
Quote:
Sorry, but give up any idea that some one-button OCR is ever going to give you a readable book from old-book scans. |
|
07-18-2024, 10:34 AM | #4 | |
Reading till the spring
Posts: 12,518
Karma: 94058919
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
|
Quote:
Still see Spelling checkers and Grammar checkers miss the same stuff as 40 years ago. Minuscule improvements. OCR needs human proofing and also the IA images need "cleaned" first. |
|
07-19-2024, 12:02 PM | #5 |
Evangelist
Posts: 422
Karma: 2737916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
|
Interesting that over the past few years my setup using Tesseract with OCRFeeder as a front end has become considerably better on old book images. Google has been developing it recently, and I understand a new AI/neural network bit has been added, but only for some detail, IIRR.
While old typesetting and generally poor images still cause many errors - especially punctuation - Tesseract can sometimes read words that I struggle to figure out. I rarely have to clean up AI images any more, unless they are really badly tilted, keystoned, or have something geometrically wrong. One recent book, Tesseract was doing fine, but I had to clean up the images so I could read them for proofing! |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
need help with chroot'ing a debian.ext3 file system | don17sch | Kindle Developer's Corner | 1 | 02-10-2016 07:58 AM |
Is barebones commercial scan/ocr to PDF file adequately converted by Send-To-Kindle ? | scanewbie | Workshop | 4 | 07-20-2015 06:54 PM |
OCR-ing with a webcam. Would you recommend it? | DSpider | Workshop | 0 | 08-29-2011 06:49 PM |
How to convert an OCR file to a Non-OCR one | res9282 | 1 | 08-05-2011 06:58 AM | |
Trouble adding PDF file | kmsulli | Sony Reader | 1 | 02-20-2007 11:26 PM |