Having trouble OCR-ing a 70MB pdf file

Lauriso · 07-15-2024, 02:33 PM

Hey all, I hope this is the appropriate forum section to post this.

I have a PDF of a book I downloaded from archive.og, and have been trying to OCR it with multiple offline programs and online services, but to no avail. I always get unspecified errors and am not sure if it's to do with the file size.

Any suggestions would be appreciated.

Quoth · 07-16-2024, 03:08 PM

You need pre-process it with The Gimp or Image Magick or K2pdfopt.

Don't use online tools.

You don't write which tools and OS you are using. It's not the file size.

retiredbiker · 07-17-2024, 08:30 PM

Quote:

Originally Posted by Lauriso

Hey all, I hope this is the appropriate forum section to post this.

I have a PDF of a book I downloaded from archive.og, and have been trying to OCR it with multiple offline programs and online services, but to no avail. I always get unspecified errors and am not sure if it's to do with the file size.

Any suggestions would be appreciated.

Look at reply #20 for a complete workflow: https://www.mobileread.com/forums/sh...=361912&page=2

Sorry, but give up any idea that some one-button OCR is ever going to give you a readable book from old-book scans.

Quoth · 07-18-2024, 10:34 AM

Quote:

Originally Posted by retiredbiker

Sorry, but give up any idea that some one-button OCR is ever going to give you a readable book from old-book scans.

Not unless someone invents "real" AI, which might be impossible.

Still see Spelling checkers and Grammar checkers miss the same stuff as 40 years ago. Minuscule improvements.

OCR needs human proofing and also the IA images need "cleaned" first.

retiredbiker · 07-19-2024, 12:02 PM

Quote:

Originally Posted by Quoth

... and also the IA images need "cleaned" first.

Interesting that over the past few years my setup using Tesseract with OCRFeeder as a front end has become considerably better on old book images. Google has been developing it recently, and I understand a new AI/neural network bit has been added, but only for some detail, IIRR.

While old typesetting and generally poor images still cause many errors - especially punctuation - Tesseract can sometimes read words that I struggle to figure out. I rarely have to clean up AI images any more, unless they are really badly tilted, keystoned, or have something geometrically wrong. One recent book, Tesseract was doing fine, but I had to clean up the images so I could read them for proofing!

07-15-2024, 02:33 PM	#1
Lauriso Junior Member Posts: 3 Karma: 10 Join Date: Sep 2018 Device: Kindle	Having trouble OCR-ing a 70MB pdf file Hey all, I hope this is the appropriate forum section to post this. I have a PDF of a book I downloaded from archive.og, and have been trying to OCR it with multiple offline programs and online services, but to no avail. I always get unspecified errors and am not sure if it's to do with the file size. Any suggestions would be appreciated.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
need help with chroot'ing a debian.ext3 file system	don17sch	Kindle Developer's Corner	1	02-10-2016 07:58 AM
Is barebones commercial scan/ocr to PDF file adequately converted by Send-To-Kindle ?	scanewbie	Workshop	4	07-20-2015 06:54 PM
OCR-ing with a webcam. Would you recommend it?	DSpider	Workshop	0	08-29-2011 06:49 PM
How to convert an OCR file to a Non-OCR one	res9282	PDF	1	08-05-2011 06:58 AM
Trouble adding PDF file	kmsulli	Sony Reader	1	02-20-2007 11:26 PM

07-16-2024, 03:08 PM	#2
Quoth the rook, bossing Never. Posts: 12,356 Karma: 92073397 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	You need pre-process it with The Gimp or Image Magick or K2pdfopt. Don't use online tools. You don't write which tools and OS you are using. It's not the file size.