Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 07-15-2024, 02:33 PM   #1
Lauriso
Junior Member
Lauriso began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2018
Device: Kindle
Having trouble OCR-ing a 70MB pdf file

Hey all, I hope this is the appropriate forum section to post this.

I have a PDF of a book I downloaded from archive.og, and have been trying to OCR it with multiple offline programs and online services, but to no avail. I always get unspecified errors and am not sure if it's to do with the file size.

Any suggestions would be appreciated.
Lauriso is offline   Reply With Quote
Old 07-16-2024, 03:08 PM   #2
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 12,378
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
You need pre-process it with The Gimp or Image Magick or K2pdfopt.

Don't use online tools.

You don't write which tools and OS you are using. It's not the file size.
Quoth is offline   Reply With Quote
Old 07-17-2024, 08:30 PM   #3
retiredbiker
Evangelist
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 420
Karma: 2737916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
Quote:
Originally Posted by Lauriso View Post
Hey all, I hope this is the appropriate forum section to post this.

I have a PDF of a book I downloaded from archive.og, and have been trying to OCR it with multiple offline programs and online services, but to no avail. I always get unspecified errors and am not sure if it's to do with the file size.

Any suggestions would be appreciated.
Look at reply #20 for a complete workflow: https://www.mobileread.com/forums/sh...=361912&page=2

Sorry, but give up any idea that some one-button OCR is ever going to give you a readable book from old-book scans.
retiredbiker is offline   Reply With Quote
Old 07-18-2024, 10:34 AM   #4
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 12,378
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Quote:
Originally Posted by retiredbiker View Post
Sorry, but give up any idea that some one-button OCR is ever going to give you a readable book from old-book scans.
Not unless someone invents "real" AI, which might be impossible.

Still see Spelling checkers and Grammar checkers miss the same stuff as 40 years ago. Minuscule improvements.

OCR needs human proofing and also the IA images need "cleaned" first.
Quoth is offline   Reply With Quote
Old 07-19-2024, 12:02 PM   #5
retiredbiker
Evangelist
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 420
Karma: 2737916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
Quote:
Originally Posted by Quoth View Post
... and also the IA images need "cleaned" first.
Interesting that over the past few years my setup using Tesseract with OCRFeeder as a front end has become considerably better on old book images. Google has been developing it recently, and I understand a new AI/neural network bit has been added, but only for some detail, IIRR.

While old typesetting and generally poor images still cause many errors - especially punctuation - Tesseract can sometimes read words that I struggle to figure out. I rarely have to clean up AI images any more, unless they are really badly tilted, keystoned, or have something geometrically wrong. One recent book, Tesseract was doing fine, but I had to clean up the images so I could read them for proofing!
retiredbiker is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
need help with chroot'ing a debian.ext3 file system don17sch Kindle Developer's Corner 1 02-10-2016 07:58 AM
Is barebones commercial scan/ocr to PDF file adequately converted by Send-To-Kindle ? scanewbie Workshop 4 07-20-2015 06:54 PM
OCR-ing with a webcam. Would you recommend it? DSpider Workshop 0 08-29-2011 06:49 PM
How to convert an OCR file to a Non-OCR one res9282 PDF 1 08-05-2011 06:58 AM
Trouble adding PDF file kmsulli Sony Reader 1 02-20-2007 11:26 PM


All times are GMT -4. The time now is 11:46 AM.


MobileRead.com is a privately owned, operated and funded community.