Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 10-24-2022, 08:15 AM   #1
Tenome
Enthusiast
Tenome began at the beginning.
 
Posts: 38
Karma: 26
Join Date: Jan 2022
Device: none
OCR'd PDF to EPUB/TXT/etc. not copying text over (text under image).

I made a searchable OCR'd PDF in ABBYY with the "save text under image" setting (this is what OCR software usually defaults to, so that it displays the original scan in case the OCR made a mistake). Whenever I try to convert the PDF in Calibre, though, it ignores the included OCR'd text and just spits out the original images. How can I resolve this? I'm able to copy and paste the OCR'd text, so I know it's not a problem with the PDF. Calibre just isn't seeing the text for some reason.

Last edited by Tenome; 10-24-2022 at 08:39 AM.
Tenome is offline   Reply With Quote
Old 10-24-2022, 10:17 AM   #2
retiredbiker
Addict
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 388
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
I have found some pdfs that have this problem. Calibre uses the pdftohtml tool to pull the text out of a pdf, and for some reason that can fail. Take the pdf out of Calibre and try using the pdftohtml tool from the command line and you get nothing, but try the pdftotext tool and you usually do get the text. I've never seen an answer on why some text that is definitely there does not respond to pdftohtml. Another example of the evil behaviour of pdfs!
retiredbiker is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can't extract text in image for MOBI/AZW3, despite using OCR, in Calibre for Kindle ck18ss@brocku.ca Conversion 1 08-15-2022 05:34 PM
Tool to OCR an "image" PDF → add text as extra layer? Shohreh PDF 5 12-19-2020 12:47 PM
Best practice to OCR and convert PDF to text or html or epub crankypants ePub 15 12-14-2015 08:00 PM
EPUB -> PDF: Image Rather Than Text claytoncarney Conversion 3 01-03-2013 12:15 PM
PDF Image -> OCR -> text frikk Workshop 9 07-08-2009 07:21 PM


All times are GMT -4. The time now is 03:39 PM.


MobileRead.com is a privately owned, operated and funded community.