08-29-2024, 01:26 PM | #1 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
[SOLVED] [OCR] Extract text layer, fix errors, re-import?
Hello,
I notice some typos in the text layer added by an OCR into a "bitmap" PDF, ie. pages are actually scanned pages. I first tried opening the EPUB generated by Abbyy Finereader, but LibreOffice couldn't open it at all, while Sigil could after showing an error message but lacks a French dictionary to run the job (as far as I can tell). As an alternative, pdftotext or mutool (convert) can extract the text layer from such PDF, but can they put it back after I fixed the typos? Thank you. -- Edit: An easy solution is to convert the PDF to EPUB using Abbyy Finereader, and then run the HTML files within through a spellchecker. Last edited by Shohreh; 08-30-2024 at 04:28 AM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can't extract text in image for MOBI/AZW3, despite using OCR, in Calibre for Kindle | ck18ss@brocku.ca | Conversion | 1 | 08-15-2022 06:34 PM |
(Open-source) application to extract text layer? | Shohreh | 5 | 02-11-2022 09:00 AM | |
Tool to OCR an "image" PDF → add text as extra layer? | Shohreh | 5 | 12-19-2020 01:47 PM | |
OCRmyPDF adds OCR text layer to scanned PDF files | orebmur | 0 | 01-20-2018 07:16 PM | |
Scanned text pdf with OCR but graphical layer instead vectorial | whopper | 2 | 09-10-2011 07:32 PM |