03-03-2017, 04:38 AM | #1 |
Junior Member
Posts: 1
Karma: 10
Join Date: Mar 2017
Device: Kindle
|
Converting pdf file of a scanned book to epub format
Hi,
I try to convert my books to epub format, so that I can easily have them when I am mobile. I don't have any problems with OCR programms. Mostly I use Abbyy Fine Reader for Mac. 1. Before or after ocr I have to manually deselect all the page numbers and volume or author names which appear on the most upper part of pages. I couldn't find any script or any solution for that. Do you know any easy method? 2. The reference numbers in superscript at the end of sentences have to be manually linked. Is there a way to do it automatically? 3. Footnotes at the end of every page should be either deselected or manually transferred to the end of the pages in order not to compromise the book's reading in epub format. Is there any automatic solution for that? |
03-03-2017, 06:18 AM | #2 | |||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
See how close the Header is in Figure A: Finereader has a serious problem with detecting the Header in that book... you can see how it could easily be seen as a part of the body text. And see what your typical book in Figure B: Finereader has absolutely zero problems with that. Only when the Header is as close or closer than Figure A might Finereader start to become inaccurate with its guesses (maybe it will handle the Headers perfectly though.... each book is different). Potential Solution Not too sure if Finereader on Mac is the same, but there is a "Save Area Template" under Area > Save Area Template...: http://help.abbyy.com/FineReader/Fin...hTemplates.htm I think the Area Templates were intended more for scanning in documents of the same exact type (like hundreds of forms that all have the exact same layout). You may be able to hack an Area Template together for yourself on a per-book basis. I personally haven't found it to be too useful in the case of books, but your case may be different. Quote:
Depending on the export format, Finereader does try to do its best, but it botches the "linking back/forth footnotes" pretty badly. The only way to handle it is properly is to manually correct them. There are some tools to kind of help speed up the process though: 1. If you have Microsoft Word, you can use Finereader to export to DOCX (Formatted), and then run Toxaris's EPUB Tools add-in (doesn't work in the Mac version): https://toxaris.nl/en/ Toxaris specialized his tool for a lot of Finereader cleanup (and a ton of of other helpful things). If you use his tool and press "Preparation", it can clean up a lot of the Finereader DOCX cruft. You can then fix the document in Word, or export from there and do more thorough cleaning. 2. Taking the HTML and doing lots of fancy Regex (each book is different). Some generic rules can apply though, like searching a book for all <sup>##</sup> (these are most likely superscript footnotes sitting in the text). Or searching for paragraphs starting with a superscript number (this is most likely a footnote): Things can get a little hairier if you have a complex book (like one with formulas) or OCR errors (maybe a ” [Right Double Quote] might be OCRed as a <sup>9</sup> or a ° [degree] might be OCRed as <sup>0</sup>). Quote:
You will most likely have to manually fix/check the links and place them in the proper order/location. Side Note: You will also have to keep an eye out for footnotes that are missing text or large footnotes that carry on to a second/third page... these will have to be manually stitched back together. Last edited by Tex2002ans; 03-03-2017 at 07:10 AM. |
|||
03-03-2017, 05:50 PM | #3 |
null operator (he/him)
Posts: 20,946
Karma: 27620688
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
@fcemari - Sigil has plugin for tidying ePub files created from scans and PDFs ==>> ePub Tidy
I don't use the Sigil PI because I have Word on Windows, so I use the epub-tools add-in that Tex2002ans has already mentioned. From its description the Sigil PI appears do something similar to epub-tools Preparation. As well correcting Latin text it also corrects Greek text. BR Last edited by BetterRed; 03-03-2017 at 05:52 PM. |
Tags |
book, footnotes, pdf to epub, reference numbers |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Converting a scanned book from 1DollarScan to ePub | adrenaline | Workshop | 30 | 10-04-2014 02:24 AM |
converting PDF magazine to ePub format | PublicarGuate | General Discussions | 2 | 01-21-2014 05:44 PM |
converting pdf screenplays / scripts for movies into ePUB format | alanjay | Calibre | 15 | 10-07-2011 07:49 AM |
Classic Converting .epub to .pdb file format | ashalluri | Barnes & Noble NOOK | 3 | 05-27-2010 05:07 PM |