07-08-2020, 05:51 PM | #1 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
OCRing + EPUBing my first book: Tips?
Hello,
I'd like to turn an out-of-print paper book I have into an EPUB. I just tried taking pictures of a few pages using my smartphone, and fed them to gImageReader (a GUI to Tesseract). The text only has a few errors, and I'll have to manually remove mid-line carriage returns, but it's pretty good. Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)? Thank you. |
07-08-2020, 06:10 PM | #2 |
Running with scissors
Posts: 1,557
Karma: 14325282
Join Date: Nov 2019
Device: none
|
I've converted some books that are available as PDF + TXT from archive.org. I use sumatra for opening the PDFs since it knows about the invisible/hidden text layer; maybe they all do.
I use sigil and don't need to manually remove the mid-line carriage returns; you can use a search and replace to replace the blank lines between paragraphs with end paragraph tag followed by beginning paragraph tag; </p><p>. Then jump to the top of the book and add the missing beginning of paragraph tag, then to the bottom of the book and add the missing end of paragraph tag. Then use sigil's Mend and Prettify to make it look good in sigil. The hyphens that were at the ends of lines can be found by searching for hyphen followed by a space; you can't remove them all because sometimes it was a word that's normally hyphenated. What archive.org uses often sees a screechmark/! or ell/l as a 1 so search for digits; there are threads here about this and other common errors and regexps for searching with. Last edited by hobnail; 07-08-2020 at 06:15 PM. |
07-08-2020, 08:58 PM | #3 |
null operator (he/him)
Posts: 20,997
Karma: 27620706
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
If you have access to MS Word 2007 or later on Windows, there are a couple of very useful addins you should have a look at :
Toxaris' ePub Tools, it was specifically created for the issue you're dealing with, see ==>> Index of Useful Links for Book Creators TransTools (not free) addin, it has some overlap with Toxaris' addin, but it also has some unique tools for fixing scanned texts - such as its Unbreaker tool, see ==>> Translator Tools. I use both. BR |
07-09-2020, 04:31 AM | #4 | |
Imperfect Perfectionist
Posts: 542
Karma: 863576
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
|
Quote:
Regards, Kim |
|
07-09-2020, 04:53 AM | #5 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
Thanks much!
|
07-09-2020, 06:23 AM | #6 |
The Grand Mouse 高貴的老鼠
Posts: 72,511
Karma: 309063598
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
You will need to do a lot of proof-reading to catch OCR errors. rn/m etc.
|
07-09-2020, 09:52 AM | #7 |
the rook, bossing Never.
Posts: 12,352
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Use a proper scanner. An archival scanner allows the book to sit thus \/ and uses far better cameras and lenses than in any phone. If it's a common book and the scanner has an ADF, the spine is usually cut off. An expert copy typist (maybe none left?) can probably beat an inexperienced person with a camera phone and need much less proof reading/editing.
Do make sure the copyright has expired. That is now quite complicated. You'll want to proof read it entirely several times, with a gap of at least a week. You'll not see most of the errors if you are not experienced at proofing. Pirates do this with ARCs and simply upload a PDF with unproofed text for search to Google Books/Playstore. IMO, the piracy on that and also pirated books packaged as Apps on the Playstore, that Google's book sales/distribution and their scanning of books for search (they DO store entire copyright works on public servers, they mislead during the court case). Last edited by Quoth; 07-09-2020 at 09:55 AM. |
07-09-2020, 12:08 PM | #8 |
Addict
Posts: 311
Karma: 3196258
Join Date: Oct 2015
Location: Madison, WI
Device: Kindle 5th Gen
|
You might save a little time on the mid-paragraph carriage return thing if you treat that output as markdown. Markdown treats lines separated by a single carriage return as one continuous line/paragraph by default. Basically if you ran your example through pandoc or something, that first block will convert to one paragraph automatically.
|
07-09-2020, 07:02 PM | #10 |
A Hairy Wizard
Posts: 3,222
Karma: 19000635
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
You can also look at diybookscanner.org. They have been helping people build book scanners using cameras for several years. They have quite the community over there as well as software suggestions that might save you tons of time.
|
07-10-2020, 05:34 AM | #11 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
Thanks again.
I tried Abbyy FineReader, and it worked much better than gImageReader (ie. Tesseract). |
07-12-2020, 08:48 PM | #12 | ||
Wizard
Posts: 2,304
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
On cleaning up your images, I would recommend using Scan Tailor Advanced. This crops your images, fixes distortion due to curved pages, and can turn them B&W. I recently wrote a tutorial + more details about this just a few months ago: "Optimize PDFs from archive.org for E-Ink devices" (especially Post #2+#14). On OCRing and all other errors/situations that may crop up, I recommend my detailed posts in the 2014 topic, "Delicate text digitalizing + scanning issues". Not too much has changed since then... most of the steps and issues are still exactly the same in 2020. Quote:
Back in 2014, I wrote another post discussing all the ins-and-outs of free vs. proprietary OCR: "Can you OCR the images inside of .pdf files?" Most of the free tools get you the straight text, but then do a poorer job of carrying over the actual formatting (italics/bold, footnotes, superscript, tables, etc.). Fiction, you would probably be okay... but the more complicated the book, the more time you're going to be spending trying to correct/readd all the formatting. Last edited by Tex2002ans; 07-12-2020 at 09:08 PM. |
||
07-13-2020, 05:40 AM | #13 | |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Quote:
See screenshot I process only .tif images coming from Scan Tailor. I recognize text in HOCR format by blocks of 70 pages max I save in html file (see red arrow) I insert the block file in LibreOffice and save as odt. Each block has a 3 mega size max I suppress all bookmarks and sections, block by block. the result is a clean enough odt file that will be later converted using ODTImport (a Sigil plugin). Last edited by roger64; 07-13-2020 at 05:47 AM. Reason: image |
|
07-13-2020, 09:37 AM | #14 | |
Guru
Posts: 674
Karma: 4568205
Join Date: Jan 2010
Location: Sweden
Device: Kobo Forma
|
You have no idea how many notes I have due to your posts.
Quote:
BTW, I recently got a new scanner with a fairly good software (which uses Finereader for ocr). The best output of it is docx. But I miss the "verify text" step. Have you, or anyone else, find a better, or at least equal way to go through the text to find errors? |
|
07-13-2020, 06:18 PM | #15 | |||
Wizard
Posts: 2,304
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Really only used if you have ugly input that needs serious cleaning. You mentioned taking pictures with your smartphone, so that would cause issues like:
So Scan Tailor would take you from something like this: to this: Attachment 177415 (Those images were from the book in the "Optimize PDFs" thread.) Related Side Note: I also gave an example of handling OCR + images in "How to handle images in books while doing OCR of books?". Quote:
If it's the full Finereader, you should be able to open it up and have an Original+OCR split in the Left/Right windows. See my posts in 2013, "Best way to copy text from a PDF or MOBI?". This lets you easily see a magnified version of the exact location in the book, and make sure the text is correct. That's exactly how I squash most errors... right at the source! That's one thing I've changed within the past few years... now I trust Toxaris's EPUB Tools to clean up Finereader's cruft. When you export, change Finereader to "Formatted Text" and DOCX. Toxaris's EPUB Tools will then clean up the rest. From there, you could do further cleaning in DOCX (if that's what you're comfortable with), or get it into EPUB as soon as possible + do your cleaning there (that's what I prefer). Quote:
As always, it's best to squash this stuff as close to the source as possible. 1. Clean Input Images = More Accurate OCR The cleaner the input, the less time wasted fixing errors. 2. Mark/Proofread in Finereader This is where you make sure "big picture" things are marked—Text, Images, Tables, Headers/Footers. Then it's helpful to focus on all the "blue highlights" (unsure characters) and fix as many of those as you can. Also making sure things like bold/italics/superscripts are carried over properly. 3. Export DOCX (or EPUB) out of Finereader Do further cleanup. Toxaris's EPUB Tools merges accidental split paragraphs together, etc. You may have to re-correct "odd" line breaks that may have accidentally been merged, for example, poetry. If you're comfortable with Word, you may want to add in some more Styles/formatting here (headings, blockquotes, captions, [...]). 4. Clean the EPUB This is where you also make sure all the little things are correct:
And with Sigil/Calibre, you have access to more powerful tools/Regex. For example, one of my favorite tricks is still to search for all hyphenated words in the Spellcheck Lists (I wrote about that all the way back in 2013!). And now that "numbers are words", you can use a similar trick to find whole classes of OCR errors (0<->O, 1<->l). (See "Suggestion: Spellcheck Enhancement (Numbers)"). 5. Run through a final Spellcheck/Grammarcheck pass See my 2018 post in, "Does Tool Exist to Spellcheck/Grammarcheck by Category?". If you spellchecked in Sigil/Calibre, maybe try Word (different dictionaries may point out other misspellings). If you grammarchecked in Word, maybe try LanguageTool or Antidote. Different tools might catch different errors. And I definitely run EPUB Tools's Dialogue Check—it's the best damn thing since sliced bread, and it catches all the mismatching quotation marks + parentheses/brackets. I'd be interested in learning what sorts of things you marked down in your notes. I've been trying to put together an "FAQ"-type series of posts for the blog... and I have no idea what sorts of things people found useful over the years. PM me if you don't want to type about it here. (Wouldn't want to derail this thread.) Last edited by Tex2002ans; 07-13-2020 at 06:28 PM. |
|||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
An advice on OCRing, please. | nlundberg | Workshop | 6 | 03-13-2013 07:29 AM |
Book Designer Hints and Tips | Patricia | Workshop | 59 | 06-10-2010 08:14 AM |