OCRing + EPUBing my first book: Tips?

Shohreh · 07-08-2020, 05:51 PM

Hello,

I'd like to turn an out-of-print paper book I have into an EPUB.

I just tried taking pictures of a few pages using my smartphone, and fed them to gImageReader (a GUI to Tesseract).

The text only has a few errors, and I'll have to manually remove mid-line carriage returns, but it's pretty good.

Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)?

Thank you.

hobnail · 07-08-2020, 06:10 PM

I've converted some books that are available as PDF + TXT from archive.org. I use sumatra for opening the PDFs since it knows about the invisible/hidden text layer; maybe they all do.

I use sigil and don't need to manually remove the mid-line carriage returns; you can use a search and replace to replace the blank lines between paragraphs with end paragraph tag followed by beginning paragraph tag; </p><p>. Then jump to the top of the book and add the missing beginning of paragraph tag, then to the bottom of the book and add the missing end of paragraph tag. Then use sigil's Mend and Prettify to make it look good in sigil. The hyphens that were at the ends of lines can be found by searching for hyphen followed by a space; you can't remove them all because sometimes it was a word that's normally hyphenated.

What archive.org uses often sees a screechmark/! or ell/l as a 1 so search for digits; there are threads here about this and other common errors and regexps for searching with.

BetterRed · 07-08-2020, 08:58 PM

If you have access to MS Word 2007 or later on Windows, there are a couple of very useful addins you should have a look at :

Toxaris' ePub Tools, it was specifically created for the issue you're dealing with, see ==>> Index of Useful Links for Book Creators

TransTools (not free) addin, it has some overlap with Toxaris' addin, but it also has some unique tools for fixing scanned texts - such as its Unbreaker tool, see ==>> Translator Tools.

I use both.

BR

elibrarian · 07-09-2020, 04:31 AM

Quote:

Originally Posted by Shohreh

... I'll have to manually remove mid-line carriage returns, but it's pretty good.

gImagereader has its own tool for that - second button from left in the output pane - which may (or may not) work for you. You have to mark the text before running it (CTRL+A works), and on longer texts it will take some seconds before the result shows.

Regards,

Kim

Shohreh · 07-09-2020, 04:53 AM

Thanks much!

pdurrant · 07-09-2020, 06:23 AM

You will need to do a lot of proof-reading to catch OCR errors. rn/m etc.

Quoth · 07-09-2020, 09:52 AM

Use a proper scanner. An archival scanner allows the book to sit thus \/ and uses far better cameras and lenses than in any phone. If it's a common book and the scanner has an ADF, the spine is usually cut off. An expert copy typist (maybe none left?) can probably beat an inexperienced person with a camera phone and need much less proof reading/editing.

Do make sure the copyright has expired. That is now quite complicated.

You'll want to proof read it entirely several times, with a gap of at least a week. You'll not see most of the errors if you are not experienced at proofing.

Pirates do this with ARCs and simply upload a PDF with unproofed text for search to Google Books/Playstore. IMO, the piracy on that and also pirated books packaged as Apps on the Playstore, that Google's book sales/distribution and their scanning of books for search (they DO store entire copyright works on public servers, they mislead during the court case).

phillipgessert · 07-09-2020, 12:08 PM

You might save a little time on the mid-paragraph carriage return thing if you treat that output as markdown. Markdown treats lines separated by a single carriage return as one continuous line/paragraph by default. Basically if you ran your example through pandoc or something, that first block will convert to one paragraph automatically.

DaleDe · 07-09-2020, 03:29 PM

check our wiki on OCR

Dale

Turtle91 · 07-09-2020, 07:02 PM

You can also look at diybookscanner.org. They have been helping people build book scanners using cameras for several years. They have quite the community over there as well as software suggestions that might save you tons of time.

Shohreh · 07-10-2020, 05:34 AM

Thanks again.

I tried Abbyy FineReader, and it worked much better than gImageReader (ie. Tesseract).

Tex2002ans · 07-12-2020, 08:48 PM

Quote:

Originally Posted by Shohreh

I'd like to turn an out-of-print paper book I have into an EPUB.

[...]

Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)?

I've written extensively about this over the years.

On cleaning up your images, I would recommend using Scan Tailor Advanced. This crops your images, fixes distortion due to curved pages, and can turn them B&W.

I recently wrote a tutorial + more details about this just a few months ago: "Optimize PDFs from archive.org for E-Ink devices" (especially Post #2+#14).

On OCRing and all other errors/situations that may crop up, I recommend my detailed posts in the 2014 topic, "Delicate text digitalizing + scanning issues".

Not too much has changed since then... most of the steps and issues are still exactly the same in 2020.

Quote:

Originally Posted by Shohreh

I tried Abbyy FineReader, and it worked much better than gImageReader (ie. Tesseract).

Back in 2014, I wrote another post discussing all the ins-and-outs of free vs. proprietary OCR:

"Can you OCR the images inside of .pdf files?"

Most of the free tools get you the straight text, but then do a poorer job of carrying over the actual formatting (italics/bold, footnotes, superscript, tables, etc.).

Fiction, you would probably be okay... but the more complicated the book, the more time you're going to be spending trying to correct/readd all the formatting.

roger64 · 07-13-2020, 05:40 AM

Quote:

Originally Posted by Shohreh

Hello,

I'd like to turn an out-of-print paper book I have into an EPUB.

I just tried taking pictures of a few pages using my smartphone, and fed them to gImageReader (a GUI to Tesseract).

The text only has a few errors, and I'll have to manually remove mid-line carriage returns, but it's pretty good.

Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)?

Thank you.

I also use Gimagereader-qt5 with Archlinux. Mine looks slightly different.

See screenshot

I process only .tif images coming from Scan Tailor.
I recognize text in HOCR format by blocks of 70 pages max
I save in html file (see red arrow)
I insert the block file in LibreOffice and save as odt.
Each block has a 3 mega size max
I suppress all bookmarks and sections, block by block.

the result is a clean enough odt file that will be later converted using ODTImport (a Sigil plugin).

patrik · 07-13-2020, 09:37 AM

Quote:

Originally Posted by Tex2002ans

I've written extensively about this over the years.

You have no idea how many notes I have due to your posts.

Quote:

On cleaning up your images, I would recommend using Scan Tailor Advanced. This crops your images, fixes distortion due to curved pages, and can turn them B&W.

Do you still use Scan Tailer if you are going to use Finereader afterwards?

BTW, I recently got a new scanner with a fairly good software (which uses Finereader for ocr). The best output of it is docx. But I miss the "verify text" step. Have you, or anyone else, find a better, or at least equal way to go through the text to find errors?

Tex2002ans · 07-13-2020, 06:18 PM

Quote:

Originally Posted by patrik

Do you still use Scan Tailer if you are going to use Finereader afterwards?

Scan Tailor Advanced is best used as a pre-OCR step.

Really only used if you have ugly input that needs serious cleaning.

You mentioned taking pictures with your smartphone, so that would cause issues like:

Rotation
Spine showing
- Can easily cut Left/Right pages, or crop the spine out of the image.
Bent pages (thus bent/wavy lines of text)
- It can dewarp them to become straight.
Uneven Lighting/Color (Yellowed Pages)
- When trying to grayscale/B&W, you could get a "ring" or tons of black speckles.

So Scan Tailor would take you from something like this:

Click image for larger version

Name: Page16.jpg
Views: 660
Size: 1.44 MB
ID: 180573

Click image for larger version

Name: Page17.jpg
Views: 624
Size: 1.37 MB
ID: 180574

to this:

Attachment 177415

(Those images were from the book in the "Optimize PDFs" thread.)

Related Side Note: I also gave an example of handling OCR + images in "How to handle images in books while doing OCR of books?".

Quote:

Originally Posted by patrik

BTW, I recently got a new scanner with a fairly good software (which uses Finereader for ocr).

Is it the full Finereader? Or just some instant scan -> PDF/DOCX thing?

If it's the full Finereader, you should be able to open it up and have an Original+OCR split in the Left/Right windows.

See my posts in 2013, "Best way to copy text from a PDF or MOBI?". This lets you easily see a magnified version of the exact location in the book, and make sure the text is correct.

That's exactly how I squash most errors... right at the source!

Quote:

Originally Posted by patrik

The best output of it is docx.

That's one thing I've changed within the past few years... now I trust Toxaris's EPUB Tools to clean up Finereader's cruft.

When you export, change Finereader to "Formatted Text" and DOCX. Toxaris's EPUB Tools will then clean up the rest.

From there, you could do further cleaning in DOCX (if that's what you're comfortable with), or get it into EPUB as soon as possible + do your cleaning there (that's what I prefer).

Quote:

Originally Posted by patrik

But I miss the "verify text" step. Have you, or anyone else, find a better, or at least equal way to go through the text to find errors?

There's still multiple rounds of proofing that has to be done. Nothing gets rid of that.

As always, it's best to squash this stuff as close to the source as possible.

1. Clean Input Images = More Accurate OCR

The cleaner the input, the less time wasted fixing errors.

2. Mark/Proofread in Finereader

This is where you make sure "big picture" things are marked—Text, Images, Tables, Headers/Footers.

Then it's helpful to focus on all the "blue highlights" (unsure characters) and fix as many of those as you can.

Also making sure things like bold/italics/superscripts are carried over properly.

3. Export DOCX (or EPUB) out of Finereader

Do further cleanup.

Toxaris's EPUB Tools merges accidental split paragraphs together, etc.

You may have to re-correct "odd" line breaks that may have accidentally been merged, for example, poetry.

If you're comfortable with Word, you may want to add in some more Styles/formatting here (headings, blockquotes, captions, [...]).

4. Clean the EPUB

This is where you also make sure all the little things are correct:

Headings are <h1>-<h6>
Paragraphs are correct
Indentation is correct
Blockquotes are <blockquote>s
Footnotes are footnotes
Left/Center/Right alignment
[...]

And with Sigil/Calibre, you have access to more powerful tools/Regex.

For example, one of my favorite tricks is still to search for all hyphenated words in the Spellcheck Lists (I wrote about that all the way back in 2013!).

And now that "numbers are words", you can use a similar trick to find whole classes of OCR errors (0<->O, 1<->l). (See "Suggestion: Spellcheck Enhancement (Numbers)").

5. Run through a final Spellcheck/Grammarcheck pass

See my 2018 post in, "Does Tool Exist to Spellcheck/Grammarcheck by Category?".

If you spellchecked in Sigil/Calibre, maybe try Word (different dictionaries may point out other misspellings).

If you grammarchecked in Word, maybe try LanguageTool or Antidote. Different tools might catch different errors.

And I definitely run EPUB Tools's Dialogue Check—it's the best damn thing since sliced bread, and it catches all the mismatching quotation marks + parentheses/brackets.

Quote:

Originally Posted by patrik

You have no idea how many notes I have due to your posts.

I'd be interested in learning what sorts of things you marked down in your notes.

I've been trying to put together an "FAQ"-type series of posts for the blog... and I have no idea what sorts of things people found useful over the years.

PM me if you don't want to type about it here. (Wouldn't want to derail this thread.)

07-09-2020, 09:52 AM	#7
Quoth the rook, bossing Never. Posts: 12,352 Karma: 92073397 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	Use a proper scanner. An archival scanner allows the book to sit thus \/ and uses far better cameras and lenses than in any phone. If it's a common book and the scanner has an ADF, the spine is usually cut off. An expert copy typist (maybe none left?) can probably beat an inexperienced person with a camera phone and need much less proof reading/editing. Do make sure the copyright has expired. That is now quite complicated. You'll want to proof read it entirely several times, with a gap of at least a week. You'll not see most of the errors if you are not experienced at proofing. Pirates do this with ARCs and simply upload a PDF with unproofed text for search to Google Books/Playstore. IMO, the piracy on that and also pirated books packaged as Apps on the Playstore, that Google's book sales/distribution and their scanning of books for search (they DO store entire copyright works on public servers, they mislead during the court case). Last edited by Quoth; 07-09-2020 at 09:55 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
An advice on OCRing, please.	nlundberg	Workshop	6	03-13-2013 07:29 AM
Book Designer Hints and Tips	Patricia	Workshop	59	06-10-2010 08:14 AM

07-08-2020, 05:51 PM	#1
Shohreh Groupie Posts: 181 Karma: 304158 Join Date: Jan 2016 Device: none	OCRing + EPUBing my first book: Tips? Hello, I'd like to turn an out-of-print paper book I have into an EPUB. I just tried taking pictures of a few pages using my smartphone, and fed them to gImageReader (a GUI to Tesseract). The text only has a few errors, and I'll have to manually remove mid-line carriage returns, but it's pretty good. Are there tips you would recommend before I go ahead with the whole 250 pages and turn them into an EPUB (and PDF as well)? Thank you. Attached Thumbnails

07-08-2020, 06:10 PM	#2
hobnail Running with scissors Posts: 1,557 Karma: 14325282 Join Date: Nov 2019 Device: none	I've converted some books that are available as PDF + TXT from archive.org. I use sumatra for opening the PDFs since it knows about the invisible/hidden text layer; maybe they all do. I use sigil and don't need to manually remove the mid-line carriage returns; you can use a search and replace to replace the blank lines between paragraphs with end paragraph tag followed by beginning paragraph tag; </p><p>. Then jump to the top of the book and add the missing beginning of paragraph tag, then to the bottom of the book and add the missing end of paragraph tag. Then use sigil's Mend and Prettify to make it look good in sigil. The hyphens that were at the ends of lines can be found by searching for hyphen followed by a space; you can't remove them all because sometimes it was a word that's normally hyphenated. What archive.org uses often sees a screechmark/! or ell/l as a 1 so search for digits; there are threads here about this and other common errors and regexps for searching with. Last edited by hobnail; 07-08-2020 at 06:15 PM.

07-08-2020, 08:58 PM	#3
BetterRed null operator (he/him) Posts: 20,997 Karma: 27620706 Join Date: Mar 2012 Location: Sydney Australia Device: none	If you have access to MS Word 2007 or later on Windows, there are a couple of very useful addins you should have a look at : Toxaris' ePub Tools, it was specifically created for the issue you're dealing with, see ==>> Index of Useful Links for Book Creators TransTools (not free) addin, it has some overlap with Toxaris' addin, but it also has some unique tools for fixing scanned texts - such as its Unbreaker tool, see ==>> Translator Tools. I use both. BR

07-09-2020, 04:53 AM	#5
Shohreh Groupie Posts: 181 Karma: 304158 Join Date: Jan 2016 Device: none	Thanks much!

07-09-2020, 06:23 AM	#6
pdurrant The Grand Mouse 高貴的老鼠 Posts: 72,511 Karma: 309063598 Join Date: Jul 2007 Location: Norfolk, England Device: Kindle Voyage	You will need to do a lot of proof-reading to catch OCR errors. rn/m etc.

07-09-2020, 12:08 PM	#8
phillipgessert Addict Posts: 311 Karma: 3196258 Join Date: Oct 2015 Location: Madison, WI Device: Kindle 5th Gen	You might save a little time on the mid-paragraph carriage return thing if you treat that output as markdown. Markdown treats lines separated by a single carriage return as one continuous line/paragraph by default. Basically if you ran your example through pandoc or something, that first block will convert to one paragraph automatically.

07-09-2020, 03:29 PM	#9
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	check our wiki on OCR Dale

07-09-2020, 07:02 PM	#10
Turtle91 A Hairy Wizard Posts: 3,222 Karma: 19000635 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	You can also look at diybookscanner.org. They have been helping people build book scanners using cameras for several years. They have quite the community over there as well as software suggestions that might save you tons of time.

07-10-2020, 05:34 AM	#11
Shohreh Groupie Posts: 181 Karma: 304158 Join Date: Jan 2016 Device: none	Thanks again. I tried Abbyy FineReader, and it worked much better than gImageReader (ie. Tesseract).