How do you deal with soft hyphens in OCR texts?

SBT · 06-24-2013, 03:34 PM

What is the best and most efficient way of rejoining words in an OCR-ed text which are split over two lines?

I use a sed script which prefixes the first part of the word to the next line, and replaces the hyphen with #, and then I examine these words for any that contain hard hyphens. Page breaks are handled.

How do you solve this problem?

Jellby · 06-24-2013, 03:44 PM

Something similar to that. Much easier in Spanish, where hard hyphens are very seldom used. Search "-\n", replace with "¬" (for instance), then manually check all ¬: most will disappear, some will turn back into "-" or "- ". If I detect some common case, I can bulk-replace (like finding "Jean¬Jacques" several times in the first few chapters).

Be careful with words like to-day, up-stairs, etc. They might have been written with hyphens in that book, even though they usually aren't today. Whenever some suspect appears, search the whole book for similar instances and see if there's a pattern.

Tex2002ans · 06-24-2013, 09:27 PM

If the text is already in HTML, I use this Regex, and replace one by one:

Search:

Code:

-</p>\s+<p>

Replace with nothing.

A faster way might be to use the above search and replace with a hyphen. Then go through your typical "hyphenation search" pass and fix any unneeded hyphens. I personally replace one by one just so I can double check with the PDF that a chunk of text is not missing (Most of my work is from PDF -> EPUB).

To clean up hyphens throughout the text, I use the Spellcheck function in Sigil (Tools - Spellcheck - Spellcheck (Alt+Q)). This will give you a list of every word in the book. I then insert a hyphen in the search box (See attached image).

Then I go look at the list of words with hyphens and remove them if needed. I do one pass with "Show All Words", then a pass with it unchecked (to show only words Sigil think is misspelled). And then maybe one more pass with "Show All Words" checked. In my case, these 2 or 3 passes usually catch almost all hyphenation errors throughout the book.

DSpider · 06-26-2013, 03:07 PM

I export as RTF, open it in Word 2010, use Ctrl+F and search for "-^p". Then I manually go over each instance throughout the entire book. Yeah, it's tedious... But considering that it takes less than 5 minutes compared to the overall part of the digitization process, it's nothing. And I always proofread the final product, so the chances that one of them slipped by are really low.

06-24-2013, 03:34 PM	#1
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	How do you deal with soft hyphens in OCR texts? What is the best and most efficient way of rejoining words in an OCR-ed text which are split over two lines? I use a sed script which prefixes the first part of the word to the next line, and replaces the hyphen with #, and then I examine these words for any that contain hard hyphens. Page breaks are handled. How do you solve this problem?

06-24-2013, 09:27 PM	#3
Tex2002ans Wizard Posts: 2,304 Karma: 12587727 Join Date: Jul 2012 Device: Kobo Forma, Nook	If the text is already in HTML, I use this Regex, and replace one by one: Search: Code: -</p>\s+<p> Replace with nothing. A faster way might be to use the above search and replace with a hyphen. Then go through your typical "hyphenation search" pass and fix any unneeded hyphens. I personally replace one by one just so I can double check with the PDF that a chunk of text is not missing (Most of my work is from PDF -> EPUB). To clean up hyphens throughout the text, I use the Spellcheck function in Sigil (Tools - Spellcheck - Spellcheck (Alt+Q)). This will give you a list of every word in the book. I then insert a hyphen in the search box (See attached image). Then I go look at the list of words with hyphens and remove them if needed. I do one pass with "Show All Words", then a pass with it unchecked (to show only words Sigil think is misspelled). And then maybe one more pass with "Show All Words" checked. In my case, these 2 or 3 passes usually catch almost all hyphenation errors throughout the book. Attached Thumbnails

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Calibre remove soft hyphens?	zuli	Calibre	3	11-08-2017 10:20 PM
Soft Hyphens	wallcraft	Workshop	29	06-12-2012 05:21 AM
Option for removing soft hyphens?	WarnerYoung	Calibre	1	05-25-2012 12:44 AM
Feature request: soft hyphens	paulpeer	Sigil	3	12-05-2009 02:43 PM
Calibre deletes soft Hyphens in Epub ?	NASCARaddicted	Calibre	4	09-20-2009 07:31 PM

06-24-2013, 03:44 PM	#2
Jellby frumious Bandersnatch Posts: 7,536 Karma: 19000001 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	Something similar to that. Much easier in Spanish, where hard hyphens are very seldom used. Search "-\n", replace with "¬" (for instance), then manually check all ¬: most will disappear, some will turn back into "-" or "- ". If I detect some common case, I can bulk-replace (like finding "Jean¬Jacques" several times in the first few chapters). Be careful with words like to-day, up-stairs, etc. They might have been written with hyphens in that book, even though they usually aren't today. Whenever some suspect appears, search the whole book for similar instances and see if there's a pattern.

06-26-2013, 03:07 PM	#4
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	I export as RTF, open it in Word 2010, use Ctrl+F and search for "-^p". Then I manually go over each instance throughout the entire book. Yeah, it's tedious... But considering that it takes less than 5 minutes compared to the overall part of the digitization process, it's nothing. And I always proofread the final product, so the chances that one of them slipped by are really low.