06-24-2013, 03:34 PM | #1 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
How do you deal with soft hyphens in OCR texts?
What is the best and most efficient way of rejoining words in an OCR-ed text which are split over two lines?
I use a sed script which prefixes the first part of the word to the next line, and replaces the hyphen with #, and then I examine these words for any that contain hard hyphens. Page breaks are handled. How do you solve this problem? |
06-24-2013, 03:44 PM | #2 |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Something similar to that. Much easier in Spanish, where hard hyphens are very seldom used. Search "-\n", replace with "¬" (for instance), then manually check all ¬: most will disappear, some will turn back into "-" or "- ". If I detect some common case, I can bulk-replace (like finding "Jean¬Jacques" several times in the first few chapters).
Be careful with words like to-day, up-stairs, etc. They might have been written with hyphens in that book, even though they usually aren't today. Whenever some suspect appears, search the whole book for similar instances and see if there's a pattern. |
06-24-2013, 09:27 PM | #3 |
Wizard
Posts: 2,304
Karma: 12587727
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
If the text is already in HTML, I use this Regex, and replace one by one:
Search: Code:
-</p>\s+<p> A faster way might be to use the above search and replace with a hyphen. Then go through your typical "hyphenation search" pass and fix any unneeded hyphens. I personally replace one by one just so I can double check with the PDF that a chunk of text is not missing (Most of my work is from PDF -> EPUB). To clean up hyphens throughout the text, I use the Spellcheck function in Sigil (Tools - Spellcheck - Spellcheck (Alt+Q)). This will give you a list of every word in the book. I then insert a hyphen in the search box (See attached image). Then I go look at the list of words with hyphens and remove them if needed. I do one pass with "Show All Words", then a pass with it unchecked (to show only words Sigil think is misspelled). And then maybe one more pass with "Show All Words" checked. In my case, these 2 or 3 passes usually catch almost all hyphenation errors throughout the book. |
06-26-2013, 03:07 PM | #4 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
I export as RTF, open it in Word 2010, use Ctrl+F and search for "-^p". Then I manually go over each instance throughout the entire book. Yeah, it's tedious... But considering that it takes less than 5 minutes compared to the overall part of the digitization process, it's nothing. And I always proofread the final product, so the chances that one of them slipped by are really low.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Calibre remove soft hyphens? | zuli | Calibre | 3 | 11-08-2017 10:20 PM |
Soft Hyphens | wallcraft | Workshop | 29 | 06-12-2012 05:21 AM |
Option for removing soft hyphens? | WarnerYoung | Calibre | 1 | 05-25-2012 12:44 AM |
Feature request: soft hyphens | paulpeer | Sigil | 3 | 12-05-2009 02:43 PM |
Calibre deletes soft Hyphens in Epub ? | NASCARaddicted | Calibre | 4 | 09-20-2009 07:31 PM |