Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 06-15-2012, 10:50 AM   #1
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
Tools and methodology for easier proof-reading

Hi

I have recently been experimenting a bit, trying to find tools to help ease the proof-reading phase of the conversion from paper books to epub.

When running OCR, there will always be errors, and every program has its own algorithm for recognizing texts. In other words, ABBYY Finereader is good at some things, OmniPage is good at other things, and OmniPage can correctly recognize text that Finereader will not recognize correctly.

My idea is that by running the same scan through several OCR programs, and comparing the output, I could automatically detect some of the errors. Of course, ordinary diff will not be sufficient, because the markup will differ greatly, but I have found a program named HTML Match that can process html pages and show the differences between them (screenshot attached).

I have also experimented with two editions of the same book; the two editions was published with some years between them, set with slightly different fonts and had other minor differences. I observed that phrases with errors in one edition could be correctly recognized in the other edition.

My theory from this is that by having multiple versions of the source, one can to a large extent detect errors automatically. Versions of text can come from the following sources:
  • scans of a book
  • scans from another (identical) copy of the same book
  • scans from different editions of the book
  • raw scans or scans clean with e.g. ScanTailor
It is also possible to extend this list to include versions of the epub from the darknet or from project Gutenberg. It sounds a bit stupid to scan the book if you already have it from one of these two sources, but it is possible to compare these two formats against each other to find the differences and correct the errors.

HTML Match has its errors and there are certainly weaknesses in the algorithm. Are there anyone else "out there" using a similar method, or knowing some better tools for diffing html files than HTML Match?

I have been fantasizing about writing a similar program myself, but correcting some of the errors in the algorithm used by HTML Match and possible making it interactive - with HTML match, I have to find the errors, and then find them in the original file and correct them there. It would be better to have just one program to do this in.

Any tips on more suitable software or ways do detect OCR errors are most welcome
Attached Thumbnails
Click image for larger version

Name:	htmldiff.jpg
Views:	496
Size:	526.7 KB
ID:	87744  

Last edited by Iznogood; 06-15-2012 at 10:52 AM.
Iznogood is offline   Reply With Quote
Old 06-15-2012, 11:55 AM   #2
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,640
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
Fellow Norwegian MR member SBT posted a similar topic some time ago and came up with an interesting script himself. Unfortunately, you can only use it if you have a Linux machine or a Mac. (Windows users need to install Cygwin.)
Doitsu is offline   Reply With Quote
Advert
Old 06-15-2012, 12:53 PM   #3
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
I run ubuntu myself and windows is "confined" to virtualbox, so I will certainly take a look at his script. If I read his code correctly, he compares everything, html markup, css styles and html text. When taking into account that markup can differ without it affecting the epub, I don't think an "ordinary" diff or diff3 will do what I wish to do

Last edited by Iznogood; 06-15-2012 at 01:01 PM. Reason: typo
Iznogood is offline   Reply With Quote
Old 06-16-2012, 01:40 AM   #4
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
After a bit of more searching (and researching), I did find a program called DiffMerge that is able to run diff, ignore tags and/or classes, depending on the configuration of it. It is also capable of merging three sources into one, and the best part of it: it's cross-platform and free(!).
Iznogood is offline   Reply With Quote
Old 06-16-2012, 04:19 AM   #5
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
All these tools can help but, at the end of the day, there's no substitute for human proofreading. The only way to properly proofread a text is to have the computer screen alongside the printed book and read them in parallel, word by word, comma by comma.
HarryT is offline   Reply With Quote
Advert
Old 06-16-2012, 05:37 AM   #6
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
I do proofing broadly similar to norway1456. My workflow is roughly as follows:
  1. Download multiple versions of a book from the Internet Archive
    AND/OR
  2. Do two separate scans, 150 and 300 dpi is what I use.
  3. Use vimdiff for spotting differences and merging
  4. Put scan images and revised text side by side in an HTML file, import into LibreOffice, run spellcheck, and proofread, with particular attention to paragraphs, italics, and punctuation.
  5. Finally, add HTML code and run text through home-brewed scripts to create XHTML file and epub-file.
I use Adobe Acrobat X Pro; I haven't tried any others, but it seems to do a decent job.
vimdiff isn't exactly user friendly, but when you've learnt the key combinations, it's darn fast, and carpal tunnel friendly.
I try to eliminate trivial differences between the scanned texts before diffing, in particular different lengths in initial spaces. The following regexps handle this:
Code:
1,$s/^ *\([a-z]\)/\1/
1,$s/^    *\([A-Z"']\)/\t\1/
1,$s/^ \([^ ]\)/\1/
SBT is offline   Reply With Quote
Old 06-16-2012, 11:58 AM   #7
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
Woe

Quote:
Originally Posted by HarryT View Post
All these tools can help but, at the end of the day, there's no substitute for human proofreading. The only way to properly proofread a text is to have the computer screen alongside the printed book and read them in parallel, word by word, comma by comma.
This is the woe of the e-book creator. By the time you go through a book creating it, you have had enough of it for a long time unless you are especially fond of it.
mrmikel is offline   Reply With Quote
Old 06-16-2012, 06:02 PM   #8
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
Quote:
Originally Posted by mrmikel View Post
This is the woe of the e-book creator. By the time you go through a book creating it, you have had enough of it for a long time unless you are especially fond of it.
A good point. I'm deeply impressed by HarryT's dedication, but for myself I'm satisfied as long as the number of remaining errors do not mar the reading experience noticeably. (I know, 'noticeable' is an unknown variable for each reader...)
To achieve this, I've tried to organize the proofreading workflow so that I can read the book through for a final proofreading, still not be sickeningly familiar with its contents, while neither having to stop for every other sentence to tag a mistaek. After all, I'm supposed to be doing this for fun ....
SBT is offline   Reply With Quote
Old 06-16-2012, 10:24 PM   #9
pholy
Booklegger
pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.
 
pholy's Avatar
 
Posts: 1,801
Karma: 7999816
Join Date: Jun 2009
Location: Toronto, Ontario, Canada
Device: BeBook(1 & 2010), PEZ, PRS-505, Kobo BT, PRS-T1, Playbook, Kobo Touch
Quote:
After all, I'm supposed to be doing this for fun ....
Ahh, there's the difference. Harry is doing it for posterity! I do admire his dedication.
pholy is offline   Reply With Quote
Old 06-16-2012, 11:16 PM   #10
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 76,444
Karma: 136564696
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
The only way to do a PDF and OCR conversion is to include a full A/B comparison in the workflow and if you don't, don't bother to do it at all.
JSWolf is offline   Reply With Quote
Old 06-17-2012, 05:10 PM   #11
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
I know that proof reading is a necessity, and that it must be done thoroughly if it should be of any good at all. Also formatting of the book must be done manually. The OCR program is no good at wrapping special parts of text so that they wrap in a decent way with various font sizes/screen sizes.

But as my countryman SBT points out, it should be done for fun, and therefore the more errors are auto detected, the less interruption in the reading experience while proof-reading, and the more fun it is.

Besides: as a software man, I know that there always are, and always will be, bugs in any file, software code or html pages. While proof-reading, you find and correct maybe 98% of these. But the remaining 2% goes by undetected. If using some tool to find errors and highlight them, you might be able to find 99% of the errors.
Iznogood is offline   Reply With Quote
Old 06-18-2012, 02:08 AM   #12
PeterT
Grand Sorcerer
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 12,754
Karma: 75000002
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
It might be overkill but Project Gutenburg has an associated project "Distributed Proofreaders" at http://www.pgdp.net/c/

Their approach is to display on the screen the scanned page in image format, and the OCR'ed text. They do make their entire system available at http://sourceforge.net/projects/dproofreaders/

Someone might be interested in running their own personal DP website and using it to handle the OCR validation side; yes I realize that this would still leave the markup to be done separately.
PeterT is offline   Reply With Quote
Old 06-18-2012, 04:24 AM   #13
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
I've wondered what's the best way of handling words split over lines when proofing OCR texts.
I use sed to get all of the word on one line, and then do interactive search&replace in an editor to remove soft hyphens.
I also use sed to automatically detect chapter headings and any subtitles, page headers, page numbers, and paragraphs.
SBT is offline   Reply With Quote
Old 06-18-2012, 06:32 AM   #14
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,640
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by SBT View Post
I also use sed to automatically detect chapter headings and any subtitles, page headers, page numbers, and paragraphs.
Could you please post your sed script(s)?
Doitsu is offline   Reply With Quote
Old 06-18-2012, 07:20 AM   #15
Iznogood
Guru
Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.Iznogood ought to be getting tired of karma fortunes by now.
 
Iznogood's Avatar
 
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
I also use sed and/or other tools for regex search and replace, but my methods are based on "heuristics" rather than scripts, because the output from the OCR program depends on the input. So my opinion is that detection of chapters is best done manually in each case, but when you have seen the pattern of the html file, you can batch search and replace for such elements as chapters, page breaks etc
Iznogood is offline   Reply With Quote
Reply

Tags
ocr, proof-reading


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
ABBYY FineReader - Proof reading tips? PieOPah Workshop 23 03-02-2012 02:03 AM
Proof reading: What do you do when you find a clear misprint? graycyn Workshop 4 07-20-2011 02:13 PM
Proof Reading Service genepool General Discussions 1 03-16-2011 10:02 AM
What is easier on your eyes while reading. JeremyZ General Discussions 32 08-28-2010 06:58 PM
Reading methodology (list ordering) Be Szpilman Reading Recommendations 27 07-31-2008 09:44 PM


All times are GMT -4. The time now is 03:48 PM.


MobileRead.com is a privately owned, operated and funded community.