06-15-2012, 10:50 AM | #1 |
Guru
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
Tools and methodology for easier proof-reading
Hi
I have recently been experimenting a bit, trying to find tools to help ease the proof-reading phase of the conversion from paper books to epub. When running OCR, there will always be errors, and every program has its own algorithm for recognizing texts. In other words, ABBYY Finereader is good at some things, OmniPage is good at other things, and OmniPage can correctly recognize text that Finereader will not recognize correctly. My idea is that by running the same scan through several OCR programs, and comparing the output, I could automatically detect some of the errors. Of course, ordinary diff will not be sufficient, because the markup will differ greatly, but I have found a program named HTML Match that can process html pages and show the differences between them (screenshot attached). I have also experimented with two editions of the same book; the two editions was published with some years between them, set with slightly different fonts and had other minor differences. I observed that phrases with errors in one edition could be correctly recognized in the other edition. My theory from this is that by having multiple versions of the source, one can to a large extent detect errors automatically. Versions of text can come from the following sources:
HTML Match has its errors and there are certainly weaknesses in the algorithm. Are there anyone else "out there" using a similar method, or knowing some better tools for diffing html files than HTML Match? I have been fantasizing about writing a similar program myself, but correcting some of the errors in the algorithm used by HTML Match and possible making it interactive - with HTML match, I have to find the errors, and then find them in the original file and correct them there. It would be better to have just one program to do this in. Any tips on more suitable software or ways do detect OCR errors are most welcome Last edited by Iznogood; 06-15-2012 at 10:52 AM. |
06-15-2012, 11:55 AM | #2 |
Grand Sorcerer
Posts: 5,640
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
Fellow Norwegian MR member SBT posted a similar topic some time ago and came up with an interesting script himself. Unfortunately, you can only use it if you have a Linux machine or a Mac. (Windows users need to install Cygwin.)
|
Advert | |
|
06-15-2012, 12:53 PM | #3 |
Guru
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
I run ubuntu myself and windows is "confined" to virtualbox, so I will certainly take a look at his script. If I read his code correctly, he compares everything, html markup, css styles and html text. When taking into account that markup can differ without it affecting the epub, I don't think an "ordinary" diff or diff3 will do what I wish to do
Last edited by Iznogood; 06-15-2012 at 01:01 PM. Reason: typo |
06-16-2012, 01:40 AM | #4 |
Guru
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
After a bit of more searching (and researching), I did find a program called DiffMerge that is able to run diff, ignore tags and/or classes, depending on the configuration of it. It is also capable of merging three sources into one, and the best part of it: it's cross-platform and free(!).
|
06-16-2012, 04:19 AM | #5 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
All these tools can help but, at the end of the day, there's no substitute for human proofreading. The only way to properly proofread a text is to have the computer screen alongside the printed book and read them in parallel, word by word, comma by comma.
|
Advert | |
|
06-16-2012, 05:37 AM | #6 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
I do proofing broadly similar to norway1456. My workflow is roughly as follows:
vimdiff isn't exactly user friendly, but when you've learnt the key combinations, it's darn fast, and carpal tunnel friendly. I try to eliminate trivial differences between the scanned texts before diffing, in particular different lengths in initial spaces. The following regexps handle this: Code:
1,$s/^ *\([a-z]\)/\1/ 1,$s/^ *\([A-Z"']\)/\t\1/ 1,$s/^ \([^ ]\)/\1/ |
06-16-2012, 11:58 AM | #7 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
Woe
This is the woe of the e-book creator. By the time you go through a book creating it, you have had enough of it for a long time unless you are especially fond of it.
|
06-16-2012, 06:02 PM | #8 | |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
Quote:
To achieve this, I've tried to organize the proofreading workflow so that I can read the book through for a final proofreading, still not be sickeningly familiar with its contents, while neither having to stop for every other sentence to tag a mistaek. After all, I'm supposed to be doing this for fun .... |
|
06-16-2012, 10:24 PM | #9 | |
Booklegger
Posts: 1,801
Karma: 7999816
Join Date: Jun 2009
Location: Toronto, Ontario, Canada
Device: BeBook(1 & 2010), PEZ, PRS-505, Kobo BT, PRS-T1, Playbook, Kobo Touch
|
Quote:
|
|
06-16-2012, 11:16 PM | #10 |
Resident Curmudgeon
Posts: 76,444
Karma: 136564696
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
The only way to do a PDF and OCR conversion is to include a full A/B comparison in the workflow and if you don't, don't bother to do it at all.
|
06-17-2012, 05:10 PM | #11 |
Guru
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
I know that proof reading is a necessity, and that it must be done thoroughly if it should be of any good at all. Also formatting of the book must be done manually. The OCR program is no good at wrapping special parts of text so that they wrap in a decent way with various font sizes/screen sizes.
But as my countryman SBT points out, it should be done for fun, and therefore the more errors are auto detected, the less interruption in the reading experience while proof-reading, and the more fun it is. Besides: as a software man, I know that there always are, and always will be, bugs in any file, software code or html pages. While proof-reading, you find and correct maybe 98% of these. But the remaining 2% goes by undetected. If using some tool to find errors and highlight them, you might be able to find 99% of the errors. |
06-18-2012, 02:08 AM | #12 |
Grand Sorcerer
Posts: 12,754
Karma: 75000002
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
It might be overkill but Project Gutenburg has an associated project "Distributed Proofreaders" at http://www.pgdp.net/c/
Their approach is to display on the screen the scanned page in image format, and the OCR'ed text. They do make their entire system available at http://sourceforge.net/projects/dproofreaders/ Someone might be interested in running their own personal DP website and using it to handle the OCR validation side; yes I realize that this would still leave the markup to be done separately. |
06-18-2012, 04:24 AM | #13 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
I've wondered what's the best way of handling words split over lines when proofing OCR texts.
I use sed to get all of the word on one line, and then do interactive search&replace in an editor to remove soft hyphens. I also use sed to automatically detect chapter headings and any subtitles, page headers, page numbers, and paragraphs. |
06-18-2012, 06:32 AM | #14 |
Grand Sorcerer
Posts: 5,640
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
|
06-18-2012, 07:20 AM | #15 |
Guru
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
I also use sed and/or other tools for regex search and replace, but my methods are based on "heuristics" rather than scripts, because the output from the OCR program depends on the input. So my opinion is that detection of chapters is best done manually in each case, but when you have seen the pattern of the html file, you can batch search and replace for such elements as chapters, page breaks etc
|
Tags |
ocr, proof-reading |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
ABBYY FineReader - Proof reading tips? | PieOPah | Workshop | 23 | 03-02-2012 02:03 AM |
Proof reading: What do you do when you find a clear misprint? | graycyn | Workshop | 4 | 07-20-2011 02:13 PM |
Proof Reading Service | genepool | General Discussions | 1 | 03-16-2011 10:02 AM |
What is easier on your eyes while reading. | JeremyZ | General Discussions | 32 | 08-28-2010 06:58 PM |
Reading methodology (list ordering) | Be Szpilman | Reading Recommendations | 27 | 07-31-2008 09:44 PM |