Tools and methodology for easier proof-reading

Iznogood · 06-15-2012, 09:50 AM

Hi

I have recently been experimenting a bit, trying to find tools to help ease the proof-reading phase of the conversion from paper books to epub.

When running OCR, there will always be errors, and every program has its own algorithm for recognizing texts. In other words, ABBYY Finereader is good at some things, OmniPage is good at other things, and OmniPage can correctly recognize text that Finereader will not recognize correctly.

My idea is that by running the same scan through several OCR programs, and comparing the output, I could automatically detect some of the errors. Of course, ordinary diff will not be sufficient, because the markup will differ greatly, but I have found a program named HTML Match that can process html pages and show the differences between them (screenshot attached).

I have also experimented with two editions of the same book; the two editions was published with some years between them, set with slightly different fonts and had other minor differences. I observed that phrases with errors in one edition could be correctly recognized in the other edition.

My theory from this is that by having multiple versions of the source, one can to a large extent detect errors automatically. Versions of text can come from the following sources:

scans of a book
scans from another (identical) copy of the same book
scans from different editions of the book
raw scans or scans clean with e.g. ScanTailor

It is also possible to extend this list to include versions of the epub from the darknet or from project Gutenberg. It sounds a bit stupid to scan the book if you already have it from one of these two sources, but it is possible to compare these two formats against each other to find the differences and correct the errors.

HTML Match has its errors and there are certainly weaknesses in the algorithm. Are there anyone else "out there" using a similar method, or knowing some better tools for diffing html files than HTML Match?

I have been fantasizing about writing a similar program myself, but correcting some of the errors in the algorithm used by HTML Match and possible making it interactive - with HTML match, I have to find the errors, and then find them in the original file and correct them there. It would be better to have just one program to do this in.

Any tips on more suitable software or ways do detect OCR errors are most welcome

Doitsu · 06-15-2012, 10:55 AM

Fellow Norwegian MR member SBT posted a similar topic some time ago and came up with an interesting script himself. Unfortunately, you can only use it if you have a Linux machine or a Mac. (Windows users need to install Cygwin.)

Iznogood · 06-15-2012, 11:53 AM

I run ubuntu myself and windows is "confined" to virtualbox, so I will certainly take a look at his script. If I read his code correctly, he compares everything, html markup, css styles and html text. When taking into account that markup can differ without it affecting the epub, I don't think an "ordinary" diff or diff3 will do what I wish to do

Iznogood · 06-16-2012, 12:40 AM

After a bit of more searching (and researching), I did find a program called DiffMerge that is able to run diff, ignore tags and/or classes, depending on the configuration of it. It is also capable of merging three sources into one, and the best part of it: it's cross-platform and free(!).

HarryT · 06-16-2012, 03:19 AM

All these tools can help but, at the end of the day, there's no substitute for human proofreading. The only way to properly proofread a text is to have the computer screen alongside the printed book and read them in parallel, word by word, comma by comma.

SBT · 06-16-2012, 04:37 AM

I do proofing broadly similar to norway1456. My workflow is roughly as follows:

Download multiple versions of a book from the Internet Archive
AND/OR
Do two separate scans, 150 and 300 dpi is what I use.
Use vimdiff for spotting differences and merging
Put scan images and revised text side by side in an HTML file, import into LibreOffice, run spellcheck, and proofread, with particular attention to paragraphs, italics, and punctuation.
Finally, add HTML code and run text through home-brewed scripts to create XHTML file and epub-file.

I use Adobe Acrobat X Pro; I haven't tried any others, but it seems to do a decent job.
vimdiff isn't exactly user friendly, but when you've learnt the key combinations, it's darn fast, and carpal tunnel friendly.
I try to eliminate trivial differences between the scanned texts before diffing, in particular different lengths in initial spaces. The following regexps handle this:

Code:

1,$s/^ *\([a-z]\)/\1/
1,$s/^    *\([A-Z"']\)/\t\1/
1,$s/^ \([^ ]\)/\1/

mrmikel · 06-16-2012, 10:58 AM

Quote:

Originally Posted by HarryT

All these tools can help but, at the end of the day, there's no substitute for human proofreading. The only way to properly proofread a text is to have the computer screen alongside the printed book and read them in parallel, word by word, comma by comma.

This is the woe of the e-book creator. By the time you go through a book creating it, you have had enough of it for a long time unless you are especially fond of it.

SBT · 06-16-2012, 05:02 PM

Quote:

Originally Posted by mrmikel

This is the woe of the e-book creator. By the time you go through a book creating it, you have had enough of it for a long time unless you are especially fond of it.

A good point. I'm deeply impressed by HarryT's dedication, but for myself I'm satisfied as long as the number of remaining errors do not mar the reading experience noticeably. (I know, 'noticeable' is an unknown variable for each reader...)
To achieve this, I've tried to organize the proofreading workflow so that I can read the book through for a final proofreading, still not be sickeningly familiar with its contents, while neither having to stop for every other sentence to tag a mistaek. After all, I'm supposed to be doing this for fun ....

pholy · 06-16-2012, 09:24 PM

Quote:

After all, I'm supposed to be doing this for fun ....

Ahh, there's the difference. Harry is doing it for posterity! I do admire his dedication.

JSWolf · 06-16-2012, 10:16 PM

The only way to do a PDF and OCR conversion is to include a full A/B comparison in the workflow and if you don't, don't bother to do it at all.

Iznogood · 06-17-2012, 04:10 PM

I know that proof reading is a necessity, and that it must be done thoroughly if it should be of any good at all. Also formatting of the book must be done manually. The OCR program is no good at wrapping special parts of text so that they wrap in a decent way with various font sizes/screen sizes.

But as my countryman SBT points out, it should be done for fun, and therefore the more errors are auto detected, the less interruption in the reading experience while proof-reading, and the more fun it is.

Besides: as a software man, I know that there always are, and always will be, bugs in any file, software code or html pages. While proof-reading, you find and correct maybe 98% of these. But the remaining 2% goes by undetected. If using some tool to find errors and highlight them, you might be able to find 99% of the errors.

PeterT · 06-18-2012, 01:08 AM

It might be overkill but Project Gutenburg has an associated project "Distributed Proofreaders" at http://www.pgdp.net/c/

Their approach is to display on the screen the scanned page in image format, and the OCR'ed text. They do make their entire system available at http://sourceforge.net/projects/dproofreaders/

Someone might be interested in running their own personal DP website and using it to handle the OCR validation side; yes I realize that this would still leave the markup to be done separately.

SBT · 06-18-2012, 03:24 AM

I've wondered what's the best way of handling words split over lines when proofing OCR texts.
I use sed to get all of the word on one line, and then do interactive search&replace in an editor to remove soft hyphens.
I also use sed to automatically detect chapter headings and any subtitles, page headers, page numbers, and paragraphs.

Doitsu · 06-18-2012, 05:32 AM

Quote:

Originally Posted by SBT

I also use sed to automatically detect chapter headings and any subtitles, page headers, page numbers, and paragraphs.

Could you please post your sed script(s)?

Iznogood · 06-18-2012, 06:20 AM

I also use sed and/or other tools for regex search and replace, but my methods are based on "heuristics" rather than scripts, because the output from the OCR program depends on the input. So my opinion is that detection of chapters is best done manually in each case, but when you have seen the pattern of the html file, you can batch search and replace for such elements as chapters, page breaks etc

06-15-2012, 09:50 AM	#1
Iznogood Guru Posts: 932 Karma: 15752887 Join Date: Mar 2011 Location: Norway Device: Ipad, kindle paperwhite	Tools and methodology for easier proof-reading Hi I have recently been experimenting a bit, trying to find tools to help ease the proof-reading phase of the conversion from paper books to epub. When running OCR, there will always be errors, and every program has its own algorithm for recognizing texts. In other words, ABBYY Finereader is good at some things, OmniPage is good at other things, and OmniPage can correctly recognize text that Finereader will not recognize correctly. My idea is that by running the same scan through several OCR programs, and comparing the output, I could automatically detect some of the errors. Of course, ordinary diff will not be sufficient, because the markup will differ greatly, but I have found a program named HTML Match that can process html pages and show the differences between them (screenshot attached). I have also experimented with two editions of the same book; the two editions was published with some years between them, set with slightly different fonts and had other minor differences. I observed that phrases with errors in one edition could be correctly recognized in the other edition. My theory from this is that by having multiple versions of the source, one can to a large extent detect errors automatically. Versions of text can come from the following sources: scans of a book scans from another (identical) copy of the same book scans from different editions of the book raw scans or scans clean with e.g. ScanTailor It is also possible to extend this list to include versions of the epub from the darknet or from project Gutenberg. It sounds a bit stupid to scan the book if you already have it from one of these two sources, but it is possible to compare these two formats against each other to find the differences and correct the errors. HTML Match has its errors and there are certainly weaknesses in the algorithm. Are there anyone else "out there" using a similar method, or knowing some better tools for diffing html files than HTML Match? I have been fantasizing about writing a similar program myself, but correcting some of the errors in the algorithm used by HTML Match and possible making it interactive - with HTML match, I have to find the errors, and then find them in the original file and correct them there. It would be better to have just one program to do this in. Any tips on more suitable software or ways do detect OCR errors are most welcome Attached Thumbnails Last edited by Iznogood; 06-15-2012 at 09:52 AM.

06-15-2012, 11:53 AM	#3
Iznogood Guru Posts: 932 Karma: 15752887 Join Date: Mar 2011 Location: Norway Device: Ipad, kindle paperwhite	I run ubuntu myself and windows is "confined" to virtualbox, so I will certainly take a look at his script. If I read his code correctly, he compares everything, html markup, css styles and html text. When taking into account that markup can differ without it affecting the epub, I don't think an "ordinary" diff or diff3 will do what I wish to do Last edited by Iznogood; 06-15-2012 at 12:01 PM. Reason: typo

06-16-2012, 04:37 AM	#6
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	I do proofing broadly similar to norway1456. My workflow is roughly as follows: Download multiple versions of a book from the Internet Archive AND/OR Do two separate scans, 150 and 300 dpi is what I use. Use vimdiff for spotting differences and merging Put scan images and revised text side by side in an HTML file, import into LibreOffice, run spellcheck, and proofread, with particular attention to paragraphs, italics, and punctuation. Finally, add HTML code and run text through home-brewed scripts to create XHTML file and epub-file. I use Adobe Acrobat X Pro; I haven't tried any others, but it seems to do a decent job. vimdiff isn't exactly user friendly, but when you've learnt the key combinations, it's darn fast, and carpal tunnel friendly. I try to eliminate trivial differences between the scanned texts before diffing, in particular different lengths in initial spaces. The following regexps handle this: Code: 1,$s/^ \([a-z]\)/\1/ 1,$s/^ \([A-Z"']\)/\t\1/ 1,$s/^ \([^ ]\)/\1/

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
ABBYY FineReader - Proof reading tips?	PieOPah	Workshop	23	03-02-2012 01:03 AM
Proof reading: What do you do when you find a clear misprint?	graycyn	Workshop	4	07-20-2011 01:13 PM
Proof Reading Service	genepool	General Discussions	1	03-16-2011 09:02 AM
What is easier on your eyes while reading.	JeremyZ	General Discussions	32	08-28-2010 05:58 PM
Reading methodology (list ordering)	Be Szpilman	Reading Recommendations	27	07-31-2008 08:44 PM

06-15-2012, 10:55 AM	#2
Doitsu Grand Sorcerer Posts: 5,725 Karma: 24031401 Join Date: Dec 2010 Device: Kindle PW2	Fellow Norwegian MR member SBT posted a similar topic some time ago and came up with an interesting script himself. Unfortunately, you can only use it if you have a Linux machine or a Mac. (Windows users need to install Cygwin.)

06-16-2012, 12:40 AM	#4
Iznogood Guru Posts: 932 Karma: 15752887 Join Date: Mar 2011 Location: Norway Device: Ipad, kindle paperwhite	After a bit of more searching (and researching), I did find a program called DiffMerge that is able to run diff, ignore tags and/or classes, depending on the configuration of it. It is also capable of merging three sources into one, and the best part of it: it's cross-platform and free(!).

06-16-2012, 03:19 AM	#5
HarryT eBook Enthusiast Posts: 85,544 Karma: 93383099 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6	All these tools can help but, at the end of the day, there's no substitute for human proofreading. The only way to properly proofread a text is to have the computer screen alongside the printed book and read them in parallel, word by word, comma by comma.

06-16-2012, 10:16 PM	#10
JSWolf Resident Curmudgeon Posts: 79,687 Karma: 145864619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	The only way to do a PDF and OCR conversion is to include a full A/B comparison in the workflow and if you don't, don't bother to do it at all.

06-17-2012, 04:10 PM	#11
Iznogood Guru Posts: 932 Karma: 15752887 Join Date: Mar 2011 Location: Norway Device: Ipad, kindle paperwhite	I know that proof reading is a necessity, and that it must be done thoroughly if it should be of any good at all. Also formatting of the book must be done manually. The OCR program is no good at wrapping special parts of text so that they wrap in a decent way with various font sizes/screen sizes. But as my countryman SBT points out, it should be done for fun, and therefore the more errors are auto detected, the less interruption in the reading experience while proof-reading, and the more fun it is. Besides: as a software man, I know that there always are, and always will be, bugs in any file, software code or html pages. While proof-reading, you find and correct maybe 98% of these. But the remaining 2% goes by undetected. If using some tool to find errors and highlight them, you might be able to find 99% of the errors.

06-18-2012, 01:08 AM	#12
PeterT Grand Sorcerer Posts: 13,474 Karma: 78910112 Join Date: Nov 2007 Location: Toronto Device: Libra H2O, Libra Colour	It might be overkill but Project Gutenburg has an associated project "Distributed Proofreaders" at http://www.pgdp.net/c/ Their approach is to display on the screen the scanned page in image format, and the OCR'ed text. They do make their entire system available at http://sourceforge.net/projects/dproofreaders/ Someone might be interested in running their own personal DP website and using it to handle the OCR validation side; yes I realize that this would still leave the markup to be done separately.

06-18-2012, 03:24 AM	#13
SBT Fanatic Posts: 580 Karma: 810184 Join Date: Sep 2010 Location: Norway Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad	I've wondered what's the best way of handling words split over lines when proofing OCR texts. I use sed to get all of the word on one line, and then do interactive search&replace in an editor to remove soft hyphens. I also use sed to automatically detect chapter headings and any subtitles, page headers, page numbers, and paragraphs.

06-18-2012, 06:20 AM	#15
Iznogood Guru Posts: 932 Karma: 15752887 Join Date: Mar 2011 Location: Norway Device: Ipad, kindle paperwhite	I also use sed and/or other tools for regex search and replace, but my methods are based on "heuristics" rather than scripts, because the output from the OCR program depends on the input. So my opinion is that detection of chapters is best done manually in each case, but when you have seen the pattern of the html file, you can batch search and replace for such elements as chapters, page breaks etc

Advert

Advert