07-25-2014, 05:55 PM | #1 |
formatting student
Posts: 47
Karma: 38268
Join Date: Dec 2013
Location: South Arkansas, US
Device: several models of Kindle; Several Android tablets & 3 Android phones.
|
Can you OCR the images inside of .pdf files?
Hello All--
Can you OCR the images inside of .pdf files? Do you have to extract the images somehow to .tiff images, and then put them to an OCR software? HOW?? I'd like to be able to (hopefully with no-cost/low-cost software) locate the scanned copy of books I own (or Public Domain) and thru whatever processes, end up with a text / rtf/ whatever... file of the words on that page/book. Is there a tutorial, or section of the Wiki that I've overlooked/not found ?? I know I sound like a babbling idiot...... I've downloaded one-too-many public domain epubs from both Amazon and B&N, as well as Google & Internet Archive, that needed to be worked on. I've typed out by hand several books, especially cookbooks, and it's a tedious and long term project. There's got to be a better way!!! (Moderators- please move this thread to whatever section it belongs in, If I'm in the wrong place.) TIA Kathy MamaDragon |
07-26-2014, 12:05 AM | #2 |
Addict
Posts: 272
Karma: 8000000
Join Date: Oct 2010
Location: Corvallis, OR
Device: Kindle PW2, iPad Pro
|
Kathy, I am a scanner/OCR lover. I use ABBYY Finereader. I just drag the file (PDF) into the software and it does its magic. I can run them for you if you need. You would then compare the opened PDF with the OCR version and correct errors. It can take a bit of time.
|
Advert | |
|
07-26-2014, 12:59 PM | #3 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
Be aware that a good error rate for OCR is 1%. That translates to an error PER PAGE, so it will take some good proofreading to make sure there are no errors. It would be mighty shame to put in 1 cup when 1/4 cup was in the original!!!!
|
07-26-2014, 06:03 PM | #4 | |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Quote:
Unless the 1% error rate refers to words, not to characters. Then we can estimate around 250 words per page, and 2-3 errors per page. |
|
07-27-2014, 07:55 AM | #5 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
I think you are right Jelby. The error rates I found refer by character. That only makes close proofreading more imperative and almost word for word examination is required. Close guesses will work much of the time, but when they don't the whole meaning of the sentence is easily changed.
Columns are a particular bugbear. It is very common for parts of paragraphs to be shifted down the page and it can be remarkably hard to spot in casual reading. The tops of pages where the text in the right column is only a line or two is where it often happens. Last edited by mrmikel; 07-27-2014 at 08:00 AM. |
Advert | |
|
07-27-2014, 11:22 AM | #6 | ||||
Wizard
Posts: 2,303
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
https://www.mobileread.com/forums/sho...d.php?t=243021 Overall, I would say it is how much you value your own time.
I have conversions down to an average of ~8-15 hours to go from OCR -> completed EPUB (I tackle non-fiction economics books, different genres are probably faster/slower, and when you first start out, it will be much slower). Manually typing in everything, or working from much less accurate OCR, while "free" (as in, I didn't pay any money for tools) would take cost you WAY more in manhours. Quote:
And remember, pure character accuracy doesn't take into account other a HECK of a lot of formatting in a book as well:
A more expensive OCR program would typically handle these much better than the free OCR stuff. Also, the overall character accuracy depends on what kind of text you are converting. Cookbooks are probably going to have a lot of lists and fractions and images. I bet Finereader would do a much more accurate job at recognizing and accounting for these than a lot of the free OCR programs out there. If you are working from scanned older material, working from a crappy picture (lets say, you take it with your phone), or the scans are subpar (people who write/underline in the books, water damage, blotches, etc. etc.), accuracy goes WAY down. Archive.org scans are probably going to have much higher OCR errors than if you were working from a crisp digital image from a newer book. Here too, the paid programs will probably handle crappier source material better than the free OCR solutions. Quote:
A lot of these free OCR programs just will export the OCR output. In Finereader, you can use the GUI. It highlights characters that it is "unsure" about. You can then easily look through and pay much closer attention to THOSE sections only. This saves a massive amount of time, since you don't have to waste much of your time looking at every word in the entire book, and you can focus on that 1-5% that is "unsure". You also get the dictionary support, so it underlines words that are spelled wrong. (Again, you can focus a lot more attention on these than if you had to closely scrutinize every word under the sun). You can also QUICKLY A/B compare with the source, you can have a magnification set up. For example, here are two images just showing off the types of A/B compare. Magnification, or side-by-side: Quote:
The ABSOLUTE WORST is "newspaper" type material, where they have stories that get cut into pieces and "continue on Page C3". So a single page can have about two or three running stories on it, that connect together like a giant spaghetti monster. Last edited by Tex2002ans; 07-27-2014 at 11:57 AM. Reason: Added Images |
||||
07-27-2014, 11:30 AM | #7 | |
formatting student
Posts: 47
Karma: 38268
Join Date: Dec 2013
Location: South Arkansas, US
Device: several models of Kindle; Several Android tablets & 3 Android phones.
|
Quote:
I'm looking at working thru books like McGuffeys' Readers, cookbooks, and so forth, that most of the rest of the people who are rejuvenating the Public Domain reading have passed over. Thanks! Kathy MamaDragon |
|
07-27-2014, 11:44 AM | #8 |
formatting student
Posts: 47
Karma: 38268
Join Date: Dec 2013
Location: South Arkansas, US
Device: several models of Kindle; Several Android tablets & 3 Android phones.
|
Don't you just LOVE cross-posting...?!
Tex2002ans- THANKS!!! I'll have to tough it out for now with something less expensive for now, until I can either find a way to make Finereader pay for itself (performing services for others), or my pocket money funds get a raise.... [in other words - Time I Got.. Funds... not so much] I'm sure that a LOT of your information will still teach me things I need to know. MrMikeL - yes, I'm still looking at word-by-word proofing, but that's better MOST of the time than typing it all in from scratch. Thanks All Kathy |
07-27-2014, 11:45 AM | #9 | |||
Wizard
Posts: 2,303
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Those are probably the most common, there are a few more (XLS, CSV) that probably wouldn't be used in your typical book. Well when the same stuff gets said again and again (I mean, someone JUST posted this topic last week).... It sort of gets boring having to type out a lot of the same info when it was already covered about a thousand times. :P Quote:
And let's be serious.... proofing crappy OCR is boring stuff... proofing better OCR is much less boring. The quicker you finish, the more time you can spend actually READING the cookbooks (or cooking)! You can a copy of Finereader much cheaper off Ebay or something similar: http://www.ebay.com/sch/i.html?_from...eader&_sacat=0 As I said, just hunt for 9 or 10, they are perfectly fine. No need for 12 (I would actually recommend AGAINST 12). Stick with 9/10/11. Quote:
Also, if you use Microsoft Word (2007+), Toxaris came up with this ePUB Tools addon which really speeds things up: https://www.mobileread.com/forums/sho...d.php?t=213372 Last edited by Tex2002ans; 07-27-2014 at 11:56 AM. |
|||
07-27-2014, 11:56 AM | #10 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
In forums, TMI is better than TLI!
|
07-27-2014, 09:04 PM | #11 | |
Connoisseur
Posts: 91
Karma: 10
Join Date: Feb 2014
Location: Long Island, NY
Device: Aura, N514KUBKKEP, 4.7.10.413
|
Quote:
|
|
07-28-2014, 02:15 AM | #12 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
By checking. Some of them are handled correctly by the OCR, but not all. That is one of the reasons that I created my add-on to automate a lot of tasks to fix these (and a lot of other) OCR mistakes. In case of doubt, manual intervention is required.
|
07-28-2014, 01:00 PM | #13 | ||
Wizard
Posts: 2,303
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Finereader does a pretty decent job at separating images from the text, and it is pretty dang good at figuring out tables. (Let me tell you, doing tables manually will make you want to kill yourself ). Here is a list of a bunch of different OCR programs: https://en.wikipedia.org/wiki/Compar...ition_software There isn't really a "guide", just that in my experience, the Free OCR tools (Tesseract, FreeOCR, etc. etc.), do not recognizing a lot of that "complex" formatting as accurately as something like Finereader. And it is exactly as Toxaris stated: Quote:
Also, another disadvantage of the free stuff, you are most likely going to have to do A LOT of your own training. For example, here is the training manual for Tesseract: https://code.google.com/p/tesseract-...ningTesseract3 While the default training included with the program probably works perfectly fine for basic things like novels, and cleaner scans, it will probably require more training if the book you are dealing with has older/more obscure fonts, or when dealing with non-English languages. (Even a lot of "English" books have a lot of accented characters, and letters out of the usual A-Z subset). In Finereader, you are also paying for the massive amount of training that THEY have already done for you (on the millions and millions of documents they process). This again, will lead to more accurate results than otherwise. Remember, the more accurate the OCR is, the less time you have to spend actually cleaning up the wrong output. So with free, sure, it might cost you $0 initially, but then you spend many more hours double-checking/cleaning up the output. Edit: Actually, now that I reread u238110's post, he MAY have meant how I handle coding those things in actual EPUB. I explained Tables/Footnotes/Formulas/Figures/Images towards the bottom of this post (with links to the specific topics/real-life examples): https://www.mobileread.com/forums/sho...68&postcount=8 Headers/Footers can just be trashed. Finereader does a good job at recognizing them in the document, and just allows you to easily export without those included (again, this is an area where the free stuff might lack, and you would have to spend time manually removing). |
||
07-30-2014, 01:21 AM | #14 |
formatting student
Posts: 47
Karma: 38268
Join Date: Dec 2013
Location: South Arkansas, US
Device: several models of Kindle; Several Android tablets & 3 Android phones.
|
Thanks All, and especially Tex2002ans !!
I do appreciate you all taking the time to answer my query..... I THINK the answer was Yes, AbbyyFinereader, and possibly a few others are able to OCR the page images within a PDF WITHOUT disassembling the PDF first. If the PDF needs to be disassembled first, I better get someone to teach me how... FineReader will definitely make onto my Xmas Wish List... In the mean time, I'll work the books that others have already done the heavy lifting on. My area of concentration / preference are the PD books that a lot of the Homeschoolers use, McGuffey's, Ray's (if I can figure out how to make those math problems do what I want WHERE I want them to do it ), Primary Source Documents for History; Pleasure / Literature Reading for the younger bunch, and so forth. some of the story chapter books require minor updating, but a lot more are excellent as they are for vocabulary building, as well as "just for fun." While a lot of the Homeschool crowd are printing PDF's and binding at 5x8; usually it's because they don't have other options. I want to present them with another option -- most, if not all of that years' books ready to load onto a reader. (When my kids were Homeschooling, they'd much rather have their books on the readers, than a half-size notebook or home-bound edition.) Again, Thanks All Kathy MamaDragon |
07-30-2014, 07:44 AM | #15 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
I think you could break pdfs apart using the combination of ghostscript and gsview. But finding a server that actually works to download gsview can be a challenge.
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
no text extraction for pdf with images and OCR | fxp33 | Conversion | 7 | 12-15-2015 08:22 AM |
Cover images for pdf files on Kindle PW | blz777 | Amazon Kindle | 0 | 07-21-2013 11:45 AM |
Google Adds OCR for PDF Files | kjk | News | 0 | 06-22-2010 03:27 PM |
Can I view images in PDF files ? | eisho | Sony Reader | 1 | 08-03-2008 09:49 PM |
Sony reader for PDF files: pages as images | claudioita | Sony Reader | 3 | 07-30-2007 03:46 PM |