Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 11-23-2012, 07:06 AM   #1
neuvivlio
Member
neuvivlio began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Nov 2012
Device: none
how to convert a scanned page from a book (looks like photo of page) to clean text?

ok, fairly simple question - i have a book in pdf format, that is scanned with the page and all (looks a bit like a photo of the book almost).
this of course makes for awkward reading, and obviously awkward printing.

how could i take this roughly scanned book, and convert the text into nice clean, legible text?

is it an easy process? several steps involved, great patience, etc? someone fill me in

some of my old books are quite treasured and i'd love to see them get a bit of a second wind by being digitized, even if it does take some work on my part

thank you

Last edited by neuvivlio; 09-20-2019 at 01:40 AM.
neuvivlio is offline   Reply With Quote
Old 11-23-2012, 07:11 AM   #2
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
You need a decent OCR program. Abbyy Finereader is one of the best.
HarryT is offline   Reply With Quote
Advert
Old 11-23-2012, 07:12 AM   #3
neuvivlio
Member
neuvivlio began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Nov 2012
Device: none
which i have... have you done this very thing Harry?
neuvivlio is offline   Reply With Quote
Old 11-23-2012, 07:14 AM   #4
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Sure. Lots of times. Not for a book, but for all sorts of other documents (PDFs of journal articles, etc). OCR isn't perfect - you'll still need to proof-read the book to get it perfect - but Abbyy does an excellent job.
HarryT is offline   Reply With Quote
Old 11-23-2012, 07:20 AM   #5
neuvivlio
Member
neuvivlio began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Nov 2012
Device: none
hmm, well this particular book (as per first photo) is an old anthology of horror stories about 300 pages hehe..
i have the book scanned in, but in order to read it comfortably i'd need to convert it into text somewhat like what we here are typing in.

are there any guides that you know of for this? i guess it is a long drawn out process
neuvivlio is offline   Reply With Quote
Advert
Old 11-23-2012, 07:22 AM   #6
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
No, it's easy. Scan it as a PDF document (ie a document where each page of the PDF is a scanned image). Abbyy will do OCR using an image PDF as the source.
HarryT is offline   Reply With Quote
Old 11-23-2012, 07:23 AM   #7
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
I'm moving this thread to the "Workshop" forum, which is the best place for it.
HarryT is offline   Reply With Quote
Old 11-23-2012, 07:58 AM   #8
neuvivlio
Member
neuvivlio began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Nov 2012
Device: none
Thanks Harry - I imported the book in Abbyy as 'image/file -> pdf'
it's analyzing now, or as you guys say, analysing. I guess one good thing to do would be to keep the mode in black & white, for this type operation.

I suppose that, creating a table of contents is a manual job? I could see no way that it could be otherwise
neuvivlio is offline   Reply With Quote
Old 11-23-2012, 11:19 AM   #9
AJ Starr
Guru
AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.AJ Starr ought to be getting tired of karma fortunes by now.
 
AJ Starr's Avatar
 
Posts: 815
Karma: 1029784
Join Date: May 2008
Location: Nebraska, USA
Device: PEZ, Color Libre, 2@Sony T1, Onyx i62HD
If Abbyy reader doesn't work, you might check out word processing programs.

My WordPerfect will Open a PDF and convert it to text. However then I have a roundabout method to convert it to ereader format.

AJ
AJ Starr is offline   Reply With Quote
Old 11-23-2012, 03:48 PM   #10
GMcG
Writer
GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.GMcG ought to be getting tired of karma fortunes by now.
 
GMcG's Avatar
 
Posts: 101
Karma: 590630
Join Date: Mar 2011
Location: Munich, Germany
Device: none
@neuvivlio

If he book is already scanned and you have a pdf file, then why can't you open it in ACROBAT reader and save it as txt?
(File --> save as txt)?

George
GMcG is offline   Reply With Quote
Old 11-23-2012, 04:22 PM   #11
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
Quote:
Originally Posted by GMcG View Post
@neuvivlio

If he book is already scanned and you have a pdf file, then why can't you open it in ACROBAT reader and save it as txt?
(File --> save as txt)?

George
Because they are basically JPG images. While some scanners may apply some half-assed OCR underneath those images ("positional OCR"), it's way too inferior compared to ABBYY FineReader. Adobe Acrobat can OCR it, as well, but it has a very poor engine backing it up.

Also, saving it as plain text is just awful for e-books, because there's absolutely no formatting at all (italics, bolds, chapter titles, etc). Italics are the soul of a book, and it's what makes the reading experience enjoyable - especially if used right. Trying to manually spot them in the scans, and then manually re-add them is pure madness. You're bound to miss a few, unless you spend a SIGNIFICANT amount of mental effort and you go over them at least twice.

Last edited by DSpider; 11-23-2012 at 05:01 PM.
DSpider is offline   Reply With Quote
Old 11-23-2012, 04:55 PM   #12
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
My advice to the original poster of this thread is to watch a few tutorials on how to use Word, Acrobat, maybe InDesign and Illustrator, as well. The "Essential" libraries from lynda.com are excellent and I highly recommend them. Of course, there are also open-source alternatives (Adobe software is very expensive), such as LibreOffice, Inkscape, Scribus, GIMP, etc., which can do the job very well. I'm a fan of open-source software (Arch Linux is my primary OS), but I have to admit that Adobe has the flagship software for this industry.


My current workflow is to scan text in grayscale, at 300 dpi, JPG (~90% quality setting), and images (like the covers, pictures or other graphics) in colour, at 600 dpi, TIFF.

First I start with the graphics, using Photoshop to get the most out of them, and vectorize what I can by manually tracing them in Illustrator (again, the "Essential" training library from lynda.com should be enough). Then I run the scans through ABBYY FineReader, proofread it once, export as RTF, run a custom macro in Word 2010 SP1 that keeps only the bolds, italics, subscripts, superscripts and inserts the footnotes as in-line text (separated with tags, so I know where to place them later). This macro outputs a squeaky clean RTF, which I import into InDesign CS5.5 and start redoing the layout, based on styles. Here I sometimes use Scan Tailor, but just for 1:1 comparison when redoing the layout. Usually I batch rename the images from Scan Tailor to match the page number, so that it's easier to go back and forth.

And finally, I proofread again the final e-book, but this time on my e-reader, highlighting the parts that I may have missed or that just don't look right. As you can imagine, the quality is very high, but only after putting a lot of time into it. An e-book could take up to a month.

Good luck. Many people give up after the third book or so.
DSpider is offline   Reply With Quote
Old 11-23-2012, 07:04 PM   #13
neuvivlio
Member
neuvivlio began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Nov 2012
Device: none
thanks for the thoughts and comments gents -

abbyy fine did a good job of scanning the original scan of the book.. one problem though, is that, on pages that were left intentionally blank in the book (separators of chapters), abbyy has tried to read the other side of those pages, and came out with gibberish with those.

once the scanning was done and i had the page by page preview, i looked for some sort of option like "save as blank page" or similar, but saw nothing like this. is this more the realm of a pdf editor?

if i could blank out those pages which were originally just blank pages to begin with i'd have a pretty good pdf made from the original scanned book, abbyy seemed to've picked up everything properly, the original scan was quite good.

thanks!
neuvivlio is offline   Reply With Quote
Old 11-23-2012, 07:19 PM   #14
neuvivlio
Member
neuvivlio began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Nov 2012
Device: none
i saved the scan as an rtf file through abbyy and was surprised to see an almost perfect table of contents when it opened in microsoft word, how did it do this? and why did it not when saved as a pdf?
much to learn, i guess
neuvivlio is offline   Reply With Quote
Old 11-23-2012, 09:10 PM   #15
neuvivlio
Member
neuvivlio began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Nov 2012
Device: none
ok, so, i ran the original scans through abbyy fine. deskewed all pages, then saved the pdf.

is there anything else i should run the pdf through to optimize the text, or any other tips on optimizing text? i know there is an optimizer in acrobat, but i dont know if it would be useful in this instance

thanks
neuvivlio is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Page blank before and after book image page osiris12 Sigil 12 05-28-2015 05:27 PM
Need help w/very simple task: page of Word text > Kindle text I can share w/friends kearnine Conversion 1 10-17-2012 09:25 PM
PRS-T1 fist book page when comming out of sleep mode text is faint Tinderbox (UK) Sony Reader 8 01-17-2012 09:13 AM
image on separate page without half-page text next Toxaris ePub 2 01-26-2011 04:32 AM
Question Regarding 2-page Pdf (scanned book) Mholtmeier PDF 7 09-01-2009 07:47 PM


All times are GMT -4. The time now is 05:24 PM.


MobileRead.com is a privately owned, operated and funded community.