04-27-2011, 02:26 AM | #1 |
Junior Member
Posts: 5
Karma: 10
Join Date: Apr 2011
Device: netbook
|
Do I have to OCR?
I have a newbie question. I am going to start scanning my paperback books soon. I have about 3,000. I don't need to be able to edit the text once I scan it. I did a couple of sample scans and the pages are readable. I am reading on a netbook and not an ereader. So do I have to OCR each of the books? Can I just use the scanned pages as my final product? I have read that the files can very large if you don't use OCR, but how big is big? The average number of pages for my paperback books is 350-400. I will be scanning everything in black/white. All the pages are just plain text. The only images will be the front and back covers. Thank you for any help.
|
04-27-2011, 05:31 AM | #2 |
Guru
Posts: 932
Karma: 15752887
Join Date: Mar 2011
Location: Norway
Device: Ipad, kindle paperwhite
|
Size is one reason for OCRing your books. Reflowable text is another reason. If you in the future decide to read on a smaller screen, reflowing the text to shorter lines or bigger fonts could be helpful.
I was once in the same situation, and I scanned my books to jpeg images and generated a cbz-file of the. The method is quite simple: add your files to a zip- or .rar archive, thus compressing them to smaller size and only get one file pr. book. Rename .zip to .cbz. If you compressed to .rar, rename the ending to .cbr. Now you can read them using a Comic Book Reader program. Of course it is also possible to Read your image files in an image viewer or something. It is also possible to convert to .pdf and read it there. Pdf files need some computing to prosess/load, and big PDFs are difficult to handle on mobile platforms, but perhaps it could be an option on a netbook? If the need for OCRing arises in the future, it is possible to use the images as input to an OCR-program and get reflowable text out of it. Whether this is necessary as of today or if reading from images is "good enough" depends on your screen size, storage on your netbook and of course on your reading preferances, and only you can answer those questions. Last edited by Iznogood; 04-27-2011 at 05:49 AM. |
Advert | |
|
04-27-2011, 10:48 AM | #3 | ||
frumious Bandersnatch
Posts: 7,534
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Quote:
Quote:
|
||
04-27-2011, 11:06 AM | #4 |
Junior Member
Posts: 5
Karma: 10
Join Date: Apr 2011
Device: netbook
|
I was gonna scan to pdf. I have a lot of ebooks already in this formart. It works fine when reading on my netbook. The sample scans I took also look good, very clear and readable. So would scanned pdf files be to large? If they can I make them smaller without OCR? thanks
|
04-27-2011, 11:42 AM | #5 |
Guru
Posts: 860
Karma: 4380
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
|
Hello
My advice: 1 - scan in black and white and test OCR in it. [Remember that the most work and time spent is in proof reading (and correcting) the OCR result, then re-format all the formatting until you have a document that resembles an original from where you could create a new book (in any format)]; 2 - if your OCR results are good enough that you think if one day you will be wanting to do OCR and proof reading out of these scannings, consider these PDFs you are now making to be your using files and your base files. Use them with your netbook; 3 - if not (or for the books that the black and white scanning did not give you quality enough for OCR), scan in grey or color and/or go up with the resolution (400dpi or even 600 dpi) until you get good OCR results - these PDFs are now your base files. From these PDFs make black and white PDF files - these are now your use files, read them in your netbook; 4 - make security copies of all your base files. Conclusion: a) you are making PDF files to read now; b) you are putting aside (backing up) base files that in the future, if you want (or the OCR technology grows to the point of creating perfect results with almost no need of human intervention), you can do it not needing to repeat all the process. Best regards, |
Advert | |
|
05-04-2011, 05:28 AM | #6 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
You don't have to OCR. But if you'd like to search, highlight, reference and so on, it would be ideal. Otherwise you could use Scan Tailor on the scans, pack them in a PDF and you'd be done. But OCR-ing usually results in a much higher quality output - and quality trumps quantity every time.
Pros: - cleaner text (free from printing flaws) - lower filesize - faster rendering and page flipping on portable (which are usually slower) devices - fully search-able - highlighting text is possible - dictionary look-up - reflow-able text (ePUB, MOBI, etc.) - body fonts can be replaced if the user wants to - www and email links are click-able - footnotes can be added to the end of the document instead of getting in your face - in-document references (for instance you could simply click "See page 91") - text-to-speech (for the visually impaired) ...and maybe more. Cons: - proof-reading takes time - layout takes time - vectorizing the cover takes time (optional) - font matching takes time (again, optional) - that's if the font is even available. If not, you'd have to edit a similar font which would take even more time (at least until you get the hang of it) Is it worth it ? Oh yeah. Like I said, quality trumps quantity. Always. Especially if it's a good book, it's worth it. It's always a pleasure to read a book with smooth text than with jagged, partial, half characters. Think about it. Out of those 3000 books, which are the top, say, 30 you'd like to keep ? The rest I would probably just archive with Scan Tailor (grayscale), keeping the correct layout, etc. Also, while black and white TIFFs can have a huge impact on filesize (especially in .djvu format), they could prove difficult to OCR in the future as most OCR software have filters that were tweaked to work better with grayscale images. B&W TIFFs can sometimes remove details that would help OCR-ing differentiate tl from a d, for example. Last edited by DSpider; 05-04-2011 at 05:45 AM. |
05-06-2011, 09:30 AM | #7 |
MAPC grad student
Posts: 3
Karma: 10
Join Date: Apr 2011
Location: Georgia, USA
Device: Kindle
|
DSpider, you wrote: - text-to-speech (for the visually impaired)
For me, TTS was a major reason I got my Kindle. I'm not visually impaired, but maybe a little reading-lazy. I enjoy watching the screen change pages as it reads it to me in real time. Also, I put hook up my Kindle to play over my car speakers while driving--a poor man's audio book of sorts. |
05-07-2011, 11:03 AM | #8 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
Well, you do wear glasses.
But yeah, "TTS" implies the book has been through active proof-reading and sometimes dictionary look-up of a few (occasional) words which may or may not be read correctly. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
OCR Software Help | kpfeifle | Workshop | 5 | 03-01-2010 02:27 PM |
OCR help needed | Nate the great | Workshop | 7 | 09-21-2009 11:21 PM |
OCR to use | pepak | Workshop | 17 | 05-26-2008 05:30 PM |