Do I have to OCR?

Ceryta · 04-27-2011, 02:26 AM

I have a newbie question. I am going to start scanning my paperback books soon. I have about 3,000. I don't need to be able to edit the text once I scan it. I did a couple of sample scans and the pages are readable. I am reading on a netbook and not an ereader. So do I have to OCR each of the books? Can I just use the scanned pages as my final product? I have read that the files can very large if you don't use OCR, but how big is big? The average number of pages for my paperback books is 350-400. I will be scanning everything in black/white. All the pages are just plain text. The only images will be the front and back covers. Thank you for any help.

Iznogood · 04-27-2011, 05:31 AM

Size is one reason for OCRing your books. Reflowable text is another reason. If you in the future decide to read on a smaller screen, reflowing the text to shorter lines or bigger fonts could be helpful.

I was once in the same situation, and I scanned my books to jpeg images and generated a cbz-file of the. The method is quite simple: add your files to a zip- or .rar archive, thus compressing them to smaller size and only get one file pr. book. Rename .zip to .cbz. If you compressed to .rar, rename the ending to .cbr. Now you can read them using a Comic Book Reader program.

Of course it is also possible to Read your image files in an image viewer or something. It is also possible to convert to .pdf and read it there. Pdf files need some computing to prosess/load, and big PDFs are difficult to handle on mobile platforms, but perhaps it could be an option on a netbook?

If the need for OCRing arises in the future, it is possible to use the images as input to an OCR-program and get reflowable text out of it. Whether this is necessary as of today or if reading from images is "good enough" depends on your screen size, storage on your netbook and of course on your reading preferances, and only you can answer those questions.

Jellby · 04-27-2011, 10:48 AM

Quote:

Originally Posted by norway1456

I scanned my books to jpeg images and generated a cbz-file of the. The method is quite simple: add your files to a zip- or .rar archive, thus compressing them to smaller size

JPEG images are already compressed. Trying to compress them again is going to give very little gain, and could even result in larger files (the same happens if you try to compress MP3 files or DIVX movies).

Quote:

and only get one file pr. book. [...] Now you can read them using a Comic Book Reader program.

These are valid and excellent reasons, however.

Ceryta · 04-27-2011, 11:06 AM

I was gonna scan to pdf. I have a lot of ebooks already in this formart. It works fine when reading on my netbook. The sample scans I took also look good, very clear and readable. So would scanned pdf files be to large? If they can I make them smaller without OCR? thanks

DDHarriman · 04-27-2011, 11:42 AM

Hello

My advice:

1 - scan in black and white and test OCR in it.
[Remember that the most work and time spent is in proof reading (and correcting) the OCR result, then re-format all the formatting until you have a document that resembles an original from where you could create a new book (in any format)];

2 - if your OCR results are good enough that you think if one day you will be wanting to do OCR and proof reading out of these scannings, consider these PDFs you are now making to be your using files and your base files. Use them with your netbook;

3 - if not (or for the books that the black and white scanning did not give you quality enough for OCR), scan in grey or color and/or go up with the resolution (400dpi or even 600 dpi) until you get good OCR results - these PDFs are now your base files. From these PDFs make black and white PDF files - these are now your use files, read them in your netbook;

4 - make security copies of all your base files.

Conclusion:

a) you are making PDF files to read now;
b) you are putting aside (backing up) base files that in the future, if you want (or the OCR technology grows to the point of creating perfect results with almost no need of human intervention), you can do it not needing to repeat all the process.

Best regards,

DSpider · 05-04-2011, 05:28 AM

You don't have to OCR. But if you'd like to search, highlight, reference and so on, it would be ideal. Otherwise you could use Scan Tailor on the scans, pack them in a PDF and you'd be done. But OCR-ing usually results in a much higher quality output - and quality trumps quantity every time.

Pros:

- cleaner text (free from printing flaws)
- lower filesize
- faster rendering and page flipping on portable (which are usually slower) devices
- fully search-able
- highlighting text is possible
- dictionary look-up
- reflow-able text (ePUB, MOBI, etc.)
- body fonts can be replaced if the user wants to
- www and email links are click-able
- footnotes can be added to the end of the document instead of getting in your face
- in-document references (for instance you could simply click "See page 91")
- text-to-speech (for the visually impaired)

...and maybe more.

Cons:

- proof-reading takes time
- layout takes time
- vectorizing the cover takes time (optional)
- font matching takes time (again, optional) - that's if the font is even available. If not, you'd have to edit a similar font which would take even more time (at least until you get the hang of it)

Is it worth it ? Oh yeah. Like I said, quality trumps quantity. Always. Especially if it's a good book, it's worth it. It's always a pleasure to read a book with smooth text than with jagged, partial, half characters.

Think about it. Out of those 3000 books, which are the top, say, 30 you'd like to keep ? The rest I would probably just archive with Scan Tailor (grayscale), keeping the correct layout, etc. Also, while black and white TIFFs can have a huge impact on filesize (especially in .djvu format), they could prove difficult to OCR in the future as most OCR software have filters that were tweaked to work better with grayscale images. B&W TIFFs can sometimes remove details that would help OCR-ing differentiate tl from a d, for example.

srhamm · 05-06-2011, 09:30 AM

DSpider, you wrote: - text-to-speech (for the visually impaired)
For me, TTS was a major reason I got my Kindle. I'm not visually impaired, but maybe a little reading-lazy. I enjoy watching the screen change pages as it reads it to me in real time. Also, I put hook up my Kindle to play over my car speakers while driving--a poor man's audio book of sorts.

DSpider · 05-07-2011, 11:03 AM

Well, you do wear glasses.

But yeah, "TTS" implies the book has been through active proof-reading and sometimes dictionary look-up of a few (occasional) words which may or may not be read correctly.

04-27-2011, 02:26 AM	#1
Ceryta Junior Member Posts: 5 Karma: 10 Join Date: Apr 2011 Device: netbook	Do I have to OCR? I have a newbie question. I am going to start scanning my paperback books soon. I have about 3,000. I don't need to be able to edit the text once I scan it. I did a couple of sample scans and the pages are readable. I am reading on a netbook and not an ereader. So do I have to OCR each of the books? Can I just use the scanned pages as my final product? I have read that the files can very large if you don't use OCR, but how big is big? The average number of pages for my paperback books is 350-400. I will be scanning everything in black/white. All the pages are just plain text. The only images will be the front and back covers. Thank you for any help.

04-27-2011, 05:31 AM	#2
Iznogood Guru Posts: 932 Karma: 15752887 Join Date: Mar 2011 Location: Norway Device: Ipad, kindle paperwhite	Size is one reason for OCRing your books. Reflowable text is another reason. If you in the future decide to read on a smaller screen, reflowing the text to shorter lines or bigger fonts could be helpful. I was once in the same situation, and I scanned my books to jpeg images and generated a cbz-file of the. The method is quite simple: add your files to a zip- or .rar archive, thus compressing them to smaller size and only get one file pr. book. Rename .zip to .cbz. If you compressed to .rar, rename the ending to .cbr. Now you can read them using a Comic Book Reader program. Of course it is also possible to Read your image files in an image viewer or something. It is also possible to convert to .pdf and read it there. Pdf files need some computing to prosess/load, and big PDFs are difficult to handle on mobile platforms, but perhaps it could be an option on a netbook? If the need for OCRing arises in the future, it is possible to use the images as input to an OCR-program and get reflowable text out of it. Whether this is necessary as of today or if reading from images is "good enough" depends on your screen size, storage on your netbook and of course on your reading preferances, and only you can answer those questions. Last edited by Iznogood; 04-27-2011 at 05:49 AM.

05-04-2011, 05:28 AM	#6
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	You don't have to OCR. But if you'd like to search, highlight, reference and so on, it would be ideal. Otherwise you could use Scan Tailor on the scans, pack them in a PDF and you'd be done. But OCR-ing usually results in a much higher quality output - and quality trumps quantity every time. Pros: - cleaner text (free from printing flaws) - lower filesize - faster rendering and page flipping on portable (which are usually slower) devices - fully search-able - highlighting text is possible - dictionary look-up - reflow-able text (ePUB, MOBI, etc.) - body fonts can be replaced if the user wants to - www and email links are click-able - footnotes can be added to the end of the document instead of getting in your face - in-document references (for instance you could simply click "See page 91") - text-to-speech (for the visually impaired) ...and maybe more. Cons: - proof-reading takes time - layout takes time - vectorizing the cover takes time (optional) - font matching takes time (again, optional) - that's if the font is even available. If not, you'd have to edit a similar font which would take even more time (at least until you get the hang of it) Is it worth it ? Oh yeah. Like I said, quality trumps quantity. Always. Especially if it's a good book, it's worth it. It's always a pleasure to read a book with smooth text than with jagged, partial, half characters. Think about it. Out of those 3000 books, which are the top, say, 30 you'd like to keep ? The rest I would probably just archive with Scan Tailor (grayscale), keeping the correct layout, etc. Also, while black and white TIFFs can have a huge impact on filesize (especially in .djvu format), they could prove difficult to OCR in the future as most OCR software have filters that were tweaked to work better with grayscale images. B&W TIFFs can sometimes remove details that would help OCR-ing differentiate tl from a d, for example. Last edited by DSpider; 05-04-2011 at 05:45 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
OCR Software Help	kpfeifle	Workshop	5	03-01-2010 02:27 PM
OCR help needed	Nate the great	Workshop	7	09-21-2009 11:21 PM
OCR to use	pepak	Workshop	17	05-26-2008 05:30 PM

04-27-2011, 11:06 AM	#4
Ceryta Junior Member Posts: 5 Karma: 10 Join Date: Apr 2011 Device: netbook	I was gonna scan to pdf. I have a lot of ebooks already in this formart. It works fine when reading on my netbook. The sample scans I took also look good, very clear and readable. So would scanned pdf files be to large? If they can I make them smaller without OCR? thanks

04-27-2011, 11:42 AM	#5
DDHarriman Guru Posts: 860 Karma: 4380 Join Date: Feb 2008 Location: Almada, Portugal Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note	Hello My advice: 1 - scan in black and white and test OCR in it. [Remember that the most work and time spent is in proof reading (and correcting) the OCR result, then re-format all the formatting until you have a document that resembles an original from where you could create a new book (in any format)]; 2 - if your OCR results are good enough that you think if one day you will be wanting to do OCR and proof reading out of these scannings, consider these PDFs you are now making to be your using files and your base files. Use them with your netbook; 3 - if not (or for the books that the black and white scanning did not give you quality enough for OCR), scan in grey or color and/or go up with the resolution (400dpi or even 600 dpi) until you get good OCR results - these PDFs are now your base files. From these PDFs make black and white PDF files - these are now your use files, read them in your netbook; 4 - make security copies of all your base files. Conclusion: a) you are making PDF files to read now; b) you are putting aside (backing up) base files that in the future, if you want (or the OCR technology grows to the point of creating perfect results with almost no need of human intervention), you can do it not needing to repeat all the process. Best regards,

05-06-2011, 09:30 AM	#7
srhamm MAPC grad student Posts: 3 Karma: 10 Join Date: Apr 2011 Location: Georgia, USA Device: Kindle	DSpider, you wrote: - text-to-speech (for the visually impaired) For me, TTS was a major reason I got my Kindle. I'm not visually impaired, but maybe a little reading-lazy. I enjoy watching the screen change pages as it reads it to me in real time. Also, I put hook up my Kindle to play over my car speakers while driving--a poor man's audio book of sorts.

05-07-2011, 11:03 AM	#8
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	Well, you do wear glasses. But yeah, "TTS" implies the book has been through active proof-reading and sometimes dictionary look-up of a few (occasional) words which may or may not be read correctly.

Advert

Advert