Book Scanning tool chains

tomsem · 01-20-2023, 07:22 PM

I recently got a CZUR ET18 Pro overhead scanner:

https://www.amazon.com/gp/product/B07JMTPJ8S/

Mostly I wanted to be able to scan my favorite sheet music and music instructional material so I could use it on my iPad and Mac. I was mostly able to get decent enough results with the included scanning. It's able to handle the largest page sizes for the music I have 2 pages at at time. There's no need to OCR and make a fully navigable PDF out of it (I have an iPad app that's can to some degree, interpret sheet music and export MusicXML and MIDI).

A lot of what I've done would have been easier with external tools, and going forward with future music material, I'll be applying what I've learned. The available cropping and editing tools are rudimentary.

I have a few books lying around which are out of print or not available as ebooks, and I'd like to read them in digital form. This is promoted as a fast way to scan these, so I've been trying that out, and integrating Photoshop and Acrobat into the toolchain. I'm in the middle of finishing with the first one, about 400 pages long, plus about 48 photos mixed in.

The software takes unprocessed images, then applies rules according to the option you've specified: Single Page, Facing Pages, etc. Facing pages will try to find the center 'seam', remove page curl, and generate an image for each page. One scan takes about a second, and you can use a foot switch, a button on your desk, the scanning software, or Auto mode to trigger capture (it has a microphone, I think you might be able to use voice too).

Before moving to the next steps, it's prudent to review each page to make sure 1) you didn't skip pages or 2) you didn't get a good capture. Some errors can be corrected without re-scanning (e.g. failure to find the 'seam', or you need to apply greyscale or color rules rather than B&W - it generates new images from the unprocessed ones). Then you can easily insert missing pages or replace the bad captures.

In this case the photos were turn of the 20th century and had poor exposure (or chemical aging of the source material) the swaths of black reflected the downward pointing LED lights. So before shooting those, there's another set of LEDs lower down that illuminate from a different angle, without causing reflections in the image to turn on for these situations.

Even with the best technique there are significant deviations in the dimensions, positioning, and rotation of the page in the image. The cropping algorithm with facing pages case is appropriately conservative, though it might be good if you could make it a little less so. So each individual page has to be repositioned and cropped; some need minor rotation adjustments.

For small jobs you can get by with the built in tools, but it's not productive for larger ones.

Hence Photoshop. There is a Script called Load Files into Stack... This places a sequence of images into a sequence of layers. Setting the target Canvas size lets me drag the images around so they're more uniformly positioned; guide lines help position common page elements like headers and margins, and adjust rotations to make things straighter. So by having only one layer visible at a time lets me work through each page and get it ready for bundling in a PDF (the book I'm working on has a lot of paragraph styles, footnotes and hand-written drawings, so it would more work to produce an ePub than just replicating the book).

When done, Export Layers to Files, and load into Acrobat and take another pass through all of the pages. It tries (with varying degrees of success) to separate images and text, and straighten text blocks. Blemishes can become objects that you can just delete.

Finally, generate page labels (roman, numeric, other styles can be mixed), and create page links for TOC and index).

Ta Da! No sweat! (er, actually a lot of sweat)

Even without any change to toolchain, I figure it will take me maybe 25% the time for the next book project.

I'll probably create a macro to move though the list o f layers and toggle visibility as it does so.

And since I can imagine writing a script that does 99% of the PDF page links I'd otherwise have to do manually, I'll be looking for one, or trying to write one.

And there has to be a better PDF tool than Acrobat.

It's very inconsistent about object identification. It sometimes takes a contiguous illustration and leaves parts of it on a page wide object so you can't freely re-position it. Sometimes the object consists of a union of all the elements on the page, lumping header/body/footer so you can't reposition those independently.

Every time I want to create a page from a JPG I have to change the default format from PDF.

When I Replace a page, there seems to be no way to trigger its object recognition so I can make adjustments, short of OCRing the entire document.

And I don't yet know how OCR compares with other products, including AABBYY, which the scanning software includes. If the target is ePub (or even fixed layout ePub) then Acrobat need not apply.

I'm not quite as ready to throw Photoshop under the bus. Importing and exporting layers like this is pretty slow for some reason, but at least it works.

Looking ahead, I'm also planning a screen capture based tool chain, which should more be amendable to scripting.

Quoth · 01-21-2023, 07:22 AM

GIMP also does layers and can import PDF as a layer per page. Can reverse order export as mpng for ImageMagick.

I have an old SCSI scanner also the network scanner on my laser printer (both have sheet feed options), but I'd been wondering about either a setup for my 20+ Megapixel Canon DSLR or getting an A3 overhead scanner. Most of the scanners can be got working on Linux, but the included SW is usually useless. I have tesseract OCR installed on Linux.

The 20 year old scanner is slower than the Brother's scanner at full colour and resolution but seems better quality. Likely OCR work only needs mono and not the highest resolution?

Curve on pages on a book is an issue which some of the overhead scanners claim to fix, but I suspect that is Windows software, not anything in the scanner which is likely just a camera.

Quoth · 01-21-2023, 07:24 AM

Quote:

Originally Posted by tomsem

And there has to be a better PDF tool than Acrobat.

Not used it for years. Nor Photoshop, even before moving from Windows to Linux.

Turtle91 · 01-21-2023, 12:23 PM

The good folks over at DIY Book Scanner have been developing these types of scanners for a long time. Although most of their designs tend to use a “v” shaped holder (the cradle) for the book and a “v” shaped glass to flatten the page (the platen) which would minimize warping of the page when it is opened against the resistance of the binding. They also usually have 2 cameras, each aligned directly facing the left or right page.

Both of those techniques minimize the required software calculation to correct image warpage and perspective shift. There was some enterprising programmers working on software that would automagically correct all that AND provide some OCR AND maybe even bake some cookies for you…. I haven’t been active over there in awhile so I don’t know the current status of those projects…. They may have dropped the cookie feature

The last I heard they had a scanner that could scan/correct/ocr several hundred pages per hour (800 pph comes to mind). And it’s all free, except for the material and time investment.

Tex2002ans · 01-21-2023, 02:36 PM

Last month I described a lot of this book digitizing process in detail:

2022: "Scanning Books"

You pretty much have these basic steps:

1. Scanning / Taking the photos.
2. Normalizing / Cleaning up the images.
3. OCRing / Converting

Each of these has its own tools + enhancements you can do to make things better.

Like Turtle91 said, DIY Book Scanner is where you can learn a lot of info on the scanning side of things. (Like V-shaped plexiglass to press pages down will help you with much less dewarping in the 2nd stage!)

The better input you get in those initial stages helps, because that becomes the basis for ALL FUTURE stages.

(If your original images are crap/warped, this requires much more work in Stage 2 + Stage 3—much more time spent dewarping/correcting, OCR will be much less accurate, etc.)

- - -

Side Note: With sheet music, I'd suspect you REALLY want your papers straight, so that the bars will appear completely horizontal. (It'll be very easy for the dewarping algorithms to make those look wobbly.)

- - -

Quote:

Originally Posted by tomsem

And I don't yet know how OCR compares with other products, including AABBYY, which the scanning software includes.

Finereader is much better.

It can also detect images/text/tables + headers/footers, etc.

(No idea how it would deal with sheet music though. It would most likely get completely confused because of the complicated layouts.)

Quote:

Originally Posted by tomsem

If the target is ePub (or even fixed layout ePub) then Acrobat need not apply.

Music sheets like this will not be creatable as EPUB. It will have to, sadly, stay as PDF.

Quote:

Originally Posted by tomsem

I'm not quite as ready to throw Photoshop under the bus. Importing and exporting layers like this is pretty slow for some reason, but at least it works.

You can use whatever tools you want for whatever stages you want.

Some will bring more misery than others. :P

For my post-processing stage, I prefer using:

Scan Tailor Advanced

It:

will help you crop/align/resize pages
has built-in dewarping/despeckling
has multiple color/grayscale -> B&W algorithms
[...]

Most importantly, you can easily tweak variables on a per-page basis.

If one page has too many speckles? Raise the strength.

If one page had uneven lighting (or was slightly brighter than the others?)? Well, tweak the B&W strength.

- - -

And, if needed, you can always then toss it into Photoshop afterwards and do whatever extra refinements there.

Quote:

Originally Posted by tomsem

Looking ahead, I'm also planning a screen capture based tool chain, which should more be amendable to scripting.

Perhaps, you can automate some of these steps/stages, but when reality hits, a lot of these pages will require manual tweaks + elbow grease.

Anyway, I'm looking forward to hearing more from you. Always good to learn more about people's image-cleaning routines.

Quoth · 01-21-2023, 02:38 PM

Actually this is the similar version I bookmarked
https://www.amazon.de/-/en/dp/B09NLMFQGN but I didn't buy because I wondered if with suitable lights & brackets the Canon DSLR I have would do as well. Probably has better lenses. A page turner could operate the remote "shutter" feature.

Amazon De does English and also my local currency.

I've thought the dual camera with V-holder sounded better. Or a camera that rocks once between page turns.

This one claims to also support Linux (directly as I think all of them sort of work on Linux)
https://www.amazon.de/-/en/ET24-dp-B..._title_ce?th=1

More expensive and still not V-holder.

tomsem · 01-21-2023, 03:06 PM

Re sheet music, I haven't had any issues with wavy staff lines, even when there's significant page curl. The software does a very good job of uncurling (uses laser lines to compute geometry). And yes this is all PDF end product (straight out of the CZUR software). For some sheet music books, I'll probably want to go back in and add TOC page links with Acrobat (or whatever I wind up replacing it with) and push the button to reduce file size.

There are at least a couple of GitHub projects that purport to straighten things, not sure how well they work but I've been meaning to check them out.

Fujitsu still makes an overhead scanner for book scanning, and does some sort of straightening. But costs a bit more, and CZUR seems to have more happy customer reviews on average.

I've thought about making a book cradle to help allow the spine to be flatter and help stabilize as I press down to hold pages (there are yellow 'finger cots' to help with this, software removes them from image), but so far I have a few 'chocks' that I put under there and reconfigure as I move through the book. And it is quite fast enough for me.

tomsem · 01-25-2023, 05:27 PM

The AABBYY OCR that comes with CZUR is much better than Acrobat's (or else I cannot figure out how to tell Acrobat to stop detecting Chinese text inline with the specified English). But at least with default options the PDF it created pages of varying sizes even though the images all had the same dimensions. It doesn't have that issue when creating PDF w/o OCR. Hopefully there's some tweak I can discover, or it's something they can fix with an update.

Removing page curl seems to require some way of determining 3D. The Fujitsu scanner has stereo 'vision', and CZUR has lasers that draw lines across the material and they use that to determine the curl.

This apparently is beyond scope for DIY, which is why the effort to flatten pages is necessary.

I can't find any open source code that can take raw images of books with page curl and flatten them. LIDAR (e.g. iPhone) probably isn't of sufficient resolution to help much with this. Probably you could get somewhere where it is mostly horizontal text, with some CV library to extract the curves, and applying an appropriate transform to straighten them.

Tex2002ans · 01-25-2023, 07:16 PM

Quote:

Originally Posted by tomsem

The AABBYY OCR that comes with CZUR is much better than Acrobat's. [...] But at least with default options the PDF it created pages of varying sizes even though the images all had the same dimensions.

Are you talking varying page sizes in Finereader? Or in Adobe? Or what?

This is partially why I recommend Scan Tailor Advanced as a preprocessing step.

Scan Tailor Advanced will take care of normalizing all page sizes, etc.

In Finereader, Cropping to page sizes "exists", but it's clunky.

And if your page is SHORT by a little bit, it's not easy to add the required whitespace to make all images the same size.

This is why it's always helpful to split things into intermediate stages... not necessarily trusting an "all-in-one, just let me press the button" tool/program.

Quote:

Originally Posted by tomsem

Removing page curl seems to require some way of determining 3D. The Fujitsu scanner has stereo 'vision', and CZUR has lasers that draw lines across the material and they use that to determine the curl.

I guess if you go high-tech/advanced, lasers would help in the dewarping calculations.

Scan Tailor Advanced just works based on the curve of the edges of the page:

https://github.com/4lex4/scantailor-...inal-dewarping

It works quite well for most of what I tested on.

I guess the laser would do it more accurately + more automatically, but the image-based fixes works well for most books I've processed.

(Plus, I work from already-scanned/-photographed stuff. I'm not the one actually scanning things in.)

Quote:

Originally Posted by tomsem

This apparently is beyond scope for DIY, which is why the effort to flatten pages is necessary.

The flatter the better. Even with all those newfangled lasers.

Getting it right in the image itself will ALWAYS be better than relying on mathematical dewarping!

tomsem · 02-03-2023, 03:50 PM

Thanks for the ScanTailor tip, unfortunately it doesn't run on Mac.

Yes, CZUR has ABBYY Finereader for OCR. I figured out how to keep it from morphing the pages, and it's working well.

Making some progress on a script to add page links to Contents and Index entries.

taddymack · 02-04-2023, 08:44 AM

Quote:

Originally Posted by tomsem

Thanks for the ScanTailor tip, unfortunately it doesn't run on Mac.

Install macports (https://www.macports.org/install.php).
Then
sudo port selfupdate
and
sudo port install scantailor

tomsem · 02-04-2023, 03:02 PM

Quote:

Originally Posted by taddymack

Install macports (https://www.macports.org/install.php).
Then
sudo port selfupdate
and
sudo port install scantailor

Thanks, maybe some other time. If it were in brew, I would do. But not going to install MacPorts just for this.

j.p.s · 02-04-2023, 03:21 PM

Quote:

Originally Posted by tomsem

Thanks, maybe some other time. If it were in brew, I would do. But not going to install MacPorts just for this.

https://github.com/yb85/scantailor-advanced-osx

tomsem · 02-06-2023, 02:00 PM

Quote:

Originally Posted by j.p.s

https://github.com/yb85/scantailor-advanced-osx

Thanks!

rpalyvoda · 12-02-2023, 09:13 PM

It is probably too late, but for sheet music you should use MuseScore. It can import pdf files and turn them into musicxml (the equivalent of epub for music).

01-20-2023, 07:22 PM	#1
tomsem Grand Sorcerer Posts: 6,766 Karma: 26974049 Join Date: Apr 2009 Location: USA Device: iPhone 15PM, Kindle Scribe, iPad mini 6, PocketBook InkPad Color 3	Book Scanning tool chains I recently got a CZUR ET18 Pro overhead scanner: https://www.amazon.com/gp/product/B07JMTPJ8S/ Mostly I wanted to be able to scan my favorite sheet music and music instructional material so I could use it on my iPad and Mac. I was mostly able to get decent enough results with the included scanning. It's able to handle the largest page sizes for the music I have 2 pages at at time. There's no need to OCR and make a fully navigable PDF out of it (I have an iPad app that's can to some degree, interpret sheet music and export MusicXML and MIDI). A lot of what I've done would have been easier with external tools, and going forward with future music material, I'll be applying what I've learned. The available cropping and editing tools are rudimentary. I have a few books lying around which are out of print or not available as ebooks, and I'd like to read them in digital form. This is promoted as a fast way to scan these, so I've been trying that out, and integrating Photoshop and Acrobat into the toolchain. I'm in the middle of finishing with the first one, about 400 pages long, plus about 48 photos mixed in. The software takes unprocessed images, then applies rules according to the option you've specified: Single Page, Facing Pages, etc. Facing pages will try to find the center 'seam', remove page curl, and generate an image for each page. One scan takes about a second, and you can use a foot switch, a button on your desk, the scanning software, or Auto mode to trigger capture (it has a microphone, I think you might be able to use voice too). Before moving to the next steps, it's prudent to review each page to make sure 1) you didn't skip pages or 2) you didn't get a good capture. Some errors can be corrected without re-scanning (e.g. failure to find the 'seam', or you need to apply greyscale or color rules rather than B&W - it generates new images from the unprocessed ones). Then you can easily insert missing pages or replace the bad captures. In this case the photos were turn of the 20th century and had poor exposure (or chemical aging of the source material) the swaths of black reflected the downward pointing LED lights. So before shooting those, there's another set of LEDs lower down that illuminate from a different angle, without causing reflections in the image to turn on for these situations. Even with the best technique there are significant deviations in the dimensions, positioning, and rotation of the page in the image. The cropping algorithm with facing pages case is appropriately conservative, though it might be good if you could make it a little less so. So each individual page has to be repositioned and cropped; some need minor rotation adjustments. For small jobs you can get by with the built in tools, but it's not productive for larger ones. Hence Photoshop. There is a Script called Load Files into Stack... This places a sequence of images into a sequence of layers. Setting the target Canvas size lets me drag the images around so they're more uniformly positioned; guide lines help position common page elements like headers and margins, and adjust rotations to make things straighter. So by having only one layer visible at a time lets me work through each page and get it ready for bundling in a PDF (the book I'm working on has a lot of paragraph styles, footnotes and hand-written drawings, so it would more work to produce an ePub than just replicating the book). When done, Export Layers to Files, and load into Acrobat and take another pass through all of the pages. It tries (with varying degrees of success) to separate images and text, and straighten text blocks. Blemishes can become objects that you can just delete. Finally, generate page labels (roman, numeric, other styles can be mixed), and create page links for TOC and index). Ta Da! No sweat! (er, actually a lot of sweat) Even without any change to toolchain, I figure it will take me maybe 25% the time for the next book project. I'll probably create a macro to move though the list o f layers and toggle visibility as it does so. And since I can imagine writing a script that does 99% of the PDF page links I'd otherwise have to do manually, I'll be looking for one, or trying to write one. And there has to be a better PDF tool than Acrobat. It's very inconsistent about object identification. It sometimes takes a contiguous illustration and leaves parts of it on a page wide object so you can't freely re-position it. Sometimes the object consists of a union of all the elements on the page, lumping header/body/footer so you can't reposition those independently. Every time I want to create a page from a JPG I have to change the default format from PDF. When I Replace a page, there seems to be no way to trigger its object recognition so I can make adjustments, short of OCRing the entire document. And I don't yet know how OCR compares with other products, including AABBYY, which the scanning software includes. If the target is ePub (or even fixed layout ePub) then Acrobat need not apply. I'm not quite as ready to throw Photoshop under the bus. Importing and exporting layers like this is pretty slow for some reason, but at least it works. Looking ahead, I'm also planning a screen capture based tool chain, which should more be amendable to scripting. Last edited by tomsem; 01-20-2023 at 09:04 PM.

01-21-2023, 12:23 PM	#4
Turtle91 A Hairy Wizard Posts: 3,225 Karma: 19000635 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	The good folks over at DIY Book Scanner have been developing these types of scanners for a long time. Although most of their designs tend to use a “v” shaped holder (the cradle) for the book and a “v” shaped glass to flatten the page (the platen) which would minimize warping of the page when it is opened against the resistance of the binding. They also usually have 2 cameras, each aligned directly facing the left or right page. Both of those techniques minimize the required software calculation to correct image warpage and perspective shift. There was some enterprising programmers working on software that would automagically correct all that AND provide some OCR AND maybe even bake some cookies for you…. I haven’t been active over there in awhile so I don’t know the current status of those projects…. They may have dropped the cookie feature The last I heard they had a scanner that could scan/correct/ocr several hundred pages per hour (800 pph comes to mind). And it’s all free, except for the material and time investment. Attached Thumbnails

01-21-2023, 02:38 PM	#6
Quoth the rook, bossing Never. Posts: 12,359 Karma: 92073397 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	Actually this is the similar version I bookmarked https://www.amazon.de/-/en/dp/B09NLMFQGN but I didn't buy because I wondered if with suitable lights & brackets the Canon DSLR I have would do as well. Probably has better lenses. A page turner could operate the remote "shutter" feature. Amazon De does English and also my local currency. I've thought the dual camera with V-holder sounded better. Or a camera that rocks once between page turns. This one claims to also support Linux (directly as I think all of them sort of work on Linux) https://www.amazon.de/-/en/ET24-dp-B..._title_ce?th=1 More expensive and still not V-holder. Last edited by Quoth; 01-21-2023 at 02:44 PM.

01-21-2023, 03:06 PM	#7
tomsem Grand Sorcerer Posts: 6,766 Karma: 26974049 Join Date: Apr 2009 Location: USA Device: iPhone 15PM, Kindle Scribe, iPad mini 6, PocketBook InkPad Color 3	Re sheet music, I haven't had any issues with wavy staff lines, even when there's significant page curl. The software does a very good job of uncurling (uses laser lines to compute geometry). And yes this is all PDF end product (straight out of the CZUR software). For some sheet music books, I'll probably want to go back in and add TOC page links with Acrobat (or whatever I wind up replacing it with) and push the button to reduce file size. There are at least a couple of GitHub projects that purport to straighten things, not sure how well they work but I've been meaning to check them out. Fujitsu still makes an overhead scanner for book scanning, and does some sort of straightening. But costs a bit more, and CZUR seems to have more happy customer reviews on average. I've thought about making a book cradle to help allow the spine to be flatter and help stabilize as I press down to hold pages (there are yellow 'finger cots' to help with this, software removes them from image), but so far I have a few 'chocks' that I put under there and reconfigure as I move through the book. And it is quite fast enough for me. Last edited by tomsem; 01-21-2023 at 03:41 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
What is the best book scanning service?	norweger	Workshop	15	05-13-2021 12:07 PM
Book Scanning at 1DollarScan	Gardenman	General Discussions	9	09-19-2015 10:49 PM
Book Scanning	Lordblacknail	Workshop	1	10-13-2010 07:04 PM
How do you keep your sanity? scanning a book	mypolar	Workshop	9	01-28-2010 09:43 AM
Book scanning	kusmi	iRex	33	10-09-2007 06:34 AM

01-21-2023, 07:22 AM	#2
Quoth the rook, bossing Never. Posts: 12,359 Karma: 92073397 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	GIMP also does layers and can import PDF as a layer per page. Can reverse order export as mpng for ImageMagick. I have an old SCSI scanner also the network scanner on my laser printer (both have sheet feed options), but I'd been wondering about either a setup for my 20+ Megapixel Canon DSLR or getting an A3 overhead scanner. Most of the scanners can be got working on Linux, but the included SW is usually useless. I have tesseract OCR installed on Linux. The 20 year old scanner is slower than the Brother's scanner at full colour and resolution but seems better quality. Likely OCR work only needs mono and not the highest resolution? Curve on pages on a book is an issue which some of the overhead scanners claim to fix, but I suspect that is Windows software, not anything in the scanner which is likely just a camera.

01-25-2023, 05:27 PM	#8
tomsem Grand Sorcerer Posts: 6,766 Karma: 26974049 Join Date: Apr 2009 Location: USA Device: iPhone 15PM, Kindle Scribe, iPad mini 6, PocketBook InkPad Color 3	The AABBYY OCR that comes with CZUR is much better than Acrobat's (or else I cannot figure out how to tell Acrobat to stop detecting Chinese text inline with the specified English). But at least with default options the PDF it created pages of varying sizes even though the images all had the same dimensions. It doesn't have that issue when creating PDF w/o OCR. Hopefully there's some tweak I can discover, or it's something they can fix with an update. Removing page curl seems to require some way of determining 3D. The Fujitsu scanner has stereo 'vision', and CZUR has lasers that draw lines across the material and they use that to determine the curl. This apparently is beyond scope for DIY, which is why the effort to flatten pages is necessary. I can't find any open source code that can take raw images of books with page curl and flatten them. LIDAR (e.g. iPhone) probably isn't of sufficient resolution to help much with this. Probably you could get somewhere where it is mostly horizontal text, with some CV library to extract the curves, and applying an appropriate transform to straighten them.

02-03-2023, 03:50 PM	#10
tomsem Grand Sorcerer Posts: 6,766 Karma: 26974049 Join Date: Apr 2009 Location: USA Device: iPhone 15PM, Kindle Scribe, iPad mini 6, PocketBook InkPad Color 3	Thanks for the ScanTailor tip, unfortunately it doesn't run on Mac. Yes, CZUR has ABBYY Finereader for OCR. I figured out how to keep it from morphing the pages, and it's working well. Making some progress on a script to add page links to Contents and Index entries.

12-02-2023, 09:13 PM	#15
rpalyvoda Junior Member Posts: 7 Karma: 10 Join Date: Dec 2012 Device: Kobo	It is probably too late, but for sheet music you should use MuseScore. It can import pdf files and turn them into musicxml (the equivalent of epub for music).

Advert

Advert