07-18-2023, 03:31 PM | #1 |
Connoisseur
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
|
From print to ePub - how I did it.
From print to ePub - how I did it.
Greetings, I have a knack for posting in the wrong forum so I hope I got this one right. I recently made my first eBook using Sigil. The eBook was based on an HTML version of a book. To get to the HTML version was time consuming but I could not think of another way. If I had OCR software I could have eliminated some of the steps. 1.) Scan all the pages the book to images 2.) Save all the page images to a PDF 3.) Open and run the OCR tool in Acrobat 6 4.) Convert the PDFs to text using PDF24 5.) Run a script on the text file which added <p> tags to the text blocks 6.) Copy the text blocks into and HTML document. I know, I know - this is not and HTML forum. After I had the HTML ready I then copied and pasted each chapter into a Sigil document. I then added a few scanned image into the Sigil project. The results are here: https://www.EpicRoadTrips.us/epub/ Questions: What would make this process simpler and more efficient? Thanks, WV-Mike |
07-18-2023, 03:41 PM | #2 |
Resident Curmudgeon
Posts: 76,038
Karma: 134368292
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
For one, do not use PDF as an intermediary format. It will add all kinds of errors and Acrobat 6 is an old version and may not OCR all that well. Get a good OCR program and use that instead.
Last edited by JSWolf; 07-19-2023 at 04:06 AM. |
Advert | |
|
07-18-2023, 11:43 PM | #3 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Fantastic! Congrats.
And welcome to the forum. Quote:
I've been writing about this stuff extensively since 2012. For some of the most recent topics, see:
and, just last week, I wrote an even bigger summary here which linked to even more of the previous threads:
That should hold you over on all OCRing + PDF->EPUB + DOCX->EPUB info for... oh, about 100 years. Quote:
I looked up the date, and looks like Adobe Acrobat 6 was from 2003! My gods, there has been multiple GENERATIONAL leaps in OCR quality since then. Getting much more accurate OCR is one of the biggest and most important steps you can do, because EVERY further stage will be based on how clean your initial text is. You can see the post I wrote about how important accurate OCR is: When you're creating ebooks... it's not JUST the raw text you have to worry about, but correctly recognizing all the formatting too:
Last edited by Tex2002ans; 07-18-2023 at 11:56 PM. |
||
07-19-2023, 12:49 AM | #4 |
Wizard
Posts: 1,332
Karma: 6700864
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
@Tex2002ans
In your 2020: "OCRing + EPUBing my first book: Tips?" link, you mention Scan Tailor Advanced. The only release I could find that has an install file is v0.9.11.1 from 2014. https://github.com/scantailor/scantailor/releases Is this the same software you are referring to? It seems quite old, and you mention generational leaps to the OP, so I wonder if the same applies to this software. There is a v0.9.12.1 from 2016, but there does not seem to be any install file associated with that release. |
07-19-2023, 03:41 AM | #5 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
v1.0.16 was the latest (in 2018). - - - Side Note: In September 2019 there was an "Early Access" version, and then it seems like there hasn't been much activity since. I think, since the 2019 stall, some other person created another fork of it here: but I have no idea about that fork or what sorts of bugs/fixes have been done since. - - - Side Note #2: Looks like you linked to the original "Scan Tailor". "Scan Tailor Advanced" took all the forks, pulled out all the best features, and combined them all into one super version. The biggest features for me were:
+ lots of other helpful things all listed on their Github. - - - Doesn't matter. It's only used as a middle, pre-processing stage where you are cleaning up the raw images. I don't foresee too much changing on that front any time soon.
You can see me apply it in: where I quickly:
You can compare my quickly-generated EPUB vs. the auto-generated Archive.org "EPUB". Still, nowhere near as good as a manually corrected version, but WAY better quality than just spitting out raw text right out of the PDF. Quote:
Even on the free/open-source front, there's been a lot of action, but I haven't been following that too closely... Because those tools tended to:
Last edited by Tex2002ans; 07-19-2023 at 04:06 AM. |
||
Advert | |
|
07-19-2023, 06:03 AM | #6 | |
Connoisseur
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
|
Quote:
I detected few, if any errors when using Acro 6. I have been looking on Ebay for a newer version but haven't yet purchased one. Thanks, WV-Mike |
|
07-19-2023, 06:15 AM | #7 | |
Connoisseur
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
|
From print to ePub - how I did it.
Whew! This is all a bit overwhelming.
I looked at 4lex4 / scantailor-advanced but I cannot see how to use it. I am used to downloading a .msi or .exe for the installation. I don't have a clue how to install scantailor-advanced or then use it. I looked at https://github.com/4lex4/scantailor-advanced#readme However, I saw no instructions for installing it. Thanks to everyone for all this info. As you say: I came to the right place. WV-Mike Quote:
|
|
07-19-2023, 07:02 AM | #8 | |
Connoisseur
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
|
From print to ePub - how I did it
Quote:
I am still looking for a standalone program I can install from a .exe or CD. Thanks, WV-Mike |
|
07-19-2023, 08:48 AM | #9 |
the rook, bossing Never.
Posts: 12,268
Karma: 89531599
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Plenty of free and pay once OCR that's good. PDF is a terrible step to include. Better to have TIFF or png.
I use Tesseract OCR. |
07-19-2023, 09:38 AM | #10 |
Connoisseur
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
|
From print to ePub - how I did it.
|
07-19-2023, 12:23 PM | #11 | |
Zealot
Posts: 106
Karma: 1133068
Join Date: Sep 2007
Device: ipaq
|
Quote:
https://github.com/4lex4/scantailor-advanced/releases Under the heading "2019.8.16 Early Access", click "Assets" and download the installer. |
|
07-19-2023, 02:43 PM | #12 | |
Connoisseur
Posts: 66
Karma: 10
Join Date: Jul 2023
Device: None
|
From print to ePub - how I did it.
Quote:
Too be clear this software preps the images prior to running OCR software. Is that correct? WV-Mike |
|
07-19-2023, 03:21 PM | #13 |
A Hairy Wizard
Posts: 3,208
Karma: 19000001
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Yes.
It helps straighten an image if the capture/camera was slightly off-axis, or de-warp an image if there was any skew. That helps set the characters to the correct orientation and consistent sizing...which makes OCR much better. Some OCR software will do this a little bit, with differing levels of success. It is much better to get a very accurate image in the first place. Scantailor was originally designed for just that deskewing purpose...although it sounds like they have added more functionality. I'll have to check it out again! |
07-19-2023, 03:37 PM | #14 | |
Wizard
Posts: 1,332
Karma: 6700864
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
Quote:
I've installed and a quick trial run on an image I was previously having poor results in, and it OCR'd almost perfectly. In the few minutes I fiddled around with it, it seemed pretty easy to use. But I'll spend some time understanding it better. I just learnt that images OCR better when using a non-compressed / lossless format. |
|
07-19-2023, 03:39 PM | #15 | |
Wizard
Posts: 1,332
Karma: 6700864
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
Quote:
As with all things Github, along the right side of the page you will see Releases. Click on that, look for the latest version which is usually at the top or second one down, expand the Assets button and download the appropriate installer. |
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
EPUB DIRECT PRINT | hershe | ePub | 2 | 02-21-2013 01:28 AM |
Can I print an Epub book? | Bart123 | ePub | 3 | 12-01-2011 12:04 AM |
Print version of ePub | rplantz | ePub | 3 | 09-08-2011 03:51 AM |
epub print squashed | pendragginp | Calibre | 16 | 11-10-2010 08:19 AM |
How can I print an Epub | jimjam | ePub | 4 | 11-27-2009 11:41 AM |