11-10-2023, 03:07 PM | #1 |
Always reading something
Posts: 4
Karma: 10
Join Date: Nov 2023
Location: In neverland
Device: tolino Shine 3
|
Is there a pdf->epub option that keeps the design better
I noticed that there are the ePub versions and there are the PDF versions of books.
And if I convert the PDF version into the ePub version using calibre, then formatting is mostly gone except for a few bold and italic parts, also fonts are ignored, it's always the generic font. Also images are tiny and not optimized or out of place. Compare that to an originally designed epub, how much better it looks if I have the pure ePub version. But I have some documents as PDF. Is there a more intelligent converter that retains design much more? |
11-10-2023, 03:16 PM | #2 |
Wizard
Posts: 1,353
Karma: 6794938
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
There's not much you can do to improve pdf conversions.
This might explain why... https://www.mobileread.com/forums/sh...d.php?t=118605 |
Advert | |
|
11-10-2023, 03:22 PM | #3 | |
Always reading something
Posts: 4
Karma: 10
Join Date: Nov 2023
Location: In neverland
Device: tolino Shine 3
|
Quote:
It just needs some guidance, after all, most authors stick to a pattern in their book, shown by formatting and frames and all you need to do is define what is what to then use similarity search for you to proof read before conversion quickly. Indeed, as a I wrote the opensource tabula could be of great help here. |
|
11-10-2023, 03:53 PM | #4 | |||
Wizard
Posts: 2,303
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
PDF is meant as an output-only format—not as an input into anything else. As you can see, you get LOTS of pain and junk carried over if you try to "one-button push" convert PDFs. To convert PDF into a proper ebook requires lots of elbow grease. Quote:
Quote:
It automatically marks:
Then, it allows you to:
If you want even more knowledge... I've extensively explained PDF->ebook workflows over the past 12 years. Most recently a few months ago in: Last edited by Tex2002ans; 11-10-2023 at 04:09 PM. |
|||
11-10-2023, 04:12 PM | #5 |
Always reading something
Posts: 4
Karma: 10
Join Date: Nov 2023
Location: In neverland
Device: tolino Shine 3
|
The books already have a text layer, I don't need OCR.
PS: "I convert ebooks professionally." How does one make money with that? |
Advert | |
|
11-10-2023, 05:55 PM | #6 |
Bibliophagist
Posts: 40,475
Karma: 156982136
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Umm... in a lot of cases, the text layer is done by OCR to allow searching. If you extract the text layer, in >90% of the PDFs I've looked at, it is total crap and will require way too much work for me to do unpaid. Even with PDFs that are text based, the conversion tends to leave a lot of artifacts which need to be manually cleaned up. Items such as kerned letter pairs and ligatures tend to have a habit of disappearing with some conversions (suddenly pallet becomes pa et for instance).
|
11-10-2023, 06:14 PM | #7 | |
Grand Sorcerer
Posts: 5,522
Karma: 100606001
Join Date: Apr 2011
Device: pb360
|
But the text layer has little or no formatting or semantic information.
Quote:
You think what you want is easy. Why don't you just go ahead and do it yourself? |
|
11-11-2023, 07:18 AM | #8 |
the rook, bossing Never.
Posts: 12,341
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Export or copy/past text layer to Word/LO Writer and edit, then proof.
What Tex2000ans, Karellen, j.p.s. and DNSB write. I actually convert a PROPERLY Styled docx to epub in Calibre without ANY editing of CSS (except images CSS after final proof of text) and then proof read / annotate on a Kobo eink. PDFs are only a source for old PD that's only been scanned and OCRed by someone else. Madness for anything else, except piracy. |
11-11-2023, 09:16 AM | #9 | ||
Wizard
Posts: 2,303
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
No. If you take a closer look, it's extremely likely to be:
Quote:
For example, they might have run the PDF through:
But you can run it on:
Much more accurate OCR means MUCH less time fixing up all the errors and junk in your exported file. Quote:
I also go above and beyond:
For a little more info on the general reasons why you might want a pro converting or looking over your book... I wrote these comments last year: Last edited by Tex2002ans; 11-11-2023 at 09:23 AM. |
||
11-11-2023, 02:07 PM | #10 | |
Resident Curmudgeon
Posts: 76,368
Karma: 136006198
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
11-11-2023, 03:43 PM | #11 | |
Grand Sorcerer
Posts: 6,216
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
|
Quote:
However, the effort involved was huge and the best partial solution I could come up with was to create a series of self-programmed interactive "assistant" utilities to semi-automate the process. Most PDFs I converted introduced some kind of new challenge that I hadn't seen previously. Having experienced the challenges involved first-hand, my conclusion was that I don't think it's possible to create a magic one-click solution that would work for all PDFs. I didn't waste my time even trying to convert multi-column or non-fiction documents with many tables or footnotes. Too hard! Fiction novels only. I won't bore everyone with great detail but I found that the best method for Step 1 of the whole process was to find a utility which would extract the PDF text as an XML tree. This had the benefit of retaining a lot more info about each text snippet extracted, e.g. font used and (x,y) position on page. Unfortunately the drawback to this was that I had to create my own logic for rearranging the text snippets into correct reading order and identifying paragraph starts/ends. The font used can help identify chapter headings, italic/bold, dropcaps, small-caps. The (x,y) position can help identify correct reading order, paragraph starts, scenebreaks and those unwanted PDF headers/footers. P.S. There's no way I would offer this as a service-for-hire unless I could pick and choose the PDFs I was prepared to tackle. It was an interesting (and bloody-minded) personal project but not something I'd want to do on a regular basis. |
|
11-11-2023, 05:26 PM | #12 | |||||
Wizard
Posts: 2,303
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Nowadays, there's a bigger push for:
which are a HUGE step in the right direction. (This attaches important information—Heading/Paragraph + Bold/Italic + Headers/Footers/PageNumbers—into the PDF too, so tools like Text-to-Speech can step through the document and navigate correctly.) Theoretically, I suspect this sort of PDF->EPUB conversion easier... but you'd still probably be better off going through a known toolchain, instead of trying to unravel who-knows-what-unique-garbage-is-buried-in-that-PDF. - - - But again, PDF is a final OUTPUT format... it's an absolutely trash INPUT format—so should only be used as a very last resort. And, as always, if possible, it's always best to go back to the original source document (DOCX/ODT, RTF, TXT, ...) and convert from there. - - - Quote:
Similar happens when people try to do EPUB->EPUB, unraveling all the spaghetti of HTML+CSS someone created. Most of the time, it's faster/easier to just go back to the drawing board and restart your conversion from scratch. You could see some of that described in this post, where I explained to RbnJrg how I'd handle "surgically correcting" 20 ebooks in the same series: - - - Side Note: Since 2021, KevinH has since implemented many of those theoretical features into Sigil! Advanced cleanup tools, but EXTREMELY powerful ways to mass fix HTML+CSS much more quickly. - - - Quote:
Right now, I'm working on an ebook with 1400 Endnotes!* And the 2nd ebook has ~190 Figures! (Ugh, that amount of alt text generation... kill me now... lol.) - - - * Side Note: Speaking of Endnotes... Does anybody here know how to wrestle Microsoft Word into outputting:
- - - Quote:
Then, you just have a spaghetti nest of Calibre-converted classes to mess with, but that would be infinitely easier than this custom PDF-exporting+parsing+converting stuff. (See the "surgical" thread/methods above.) Quote:
Instead, you can just:
And boom, with so much less work, now, you can layer the book's unique quirks on top of THAT BASE document. Easier to take that clean-but-not-quite-correct text and:
than to:
Learn from me + Hitch—the pros—lol. Trying to convert from PDF like that is a dark, dark, path! |
|||||
11-11-2023, 07:19 PM | #13 |
Bibliophagist
Posts: 40,475
Karma: 156982136
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
The last time I converted more than one PDF in a batch was a paying gig where an author had gotten rights to her books back but the only copies she had was PDFs the publisher had sent her years back. I did the conversion to docx with basic cleanup. She then pulled them into Word and did the edits and more cleanup before sending back to me for checking formatting before republishing them.
I suspect I undercharged her but my wife loved her books. |
11-11-2023, 09:11 PM | #14 | |
Grand Sorcerer
Posts: 6,216
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
|
Quote:
IIRC, when I originally experimented with calibre's PDF to EPUB conversion I had difficulties with the first couple of PDFs I tried. One of them failed to retain italics, the other did detect italics but failed to retain scenebreaks. For both of them, trying to remove PDF headers/footers via the convert-search/replace option was a PITA. Maybe I was just unlucky with my choice of PDFs but all 3 of those problems were showstoppers for me so I didn't pursue one-click PDF conversion any further. This was over 10 years ago, maybe it's better now ... but based on the OP's first post, maybe not. Yes, but it is a well-trodden path which is already laid No point re-inventing the wheel if it works well enough for the occasional new PDF as I only convert for personal use or sometimes as a favour for a friend. FWIW the main reason for my original post was to say that "where there's a will there's a way" but hoping for a one-size-fits-all "magic button" is likely to end in disappointment. |
|
11-12-2023, 03:07 AM | #15 | |
Wizard
Posts: 2,303
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Yes, a lot of the work I do is also where the original files are completely lost. Think 1990s or 2000s digital publishing. The author/publisher might not even HAVE the original source files anymore... so the PDF (or physical book) is the only file left. People/organizations are very bad at backing up important files. For example, see video games from before 2000: Yep. Full agree on that. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF to EPUB Size Issue (is PDF to CBZ an option?) | Rika24 | Conversion | 4 | 06-30-2016 02:51 AM |
how do I request option to convert from epub not original-epub ? | cybmole | Conversion | 11 | 10-08-2014 01:44 PM |
Cover for In Design EPUB | SteveC100 | Sigil | 12 | 04-29-2011 02:09 PM |
Chapters option after convert pdf or lit into epub | silverdezz | Kobo Reader | 2 | 02-28-2011 02:08 PM |
Thanks for the PDF Option!!! | Hitch | Calibre | 4 | 06-30-2010 08:26 PM |