[GUI Plugin] Modify ePub - Page 62

cybmole · 04-24-2015, 07:04 AM

i would infer from the class name- mbpagebreak - that this is obsolete code, as all readers now page break anyway on start of next html file.

( googling the term seems to confirm that it's more likely to have been inserted by conversion tools than by original publishers; someone more familiar with the kindlegen process as applied to author uploads would know if amazon themselves add it )

so having this trailing at end of each chapter is redundant and can also generate spurious blank pages, depending on how the reader apps deal with an empty div.

i'd class it as redundant, like you do un-needed spans, and strip it because it is clutter that has no effect on how the epub is rendered.

I am pretty sure I have also seen it in some epubs that have not been generated/processed by calibre

jackie_w · 04-24-2015, 07:29 AM

I think the pagebreak constructs that look like

Code:

<div class="mbppagebreak" id="calibre_pb_16"></div>

are what you get when you convert an old-style MOBI using calibre. I used the 'Kindle Unpack' plugin to unpack a few of my old Amazon original MOBIs at random and the source code looked like this in all of them

Code:

<mbp:pagebreak/>

ETA: As the unpacked markup code is all in a single file I'm assuming calibre uses them to decide where to split into multiple files.

Rev. Bob · 04-24-2015, 07:40 AM

Quote:

Originally Posted by cybmole

so having this trailing at end of each chapter is redundant and can also generate spurious blank pages, depending on how the reader apps deal with an empty div.

i'd class it as redundant, like you do un-needed spans, and strip it because it is clutter that has no effect on how the epub is rendered.

Oh, I quite understand the reasoning, and even agree with the principle. The trouble is that it's dangerous to make assumptions without thorough investigation.

For instance, I've got several Kindle books that pack multiple chapters into one HTML file. I wouldn't want to remove those internal page-breaks, because then the chapters wouldn't get separated, and that starts interfering with the intended presentation.

("But you can split chapters by having calibre look for headers!"
"Not if the author uses paragraphs with big fonts instead of header tags."
"Why would anyone do that?"
"Got me... but it happens. Go figure.")

Then there's the question of how reliable the class name is. Maybe it's a native EPUB where the author manually placed a page-break element at the bottom of each document, only he called his class "pbr" because it's nice and short. Same effect, different code.

The plugin's simply not built to handle "if the document starts or ends in an empty element whose only function is to generate a page break, remove that element." It's not smart enough for that; that takes human intervention, just as "remove any blank-space paragraphs at the end of a document" does. (Yeah, that happens, too. I've even seen one publisher whose chapters gain an extra containing DIV as the book progresses. Chapter One might have two, Chapter Two three, up until Chapter Ninety-Seven having 98 of the damned things. Ebook code is easy to screw up.)

However, if I can verify that the class name is stable, I could possibly remove such elements at the top or bottom of the BODY element. Maybe. It depends, on those and other factors, and my natural inclination is to not use automation to interfere unless I can be confident in the results. I hate changing that stuff by hand, too, but I hate it much less than trying to figure out where things used to be.

So: While I may be able to do something, I'm not committing to it without having the chance to investigate. Could happen, could not, too soon to tell. Shake the 8-ball, ask again later.

cybmole · 04-24-2015, 09:37 AM

I think jackie_w has nailed it,
the class name and it's location and original purpose is stable, and it would only exist where a book has been sold in legacy mobi format, not in azw, and has then been converted.

so arguably it is a calibre conversion artifact and can be detected as such ?
i.e. I'd settle for removal only of matches for

<div class="mbppagebreak" id="calibre_pb_\d+"></div>

that's probably how I zap them in sigil: I had a quick look at my recent sigil find/replace but any like that have dropped off my recently used list

Rev. Bob · 04-24-2015, 09:48 AM

Quote:

Originally Posted by cybmole

I'd settle for removal only of matches for

<div class="mbppagebreak" id="calibre_pb_\d+"></div>

I wouldn't, for the reason I gave above: that would catch matches in the middle of a document, instead of only at the beginning and/or end. I've already explained why that's a terrible idea. You can do so manually if you wish, but I will not.

Now, if it's confined to right after <body *> or right before </body>, that's a different story - but what you describe is not limited in that way, and therefore I will not do it. I may or may not elect to build any sort of mbppagebreak processing into the plugin, but I have already decided that much.

theducks · 04-24-2015, 10:24 AM

Quote:

Originally Posted by cybmole

i would infer from the class name- mbpagebreak - that this is obsolete code, as all readers now page break anyway on start of next html file.

( googling the term seems to confirm that it's more likely to have been inserted by conversion tools than by original publishers; someone more familiar with the kindlegen process as applied to author uploads would know if amazon themselves add it )

so having this trailing at end of each chapter is redundant and can also generate spurious blank pages, depending on how the reader apps deal with an empty div.

i'd class it as redundant, like you do un-needed spans, and strip it because it is clutter that has no effect on how the epub is rendered.

I am pretty sure I have also seen it in some epubs that have not been generated/processed by calibre

CAUTION That code can also exist MID file. (probably from the I want only 1 file crowd)

cybmole · 04-24-2015, 10:28 AM

ok - all I can say is that I've been doing that regex remove manually for 2- 3 years, over 100 books I ma sure - & I have never seen that construction anywhere except at the end of a file. it maybe that calibre always breaks after one of those, so that is logically impossible for a calibre conversion to leave only the middle of a html file ( on default structure detect settings anyway )
but ok , tweak to:
find
<div class="mbppagebreak" id="calibre_pb_\d+"></div>
</body>
replace
</body>

Rev. Bob · 04-24-2015, 11:07 AM

Quote:

Originally Posted by cybmole

but ok , tweak to:
find
<div class="mbppagebreak" id="calibre_pb_\d+"></div>
</body>
replace
</body>

Or, in other words:

Quote:

Originally Posted by Rev. Bob

Now, if it's confined to right after <body *> or right before </body>, that's a different story - but what you describe is not limited in that way, and therefore I will not do it.

PandathePanda · 04-24-2015, 02:39 PM

Tested a PD book from: https://www.mobileread.com/forums/sho...d.php?t=259583 and got the <div class="mbp_pagebreak" ...> Inserted into the epub after converting the mobi to epub. Yet unpacking the azw3 file, the resulting epub does not have this inserted.

So my guess it's formatting inserted during the conversion by calibre, and can safely be removed.

DiapDealer · 04-24-2015, 03:06 PM

The problem with assuming that it's calibre added stuff that can be safely removed, is that an ebook could have been edited by someone AFTER the calibre conversion which added the mbppagebreak div stuff. Where files were split/and merged any number of different unforeseen ways (or code copied and pasted to somewhere where the pagebreak IS performing a wanted function in the middle of a file).

I agree that only the ones immediately following the <body> tag, or immediately preceding the </body> tag can be safely removed wholesale.

theducks · 04-24-2015, 03:31 PM

I would include any empty (non-text/Image) tag pairs in those locations
IMHO Margins should be used to supply top or bottom whitespace

I see no purpose in a end-of-file anchor either.

cybmole · 04-24-2015, 03:41 PM

It would be a non issue, display wise, if all renderers worked to same rules. But some add a blank line when they hit an empty tag pair, and some don't. I Sam not sure if ANY actually perform a page break!

DiapDealer · 04-24-2015, 04:00 PM

Quote:

Originally Posted by cybmole

I Sam not sure if ANY actually perform a page break!

What do you mean? Almost all renderers will perform a page break if page-break-(before|after: always) is assigned to the class. Or did you mean something else?

AnotherCat · 04-24-2015, 04:34 PM

Quote:

Originally Posted by DiapDealer

...I agree that only the ones immediately following the <body> tag, or immediately preceding the </body> tag can be safely removed wholesale.

That is the approach that I would go along with, for the reasons that were given, and would do the complete job in most cases.

Rev. Bob · 04-24-2015, 04:48 PM

Quote:

Originally Posted by PandathePanda

Tested a PD book from: https://www.mobileread.com/forums/sho...d.php?t=259583 and got the <div class="mbp_pagebreak" ...> Inserted into the epub after converting the mobi to epub. Yet unpacking the azw3 file, the resulting epub does not have this inserted.

So my guess it's formatting inserted during the conversion by calibre, and can safely be removed.

Quote:

Originally Posted by DiapDealer

The problem with assuming that it's calibre added stuff that can be safely removed, is that an ebook could have been edited by someone AFTER the calibre conversion which added the mbppagebreak div stuff. Where files were split/and merged any number of different unforeseen ways (or code copied and pasted to somewhere where the pagebreak IS performing a wanted function in the middle of a file).

Okay, now that I've had a chance to look through some of my Kindle -> EPUB conversions...

The second book I opened, copyright 2014 and converted in January 2015, repeatedly uses <div class="mbp_pagebreak"/> in the middle of its two text documents to separate chapters. (The first document is frontmatter and a serial story, and the second is an unrelated story with backmatter.) There is an additional instance at the top of the second document.

I am strongly tempted to label this a Calibre issue, an artifact of the conversion process that should be handled by adjusting that feature. That doesn't do anything about any existing conversions, though, so I haven't completely (ahem) closed the book on it yet.

If I do include processing for this, it'll definitely be tied to "is a BODY tag adjacent?" and will handle cases - such as this one - where there's no "calibre_pb_\d+" ID attribute present. That does make things more complicated, though, and further feedback is welcome.

Meanwhile, I've received a copy of the page-count plugin and information on the checkbox tweaks, so I can look into that. If they're as minor as they sound, I have no qualms about porting them over. Optional feature, doesn't break anything - sounds like a win.

04-24-2015, 07:29 AM	#917
jackie_w Grand Sorcerer Posts: 6,251 Karma: 16539642 Join Date: Sep 2009 Location: UK Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3	I think the pagebreak constructs that look like Code: <div class="mbppagebreak" id="calibre_pb_16"></div> are what you get when you convert an old-style MOBI using calibre. I used the 'Kindle Unpack' plugin to unpack a few of my old Amazon original MOBIs at random and the source code looked like this in all of them Code: <mbp:pagebreak/> ETA: As the unpacked markup code is all in a single file I'm assuming calibre uses them to decide where to split into multiple files. Last edited by jackie_w; 04-24-2015 at 07:34 AM.

04-24-2015, 03:06 PM	#925
DiapDealer Grand Sorcerer Posts: 28,560 Karma: 204127028 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	The problem with assuming that it's calibre added stuff that can be safely removed, is that an ebook could have been edited by someone AFTER the calibre conversion which added the mbppagebreak div stuff. Where files were split/and merged any number of different unforeseen ways (or code copied and pasted to somewhere where the pagebreak IS performing a wanted function in the middle of a file). I agree that only the ones immediately following the <body> tag, or immediately preceding the </body> tag can be safely removed wholesale. Last edited by DiapDealer; 04-24-2015 at 03:30 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[GUI Plugin] Quality Check	kiwidude	Plugins	1251	07-07-2025 09:13 PM
[GUI Plugin] Open With	kiwidude	Plugins	404	02-21-2025 05:42 AM
[GUI Plugin] Manage Series	kiwidude	Plugins	167	07-28-2024 03:07 PM
Modify ePub plugin dev thread	kiwidude	Development	346	09-02-2013 05:14 PM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM

04-24-2015, 07:04 AM	#916
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	i would infer from the class name- mbpagebreak - that this is obsolete code, as all readers now page break anyway on start of next html file. ( googling the term seems to confirm that it's more likely to have been inserted by conversion tools than by original publishers; someone more familiar with the kindlegen process as applied to author uploads would know if amazon themselves add it ) so having this trailing at end of each chapter is redundant and can also generate spurious blank pages, depending on how the reader apps deal with an empty div. i'd class it as redundant, like you do un-needed spans, and strip it because it is clutter that has no effect on how the epub is rendered. I am pretty sure I have also seen it in some epubs that have not been generated/processed by calibre

04-24-2015, 09:37 AM	#919
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	I think jackie_w has nailed it, the class name and it's location and original purpose is stable, and it would only exist where a book has been sold in legacy mobi format, not in azw, and has then been converted. so arguably it is a calibre conversion artifact and can be detected as such ? i.e. I'd settle for removal only of matches for <div class="mbppagebreak" id="calibre_pb_\d+"></div> that's probably how I zap them in sigil: I had a quick look at my recent sigil find/replace but any like that have dropped off my recently used list

04-24-2015, 10:28 AM	#922
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	ok - all I can say is that I've been doing that regex remove manually for 2- 3 years, over 100 books I ma sure - & I have never seen that construction anywhere except at the end of a file. it maybe that calibre always breaks after one of those, so that is logically impossible for a calibre conversion to leave only the middle of a html file ( on default structure detect settings anyway ) but ok , tweak to: find <div class="mbppagebreak" id="calibre_pb_\d+"></div> </body> replace </body>

04-24-2015, 02:39 PM	#924
PandathePanda a toy panda Posts: 2,568 Karma: 26020474 Join Date: Mar 2014 Location: Onboard the Queen Anne's Revenge Device: Various Android dvices	Tested a PD book from: https://www.mobileread.com/forums/sho...d.php?t=259583 and got the <div class="mbp_pagebreak" ...> Inserted into the epub after converting the mobi to epub. Yet unpacking the azw3 file, the resulting epub does not have this inserted. So my guess it's formatting inserted during the conversion by calibre, and can safely be removed.

04-24-2015, 03:31 PM	#926
theducks Well trained by Cats Posts: 31,041 Karma: 60358908 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	I would include any empty (non-text/Image) tag pairs in those locations IMHO Margins should be used to supply top or bottom whitespace I see no purpose in a end-of-file anchor either.

04-24-2015, 03:41 PM	#927
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	It would be a non issue, display wise, if all renderers worked to same rules. But some add a blank line when they hit an empty tag pair, and some don't. I Sam not sure if ANY actually perform a page break!

Advert

Advert