Modify ePub plugin dev thread

kiwidude · 05-23-2011, 08:24 PM

I'm working on the "Modify ePub" plugin which has various manipulations of an ePub to partner Quality Check, such as removing spurious files, legacy jackets etc.

I decided to use the Container class from calibre/books/epub/fix/container.py as the basis for my manipulations. It is fairly basic but that is to the advantage for this plugin where the intent is to apply as minimal change to the ePub files as possible, taking care not to touch the CSS etc.

With a little extension I have that working pretty well, so where necessary I can remove items from the manifest and the spine of the opf, as well as the actual file.

However that does leave one aspect of the ePub file that I am not currently handling - that being the ncx TOC file. At this point the scope of my changes is only a potential desire to remove items from the TOC.

Looking in the Calibre codebase it seems there are several "TOC" type classes around the place (such as metadata/toc.py and in oeb/base.py).

Would any of those be appropriate/useful for what I want to do? Basically I want to parse an ncx file, have an ability to remove an item from the TOC based on the @src attribute matching a value and then get the new structure back so I can persist it with the container.set() function.

Any suggestions appreciated.

kovidgoyal · 05-23-2011, 08:34 PM

If all you want to do is remove entries, I'd say just use lxml. The only tricky part will be making the path references absolute (I dont recall if ncx paths are relative to the dir containing the ncx or the root of the zip file).

kiwidude · 05-23-2011, 08:43 PM

That was my first thought too, though then when I looked at a nested TOC structure I got scared

For instance say I have this (from an actual book):

Code:

<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
    <head>
        <meta name="dtb:uid" content="4784ca89-f128-4a73-92ed-b84ac4edb658"/>
        <meta name="dtb:depth" content="2"/>
        <meta name="dtb:totalPageCount" content="0"/>
        <meta name="dtb:maxPageNumber" content="0"/>
    </head>
    <docTitle>
        <text>Grave Surprise</text>
    </docTitle>
    <navMap>
        <navPoint id="navPoint-1" playOrder="1">
            <navLabel>
                <text>Grave Surprise</text>
            </navLabel>
            <content src="Text/jacket1.xhtml"/>
            <navPoint id="navPoint-2" playOrder="2">
                <navLabel>
                    <text>Book Jacket</text>
                </navLabel>
                <content src="Text/jacket1.xhtml#heading_id_3"/>
            </navPoint>
        </navPoint>
        <navPoint id="navPoint-3" playOrder="3">
            <navLabel>
                <text>Grave Surprise</text>
            </navLabel>
            <content src="Text/jacket_split_000.xhtml"/>
            <navPoint id="navPoint-4" playOrder="4">
                <navLabel>
                    <text>Book Jacket</text>
                </navLabel>
                <content src="Text/jacket_split_001.xhtml"/>
            </navPoint>
        </navPoint>
    </navMap>
</ncx>

If this was a flat ncx, then I would have thought it would be easy, in that I could just remove the whole navPoint which contains the content node.

However how should I "fiddle" the structure, if it is not the innermost navPoint that I am removing?

kovidgoyal · 05-23-2011, 08:47 PM

Code:

for navpoint in root.xpath('//{namespace}navPoint'):
   if test_navpoint_for_removal(navpoint):
      p = navpoint.getparent()
      p.remove(navpoint)
      for child in reversed(navpoint):
         p.insert(idx, child)

You'll have to do a bit of monkeying with the text/tail of various elements to preserve indentation, but that's about it.

kiwidude · 05-23-2011, 08:50 PM

That was rather quick - come across this problem before?

Awesome, thx Kovid. I shall experiment a bit with that tomorrow.

kovidgoyal · 05-23-2011, 10:36 PM

Quote:

Originally Posted by kiwidude

That was rather quick - come across this problem before?

Considering that the vast majority of what calibre does is manipulate (X)HTML, yeah

jackie_w · 05-24-2011, 08:35 PM

I don't know whether this is relevant to your plans but someone in the epub forum has written an epub utility, called epubFixer, which already has some rather nice facilities for manipulating toc.ncx. Here's a link if you want to know more.

It's obviously not a Calibre plugin, but it can easily be run using your 'Open With' plugin

kiwidude · 05-25-2011, 04:40 AM

Hi Jackie,

Thanks for the link. From my quick look at it there isn't much overlap of the two functions. That tool you linked to looks like something that should or will be functionality in Sigil, as the majority of it that I saw listed involves interactive editing of the ePub.

There was only one automated function I saw that would be applicable in theory and that was zeroing margins. However even that I think is too crude a thing to be doing and instead needs to be done manually IMHO. Otherwise you will lose any appropriately margined subparagraphs. It is using a sledgehammer to crack a nut and expecting to get anything edible afterwards

. If you looked at the book first to know that it was safe to do so then the function could be used, but again that sounds like it should be part of Sigil.

This plugin is about applying changes in bulk in a non interactive fashion, other than choosing your set of changes on a screen before it begins, a bit like converting files in a sense. So there are certain things that quality check can detect in bulk that it then makes sense to fix in bulk. So far it does things like removing legacy or all jackets, zeroes xpgt margins, remove missing manifest entries, remove/add unmanifested files, remove iTunes files and remove calibre bookmarks. All of these have a matching search equivalent in Quality Check.

Longer term I would think this plugin could have features to add/update the cover image, update the book metadata, add/update a jacket etc. The plumbing is there. The biggest issue to not doing these now is that the calibre code to do these features is all based around assuming an oeb object. I instead chose to load the book into the more lightweight container object, to be guaranteed that the disruption to the ePub content was minimal. So I either need to replicate the calibre code, or find a way to combine the approaches, such as also generating an oeb object to call calibre functions, and then copying the bits it has generated into my container.

Perhaps Kovid or someone who knows the conversion pipeline may have some thoughts or suggestions on this.

ldolse · 05-25-2011, 09:22 AM

Quote:

Originally Posted by kiwidude

Longer term I would think this plugin could have features to add/update the cover image, update the book metadata, add/update a jacket etc. The plumbing is there. The biggest issue to not doing these now is that the calibre code to do these features is all based around assuming an oeb object. I instead chose to load the book into the more lightweight container object, to be guaranteed that the disruption to the ePub content was minimal. So I either need to replicate the calibre code, or find a way to combine the approaches, such as also generating an oeb object to call calibre functions, and then copying the bits it has generated into my container.

Perhaps Kovid or someone who knows the conversion pipeline may have some thoughts or suggestions on this.

For a number of candidate changes all you really need is an iterator to walk through the 'text' items in the manifest and then pass those files to various enabled/applicable preprocess/look and feel functions one by one - they're all just text manipulation functions in the earlier stages. After it's gone through all those functions just make sure it's still valid xhtml (there is a function for this somewhere already) and write the new version back. I'm not sure if there is anything special in the regular conversion pipeline to determine which files are valid text elements.

kovidgoyal · 05-25-2011, 11:34 AM

You should be able to re-use calibre code to update metadata/covers, that does not depend on OEB. There is no code in calibre to update existing jackets, just replace an old one with a new one.

kiwidude · 05-25-2011, 01:25 PM

Quote:

Originally Posted by kovidgoyal

You should be able to re-use calibre code to update metadata/covers, that does not depend on OEB. There is no code in calibre to update existing jackets, just replace an old one with a new one.

Hmmm. Perhaps I wasn't looking at the right code then. I was looking at ebooks.oeb.transforms.cover.py, where it defines the cover templates and populates them in the insert_cover function()? I had thought I could reuse the template definitions, but that I would have to replicate all the remaining code of actual generation. As the insert_cover() function relies on self.oeb for the guide and manifest logic?

The jacket.py file in the same namespace also has dependencies on the oeb manifest from the Jacket class. The module level functions I could call, but again I think I would be replicating logic from the Jacket class.

Which is all fine if that is what you meant and recommend - but perhaps I am missing something more obvious

kovidgoyal · 05-25-2011, 03:15 PM

Ah you are talking about manipulating the HTML/SVG wrapper around the cover. Yeah that would require code duplication. I though you just meant replacing the cover in the epub for which you can use metadata.epub, however it only works if the epub defines a raster cover.

kiwidude · 05-25-2011, 04:01 PM

Cool, at least what I understood of it was correct.

It isn't masses of code to duplicate in either case, but it would be silly of me to do so if my approach should be changed to get a more direct reuse without adding to the potential future maintenance burden.

I guess what I had in mind is that as a user I want the ability to stick my latest howdy doody cover I downloaded into Calibre on the front of my ePub.

As I understand it (which could be utterly wrong) at some point in the history of the book it needs to be converted using Calibre to get the special cover xhtml page as defined in the guide and identified with the metadata tags in that page. I "think" this is what you call the "raster cover"? From that point on, it is possible for that cover to be overwritten when using save to disk etc in the target copy of the ePub.

What that doesn't do is update the cover on the copy of the ePub in your library. For that you have to reconvert the book again (or presumably reimport it back from your save target). And you would also have to do a conversion for a first time book.

So given this plugin is considerably about "avoiding full-on conversion" I was thinking it would be a desirable feature to handle both updating metadata for a previously converted book, and inserting the html/SVG wrapper if it hasn't been converted previously. Either way it means the ePub in your library now will have the latest image to match what you see in the pane and on your device.

That was the theory. Of course that may be a pit of despair to attempt to implement of course.

I found/replicated the code to serialise through the metadata and cover to my worker processes, so the raw data is there. I'm guessing there are probably lots of nasty special cases that all those masses of code in Calibre over many years are taking into account that may make it not as "simple" as it sounds...

kovidgoyal · 05-25-2011, 05:53 PM

Most EPUBs created in most software have a raster cover, i.e. a jpg or png image that is the actual cover image, which is referenced in the first html file of the epub. In properly produced epubs, this is image is unambiguously identified as the cover my a <meta> tag in the opf tat refers to the manifest item corresponding to the cover.

If the epub lacks such unambiguous identification, there is no way to safely replace the cover. You have no way to know if the first html file is a cover, or normal content.

calibre assumes that if there is an entry in the <guide> of the OPF that points to the HTML with type="cover" then the HTML file can be replaced. Otherwise the cover is prepended by inserting a new HTML file at the beginning.

Yeah, cover's in EPUB suck.

kiwidude · 05-25-2011, 07:53 PM

Ahhh, thx for filling in a few gaps Kovid. Painful but doable by the sounds of it.

Since the Find Duplicates thread worked pretty well (I thought at least) as a combination of technical discussions and beta versions I have renamed this thread and will attempt the same here.

Attached is the version of the plugin as it stands with the functionality you can see in the screenshot. I have run it against my own library and am happy with the modifications it made. But as with anything if in doubt make a backup of the ePub you are modifying first or copy it to a test library if you want to be 100% sure that if it did do something unwanted you can reverse it.

You are prompted before it updates your library with the modified epub, and you can use the path from the log file to view that modified version manually before clicking yes if you want to see what it has done first.

As always, feedback and suggestions appreciated.

05-23-2011, 08:24 PM	#1
kiwidude Calibre Plugins Developer Posts: 4,728 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Modify ePub plugin dev thread I'm working on the "Modify ePub" plugin which has various manipulations of an ePub to partner Quality Check, such as removing spurious files, legacy jackets etc. I decided to use the Container class from calibre/books/epub/fix/container.py as the basis for my manipulations. It is fairly basic but that is to the advantage for this plugin where the intent is to apply as minimal change to the ePub files as possible, taking care not to touch the CSS etc. With a little extension I have that working pretty well, so where necessary I can remove items from the manifest and the spine of the opf, as well as the actual file. However that does leave one aspect of the ePub file that I am not currently handling - that being the ncx TOC file. At this point the scope of my changes is only a potential desire to remove items from the TOC. Looking in the Calibre codebase it seems there are several "TOC" type classes around the place (such as metadata/toc.py and in oeb/base.py). Would any of those be appropriate/useful for what I want to do? Basically I want to parse an ncx file, have an ability to remove an item from the TOC based on the @src attribute matching a value and then get the new structure back so I can persist it with the container.set() function. Any suggestions appreciated.

05-23-2011, 08:47 PM	#4
kovidgoyal creator of calibre Posts: 45,307 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Code: for navpoint in root.xpath('//{namespace}navPoint'): if test_navpoint_for_removal(navpoint): p = navpoint.getparent() p.remove(navpoint) for child in reversed(navpoint): p.insert(idx, child) You'll have to do a bit of monkeying with the text/tail of various elements to preserve indentation, but that's about it.

05-25-2011, 07:53 PM	#15
kiwidude Calibre Plugins Developer Posts: 4,728 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Beta v0.1 Ahhh, thx for filling in a few gaps Kovid. Painful but doable by the sounds of it. Since the Find Duplicates thread worked pretty well (I thought at least) as a combination of technical discussions and beta versions I have renamed this thread and will attempt the same here. Attached is the version of the plugin as it stands with the functionality you can see in the screenshot. I have run it against my own library and am happy with the modifications it made. But as with anything if in doubt make a backup of the ePub you are modifying first or copy it to a test library if you want to be 100% sure that if it did do something unwanted you can reverse it. You are prompted before it updates your library with the modified epub, and you can use the path from the log file to view that modified version manually before clicking yes if you want to see what it has done first. As always, feedback and suggestions appreciated. Attached Thumbnails Last edited by kiwidude; 05-31-2011 at 03:13 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Any web-to-epub plugin for internet browser?	bthoven	ePub	7	07-10-2011 05:14 AM
[Old Thread] Reading epub on viewer inexplicably changes the time stamp of epub	greenapple	Library Management	20	03-19-2011 10:18 PM
Easy way to modify thread subscription emails in bulk?	snipenekkid	Feedback	11	02-06-2011 03:47 AM
Another plugin dev question	DiapDealer	Plugins	2	12-11-2010 01:46 PM
Epub plugin dev	DiapDealer	Plugins	15	11-12-2010 09:36 AM

05-23-2011, 08:34 PM	#2
kovidgoyal creator of calibre Posts: 45,307 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If all you want to do is remove entries, I'd say just use lxml. The only tricky part will be making the path references absolute (I dont recall if ncx paths are relative to the dir containing the ncx or the root of the zip file).

05-23-2011, 08:50 PM	#5
kiwidude Calibre Plugins Developer Posts: 4,728 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	That was rather quick - come across this problem before? Awesome, thx Kovid. I shall experiment a bit with that tomorrow.

05-24-2011, 08:35 PM	#7
jackie_w Grand Sorcerer Posts: 6,251 Karma: 16539642 Join Date: Sep 2009 Location: UK Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3	I don't know whether this is relevant to your plans but someone in the epub forum has written an epub utility, called epubFixer, which already has some rather nice facilities for manipulating toc.ncx. Here's a link if you want to know more. It's obviously not a Calibre plugin, but it can easily be run using your 'Open With' plugin

05-25-2011, 04:40 AM	#8
kiwidude Calibre Plugins Developer Posts: 4,728 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Hi Jackie, Thanks for the link. From my quick look at it there isn't much overlap of the two functions. That tool you linked to looks like something that should or will be functionality in Sigil, as the majority of it that I saw listed involves interactive editing of the ePub. There was only one automated function I saw that would be applicable in theory and that was zeroing margins. However even that I think is too crude a thing to be doing and instead needs to be done manually IMHO. Otherwise you will lose any appropriately margined subparagraphs. It is using a sledgehammer to crack a nut and expecting to get anything edible afterwards . If you looked at the book first to know that it was safe to do so then the function could be used, but again that sounds like it should be part of Sigil. This plugin is about applying changes in bulk in a non interactive fashion, other than choosing your set of changes on a screen before it begins, a bit like converting files in a sense. So there are certain things that quality check can detect in bulk that it then makes sense to fix in bulk. So far it does things like removing legacy or all jackets, zeroes xpgt margins, remove missing manifest entries, remove/add unmanifested files, remove iTunes files and remove calibre bookmarks. All of these have a matching search equivalent in Quality Check. Longer term I would think this plugin could have features to add/update the cover image, update the book metadata, add/update a jacket etc. The plumbing is there. The biggest issue to not doing these now is that the calibre code to do these features is all based around assuming an oeb object. I instead chose to load the book into the more lightweight container object, to be guaranteed that the disruption to the ePub content was minimal. So I either need to replicate the calibre code, or find a way to combine the approaches, such as also generating an oeb object to call calibre functions, and then copying the bits it has generated into my container. Perhaps Kovid or someone who knows the conversion pipeline may have some thoughts or suggestions on this.

05-25-2011, 11:34 AM	#10
kovidgoyal creator of calibre Posts: 45,307 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You should be able to re-use calibre code to update metadata/covers, that does not depend on OEB. There is no code in calibre to update existing jackets, just replace an old one with a new one.

05-25-2011, 03:15 PM	#12
kovidgoyal creator of calibre Posts: 45,307 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Ah you are talking about manipulating the HTML/SVG wrapper around the cover. Yeah that would require code duplication. I though you just meant replacing the cover in the epub for which you can use metadata.epub, however it only works if the epub defines a raster cover.

05-25-2011, 04:01 PM	#13
kiwidude Calibre Plugins Developer Posts: 4,728 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Cool, at least what I understood of it was correct. It isn't masses of code to duplicate in either case, but it would be silly of me to do so if my approach should be changed to get a more direct reuse without adding to the potential future maintenance burden. I guess what I had in mind is that as a user I want the ability to stick my latest howdy doody cover I downloaded into Calibre on the front of my ePub. As I understand it (which could be utterly wrong) at some point in the history of the book it needs to be converted using Calibre to get the special cover xhtml page as defined in the guide and identified with the metadata tags in that page. I "think" this is what you call the "raster cover"? From that point on, it is possible for that cover to be overwritten when using save to disk etc in the target copy of the ePub. What that doesn't do is update the cover on the copy of the ePub in your library. For that you have to reconvert the book again (or presumably reimport it back from your save target). And you would also have to do a conversion for a first time book. So given this plugin is considerably about "avoiding full-on conversion" I was thinking it would be a desirable feature to handle both updating metadata for a previously converted book, and inserting the html/SVG wrapper if it hasn't been converted previously. Either way it means the ePub in your library now will have the latest image to match what you see in the pane and on your device. That was the theory. Of course that may be a pit of despair to attempt to implement of course. I found/replicated the code to serialise through the metadata and cover to my worker processes, so the raw data is there. I'm guessing there are probably lots of nasty special cases that all those masses of code in Calibre over many years are taking into account that may make it not as "simple" as it sounds...

05-25-2011, 05:53 PM	#14
kovidgoyal creator of calibre Posts: 45,307 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Most EPUBs created in most software have a raster cover, i.e. a jpg or png image that is the actual cover image, which is referenced in the first html file of the epub. In properly produced epubs, this is image is unambiguously identified as the cover my a <meta> tag in the opf tat refers to the manifest item corresponding to the cover. If the epub lacks such unambiguous identification, there is no way to safely replace the cover. You have no way to know if the first html file is a cover, or normal content. calibre assumes that if there is an entry in the <guide> of the OPF that points to the HTML with type="cover" then the HTML file can be replaced. Otherwise the cover is prepended by inserting a new HTML file at the beginning. Yeah, cover's in EPUB suck.

Advert

Advert