05-23-2011, 09:24 PM | #1 |
Calibre Plugins Developer
Posts: 4,686
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Modify ePub plugin dev thread
I'm working on the "Modify ePub" plugin which has various manipulations of an ePub to partner Quality Check, such as removing spurious files, legacy jackets etc.
I decided to use the Container class from calibre/books/epub/fix/container.py as the basis for my manipulations. It is fairly basic but that is to the advantage for this plugin where the intent is to apply as minimal change to the ePub files as possible, taking care not to touch the CSS etc. With a little extension I have that working pretty well, so where necessary I can remove items from the manifest and the spine of the opf, as well as the actual file. However that does leave one aspect of the ePub file that I am not currently handling - that being the ncx TOC file. At this point the scope of my changes is only a potential desire to remove items from the TOC. Looking in the Calibre codebase it seems there are several "TOC" type classes around the place (such as metadata/toc.py and in oeb/base.py). Would any of those be appropriate/useful for what I want to do? Basically I want to parse an ncx file, have an ability to remove an item from the TOC based on the @src attribute matching a value and then get the new structure back so I can persist it with the container.set() function. Any suggestions appreciated. |
05-23-2011, 09:34 PM | #2 |
creator of calibre
Posts: 44,546
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If all you want to do is remove entries, I'd say just use lxml. The only tricky part will be making the path references absolute (I dont recall if ncx paths are relative to the dir containing the ncx or the root of the zip file).
|
Advert | |
|
05-23-2011, 09:43 PM | #3 |
Calibre Plugins Developer
Posts: 4,686
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
That was my first thought too, though then when I looked at a nested TOC structure I got scared
For instance say I have this (from an actual book): Code:
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1"> <head> <meta name="dtb:uid" content="4784ca89-f128-4a73-92ed-b84ac4edb658"/> <meta name="dtb:depth" content="2"/> <meta name="dtb:totalPageCount" content="0"/> <meta name="dtb:maxPageNumber" content="0"/> </head> <docTitle> <text>Grave Surprise</text> </docTitle> <navMap> <navPoint id="navPoint-1" playOrder="1"> <navLabel> <text>Grave Surprise</text> </navLabel> <content src="Text/jacket1.xhtml"/> <navPoint id="navPoint-2" playOrder="2"> <navLabel> <text>Book Jacket</text> </navLabel> <content src="Text/jacket1.xhtml#heading_id_3"/> </navPoint> </navPoint> <navPoint id="navPoint-3" playOrder="3"> <navLabel> <text>Grave Surprise</text> </navLabel> <content src="Text/jacket_split_000.xhtml"/> <navPoint id="navPoint-4" playOrder="4"> <navLabel> <text>Book Jacket</text> </navLabel> <content src="Text/jacket_split_001.xhtml"/> </navPoint> </navPoint> </navMap> </ncx> However how should I "fiddle" the structure, if it is not the innermost navPoint that I am removing? |
05-23-2011, 09:47 PM | #4 |
creator of calibre
Posts: 44,546
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Code:
for navpoint in root.xpath('//{namespace}navPoint'): if test_navpoint_for_removal(navpoint): p = navpoint.getparent() p.remove(navpoint) for child in reversed(navpoint): p.insert(idx, child) |
05-23-2011, 09:50 PM | #5 |
Calibre Plugins Developer
Posts: 4,686
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
That was rather quick - come across this problem before?
Awesome, thx Kovid. I shall experiment a bit with that tomorrow. |
Advert | |
|
05-23-2011, 11:36 PM | #6 |
creator of calibre
Posts: 44,546
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
05-24-2011, 09:35 PM | #7 |
Grand Sorcerer
Posts: 6,224
Karma: 16536676
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
|
I don't know whether this is relevant to your plans but someone in the epub forum has written an epub utility, called epubFixer, which already has some rather nice facilities for manipulating toc.ncx. Here's a link if you want to know more.
It's obviously not a Calibre plugin, but it can easily be run using your 'Open With' plugin |
05-25-2011, 05:40 AM | #8 |
Calibre Plugins Developer
Posts: 4,686
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hi Jackie,
Thanks for the link. From my quick look at it there isn't much overlap of the two functions. That tool you linked to looks like something that should or will be functionality in Sigil, as the majority of it that I saw listed involves interactive editing of the ePub. There was only one automated function I saw that would be applicable in theory and that was zeroing margins. However even that I think is too crude a thing to be doing and instead needs to be done manually IMHO. Otherwise you will lose any appropriately margined subparagraphs. It is using a sledgehammer to crack a nut and expecting to get anything edible afterwards . If you looked at the book first to know that it was safe to do so then the function could be used, but again that sounds like it should be part of Sigil. This plugin is about applying changes in bulk in a non interactive fashion, other than choosing your set of changes on a screen before it begins, a bit like converting files in a sense. So there are certain things that quality check can detect in bulk that it then makes sense to fix in bulk. So far it does things like removing legacy or all jackets, zeroes xpgt margins, remove missing manifest entries, remove/add unmanifested files, remove iTunes files and remove calibre bookmarks. All of these have a matching search equivalent in Quality Check. Longer term I would think this plugin could have features to add/update the cover image, update the book metadata, add/update a jacket etc. The plumbing is there. The biggest issue to not doing these now is that the calibre code to do these features is all based around assuming an oeb object. I instead chose to load the book into the more lightweight container object, to be guaranteed that the disruption to the ePub content was minimal. So I either need to replicate the calibre code, or find a way to combine the approaches, such as also generating an oeb object to call calibre functions, and then copying the bits it has generated into my container. Perhaps Kovid or someone who knows the conversion pipeline may have some thoughts or suggestions on this. |
05-25-2011, 10:22 AM | #9 | |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
For a number of candidate changes all you really need is an iterator to walk through the 'text' items in the manifest and then pass those files to various enabled/applicable preprocess/look and feel functions one by one - they're all just text manipulation functions in the earlier stages. After it's gone through all those functions just make sure it's still valid xhtml (there is a function for this somewhere already) and write the new version back. I'm not sure if there is anything special in the regular conversion pipeline to determine which files are valid text elements. |
|
05-25-2011, 12:34 PM | #10 |
creator of calibre
Posts: 44,546
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You should be able to re-use calibre code to update metadata/covers, that does not depend on OEB. There is no code in calibre to update existing jackets, just replace an old one with a new one.
|
05-25-2011, 02:25 PM | #11 | |
Calibre Plugins Developer
Posts: 4,686
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Quote:
The jacket.py file in the same namespace also has dependencies on the oeb manifest from the Jacket class. The module level functions I could call, but again I think I would be replicating logic from the Jacket class. Which is all fine if that is what you meant and recommend - but perhaps I am missing something more obvious |
|
05-25-2011, 04:15 PM | #12 |
creator of calibre
Posts: 44,546
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Ah you are talking about manipulating the HTML/SVG wrapper around the cover. Yeah that would require code duplication. I though you just meant replacing the cover in the epub for which you can use metadata.epub, however it only works if the epub defines a raster cover.
|
05-25-2011, 05:01 PM | #13 |
Calibre Plugins Developer
Posts: 4,686
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Cool, at least what I understood of it was correct. It isn't masses of code to duplicate in either case, but it would be silly of me to do so if my approach should be changed to get a more direct reuse without adding to the potential future maintenance burden.
I guess what I had in mind is that as a user I want the ability to stick my latest howdy doody cover I downloaded into Calibre on the front of my ePub. As I understand it (which could be utterly wrong) at some point in the history of the book it needs to be converted using Calibre to get the special cover xhtml page as defined in the guide and identified with the metadata tags in that page. I "think" this is what you call the "raster cover"? From that point on, it is possible for that cover to be overwritten when using save to disk etc in the target copy of the ePub. What that doesn't do is update the cover on the copy of the ePub in your library. For that you have to reconvert the book again (or presumably reimport it back from your save target). And you would also have to do a conversion for a first time book. So given this plugin is considerably about "avoiding full-on conversion" I was thinking it would be a desirable feature to handle both updating metadata for a previously converted book, and inserting the html/SVG wrapper if it hasn't been converted previously. Either way it means the ePub in your library now will have the latest image to match what you see in the pane and on your device. That was the theory. Of course that may be a pit of despair to attempt to implement of course. I found/replicated the code to serialise through the metadata and cover to my worker processes, so the raw data is there. I'm guessing there are probably lots of nasty special cases that all those masses of code in Calibre over many years are taking into account that may make it not as "simple" as it sounds... |
05-25-2011, 06:53 PM | #14 |
creator of calibre
Posts: 44,546
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Most EPUBs created in most software have a raster cover, i.e. a jpg or png image that is the actual cover image, which is referenced in the first html file of the epub. In properly produced epubs, this is image is unambiguously identified as the cover my a <meta> tag in the opf tat refers to the manifest item corresponding to the cover.
If the epub lacks such unambiguous identification, there is no way to safely replace the cover. You have no way to know if the first html file is a cover, or normal content. calibre assumes that if there is an entry in the <guide> of the OPF that points to the HTML with type="cover" then the HTML file can be replaced. Otherwise the cover is prepended by inserting a new HTML file at the beginning. Yeah, cover's in EPUB suck. |
05-25-2011, 08:53 PM | #15 |
Calibre Plugins Developer
Posts: 4,686
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Beta v0.1
Ahhh, thx for filling in a few gaps Kovid. Painful but doable by the sounds of it.
Since the Find Duplicates thread worked pretty well (I thought at least) as a combination of technical discussions and beta versions I have renamed this thread and will attempt the same here. Attached is the version of the plugin as it stands with the functionality you can see in the screenshot. I have run it against my own library and am happy with the modifications it made. But as with anything if in doubt make a backup of the ePub you are modifying first or copy it to a test library if you want to be 100% sure that if it did do something unwanted you can reverse it. You are prompted before it updates your library with the modified epub, and you can use the path from the log file to view that modified version manually before clicking yes if you want to see what it has done first. As always, feedback and suggestions appreciated. Last edited by kiwidude; 05-31-2011 at 04:13 PM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Any web-to-epub plugin for internet browser? | bthoven | ePub | 7 | 07-10-2011 06:14 AM |
[Old Thread] Reading epub on viewer inexplicably changes the time stamp of epub | greenapple | Library Management | 20 | 03-19-2011 11:18 PM |
Easy way to modify thread subscription emails in bulk? | snipenekkid | Feedback | 11 | 02-06-2011 04:47 AM |
Another plugin dev question | DiapDealer | Plugins | 2 | 12-11-2010 02:46 PM |
Epub plugin dev | DiapDealer | Plugins | 15 | 11-12-2010 10:36 AM |