AZW Conversion Output Plugin

tylau0 · 07-20-2011, 07:40 AM

This plugin overrides the default mobi periodical generation routine with another that makes use of the kindlegen program available at Amazon.com here. This project is motivated by the fact that the section/article view introduced since Kindle 3.1 firmware does not work properly with the mobi periodical generated by the default Calibre routine. One cannot have the pointer in that view points to the last article read when he reads an article and uses the "back" button on the device to go to that view (details). Using kindlegen is a viable solution the community comes up so far (details). This plugin ports that solution into the Calibre plugin framework. After installing this plugin, one can simply specify the output format to azw in both command line and Calibre graphical interface to generate periodicals with the problem described above gone.

History:
v1.0.5 [2014/12/07]: Now compatible with Calibre version 2.12. Note I can no longer generate periodicals using the latest Kindlegen (2.9).
v1.0.4 [2011/07/23]: Now compatible with Calibre version 0.8.11.
v1.0.3 [2011/07/22] - Remove sections that are empty (Support Calibre version up to 0.8.10)
v1.0.2 [2011/07/22] - Use kindlestrip (by Paul Durrant) to trim down the result file
v1.0.1 [2011/07/21] - Fix a few typesetting problems (using calibre's own routine) and make it work with one-feed recipe
v1.0.0 [2011/07/20] - Basic done

TODO:
* Increase its dependency on code in Calibre source. This allows the code to stay in sync with any updates in Calibre.

Latest news (2011/09/08):
There is already a native solution to the problem. Please check here for details.

kovidgoyal · 07-20-2011, 11:29 AM

No need, I started working on MOBI indexing a few days ago. Hopefully I will be able to figure out the problem. I've written code that decompiles the MOBI, including all indexing information which should allow me to see what the differences between kindlegen generated periodicals and calibre ones are. You can run it with

calibre-debug --inspect-mobi filename.mobi

You will need to be running from latest calibre source for this to work.

nickredding · 07-20-2011, 03:11 PM

kovid - I'm 99% sure the issue includes the trailing byte sequence following each HTML record. I have been unable to fully decode it because Kindlegen seems to insert (in an apparently inconsistent way) some arbitrary bytes in some of the sequences and I haven't been able to determine what the logic is. However, if you take a Kindlegen-generated document (which works properly on Kindle in Sections & Articles view) and zero out the trailing byte sequences, the document still displays properly on Kindle except now it exhibits the impaired 'back' function in the Sections & Articles view. So the trailing byte sequences are definately part of the puzzle.

On the other hand, if you look at Amazon generated periodicals (e.g. the New York Times) the trailing byte sequences are consistent and reflect the changes I made to the MOBI code a few months ago (and forwarded to you). However, in those files the NCX entries have additional data bytes that I cannot decode, but may be associated with the issue.

This suggests Amazon is NOT using Kindlegen to format periodicals (not surprising, actually because Kindlegen is a piece of cr*p).

kovidgoyal · 07-20-2011, 03:36 PM

Yeah, I've already figured out its the trailing byte sequences. I'm working on decoding them now.

I'm currently working off kindlegen generated mobi files, and I've completely deciphered the index, cnx and tagx records for those. The trailing byte sequences are still opaque to me, but I don't think they will prove very hard to decode.

Hopefully, understanding and duplicating what kindlegen does with the TBS sequences will allow calibre periodicals to work properly.

oneillpt · 07-20-2011, 08:45 PM

Quote:

Originally Posted by tylau0

This plugin overrides the default mobi periodical generation routine with another that makes use of the kindlegen program available at Amazon.com

A plugin or updated mobi generation is clearly the way to go, as this will allow automatic uploading on connection of the Kindle.

If you are continuing to work on this while waiting for an update of Calibre with mobi indexing as desired, or if the feedback is of any use for any other projects you may have under way, I have tested the plugin with six recipes for which Calibre generates correct epub and mobi output (although of course without proper back button behaviour), using both current and previous versions of kindlegen - although unfortunately the recipe I included to test the masthead was one of those which failed, so I could just have tested with the current version.

Four of the six recipes generated azw output files, the other two failed. Of the four which produced azw output, two had correct back button behaviour, the other two produced azw files which could be viewed with Kindle for PC, but opened on the Kindle itself showing a table of contents but with a message box which displayed "The selected item could not be opened. If you purchased this item from Amazon, delete the item and download it from Archived Items." More comments on this below.

I tested using the command line ebook-convert with "--test -vv --debug-pipeline" to generate small e-books, and generated epub, azw and mobi versions to compare. In one case, one of the four articles extracted showed a loss of some text in the first article in the azw output when compared to the epub or mobi versions:

The recipe used was:

Spoiler:

Code:

class IrishTimes(BasicNewsRecipe):
    title          = u'The Irish TimesAZW'
    encoding  = 'ISO-8859-1'
    __author__    = "Derry FitzGerald, Ray Kinsella, David O'Callaghan and Phil Burns"
    language = 'en_IE'
    timefmt = ' (%A, %B %d, %Y)'


    oldest_article = 1.0
    max_articles_per_feed  = 100
    no_stylesheets = True
    simultaneous_downloads= 5

    r = re.compile('.*(?P<url>http:\/\/(www.irishtimes.com)|(rss.feedsportal.com\/c)\/.*\.html?).*')
    remove_tags    = [dict(name='div', attrs={'class':'footer'})]
    extra_css      = 'p, div { margin: 0pt; border: 0pt; text-indent: 0.5em } .headline {font-size: large;} \n .fact { padding-top: 10pt  }'

    feeds          = [
                      ('Frontpage', 'http://www.irishtimes.com/feeds/rss/newspaper/index.rss'),
                      ('Ireland', 'http://www.irishtimes.com/feeds/rss/newspaper/ireland.rss'),
                      ('World', 'http://www.irishtimes.com/feeds/rss/newspaper/world.rss'),
                      ('Finance', 'http://www.irishtimes.com/feeds/rss/newspaper/finance.rss'),
                      ('Features', 'http://www.irishtimes.com/feeds/rss/newspaper/features.rss'),
                      ('Sport', 'http://www.irishtimes.com/feeds/rss/newspaper/sport.rss'),
                      ('Opinion', 'http://www.irishtimes.com/feeds/rss/newspaper/opinion.rss'),
                      ('Letters', 'http://www.irishtimes.com/feeds/rss/newspaper/letters.rss'),
                      ('Magazine', 'http://www.irishtimes.com/feeds/rss/newspaper/magazine.rss'),
                      ('Health', 'http://www.irishtimes.com/feeds/rss/newspaper/health.rss'),
                      ('Education & Parenting', 'http://www.irishtimes.com/feeds/rss/newspaper/education.rss'),
                      ('Motors', 'http://www.irishtimes.com/feeds/rss/newspaper/motors.rss'),
                      ('An Teanga Bheo', 'http://www.irishtimes.com/feeds/rss/newspaper/anteangabheo.rss'),
                      ('Commercial Property', 'http://www.irishtimes.com/feeds/rss/newspaper/commercialproperty.rss'),
                      ('Science Today', 'http://www.irishtimes.com/feeds/rss/newspaper/sciencetoday.rss'),
                      ('Property', 'http://www.irishtimes.com/feeds/rss/newspaper/property.rss'),
                      ('The Tickets', 'http://www.irishtimes.com/feeds/rss/newspaper/theticket.rss'),
                      ('Weekend', 'http://www.irishtimes.com/feeds/rss/newspaper/weekend.rss'),
                      ('News features', 'http://www.irishtimes.com/feeds/rss/newspaper/newsfeatures.rss'),
                      ('Obituaries', 'http://www.irishtimes.com/feeds/rss/newspaper/obituaries.rss'),
                    ]


    def print_version(self, url):
        if url.count('rss.feedsportal.com'):
            u = url.replace('0Bhtml/story01.htm','_pf0Bhtml/story01.htm')
        else:
            u = url.replace('.html','_pf.html')
        return u

    def get_article_url(self, article):
        return article.link

The second recipe which produced a useable azw file (loss of text not noticed in this case, but possible of course when more articles are extracted) was:

Spoiler:

The third azw file producing recipe, with problems described above, was:

Spoiler:

This recipe failed at first to produce an azw file, as it was an initial version returning the complete page. The faulty azw file was only generated when the keep_only_tags and remove_tags were added to restrict the text extracted. I found with nickredding's code that more azw files were generated, but the extra azw files (beyond the first two which worked here) also were faulty and showed the same message box.

The fourth recipe which produced a faulty azw file was:

Spoiler:

The two recipes which completely failed were:

Spoiler:

Code:

import re
from calibre import strftime
from time import gmtime
from calibre.web.feeds.news import BasicNewsRecipe

class HaaretzPrint_en(BasicNewsRecipe):
    title                 = 'Haaretz - print editAZW'
    __author__            = 'Darko Miletic'
    description           = "Haaretz.com is the world's leading English-language Website for real-time news and analysis of Israel and the Middle East."
    publisher             = 'Haaretz'
    category              = "news, Haaretz, Israel news, Israel newspapers, Israel business news, Israel financial news, Israeli news,Israeli newspaper, Israeli newspapers, news from Israel, news in Israel, news Israel, news on Israel, newspaper Israel, Israel sports news, Israel diplomacy news"
    oldest_article        = 2
    max_articles_per_feed = 25
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    language              = 'en_IL'
    publication_type      = 'newspaper'
    PREFIX                = 'http://www.haaretz.com'
    masthead_url          = PREFIX + '/images/logos/logoGrey.gif'
    extra_css             = ' body{font-family: Verdana,Arial,Helvetica,sans-serif } '

    preprocess_regexps = [(re.compile(r'</body>.*?</html>', re.DOTALL|re.IGNORECASE),lambda match: '</body></html>')]

    conversion_options = {
                          'comment'  : description
                        , 'tags'     : category
                        , 'publisher': publisher
                        , 'language' : language
                        }

    keep_only_tags    = [dict(attrs={'id':'threecolumns'})]
    remove_attributes = ['width','height']
    remove_tags       = [
                           dict(name=['iframe','link','object','embed'])
                          ,dict(name='div',attrs={'class':'rightcol'})
                        ]


    feeds = [
              (u'News'          , PREFIX + u'/print-edition/news'         )
             ,(u'Opinion'       , PREFIX + u'/print-edition/opinion'      )
             ,(u'International', PREFIX + u'/news/international'      )
             ,(u'Defense and Diplomacy', PREFIX + u'/news/diplomacy-defense'      )
             ,(u'Features'      , PREFIX + u'/print-edition/features'     )
             ,(u'Business'      , PREFIX + u'/print-edition/business'     )
             ,(u'Real estate'   , PREFIX + u'/print-edition/real-estate'  )
             ,(u'Sports'        , PREFIX + u'/print-edition/sports'       )
             ,(u'Travel'        , PREFIX + u'/print-edition/travel'       )
             ,(u'Books'         , PREFIX + u'/print-edition/books'        )
             ,(u'Food & Wine'   , PREFIX + u'/print-edition/food-wine'    )
             ,(u'Arts & Leisure', PREFIX + u'/print-edition/arts-leisure' )
             #,(u'A Special Place in Hell', PREFIX + u'/blogs/a-special-place-in-hell'     )
             #,(u'Strenger than Fiction', PREFIX + u'/blogs/strenger-than-fiction'     )
             #,(u'MESS Report'      , PREFIX + u'/blogs/mess-report'     )
            ]


    def print_version(self, url):
        article = url.rpartition('/')[2]
        return 'http://www.haaretz.com/misc/article-print-page/' + article

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            for item in soup.findAll(attrs={'class':'text'}):
                sp = item.find('span',attrs={'class':'h3 font-weight-normal'})
                desc = item.find('p')
                description = ''
                if sp:
                    if desc:
                       description = self.tag_to_string(desc)
                    link        = sp.a
                    url         = self.PREFIX + link['href']
                    title       = self.tag_to_string(link)
                    times        = strftime('%a, %d %b %Y %H:%M:%S +0000',gmtime())
                    articles.append({
                                          'title'      :title
                                         ,'date'       :times
                                         ,'url'        :url
                                         ,'description':description
                                        })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds


    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        return soup

which could have tested the masthead with kindlegen 1.1, if it had generated output, and:

Spoiler:

As all six recipes produced epub and mobi versions, my suspicion is that the problem may be with the html extraction, either that Calibre removes content which would prove problematic which is left in here (and the lost text with the first recipe suggests comparison of the html extracted with Calibre and here could be useful - I will report if I find anything of interest in this respect, or kindlegen is simply more sensitive to unwanted or unsupported html than ebook-convert. As kindlegen seems to be based on MobiPocket mobigen, which I called without difficulty in my own extended version of the MobiPocket webcompanion which I continued to develop and use after Amazon bought MobiPocket and dropped the webcompanion, until I bought a Kindle in January and started to use Calibre for News generation, I am more inclined to suspect that it is something with the html passed to kindlegen which causes failure - five of these six recipes are for publications which extracted without difficulty when I used mobigen in my own software.

tylau0 · 07-21-2011, 09:07 AM

Thank oneillpt for the extensive testing.
The missing text was my fault - I delete certain <p> tags in opf file with content inside.
All the recipes that were not working contain only one feed. It was not taken care before in my implementation.
Attached is the modified CalibreKindlegenHelper.py. Replace it with the one in azwplugin.zip. I'll do a further extensive testing soon and pack it in the plugin.

tylau0 · 07-21-2011, 11:07 AM

Check the updated plugin at the top of this thread. It should have all the problems you mentioned fixed.

Thanks again.

P.S. Thanks Kovid and nickredding for working on a Calibre self-contained solution. I am looking forward to that clean fix!

oneillpt · 07-21-2011, 09:17 PM

Quote:

Originally Posted by tylau0

Check the updated plugin at the top of this thread. It should have all the problems you mentioned fixed.

Thanks again.

P.S. Thanks Kovid and nickredding for working on a Calibre self-contained solution. I am looking forward to that clean fix!

I've checked all six recipes I tried, and all now work correctly with proper navigation. I've also verified that masthead images are processed if kindlegen 1.1 is used rather than 1.2. The release notes for 1.2 mention "Bug fixes from older versions of Kindlegen", but I'm going to change over to use the plugin with kindlegen 1.1 for all my other news feeds too and see how it goes. I will report any problems found on this thread.

One change which I would suggest, and which I will try out for myself tomorrow, is to add a compression setting for kindlegen in the plugin. The azw files from kindlegen weigh in at nearly twice the size of the mobi version generated from the same recipe. My Depeche du Midi azw file now comes in at 18 MB for example!

Many thanks for this very useful plugin!

tylau0 · 07-22-2011, 12:07 PM

I adopt the code from Kindlestrip that trims the file size by half. Please check the top post for the updated plugin. Thanks.

oneillpt · 07-22-2011, 06:01 PM

Quote:

Originally Posted by tylau0

I adopt the code from Kindlestrip that trims the file size by half. Please check the top post for the updated plugin. Thanks.

Thanks. Even modified to use kindlegen with -c2 the asw files were still coming out about 25% larger than the corresponding mobi version. I'll try this version next.

I still find that the "The selected item could not be opened. If you purchased ..." message box can occur, although now in a way which does not prevent use of the ebook. It occurs with an extended version of one of the recipes I used earlier to test:

Spoiler:

In this case only the first feed shows any articles (38 at the moment), but the Kindle table of contents includes all the remaining feeds, showing zero articles for each, and the ebook text shows the name of each feed followed by the single line "RSS de diariodelaltoaragon.es" (this seems to be correct as browsing the rss feeds in a web browser gets this single line for these feeds too). Moving down the left (sections) column of the Kindle toc past that first feed to the second which shows zero articles gives the message box, forcing closure of the ebook. The same thing happens when on the last (Calibre Table of Contents) page when attempting to open the Kindle table of contents, requiring paging back or skipping to previous article before the Kindle table of contents can be accessed.

In this case I found this problem by accident - all sections in the Kindle table of contents were visible on screen at the same time, so there was no need to scroll down the sections. In another case, the list of sections required a second page to display, and while scrolling down through a series of sections with zero articles the right hand (articles) column displayed the first articles for the next section with articles, and no message box occurred.

oneillpt · 07-22-2011, 06:42 PM

The latest version now produces an azw file about 10% smaller than the corresponding mobi version.

tylau0 · 07-22-2011, 07:24 PM

v1.0.3 (available at the top post) removes sections without any article. That should fix the issue you raised.

Quote:

Originally Posted by oneillpt

Thanks. Even modified to use kindlegen with -c2 the asw files were still coming out about 25% larger than the corresponding mobi version. I'll try this version next.

I still find that the "The selected item could not be opened. If you purchased ..." message box can occur, although now in a way which does not prevent use of the ebook. It occurs with an extended version of one of the recipes I used earlier to test:

Spoiler:

In this case only the first feed shows any articles (38 at the moment), but the Kindle table of contents includes all the remaining feeds, showing zero articles for each, and the ebook text shows the name of each feed followed by the single line "RSS de diariodelaltoaragon.es" (this seems to be correct as browsing the rss feeds in a web browser gets this single line for these feeds too). Moving down the left (sections) column of the Kindle toc past that first feed to the second which shows zero articles gives the message box, forcing closure of the ebook. The same thing happens when on the last (Calibre Table of Contents) page when attempting to open the Kindle table of contents, requiring paging back or skipping to previous article before the Kindle table of contents can be accessed.

In this case I found this problem by accident - all sections in the Kindle table of contents were visible on screen at the same time, so there was no need to scroll down the sections. In another case, the list of sections required a second page to display, and while scrolling down through a series of sections with zero articles the right hand (articles) column displayed the first articles for the next section with articles, and no message box occurred.

oneillpt · 07-22-2011, 07:35 PM

A fix in just over 40 minutes. Impressive!

roadlesstraveled · 07-23-2011, 04:51 PM

I just tried to convert an epub file into AZW using your plugin but when the conversion gets to 67% "calibre-parallel.exe" crashes.

I'm not sure if it matters but I'm using the portable version of Calibre. Any advice?

Spoiler:

tylau0 · 07-23-2011, 09:57 PM

v1.0.4 at the top post may have solved the issue you raise. Please check. Thanks.

Quote:

Originally Posted by roadlesstraveled

I just tried to convert an epub file into AZW using your plugin but when the conversion gets to 67% "calibre-parallel.exe" crashes.

I'm not sure if it matters but I'm using the portable version of Calibre. Any advice?

07-21-2011, 11:07 AM	#7
tylau0 Connoisseur Posts: 82 Karma: 10 Join Date: Oct 2010 Device: Kindle	Check the updated plugin at the top of this thread. It should have all the problems you mentioned fixed. Thanks again. P.S. Thanks Kovid and nickredding for working on a Calibre self-contained solution. I am looking forward to that clean fix! Last edited by tylau0; 07-21-2011 at 01:45 PM.

07-22-2011, 06:42 PM	#11
oneillpt Connoisseur Posts: 63 Karma: 46 Join Date: Feb 2011 Device: Kindle 3 (cracked screen!); PW1; Oasis	Latest version even better The latest version now produces an azw file about 10% smaller than the corresponding mobi version.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
conversion to azw?	grapho	Conversion	6	01-30-2011 10:01 AM
AZW to EPUB conversion - overlapping letters	suecsi	Calibre	4	10-16-2010 11:53 PM
PDF to prc/azw Batch Conversion	xsolitudex	PDF	2	09-04-2010 10:19 AM
PDF -> AZW conversion, weird character spacing	beacher	Amazon Kindle	7	08-17-2010 09:54 PM
AZW Conversion	elliskatz	Introduce Yourself	7	08-14-2010 05:47 AM

07-20-2011, 11:29 AM	#2
kovidgoyal creator of calibre Posts: 45,306 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	No need, I started working on MOBI indexing a few days ago. Hopefully I will be able to figure out the problem. I've written code that decompiles the MOBI, including all indexing information which should allow me to see what the differences between kindlegen generated periodicals and calibre ones are. You can run it with calibre-debug --inspect-mobi filename.mobi You will need to be running from latest calibre source for this to work.

07-20-2011, 03:11 PM	#3
nickredding onlinenewsreader.net Posts: 327 Karma: 10143 Join Date: Dec 2009 Location: Phoenix, AZ & Victoria, BC Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire	kovid - I'm 99% sure the issue includes the trailing byte sequence following each HTML record. I have been unable to fully decode it because Kindlegen seems to insert (in an apparently inconsistent way) some arbitrary bytes in some of the sequences and I haven't been able to determine what the logic is. However, if you take a Kindlegen-generated document (which works properly on Kindle in Sections & Articles view) and zero out the trailing byte sequences, the document still displays properly on Kindle except now it exhibits the impaired 'back' function in the Sections & Articles view. So the trailing byte sequences are definately part of the puzzle. On the other hand, if you look at Amazon generated periodicals (e.g. the New York Times) the trailing byte sequences are consistent and reflect the changes I made to the MOBI code a few months ago (and forwarded to you). However, in those files the NCX entries have additional data bytes that I cannot decode, but may be associated with the issue. This suggests Amazon is NOT using Kindlegen to format periodicals (not surprising, actually because Kindlegen is a piece of cr*p).

07-20-2011, 03:36 PM	#4
kovidgoyal creator of calibre Posts: 45,306 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Yeah, I've already figured out its the trailing byte sequences. I'm working on decoding them now. I'm currently working off kindlegen generated mobi files, and I've completely deciphered the index, cnx and tagx records for those. The trailing byte sequences are still opaque to me, but I don't think they will prove very hard to decode. Hopefully, understanding and duplicating what kindlegen does with the TBS sequences will allow calibre periodicals to work properly.

07-22-2011, 12:07 PM	#9
tylau0 Connoisseur Posts: 82 Karma: 10 Join Date: Oct 2010 Device: Kindle	I adopt the code from Kindlestrip that trims the file size by half. Please check the top post for the updated plugin. Thanks.

07-22-2011, 07:35 PM	#13
oneillpt Connoisseur Posts: 63 Karma: 46 Join Date: Feb 2011 Device: Kindle 3 (cracked screen!); PW1; Oasis	A fix in just over 40 minutes. Impressive!

Advert

Advert