Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 07-20-2011, 08:40 AM   #1
tylau0
Connoisseur
tylau0 began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Oct 2010
Device: Kindle
[Conversion Output Plugin] AZW output by kindlegen for periodicals

This plugin overrides the default mobi periodical generation routine with another that makes use of the kindlegen program available at Amazon.com here. This project is motivated by the fact that the section/article view introduced since Kindle 3.1 firmware does not work properly with the mobi periodical generated by the default Calibre routine. One cannot have the pointer in that view points to the last article read when he reads an article and uses the "back" button on the device to go to that view (details). Using kindlegen is a viable solution the community comes up so far (details). This plugin ports that solution into the Calibre plugin framework. After installing this plugin, one can simply specify the output format to azw in both command line and Calibre graphical interface to generate periodicals with the problem described above gone.

History:
v1.0.5 [2014/12/07]: Now compatible with Calibre version 2.12. Note I can no longer generate periodicals using the latest Kindlegen (2.9).
v1.0.4 [2011/07/23]: Now compatible with Calibre version 0.8.11.
v1.0.3 [2011/07/22] - Remove sections that are empty (Support Calibre version up to 0.8.10)
v1.0.2 [2011/07/22] - Use kindlestrip (by Paul Durrant) to trim down the result file
v1.0.1 [2011/07/21] - Fix a few typesetting problems (using calibre's own routine) and make it work with one-feed recipe
v1.0.0 [2011/07/20] - Basic done

TODO:
* Increase its dependency on code in Calibre source. This allows the code to stay in sync with any updates in Calibre.

Latest news (2011/09/08):
There is already a native solution to the problem. Please check here for details.
Attached Files
File Type: zip azw-v101.zip (54.9 KB, 1727 views)
File Type: zip azw-v102.zip (57.5 KB, 1317 views)
File Type: zip azw-v103.zip (57.5 KB, 1288 views)
File Type: zip azw-v104.zip (10.4 KB, 2314 views)
File Type: zip azw-v105.zip (9.3 KB, 1493 views)

Last edited by tylau0; 12-07-2014 at 06:10 AM.
tylau0 is offline   Reply With Quote
Old 07-20-2011, 12:29 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,559
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
No need, I started working on MOBI indexing a few days ago. Hopefully I will be able to figure out the problem. I've written code that decompiles the MOBI, including all indexing information which should allow me to see what the differences between kindlegen generated periodicals and calibre ones are. You can run it with

calibre-debug --inspect-mobi filename.mobi

You will need to be running from latest calibre source for this to work.
kovidgoyal is offline   Reply With Quote
Advert
Old 07-20-2011, 04:11 PM   #3
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
kovid - I'm 99% sure the issue includes the trailing byte sequence following each HTML record. I have been unable to fully decode it because Kindlegen seems to insert (in an apparently inconsistent way) some arbitrary bytes in some of the sequences and I haven't been able to determine what the logic is. However, if you take a Kindlegen-generated document (which works properly on Kindle in Sections & Articles view) and zero out the trailing byte sequences, the document still displays properly on Kindle except now it exhibits the impaired 'back' function in the Sections & Articles view. So the trailing byte sequences are definately part of the puzzle.

On the other hand, if you look at Amazon generated periodicals (e.g. the New York Times) the trailing byte sequences are consistent and reflect the changes I made to the MOBI code a few months ago (and forwarded to you). However, in those files the NCX entries have additional data bytes that I cannot decode, but may be associated with the issue.

This suggests Amazon is NOT using Kindlegen to format periodicals (not surprising, actually because Kindlegen is a piece of cr*p).
nickredding is offline   Reply With Quote
Old 07-20-2011, 04:36 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,559
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Yeah, I've already figured out its the trailing byte sequences. I'm working on decoding them now.

I'm currently working off kindlegen generated mobi files, and I've completely deciphered the index, cnx and tagx records for those. The trailing byte sequences are still opaque to me, but I don't think they will prove very hard to decode.

Hopefully, understanding and duplicating what kindlegen does with the TBS sequences will allow calibre periodicals to work properly.
kovidgoyal is offline   Reply With Quote
Old 07-20-2011, 09:45 PM   #5
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by tylau0 View Post
This plugin overrides the default mobi periodical generation routine with another that makes use of the kindlegen program available at Amazon.com
A plugin or updated mobi generation is clearly the way to go, as this will allow automatic uploading on connection of the Kindle.

If you are continuing to work on this while waiting for an update of Calibre with mobi indexing as desired, or if the feedback is of any use for any other projects you may have under way, I have tested the plugin with six recipes for which Calibre generates correct epub and mobi output (although of course without proper back button behaviour), using both current and previous versions of kindlegen - although unfortunately the recipe I included to test the masthead was one of those which failed, so I could just have tested with the current version.

Four of the six recipes generated azw output files, the other two failed. Of the four which produced azw output, two had correct back button behaviour, the other two produced azw files which could be viewed with Kindle for PC, but opened on the Kindle itself showing a table of contents but with a message box which displayed "The selected item could not be opened. If you purchased this item from Amazon, delete the item and download it from Archived Items." More comments on this below.

I tested using the command line ebook-convert with "--test -vv --debug-pipeline" to generate small e-books, and generated epub, azw and mobi versions to compare. In one case, one of the four articles extracted showed a loss of some text in the first article in the azw output when compared to the epub or mobi versions:

The recipe used was:
Spoiler:
Code:
class IrishTimes(BasicNewsRecipe):
    title          = u'The Irish TimesAZW'
    encoding  = 'ISO-8859-1'
    __author__    = "Derry FitzGerald, Ray Kinsella, David O'Callaghan and Phil Burns"
    language = 'en_IE'
    timefmt = ' (%A, %B %d, %Y)'


    oldest_article = 1.0
    max_articles_per_feed  = 100
    no_stylesheets = True
    simultaneous_downloads= 5

    r = re.compile('.*(?P<url>http:\/\/(www.irishtimes.com)|(rss.feedsportal.com\/c)\/.*\.html?).*')
    remove_tags    = [dict(name='div', attrs={'class':'footer'})]
    extra_css      = 'p, div { margin: 0pt; border: 0pt; text-indent: 0.5em } .headline {font-size: large;} \n .fact { padding-top: 10pt  }'

    feeds          = [
                      ('Frontpage', 'http://www.irishtimes.com/feeds/rss/newspaper/index.rss'),
                      ('Ireland', 'http://www.irishtimes.com/feeds/rss/newspaper/ireland.rss'),
                      ('World', 'http://www.irishtimes.com/feeds/rss/newspaper/world.rss'),
                      ('Finance', 'http://www.irishtimes.com/feeds/rss/newspaper/finance.rss'),
                      ('Features', 'http://www.irishtimes.com/feeds/rss/newspaper/features.rss'),
                      ('Sport', 'http://www.irishtimes.com/feeds/rss/newspaper/sport.rss'),
                      ('Opinion', 'http://www.irishtimes.com/feeds/rss/newspaper/opinion.rss'),
                      ('Letters', 'http://www.irishtimes.com/feeds/rss/newspaper/letters.rss'),
                      ('Magazine', 'http://www.irishtimes.com/feeds/rss/newspaper/magazine.rss'),
                      ('Health', 'http://www.irishtimes.com/feeds/rss/newspaper/health.rss'),
                      ('Education & Parenting', 'http://www.irishtimes.com/feeds/rss/newspaper/education.rss'),
                      ('Motors', 'http://www.irishtimes.com/feeds/rss/newspaper/motors.rss'),
                      ('An Teanga Bheo', 'http://www.irishtimes.com/feeds/rss/newspaper/anteangabheo.rss'),
                      ('Commercial Property', 'http://www.irishtimes.com/feeds/rss/newspaper/commercialproperty.rss'),
                      ('Science Today', 'http://www.irishtimes.com/feeds/rss/newspaper/sciencetoday.rss'),
                      ('Property', 'http://www.irishtimes.com/feeds/rss/newspaper/property.rss'),
                      ('The Tickets', 'http://www.irishtimes.com/feeds/rss/newspaper/theticket.rss'),
                      ('Weekend', 'http://www.irishtimes.com/feeds/rss/newspaper/weekend.rss'),
                      ('News features', 'http://www.irishtimes.com/feeds/rss/newspaper/newsfeatures.rss'),
                      ('Obituaries', 'http://www.irishtimes.com/feeds/rss/newspaper/obituaries.rss'),
                    ]


    def print_version(self, url):
        if url.count('rss.feedsportal.com'):
            u = url.replace('0Bhtml/story01.htm','_pf0Bhtml/story01.htm')
        else:
            u = url.replace('.html','_pf.html')
        return u

    def get_article_url(self, article):
        return article.link


The second recipe which produced a useable azw file (loss of text not noticed in this case, but possible of course when more articles are extracted) was:
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1301251451(BasicNewsRecipe):
    title          = u'Depeche du MidiAZW'
    encoding  = 'Windows-1252'
    oldest_article = 7
    max_articles_per_feed = 100
    remove_javascript     = True
    keep_only_tags = [dict(name='div', attrs={'class':'article'})]
    remove_tags_after = [dict(name='iframe', attrs={'scrolling':'no'})]

    feeds          = [(u'Accueil', u'http://www.ladepeche.fr/rss/39.rss'),
	(u'Ariege', u'http://www.ladepeche.fr/rss/63.rss'), 
	(u'Aude', u'http://www.ladepeche.fr/rss/64.rss'), 
	(u'Haute-Garonne', u'http://www.ladepeche.fr/rss/66.rss'), 
	(u'Lot', u'http://www.ladepeche.fr/rss/68.rss'), 
	(u'Hautes-Pyrenees', u'http://www.ladepeche.fr/rss/70.rss'), 
	(u'Pyrenees', u'http://www.ladepeche.fr/rss/484.rss'), 
	(u'Actu', u'http://www.ladepeche.fr/rss/75.rss'), 
	(u'A la Une', u'http://www.ladepeche.fr/rss/76.rss'), 
	(u"L'evenement", u'http://www.ladepeche.fr/rss/77.rss'), 
	(u'France', u'http://www.ladepeche.fr/rss/164.rss'), 
	(u'Monde', u'http://www.ladepeche.fr/rss/165.rss'), 
	(u'Faits divers', u'http://www.ladepeche.fr/rss/167.rss'), 
	(u'Insolite', u'http://www.ladepeche.fr/rss/168.rss'), 
	(u'Politique', u'http://www.ladepeche.fr/rss/171.rss'), 
	(u'High Tech / Sciences', u'http://www.ladepeche.fr/rss/389.rss'), 
	(u'Sortir a', u'http://www.ladepeche.fr/rss/83.rss'), 
	(u'Meteo', u'http://www.ladepeche.fr/rss/100.rss')
	]


The third azw file producing recipe, with problems described above, was:
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1311043192(BasicNewsRecipe):
    title          = u'AvuiAZW'
    oldest_article = 7
    max_articles_per_feed = 100

    feeds          = [(u'Avui', u'http://www.avui.cat/puigcerda/nacional.feed?type=rss')]
    
    keep_only_tags = [dict(name='div', attrs={'id':'article-complet'})]
    remove_tags = [dict(name='div', attrs={'class':['botonera']})]

This recipe failed at first to produce an azw file, as it was an initial version returning the complete page. The faulty azw file was only generated when the keep_only_tags and remove_tags were added to restrict the text extracted. I found with nickredding's code that more azw files were generated, but the extra azw files (beyond the first two which worked here) also were faulty and showed the same message box.

The fourth recipe which produced a faulty azw file was:
Spoiler:
Code:
__license__  = 'GPL v3'
__copyright__ = '2011 Phil Burns'
'''
TheJournal.ie
'''
import re

from calibre.web.feeds.news import BasicNewsRecipe

class TheJournal(BasicNewsRecipe):

    __author_              = ' Phil Burns'
    title                  = u'TheJournal.ieAZW'
    oldest_article        = 1
    max_articles_per_feed  = 100
    encoding              = 'utf8'
    language              = 'en_IE'
    timefmt                = ' (%A, %B %d, %Y)'

    no_stylesheets        = True
    remove_tags            = [dict(name='div', attrs={'class':'footer'}),
                          dict(name=['script', 'noscript'])]

    extra_css              = 'p, div { margin: 0pt; border: 0pt; text-indent: 0.5em }'

    feeds                  = [
                          (u'Latest News', u'http://www.thejournal.ie/feed/')]


The two recipes which completely failed were:
Spoiler:
Code:
import re
from calibre import strftime
from time import gmtime
from calibre.web.feeds.news import BasicNewsRecipe

class HaaretzPrint_en(BasicNewsRecipe):
    title                 = 'Haaretz - print editAZW'
    __author__            = 'Darko Miletic'
    description           = "Haaretz.com is the world's leading English-language Website for real-time news and analysis of Israel and the Middle East."
    publisher             = 'Haaretz'
    category              = "news, Haaretz, Israel news, Israel newspapers, Israel business news, Israel financial news, Israeli news,Israeli newspaper, Israeli newspapers, news from Israel, news in Israel, news Israel, news on Israel, newspaper Israel, Israel sports news, Israel diplomacy news"
    oldest_article        = 2
    max_articles_per_feed = 25
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    language              = 'en_IL'
    publication_type      = 'newspaper'
    PREFIX                = 'http://www.haaretz.com'
    masthead_url          = PREFIX + '/images/logos/logoGrey.gif'
    extra_css             = ' body{font-family: Verdana,Arial,Helvetica,sans-serif } '

    preprocess_regexps = [(re.compile(r'</body>.*?</html>', re.DOTALL|re.IGNORECASE),lambda match: '</body></html>')]

    conversion_options = {
                          'comment'  : description
                        , 'tags'     : category
                        , 'publisher': publisher
                        , 'language' : language
                        }

    keep_only_tags    = [dict(attrs={'id':'threecolumns'})]
    remove_attributes = ['width','height']
    remove_tags       = [
                           dict(name=['iframe','link','object','embed'])
                          ,dict(name='div',attrs={'class':'rightcol'})
                        ]


    feeds = [
              (u'News'          , PREFIX + u'/print-edition/news'         )
             ,(u'Opinion'       , PREFIX + u'/print-edition/opinion'      )
             ,(u'International', PREFIX + u'/news/international'      )
             ,(u'Defense and Diplomacy', PREFIX + u'/news/diplomacy-defense'      )
             ,(u'Features'      , PREFIX + u'/print-edition/features'     )
             ,(u'Business'      , PREFIX + u'/print-edition/business'     )
             ,(u'Real estate'   , PREFIX + u'/print-edition/real-estate'  )
             ,(u'Sports'        , PREFIX + u'/print-edition/sports'       )
             ,(u'Travel'        , PREFIX + u'/print-edition/travel'       )
             ,(u'Books'         , PREFIX + u'/print-edition/books'        )
             ,(u'Food & Wine'   , PREFIX + u'/print-edition/food-wine'    )
             ,(u'Arts & Leisure', PREFIX + u'/print-edition/arts-leisure' )
             #,(u'A Special Place in Hell', PREFIX + u'/blogs/a-special-place-in-hell'     )
             #,(u'Strenger than Fiction', PREFIX + u'/blogs/strenger-than-fiction'     )
             #,(u'MESS Report'      , PREFIX + u'/blogs/mess-report'     )
            ]


    def print_version(self, url):
        article = url.rpartition('/')[2]
        return 'http://www.haaretz.com/misc/article-print-page/' + article

    def parse_index(self):
        totalfeeds = []
        lfeeds = self.get_feeds()
        for feedobj in lfeeds:
            feedtitle, feedurl = feedobj
            self.report_progress(0, _('Fetching feed')+' %s...'%(feedtitle if feedtitle else feedurl))
            articles = []
            soup = self.index_to_soup(feedurl)
            for item in soup.findAll(attrs={'class':'text'}):
                sp = item.find('span',attrs={'class':'h3 font-weight-normal'})
                desc = item.find('p')
                description = ''
                if sp:
                    if desc:
                       description = self.tag_to_string(desc)
                    link        = sp.a
                    url         = self.PREFIX + link['href']
                    title       = self.tag_to_string(link)
                    times        = strftime('%a, %d %b %Y %H:%M:%S +0000',gmtime())
                    articles.append({
                                          'title'      :title
                                         ,'date'       :times
                                         ,'url'        :url
                                         ,'description':description
                                        })
            totalfeeds.append((feedtitle, articles))
        return totalfeeds


    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        return soup

which could have tested the masthead with kindlegen 1.1, if it had generated output, and:
Spoiler:
Code:
class AdvancedUserRecipe1311083909(BasicNewsRecipe):
    title          = u'DiarioAltoAragonAZW'
    oldest_article = 7
    max_articles_per_feed = 101

    feeds          = [(u'Portada', u'http://www.diariodelaltoaragon.es/rss.aspx')]
    
    keep_only_tags = [dict(name='div', attrs={'id':'bloquenoticia'})]
    remove_tags = [
       dict(name='div', attrs={'id':['imagen_sin_bordes', 'ctl00ContentPlaceHolder1_pnPopUp', 
          'ctl00ContentPlaceHolder1_divGoogle', 'ctl00_ContentPlaceHolder1_UpdatePanelVotos']}),
       dict(name='iframe'),
       dict(name='a', attrs={'id':['click']}),
       dict(name='a', attrs={'class':['twitter-share-button']})
    ]


As all six recipes produced epub and mobi versions, my suspicion is that the problem may be with the html extraction, either that Calibre removes content which would prove problematic which is left in here (and the lost text with the first recipe suggests comparison of the html extracted with Calibre and here could be useful - I will report if I find anything of interest in this respect, or kindlegen is simply more sensitive to unwanted or unsupported html than ebook-convert. As kindlegen seems to be based on MobiPocket mobigen, which I called without difficulty in my own extended version of the MobiPocket webcompanion which I continued to develop and use after Amazon bought MobiPocket and dropped the webcompanion, until I bought a Kindle in January and started to use Calibre for News generation, I am more inclined to suspect that it is something with the html passed to kindlegen which causes failure - five of these six recipes are for publications which extracted without difficulty when I used mobigen in my own software.
oneillpt is offline   Reply With Quote
Advert
Old 07-21-2011, 10:07 AM   #6
tylau0
Connoisseur
tylau0 began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Oct 2010
Device: Kindle
Thank oneillpt for the extensive testing.
The missing text was my fault - I delete certain <p> tags in opf file with content inside.
All the recipes that were not working contain only one feed. It was not taken care before in my implementation.
Attached is the modified CalibreKindlegenHelper.py. Replace it with the one in azwplugin.zip. I'll do a further extensive testing soon and pack it in the plugin.
Attached Files
File Type: zip CalibreKindlegenHelper.py.zip (2.1 KB, 1105 views)
tylau0 is offline   Reply With Quote
Old 07-21-2011, 12:07 PM   #7
tylau0
Connoisseur
tylau0 began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Oct 2010
Device: Kindle
Check the updated plugin at the top of this thread. It should have all the problems you mentioned fixed.

Thanks again.

P.S. Thanks Kovid and nickredding for working on a Calibre self-contained solution. I am looking forward to that clean fix!

Last edited by tylau0; 07-21-2011 at 02:45 PM.
tylau0 is offline   Reply With Quote
Old 07-21-2011, 10:17 PM   #8
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by tylau0 View Post
Check the updated plugin at the top of this thread. It should have all the problems you mentioned fixed.

Thanks again.

P.S. Thanks Kovid and nickredding for working on a Calibre self-contained solution. I am looking forward to that clean fix!
I've checked all six recipes I tried, and all now work correctly with proper navigation. I've also verified that masthead images are processed if kindlegen 1.1 is used rather than 1.2. The release notes for 1.2 mention "Bug fixes from older versions of Kindlegen", but I'm going to change over to use the plugin with kindlegen 1.1 for all my other news feeds too and see how it goes. I will report any problems found on this thread.

One change which I would suggest, and which I will try out for myself tomorrow, is to add a compression setting for kindlegen in the plugin. The azw files from kindlegen weigh in at nearly twice the size of the mobi version generated from the same recipe. My Depeche du Midi azw file now comes in at 18 MB for example!

Many thanks for this very useful plugin!
oneillpt is offline   Reply With Quote
Old 07-22-2011, 01:07 PM   #9
tylau0
Connoisseur
tylau0 began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Oct 2010
Device: Kindle
I adopt the code from Kindlestrip that trims the file size by half. Please check the top post for the updated plugin. Thanks.
tylau0 is offline   Reply With Quote
Old 07-22-2011, 07:01 PM   #10
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by tylau0 View Post
I adopt the code from Kindlestrip that trims the file size by half. Please check the top post for the updated plugin. Thanks.
Thanks. Even modified to use kindlegen with -c2 the asw files were still coming out about 25% larger than the corresponding mobi version. I'll try this version next.

I still find that the "The selected item could not be opened. If you purchased ..." message box can occur, although now in a way which does not prevent use of the ebook. It occurs with an extended version of one of the recipes I used earlier to test:
Spoiler:
Code:
class AdvancedUserRecipe1311083909(BasicNewsRecipe):
    title          = u'DiarioAltoAragon'
    oldest_article = 7
    max_articles_per_feed = 101
    keep_only_tags = [dict(name='div', attrs={'id':'bloguenoticia'})]

    feeds          = [(u'Portada', u'http://www.diariodelaltoaragon.es/rss.aspx'),
(u'Es Noticia', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=14'),
(u'Huesca', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=1'),
(u'Aragón', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=8'),
(u'España', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=5'),
(u'Mundo', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=9'),
(u'Cultura', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=10'),
(u'Última', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=13'),
(u'Opinión', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=4'),
(u'Sociedad', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=11')]
    
    keep_only_tags = [dict(name='div', attrs={'id':'bloquenoticia'})]
    remove_tags = [
       dict(name='div', attrs={'id':['imagen_sin_bordes', 'ctl00ContentPlaceHolder1_pnPopUp', 
          'ctl00ContentPlaceHolder1_divGoogle', 'ctl00_ContentPlaceHolder1_UpdatePanelVotos']}),
       dict(name='iframe'),
       dict(name='a', attrs={'id':['click']}),
       dict(name='a', attrs={'class':['twitter-share-button']})
    ]

In this case only the first feed shows any articles (38 at the moment), but the Kindle table of contents includes all the remaining feeds, showing zero articles for each, and the ebook text shows the name of each feed followed by the single line "RSS de diariodelaltoaragon.es" (this seems to be correct as browsing the rss feeds in a web browser gets this single line for these feeds too). Moving down the left (sections) column of the Kindle toc past that first feed to the second which shows zero articles gives the message box, forcing closure of the ebook. The same thing happens when on the last (Calibre Table of Contents) page when attempting to open the Kindle table of contents, requiring paging back or skipping to previous article before the Kindle table of contents can be accessed.

In this case I found this problem by accident - all sections in the Kindle table of contents were visible on screen at the same time, so there was no need to scroll down the sections. In another case, the list of sections required a second page to display, and while scrolling down through a series of sections with zero articles the right hand (articles) column displayed the first articles for the next section with articles, and no message box occurred.
oneillpt is offline   Reply With Quote
Old 07-22-2011, 07:42 PM   #11
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Latest version even better

The latest version now produces an azw file about 10% smaller than the corresponding mobi version.
oneillpt is offline   Reply With Quote
Old 07-22-2011, 08:24 PM   #12
tylau0
Connoisseur
tylau0 began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Oct 2010
Device: Kindle
v1.0.3 (available at the top post) removes sections without any article. That should fix the issue you raised.

Quote:
Originally Posted by oneillpt View Post
Thanks. Even modified to use kindlegen with -c2 the asw files were still coming out about 25% larger than the corresponding mobi version. I'll try this version next.

I still find that the "The selected item could not be opened. If you purchased ..." message box can occur, although now in a way which does not prevent use of the ebook. It occurs with an extended version of one of the recipes I used earlier to test:
Spoiler:
Code:
class AdvancedUserRecipe1311083909(BasicNewsRecipe):
    title          = u'DiarioAltoAragon'
    oldest_article = 7
    max_articles_per_feed = 101
    keep_only_tags = [dict(name='div', attrs={'id':'bloguenoticia'})]

    feeds          = [(u'Portada', u'http://www.diariodelaltoaragon.es/rss.aspx'),
(u'Es Noticia', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=14'),
(u'Huesca', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=1'),
(u'Aragón', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=8'),
(u'España', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=5'),
(u'Mundo', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=9'),
(u'Cultura', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=10'),
(u'Última', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=13'),
(u'Opinión', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=4'),
(u'Sociedad', u'http://www.diariodelaltoaragon.es/rss.aspx?Id=11')]
    
    keep_only_tags = [dict(name='div', attrs={'id':'bloquenoticia'})]
    remove_tags = [
       dict(name='div', attrs={'id':['imagen_sin_bordes', 'ctl00ContentPlaceHolder1_pnPopUp', 
          'ctl00ContentPlaceHolder1_divGoogle', 'ctl00_ContentPlaceHolder1_UpdatePanelVotos']}),
       dict(name='iframe'),
       dict(name='a', attrs={'id':['click']}),
       dict(name='a', attrs={'class':['twitter-share-button']})
    ]

In this case only the first feed shows any articles (38 at the moment), but the Kindle table of contents includes all the remaining feeds, showing zero articles for each, and the ebook text shows the name of each feed followed by the single line "RSS de diariodelaltoaragon.es" (this seems to be correct as browsing the rss feeds in a web browser gets this single line for these feeds too). Moving down the left (sections) column of the Kindle toc past that first feed to the second which shows zero articles gives the message box, forcing closure of the ebook. The same thing happens when on the last (Calibre Table of Contents) page when attempting to open the Kindle table of contents, requiring paging back or skipping to previous article before the Kindle table of contents can be accessed.

In this case I found this problem by accident - all sections in the Kindle table of contents were visible on screen at the same time, so there was no need to scroll down the sections. In another case, the list of sections required a second page to display, and while scrolling down through a series of sections with zero articles the right hand (articles) column displayed the first articles for the next section with articles, and no message box occurred.
tylau0 is offline   Reply With Quote
Old 07-22-2011, 08:35 PM   #13
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
A fix in just over 40 minutes. Impressive!
oneillpt is offline   Reply With Quote
Old 07-23-2011, 05:51 PM   #14
roadlesstraveled
Junior Member
roadlesstraveled began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jul 2011
Device: Kindle 3
I just tried to convert an epub file into AZW using your plugin but when the conversion gets to 67% "calibre-parallel.exe" crashes.

I'm not sure if it matters but I'm using the portable version of Calibre. Any advice?

Spoiler:
Code:
Problem signature:
  Problem Event Name:                        APPCRASH
  Application Name:                             calibre-parallel.exe
  Application Version:                           0.8.11.0
  Application Timestamp:                    4e299df9
  Fault Module Name:                          StackHash_0a9e
  Fault Module Version:                        0.0.0.0
  Fault Module Timestamp:                 00000000
  Exception Code:                                  c0000005
  Exception Offset:                                fd7f9bad
  OS Version:                                          6.1.7601.2.1.0.256.1
  Locale ID:                                             1033
  Additional Information 1:                  0a9e
  Additional Information 2:                  0a9e372d3b4ad19135b953a78882e789
  Additional Information 3:                  0a9e
  Additional Information 4:                  0a9e372d3b4ad19135b953a78882e789
roadlesstraveled is offline   Reply With Quote
Old 07-23-2011, 10:57 PM   #15
tylau0
Connoisseur
tylau0 began at the beginning.
 
Posts: 82
Karma: 10
Join Date: Oct 2010
Device: Kindle
v1.0.4 at the top post may have solved the issue you raise. Please check. Thanks.

Quote:
Originally Posted by roadlesstraveled View Post
I just tried to convert an epub file into AZW using your plugin but when the conversion gets to 67% "calibre-parallel.exe" crashes.

I'm not sure if it matters but I'm using the portable version of Calibre. Any advice?
tylau0 is offline   Reply With Quote
Reply

Tags
issue fix, kindle, kindlegen, periodical


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
conversion to azw? grapho Conversion 6 01-30-2011 11:01 AM
AZW to EPUB conversion - overlapping letters suecsi Calibre 4 10-17-2010 12:53 AM
PDF to prc/azw Batch Conversion xsolitudex PDF 2 09-04-2010 11:19 AM
PDF -> AZW conversion, weird character spacing beacher Amazon Kindle 7 08-17-2010 10:54 PM
AZW Conversion elliskatz Introduce Yourself 7 08-14-2010 06:47 AM


All times are GMT -4. The time now is 10:53 AM.


MobileRead.com is a privately owned, operated and funded community.