Express failing recipe rewrite produces unexpected results

scissors · 01-19-2014, 05:36 AM

The Daily Express recipe has started producing an empty mobi.
I inspected the website and the taglines i used for the keep only tags don't seem to have changed.

I removed the keep only tags and the articles came down (plus all the other junk)

I added some recompile lines to get rid of the footer etc.

The articles download now minus the images.
The log contains a load of "Failed to find image: ...." lines.

when i look in the mobi in calibre's viewer the images reference the online image (unsurprisingly with the fail)

yet if i type these urls straight into my browsers address bar they load no problem.

Could anyone explain this change?

Thanks

Express Recipe (old)

Spoiler:

Code:

import re
from calibre.web.feeds.news import BasicNewsRecipe
from calibre import browser
class AdvancedUserRecipe1376229553(BasicNewsRecipe):
    title          = u'Daily Express old'
    __author__ = 'Dave Asbury'
    # 9-9-13 added article author and now use (re.compile(r'>[\w].+? News<'
    # 16-11-13 cover adjustment
    encoding    = 'utf-8'
    remove_empty_feeds = True
    #remove_javascript     = True
    no_stylesheets        = True
    oldest_article = 1
    max_articles_per_feed = 10
    #auto_cleanup = True
    compress_news_images = True
    compress_news_images_max_size = 30
    ignore_duplicate_articles = {'title', 'url'}
    masthead_url = 'http://cdn.images.dailyexpress.co.uk/img/page/express_logo.png'

    preprocess_regexps = [

                (re.compile(r'widget', re.IGNORECASE | re.DOTALL), lambda match: ''),
                        (re.compile(r'Related articles', re.IGNORECASE | re.DOTALL), lambda match: ''),
                        (re.compile(r'Add Your Comment<', re.IGNORECASE | re.DOTALL), lambda match: '<'),
                (re.compile(r'>More [\w].+?<', re.IGNORECASE), lambda match: '><'),
                                (re.compile(r'>[\w].+? News<', re.IGNORECASE), lambda match: '><'),
                         
                ]

    remove_tags = [
                                dict(attrs={'class' : 'quote'}),
                
                                dict(name='footer'),
                                dict(attrs={'id' : 'header_addons'}),
                                dict(attrs={'class' : 'hoverException'}),
                                dict(name='_li'),dict(name='li'),
                                dict(attrs={'class' : 'box related-articles clear'}),
                                dict(attrs={'class' : 'news-list'}),
                                dict(attrs={'class' : 'sponsored-section'}),
                                dict(attrs={'class' : 'pull-quote on-right'}),
                                dict(attrs={'class' : 'pull-quote on-left'}),

                             ]
    keep_only_tags = [
                dict(name='h1'),
                                dict(attrs={'class' : 'publish-info'}),
                                dict(name='h3', limit=2),
                                dict(attrs={'class' : 'clearfix hR new-style'}),
                             ]

    feeds          = [(u'UK News', u'http://www.express.co.uk/posts/rss/1/uk'),
                 (u'World News',u'http://www.express.co.uk/posts/rss/78/world'),
                         (u'Finance',u'http://www.express.co.uk/posts/rss/21/finance'),
                 (u'Sport',u'http://www.express.co.uk/posts/rss/65/sport'),
                 (u'Entertainment',u'http://www.express.co.uk/posts/rss/18/entertainment'),
                         (u'Lifestyle',u'http://www.express.co.uk/posts/rss/8/life&style'),
                 (u'Fun',u'http://www.express.co.uk/posts/rss/110/fun'),
                        ]

    def get_cover_url(self):
        print '============Cover ================='
        print
        soup = self.index_to_soup('http://www.express.co.uk/ourpaper/')
        cov = soup.find(attrs={'src' : re.compile('http://cdn.images.express.co.uk/img/covers/')})
        cov=str(cov)
        print '^^^^^^^', cov
        cov2 =  re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', cov)

        cov=str(cov2)
        cov=cov[2:len(cov)-2]

        print '&&&&&&&&',cov,'***'
        #cover_url=cov
        br = browser()
        br.set_handle_redirect(False)
        try:
            br.open_novisit(cov)
            cover_url = cov
        except:
            cover_url ='http://cdn.images.express.co.uk/img/static/ourpaper/header-back-issue-papers.jpg'

        return cover_url

    extra_css = '''
                    h1{font-weight:bold;font-size:175%;}
                    h2{font-weight:normal;font-size:75%;}
                    #p{font-size:14px;}
                    #body{font-size:14px;}
                    .photo-caption {display: block;margin-left: auto;margin-right: auto;width:100%;font-size:40%;}
                    .publish-info {font-size:50%;}
                    .photo img {display: block;margin-left: auto;margin-right: auto;width:100%;}
      '''

Express Recipe (new)

Spoiler:

Code:

import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre import browser
class AdvancedUserRecipe1376229553(BasicNewsRecipe):
    title          = u'Daily Express'
    __author__ = 'Dave Asbury'
    # 9-9-13 added article author and now use (re.compile(r'>[\w].+? News<'
    # 16-11-13 cover adjustment
    #19.1.14 changes due to website changes breaking recipe
    encoding    = 'utf-8'
    remove_empty_feeds = True
    remove_javascript     = True
    no_stylesheets        = True
    oldest_article = 1
    max_articles_per_feed = 2
    #auto_cleanup = True
    compress_news_images = True
    compress_news_images_max_size = 30
    ignore_duplicate_articles = {'title', 'url'}
    masthead_url = 'http://cdn.images.dailyexpress.co.uk/img/page/express_logo.png'
    #conversion_options = { 'linearize_tables' : True }
    preprocess_regexps = [
                                   (re.compile(r'<blockquote.*</blockquote>', re.DOTALL|re.IGNORECASE),lambda match: '</header>'),      
                                   (re.compile(r'widget', re.IGNORECASE | re.DOTALL), lambda match: ''),
                                   (re.compile(r'Related articles', re.IGNORECASE | re.DOTALL), lambda match: ''),
                                   (re.compile(r'Add Your Comment<', re.IGNORECASE | re.DOTALL), lambda match: '<'),
                                   (re.compile(r'>More [\w].+?<', re.IGNORECASE), lambda match: '><'),
                                   (re.compile(r'>[\w].+? News<', re.IGNORECASE), lambda match: '><'),
                                   #(re.compile(r'<footer class="mainFooter cf" id="mainFooter">.*</footer>', re.DOTALL|re.IGNORECASE),lambda match: ''),
                                   (re.compile(r'<section class="box related-articles clear">.*</footer>', re.DOTALL|re.IGNORECASE),lambda match: ''),
                                   (re.compile(r'<div style="display:inline;">.*</div>', re.DOTALL|re.IGNORECASE),lambda match: ''),
                                   (re.compile(r'<div style="display:none;">.*</div>', re.DOTALL|re.IGNORECASE),lambda match: ''),
		   #(re.compile(r'<nav>.*</nav>', re.DOTALL|re.IGNORECASE),lambda match: ''),
		   (re.compile(r'<div class="social-widget gplus">.*</header>', re.DOTALL|re.IGNORECASE),lambda match: '</header>'),      
          ]

    remove_tags = [
                                dict(attrs={'class' : 'quote'}),
                                dict(attrs={'class' : 'mainFooter cf'}),
                                dict(name='footer'),
                                dict(attrs={'id' : 'header_addons'}),
                                dict(attrs={'class' : 'hoverException'}),
                                dict(name='_li'),dict(name='li'),
                                dict(attrs={'class' : 'box related-articles clear'}),
                                dict(attrs={'class' : 'news-list'}),
                                dict(attrs={'class' : 'sponsored-section'}),
                                dict(attrs={'class' : 'pull-quote on-right'}),
                                dict(attrs={'class' : 'pull-quote on-left'}),

                             ]
    remove_tags_after = [dict(attrs={'class' : 'clearfix hR new-style'})]
    

    feeds          = [(u'UK News', u'http://www.express.co.uk/posts/rss/1/uk'),
                 (u'World News',u'http://www.express.co.uk/posts/rss/78/world'),
                         (u'Finance',u'http://www.express.co.uk/posts/rss/21/finance'),
                 (u'Sport',u'http://www.express.co.uk/posts/rss/65/sport'),
                 (u'Entertainment',u'http://www.express.co.uk/posts/rss/18/entertainment'),
                         (u'Lifestyle',u'http://www.express.co.uk/posts/rss/8/life&style'),
                 (u'Fun',u'http://www.express.co.uk/posts/rss/110/fun'),
                        ]

    def get_cover_url(self):
        print '============Cover ================='
        print
        soup = self.index_to_soup('http://www.express.co.uk/ourpaper/')
        cov = soup.find(attrs={'src' : re.compile('http://cdn.images.express.co.uk/img/covers/')})
        cov=str(cov)
        print '^^^^^^^', cov
        cov2 =  re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', cov)

        cov=str(cov2)
        cov=cov[2:len(cov)-2]

        print '&&&&&&&&',cov,'***'
        #cover_url=cov
        br = browser()
        br.set_handle_redirect(False)
        try:
            br.open_novisit(cov)
            cover_url = cov
        except:
            cover_url ='http://cdn.images.express.co.uk/img/static/ourpaper/header-back-issue-papers.jpg'

        return cover_url

    extra_css = '''
                    h1{font-weight:bold;font-size:175%;}
                    h2{font-weight:normal;font-size:75%;}
                    #p{font-size:14px;}
                    #body{font-size:14px;}
                    .photo-caption {display: block;margin-left: auto;margin-right: auto;width:100%;font-size:40%;}
                    .publish-info {font-size:50%;}
                    .photo img {display: block;margin-left: auto;margin-right: auto;width:100%;}
      '''

Log file

Spoiler:

kovidgoyal · 01-19-2014, 07:54 AM

That website likely has some invalid markup that is preventing calibre from parsing the page. From a quick look, it looks like it uses invalid HTML comments that look like this:

<!—OVOLABS_2 START—>

note the use of the mdash instead of a double hyphen.

I fixed the recpe to take care of that. https://github.com/kovidgoyal/calibr...7727e30fcb8ba8

scissors · 01-19-2014, 09:03 AM

Hi Kovid

Thanks for that.

However, I rewrote the recipe as it was getting messy.
This is the new one, which seems a lot faster.

I would ask 1 question, regarding the code for auto clean up.
In the recipe I wanted photos and the writer info to not be cleaned up.

I used the following

auto_cleanup_keep = '//section[@class="photo"]'
#auto_cleanup_keep = '//div[@class="publish-info"]'
auto_cleanup = True

The 2nd line is commented out because when i add it the photos disappear. Is it a case of the auto_cleanup_keep command can only be used once?

Kind Regards
Dave

Express, new recipe

Spoiler:

Code:

import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre import browser
class AdvancedUserRecipe1390132023(BasicNewsRecipe):
    title          = u'Daily Express'
    __author__ = 'Dave Asbury'
   # 19.1.14 written due to website changes
    oldest_article = 1
    max_articles_per_feed = 10
    compress_news_images = True
    compress_news_images_max_size = 30
    ignore_duplicate_articles = {'title', 'url'}
    masthead_url = 'http://cdn.images.dailyexpress.co.uk/img/page/express_logo.png'
    auto_cleanup_keep = '//section[@class="photo"]'
    #auto_cleanup_keep = '//div[@class="publish-info"]' 
    auto_cleanup = True
    no_stylesheets        = False
    preprocess_regexps = [
		 (re.compile(r'\| [\w].+?\| [\w].+?\| Daily Express', re.IGNORECASE | re.DOTALL), lambda match: ''),
         	
         		]
    feeds          = [

		(u'UK News', u'http://www.express.co.uk/posts/rss/1/uk'),
                                (u'World News',u'http://www.express.co.uk/posts/rss/78/world'),
                                (u'Finance',u'http://www.express.co.uk/posts/rss/21/finance'),
                                (u'Sport',u'http://www.express.co.uk/posts/rss/65/sport'),
                                (u'Entertainment',u'http://www.express.co.uk/posts/rss/18/entertainment'),
                                (u'Lifestyle',u'http://www.express.co.uk/posts/rss/8/life&style'),
                                (u'Fun',u'http://www.express.co.uk/posts/rss/110/fun'),
                        ]

    def get_cover_url(self):
        print '============Cover ================='
        print
        soup = self.index_to_soup('http://www.express.co.uk/ourpaper/')
        cov = soup.find(attrs={'src' : re.compile('http://cdn.images.express.co.uk/img/covers/')})
        cov=str(cov)
        print '^^^^^^^', cov
        cov2 =  re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', cov)

        cov=str(cov2)
        cov=cov[2:len(cov)-2]

        print '&&&&&&&&',cov,'***'
        #cover_url=cov
        br = browser()
        br.set_handle_redirect(False)
        try:
            br.open_novisit(cov)
            cover_url = cov
        except:
            cover_url ='http://cdn.images.express.co.uk/img/static/ourpaper/header-back-issue-papers.jpg'

        return cover_url


    extra_css = '''
                    #h1{font-weight:bold;font-size:175%;}
                    h2{display: block;margin-left: auto;margin-right: auto;width:100%;font-weight:bold;font-size:175%;}
                    #p{font-size:14px;}
                    #body{font-size:14px;}
                    .photo-caption {display: block;margin-left: auto;margin-right: auto;width:100%;font-size:40%;}
                    .publish-info {font-size:50%;}
                    .photo img {display: block;margin-left: auto;margin-right: auto;width:100%;}
      '''

PeterT · 01-19-2014, 09:30 AM

From looking at the manual I *think* you want

Code:

auto_cleanup_keep = '//section[@class="photo"]|//div[@class="publish-info"]' 
auto_cleanup = True

scissors · 01-19-2014, 10:33 AM

Quote:

Originally Posted by PeterT

From looking at the manual I *think* you want

Code:

auto_cleanup_keep = '//section[@class="photo"]|//div[@class="publish-info"]' 
auto_cleanup = True

Thanks.

Unfortunately, the publish info /seems/ to get ignored.I tried various tags etc.

No worries. The main article works.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
recipe for Express.de - german	schuster	Recipes	1	06-05-2011 09:58 AM
Markdown / Blockquote conversion giving unexpected results	Agama	Conversion	23	02-23-2011 11:16 AM
Globe and Mail Recipe Rewrite..	Szing	Recipes	9	01-21-2011 09:06 PM
Recipe produces no Images - Please help.	Onecanuck	Recipes	6	12-16-2010 08:29 PM
Unutterably Silly Unexpected results of the pumpkin pie	kennyc	Lounge	7	11-24-2010 12:14 PM

01-19-2014, 07:54 AM	#2
kovidgoyal creator of calibre Posts: 45,345 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That website likely has some invalid markup that is preventing calibre from parsing the page. From a quick look, it looks like it uses invalid HTML comments that look like this: <!—OVOLABS_2 START—> note the use of the mdash instead of a double hyphen. I fixed the recpe to take care of that. https://github.com/kovidgoyal/calibr...7727e30fcb8ba8

01-19-2014, 09:30 AM	#4
PeterT Grand Sorcerer Posts: 13,510 Karma: 78910112 Join Date: Nov 2007 Location: Toronto Device: Libra H2O, Libra Colour	From looking at the manual I think you want Code: auto_cleanup_keep = '//section[@class="photo"]\|//div[@class="publish-info"]' auto_cleanup = True

01-19-2014, 05:36 AM	#1
scissors Addict Posts: 241 Karma: 1001369 Join Date: Sep 2010 Device: prs300, kindle keyboard 3g	Express failing recipe rewrite produces unexpected results The Daily Express recipe has started producing an empty mobi. I inspected the website and the taglines i used for the keep only tags don't seem to have changed. I removed the keep only tags and the articles came down (plus all the other junk) I added some recompile lines to get rid of the footer etc. The articles download now minus the images. The log contains a load of "Failed to find image: ...." lines. when i look in the mobi in calibre's viewer the images reference the online image (unsurprisingly with the fail) yet if i type these urls straight into my browsers address bar they load no problem. Could anyone explain this change? Thanks Express Recipe (old) Spoiler: Code: import re from calibre.web.feeds.news import BasicNewsRecipe from calibre import browser class AdvancedUserRecipe1376229553(BasicNewsRecipe): title = u'Daily Express old' __author__ = 'Dave Asbury' # 9-9-13 added article author and now use (re.compile(r'>[\w].+? News<' # 16-11-13 cover adjustment encoding = 'utf-8' remove_empty_feeds = True #remove_javascript = True no_stylesheets = True oldest_article = 1 max_articles_per_feed = 10 #auto_cleanup = True compress_news_images = True compress_news_images_max_size = 30 ignore_duplicate_articles = {'title', 'url'} masthead_url = 'http://cdn.images.dailyexpress.co.uk/img/page/express_logo.png' preprocess_regexps = [ (re.compile(r'widget', re.IGNORECASE \| re.DOTALL), lambda match: ''), (re.compile(r'Related articles', re.IGNORECASE \| re.DOTALL), lambda match: ''), (re.compile(r'Add Your Comment<', re.IGNORECASE \| re.DOTALL), lambda match: '<'), (re.compile(r'>More [\w].+?<', re.IGNORECASE), lambda match: '><'), (re.compile(r'>[\w].+? News<', re.IGNORECASE), lambda match: '><'), ] remove_tags = [ dict(attrs={'class' : 'quote'}), dict(name='footer'), dict(attrs={'id' : 'header_addons'}), dict(attrs={'class' : 'hoverException'}), dict(name='_li'),dict(name='li'), dict(attrs={'class' : 'box related-articles clear'}), dict(attrs={'class' : 'news-list'}), dict(attrs={'class' : 'sponsored-section'}), dict(attrs={'class' : 'pull-quote on-right'}), dict(attrs={'class' : 'pull-quote on-left'}), ] keep_only_tags = [ dict(name='h1'), dict(attrs={'class' : 'publish-info'}), dict(name='h3', limit=2), dict(attrs={'class' : 'clearfix hR new-style'}), ] feeds = [(u'UK News', u'http://www.express.co.uk/posts/rss/1/uk'), (u'World News',u'http://www.express.co.uk/posts/rss/78/world'), (u'Finance',u'http://www.express.co.uk/posts/rss/21/finance'), (u'Sport',u'http://www.express.co.uk/posts/rss/65/sport'), (u'Entertainment',u'http://www.express.co.uk/posts/rss/18/entertainment'), (u'Lifestyle',u'http://www.express.co.uk/posts/rss/8/life&style'), (u'Fun',u'http://www.express.co.uk/posts/rss/110/fun'), ] def get_cover_url(self): print '============Cover =================' print soup = self.index_to_soup('http://www.express.co.uk/ourpaper/') cov = soup.find(attrs={'src' : re.compile('http://cdn.images.express.co.uk/img/covers/')}) cov=str(cov) print '^^^^^^^', cov cov2 = re.findall('http[s]?://(?:[a-zA-Z]\|[0-9]\|[$-_@.&+]\|[!\(\),]\|(?:%[0-9a-fA-F][0-9a-fA-F]))+', cov) cov=str(cov2) cov=cov[2:len(cov)-2] print '&&&&&&&&',cov,'' #cover_url=cov br = browser() br.set_handle_redirect(False) try: br.open_novisit(cov) cover_url = cov except: cover_url ='http://cdn.images.express.co.uk/img/static/ourpaper/header-back-issue-papers.jpg' return cover_url extra_css = ''' h1{font-weight:bold;font-size:175%;} h2{font-weight:normal;font-size:75%;} #p{font-size:14px;} #body{font-size:14px;} .photo-caption {display: block;margin-left: auto;margin-right: auto;width:100%;font-size:40%;} .publish-info {font-size:50%;} .photo img {display: block;margin-left: auto;margin-right: auto;width:100%;} ''' Express Recipe (new) Spoiler*: Code: import re from calibre.web.feeds.news import BasicNewsRecipe from calibre import browser class AdvancedUserRecipe1376229553(BasicNewsRecipe): title = u'Daily Express' __author__ = 'Dave Asbury' # 9-9-13 added article author and now use (re.compile(r'>[\w].+? News<' # 16-11-13 cover adjustment #19.1.14 changes due to website changes breaking recipe encoding = 'utf-8' remove_empty_feeds = True remove_javascript = True no_stylesheets = True oldest_article = 1 max_articles_per_feed = 2 #auto_cleanup = True compress_news_images = True compress_news_images_max_size = 30 ignore_duplicate_articles = {'title', 'url'} masthead_url = 'http://cdn.images.dailyexpress.co.uk/img/page/express_logo.png' #conversion_options = { 'linearize_tables' : True } preprocess_regexps = [ (re.compile(r'<blockquote.</blockquote>', re.DOTALL\|re.IGNORECASE),lambda match: '</header>'), (re.compile(r'widget', re.IGNORECASE \| re.DOTALL), lambda match: ''), (re.compile(r'Related articles', re.IGNORECASE \| re.DOTALL), lambda match: ''), (re.compile(r'Add Your Comment<', re.IGNORECASE \| re.DOTALL), lambda match: '<'), (re.compile(r'>More [\w].+?<', re.IGNORECASE), lambda match: '><'), (re.compile(r'>[\w].+? News<', re.IGNORECASE), lambda match: '><'), #(re.compile(r'<footer class="mainFooter cf" id="mainFooter">.</footer>', re.DOTALL\|re.IGNORECASE),lambda match: ''), (re.compile(r'<section class="box related-articles clear">.</footer>', re.DOTALL\|re.IGNORECASE),lambda match: ''), (re.compile(r'<div style="display:inline;">.</div>', re.DOTALL\|re.IGNORECASE),lambda match: ''), (re.compile(r'<div style="display:none;">.</div>', re.DOTALL\|re.IGNORECASE),lambda match: ''), #(re.compile(r'<nav>.</nav>', re.DOTALL\|re.IGNORECASE),lambda match: ''), (re.compile(r'<div class="social-widget gplus">.</header>', re.DOTALL\|re.IGNORECASE),lambda match: '</header>'), ] remove_tags = [ dict(attrs={'class' : 'quote'}), dict(attrs={'class' : 'mainFooter cf'}), dict(name='footer'), dict(attrs={'id' : 'header_addons'}), dict(attrs={'class' : 'hoverException'}), dict(name='_li'),dict(name='li'), dict(attrs={'class' : 'box related-articles clear'}), dict(attrs={'class' : 'news-list'}), dict(attrs={'class' : 'sponsored-section'}), dict(attrs={'class' : 'pull-quote on-right'}), dict(attrs={'class' : 'pull-quote on-left'}), ] remove_tags_after = [dict(attrs={'class' : 'clearfix hR new-style'})] feeds = [(u'UK News', u'http://www.express.co.uk/posts/rss/1/uk'), (u'World News',u'http://www.express.co.uk/posts/rss/78/world'), (u'Finance',u'http://www.express.co.uk/posts/rss/21/finance'), (u'Sport',u'http://www.express.co.uk/posts/rss/65/sport'), (u'Entertainment',u'http://www.express.co.uk/posts/rss/18/entertainment'), (u'Lifestyle',u'http://www.express.co.uk/posts/rss/8/life&style'), (u'Fun',u'http://www.express.co.uk/posts/rss/110/fun'), ] def get_cover_url(self): print '============Cover =================' print soup = self.index_to_soup('http://www.express.co.uk/ourpaper/') cov = soup.find(attrs={'src' : re.compile('http://cdn.images.express.co.uk/img/covers/')}) cov=str(cov) print '^^^^^^^', cov cov2 = re.findall('http[s]?://(?:[a-zA-Z]\|[0-9]\|[$-_@.&+]\|[!\(\),]\|(?:%[0-9a-fA-F][0-9a-fA-F]))+', cov) cov=str(cov2) cov=cov[2:len(cov)-2] print '&&&&&&&&',cov,'' #cover_url=cov br = browser() br.set_handle_redirect(False) try: br.open_novisit(cov) cover_url = cov except: cover_url ='http://cdn.images.express.co.uk/img/static/ourpaper/header-back-issue-papers.jpg' return cover_url extra_css = ''' h1{font-weight:bold;font-size:175%;} h2{font-weight:normal;font-size:75%;} #p{font-size:14px;} #body{font-size:14px;} .photo-caption {display: block;margin-left: auto;margin-right: auto;width:100%;font-size:40%;} .publish-info {font-size:50%;} .photo img {display: block;margin-left: auto;margin-right: auto;width:100%;} ''' Log file Spoiler: Fetch news from Daily Express Failed to initialize plugin: Kindle and Mobipocket DeDRM (0, 4, 18) Failed to initialize plugin: u'C:\\Users\\Dave\\AppData\\Roaming\\calibre\\plug ins\\Kindle and Mobipocket DeDRM.zip' Resolved conversion options calibre version: 1.20.0 {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 13.0, 'book_producer': None, 'change_justification': 'original', 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': None, 'dehyphenate': True, 'delete_blank_paragraphs': True, 'disable_font_rescaling': False, 'dont_compress': False, 'dont_download_recipe': False, 'duplicate_links_in_toc': False, 'embed_all_fonts': False, 'embed_font_family': None, 'enable_heuristics': False, 'expand_css': False, 'extra_css': None, 'extract_to': None, 'filter_css': None, 'fix_indents': True, 'font_size_mapping': None, 'format_scene_breaks': True, 'html_unwrap_factor': 0.4, 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x0420B5B0>, 'insert_blank_line': False, 'insert_blank_line_size': 0.5, 'insert_metadata': False, 'isbn': None, 'italicize_common_cases': True, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'markup_chapter_headings': True, 'max_toc_links': 50, 'minimum_line_height': 120.0, 'mobi_file_type': 'old', 'mobi_ignore_margins': False, 'mobi_keep_original_images': False, 'mobi_toc_at_start': False, 'no_chapters_in_toc': False, 'no_inline_navbars': True, 'no_inline_toc': False, 'output_profile': <calibre.customize.profiles.KindlePaperWhiteOutp ut object at 0x0420B950>, 'page_breaks_before': None, 'personal_doc': '[PDOC]', 'prefer_author_sort': False, 'prefer_metadata_cover': False, 'pretty_print': False, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_fake_margins': True, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': '', 'search_replace': None, 'series': None, 'series_index': None, 'share_not_sync': False, 'smarten_punctuation': False, 'sr1_replace': '', 'sr1_search': '', 'sr2_replace': '', 'sr2_search': '', 'sr3_replace': '', 'sr3_search': '', 'start_reading_at': None, 'subset_embedded_fonts': False, 'tags': None, 'test': False, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'toc_title': None, 'unsmarten_punctuation': False, 'unwrap_lines': True, 'use_auto_toc': False, 'verbose': 2} InputFormatPlugin: Recipe Input running Using custom recipe Skipping article Top 10 facts about Popeye (Fri, 17 Jan, 2014 00:01) from feed Fun as it is too old. Skipping article Top 10 facts about Sherlock Holmes (Thu, 16 Jan, 2014 00:01) from feed Fun as it is too old. Skipping article Top 10 facts about museums (Wed, 15 Jan, 2014 00:01) from feed Fun as it is too old. Skipping article Top 10 facts about Casablanca (Tue, 14 Jan, 2014 00:01) from feed Fun as it is too old. Skipping article Top 10 facts about Glasgow (Mon, 13 Jan, 2014 00:00) from feed Fun as it is too old. Skipping article Top 10 facts about Siberia (Fri, 10 Jan, 2014 00:00) from feed Fun as it is too old. Skipping article Top ten facts about Cambridge (Thu, 09 Jan, 2014 00:00) from feed Fun as it is too old. Skipping article Top 10 facts about singing (Wed, 08 Jan, 2014 00:00) from feed Fun as it is too old. Skipping article Top 10 facts about Twelfth Night (Mon, 06 Jan, 2014 00:00) from feed Fun as it is too old. Skipping article Top 10 facts about... lords (Fri, 03 Jan, 2014 00:00) from feed Fun as it is too old. ============Cover ================= ^^^^^^^ <img src="http://cdn.images.express.co.uk/img/covers/287x361front/2014-01-19.jpg" alt="" width="287" height="361" /> &&&&&&&& http://cdn.images.express.co.uk/img/...2014-01-19.jpg * Downloading Fetching http://www.express.co.uk/news/uk/454...l-jibe-at-poor Downloading Fetching http://www.express.co.uk/news/uk/454...ling-marriages Downloading Fetching http://www.express.co.uk/news/world/...talian-retrial Downloading Fetching http://www.express.co.uk/news/world/...ive-in-Vietnam Downloading Fetching http://www.express.co.uk/finance/cit...interest-level Processing images... Fetching http://b.scorecardresearch.com/p?c1=...52&cv=2.0&cj=1 Processing images... Processing images... Fetching http://b.scorecardresearch.com/p?c1=...52&cv=2.0&cj=1 Fetching http://b.scorecardresearch.com/p?c1=...52&cv=2.0&cj=1 Processing images... Fetching http://b.scorecardresearch.com/p?c1=...52&cv=2.0&cj=1 Recursion limit reached. Skipping links in http://www.express.co.uk/news/uk/454...l-jibe-at-poor Recursion limit reached. Skipping links in http://www.express.co.uk/news/world/...ive-in-Vietnam Recursion limit reached. Skipping links in http://www.express.co.uk/news/uk/454...ling-marriages Recursion limit reached. Skipping links in http://www.express.co.uk/news/world/...talian-retrial Processing images... Recursion limit reached. Skipping links in http://www.express.co.uk/finance/cit...interest-level http://www.express.co.uk/news/uk/454...l-jibe-at-poor saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_0\artic le_0\index.xhtml Downloading Fetching http://www.express.co.uk/finance/cit...lling-from-MPs Downloaded article: Edwina Currie's cruel jibe at the poor from http://www.express.co.uk/news/uk/454...l-jibe-at-poor http://www.express.co.uk/news/world/...ive-in-Vietnam saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_1\artic le_1\index.xhtml http://www.express.co.uk/news/uk/454...ling-marriages saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_0\artic le_1\index.xhtml Downloading Fetching http://www.express.co.uk/sport/footb...erto-Gilardino http://www.express.co.uk/news/world/...talian-retrial saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_1\artic le_0\index.xhtml Downloaded article: 'I have to believe my son is alive in Vietnam' from http://www.express.co.uk/news/world/...ive-in-Vietnam Downloaded article: MPs tell Cameron to fight 'plague' of failing marriages from http://www.express.co.uk/news/uk/454...ling-marriages Downloading Downloaded article: Amanda Knox's fears over Italian retrial from http://www.express.co.uk/news/world/...etrialFetching http://www.express.co.uk/entertainme...-a-perfect-man Downloading Fetching http://www.express.co.uk/sport/footb...b-in-the-WORLD http://www.express.co.uk/finance/cit...interest-level saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_2\artic le_0\index.xhtml Downloading Fetching http://www.express.co.uk/entertainme...made-me-a-star Downloaded article: Rate hike risk as jobless toll falls: Mark Carney urged not to raise interest level from http://www.express.co.uk/finance/cit...interest-level Could not fetch link http://www.express.co.uk/sport/footb...erto-Gilardino Traceback (most recent call last): File "site-packages\calibre\web\fetch\simple.py", line 518, in process_links File "site-packages\calibre\web\fetch\simple.py", line 250, in fetch_url FetchError: Not Found http://www.express.co.uk/sport/footb...erto-Gilardino saved to Downloading Fetching http://www.express.co.uk/life-style/...e-grand-Tourer Failed to download article: West Ham boss Sam Allardyce targets Italian international Alberto Gilardino from http://www.express.co.uk/sport/footb...erto-Gilardino Traceback (most recent call last): File "site-packages\calibre\utils\threadpool.py", line 95, in run File "site-packages\calibre\web\feeds\news.py", line 1106, in fetch_article File "site-packages\calibre\web\feeds\news.py", line 1101, in _fetch_article Exception: Could not fetch article. The debug traceback is available earlier in this log Processing images... Recursion limit reached. Skipping links in http://www.express.co.uk/finance/cit...lling-from-MPs http://www.express.co.uk/finance/cit...lling-from-MPs saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_2\artic le_1\index.xhtml Downloading Fetching http://www.express.co.uk/life-style/...nd-Switzerland Downloaded article: Energy grid firms set for a grilling from MPs from http://www.express.co.uk/finance/cit...lling-from-MPs Processing images... Processing images... Recursion limit reached. Skipping links in http://www.express.co.uk/life-style/...e-grand-Tourer Processing images... Recursion limit reached. Skipping links in http://www.express.co.uk/entertainme...made-me-a-star Processing images... Recursion limit reached. Skipping links in http://www.express.co.uk/entertainme...-a-perfect-man Recursion limit reached. Skipping links in http://www.express.co.uk/sport/footb...b-in-the-WORLD http://www.express.co.uk/entertainme...made-me-a-star saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_4\artic le_1\index.xhtml Downloaded article: Failing to get Bourne lead made me a star from http://www.express.co.uk/entertainme...made-me-a-star http://www.express.co.uk/life-style/...e-grand-Tourer saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_5\artic le_0\index.xhtml http://www.express.co.uk/entertainme...-a-perfect-man saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_4\artic le_0\index.xhtml Downloaded article: Honda Civic Tourer: The grand Tourer from http://www.express.co.uk/life-style/...e-grand-Tourer Downloaded article: Dad may have been a gangster, but to us he was a perfect man from http://www.express.co.uk/entertainme...-a-perfect-man http://www.express.co.uk/sport/footb...b-in-the-WORLD saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_3\artic le_0\index.xhtml Downloaded article: David Moyes insists Man Utd are biggest club in the WORLD from http://www.express.co.uk/sport/footb...b-in-the-WORLD Processing images... Recursion limit reached. Skipping links in http://www.express.co.uk/life-style/...nd-Switzerland http://www.express.co.uk/life-style/...nd-Switzerland saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_5\artic le_1\index.xhtml Downloaded article: Slope off for a family skiing break this half-term holiday from http://www.express.co.uk/life-style/...nd-Switzerland Failed to download the following articles: West Ham boss Sam Allardyce targets Italian international Alberto Gilardino from Sport http://www.express.co.uk/sport/footb...erto-Gilardino Traceback (most recent call last): File "site-packages\calibre\utils\threadpool.py", line 95, in run File "site-packages\calibre\web\feeds\news.py", line 1106, in fetch_article File "site-packages\calibre\web\feeds\news.py", line 1101, in _fetch_article Exception: Could not fetch article. The debug traceback is available earlier in this log Parsing all content... Parsing feed_2/article_0/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_2/article_0/index.html as HTML Parsing feed_2/article_1/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_2/article_1/index.html as HTML Parsing feed_0/article_1/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_0/article_1/index.html as HTML Parsing feed_3/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_3/index.html as HTML Parsing index.html ... Forcing index.html into XHTML namespace Parsing feed_3/article_0/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_3/article_0/index.html as HTML Parsing feed_4/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_4/index.html as HTML Parsing feed_1/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_1/index.html as HTML Parsing feed_0/article_0/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_0/article_0/index.html as HTML Parsing feed_1/article_0/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_1/article_0/index.html as HTML Parsing feed_4/article_1/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_4/article_1/index.html as HTML Parsing feed_0/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_0/index.html as HTML Parsing feed_5/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_5/index.html as HTML Parsing feed_5/article_0/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_5/article_0/index.html as HTML Parsing feed_5/article_1/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_5/article_1/index.html as HTML Parsing feed_1/article_1/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_1/article_1/index.html as HTML Parsing feed_4/article_0/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_4/article_0/index.html as HTML Parsing feed_2/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_2/index.html as HTML Referenced file u'/search/Geoff%2bHo%3fs%3dGeoff%2bHo%26b%3d1' not found Referenced file u'/search/Henry%2bFitzherbert%3fs%3dHenry%2bFitzherbert%26b% 3d1' not found Referenced file u'/search/Marco%2bGiannangeli%3fs%3dMarco%2bGiannangeli%26b% 3d1' not found Referenced file u'/search/Helen%2bMassy-Beresford%3fs%3dHelen%2bMassy-Beresford%26b%3d1' not found Referenced file u'/search/Tom%2bStewart%3fs%3dTom%2bStewart%26b%3d1' not found Referenced file u'/news/world/454721/I-have-to-believe-my-son-is-alive-in-Vietnam' not found Referenced file u'/search/Paula%2bMurray%3fs%3dPaula%2bMurray%26b%3d1' not found Referenced file u'/search/John%2bRichardson%3fs%3dJohn%2bRichardson%26b%3d1' not found Referenced file u'/news/uk/454730/Edwina-Currie-s-cruel-jibe-at-poor' not found Referenced file u'/search/Lucy%2bJohnstone%3fs%3dLucy%2bJohnstone%26b%3d1' not found Referenced file u'//www.googleadservices.com/pagead/conversion.js' not found Referenced file u'/opensearch.xml' not found Referenced file u'/entertainment/films/454753/Failing-to-get-Bourne-lead-made-me-a-star' not found Referenced file u'/finance/city/454745/Rate-hike-risk-as-jobless-toll-falls-Mark-Carney-urged-not-to-raise-interest-level' not found Referenced file u'/entertainment/books/454754/Dad-may-have-been-a-gangster-but-to-us-he-was-a-perfect-man' not found Referenced file u'/sport/football/454711/David-Moyes-insists-Man-Utd-are-biggest-club-in-the-WORLD' not found Referenced file u'/search/David%2bMeikle%3fs%3dDavid%2bMeikle%26b%3d1' not found Referenced file u'/news/uk/454749/MPs-tell-Cameron-to-fight-plague-of-failing-marriages' not found Referenced file u'/search/Nicola%2bIseard%3fs%3dNicola%2bIseard%26b%3d1' not found Referenced file u'/news/world/454726/Amanda-Knox-s-fears-over-Italian-retrial' not found Referenced file u'/finance/city/454744/Energy-grid-firms-set-for-a-grilling-from-MPs' not found Referenced file u'//s7.addthis.com/js/300/addthis_.js' not found Referenced file u'feed_6/index.html' not found Referenced file u'/search/Mike%2bParker%3fs%3dMike%2bParker%26b%3d1' not found Referenced file u'/life-style/travel/454731/Best-ski-resorts-for-families-this-half-term-in-Austria-France-Colorado-and-Switzerland' not found Referenced file u'/life-style/cars/454737/Honda-Civic-Tourer-The-grand-Tourer' not found Referenced file u'feed_3/article_1/index.html' not found Reading TOC from NCX... Merging user specified metadata... Detecting structure... Flattening CSS and remapping font sizes... Source base font size is 12.00000pt Removing fake margins... Found 22 items of level: div_8 Found 19 items of level: div_1 Found 44 items of level: div_3 Found 50 items of level: div_2 Found 22 items of level: div_5 Found 33 items of level: div_4 Found 55 items of level: div_7 Found 22 items of level: div_6 Found 132 items of level: p_7 Found 2 items of level: p_2 Ignoring level div_8 Ignoring level div_5 Ignoring level p_2 Ignoring level div_6 div_1 left margin stats: Counter({u'': 11}) div_1 right margin stats: Counter({u'': 11}) div_3 left margin stats: Counter({u'': 44}) div_3 right margin stats: Counter({u'': 44}) div_2 left margin stats: Counter({u'': 11}) div_2 right margin stats: Counter({u'': 11}) div_4 left margin stats: Counter({u'': 33}) div_4 right margin stats: Counter({u'': 33}) div_7 left margin stats: Counter({u'': 55}) div_7 right margin stats: Counter({u'': 55}) p_7 left margin stats: Counter({u'0': 132}) p_7 right margin stats: Counter({u'0': 132}) Cleaning up manifest... Trimming unused files from manifest... Trimming u'feed_0/article_1/images/img1.png' from manifest Trimming u'feed_0/article_0/images/img1.png' from manifest Trimming u'feed_1/article_0/images/img1.png' from manifest Trimming u'feed_1/article_1/images/img1.png' from manifest Creating MOBI Output... Serializing resources... Converting TOC for MOBI periodical indexing... Using mastheadImage supplied in manifest... Creating MOBI 6 output Generating in-line TOC... Applying case-transforming CSS... Parsing manglecase.css ... Parsing tocstyle.css ... Rasterizing SVG images... Converting XHTML to Mobipocket markup... Failed to find image: http://cdn.images.express.co.uk/img/...-MP-454730.jpg Failed to find image: http://cdn.images.express.co.uk/img/...ary/118906.jpg Failed to find image: http://cdn.images.express.co.uk/img/...nce-454749.jpg Failed to find image: http://cdn.images.express.co.uk/img/...ner-454726.jpg Failed to find image: http://cdn.images.express.co.uk/img/...ary/118868.jpg Failed to find image: http://cdn.images.express.co.uk/img/...and-454721.jpg Failed to find image: http://cdn.images.express.co.uk/img/...tes-454745.jpg Failed to find image: http://cdn.images.express.co.uk/img/...ing-454744.jpg Failed to find image: http://cdn.images.express.co.uk/img/...yes-454711.jpg Failed to find image: http://cdn.images.express.co.uk/img/...ary/118847.jpg Failed to find image: http://cdn.images.express.co.uk/img/...ter-454754.jpg Failed to find image: http://cdn.images.express.co.uk/img/...ary/118905.jpg Failed to find image: http://cdn.images.express.co.uk/img/...ilm-454753.jpg Failed to find image: http://cdn.images.express.co.uk/img/...ary/118903.jpg Failed to find image: http://cdn.images.express.co.uk/img/...rer-454737.jpg Failed to find image: http://cdn.images.express.co.uk/img/...day-454731.jpg Failed to find image: http://cdn.images.express.co.uk/img/...ary/118875.jpg Serializing markup content... Compressing markup content... Generating MOBI index for a periodical MOBI output written to d:\temp\calibre_tnwotw\ulo5md_recipe_out.mobi

Advert

Advert