Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 01-19-2014, 05:36 AM   #1
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Express failing recipe rewrite produces unexpected results

The Daily Express recipe has started producing an empty mobi.
I inspected the website and the taglines i used for the keep only tags don't seem to have changed.

I removed the keep only tags and the articles came down (plus all the other junk)

I added some recompile lines to get rid of the footer etc.

The articles download now minus the images.
The log contains a load of "Failed to find image: ...." lines.

when i look in the mobi in calibre's viewer the images reference the online image (unsurprisingly with the fail)

yet if i type these urls straight into my browsers address bar they load no problem.

Could anyone explain this change?

Thanks

Express Recipe (old)
Spoiler:
Code:
import re
from calibre.web.feeds.news import BasicNewsRecipe
from calibre import browser
class AdvancedUserRecipe1376229553(BasicNewsRecipe):
    title          = u'Daily Express old'
    __author__ = 'Dave Asbury'
    # 9-9-13 added article author and now use (re.compile(r'>[\w].+? News<'
    # 16-11-13 cover adjustment
    encoding    = 'utf-8'
    remove_empty_feeds = True
    #remove_javascript     = True
    no_stylesheets        = True
    oldest_article = 1
    max_articles_per_feed = 10
    #auto_cleanup = True
    compress_news_images = True
    compress_news_images_max_size = 30
    ignore_duplicate_articles = {'title', 'url'}
    masthead_url = 'http://cdn.images.dailyexpress.co.uk/img/page/express_logo.png'

    preprocess_regexps = [

                (re.compile(r'widget', re.IGNORECASE | re.DOTALL), lambda match: ''),
                        (re.compile(r'Related articles', re.IGNORECASE | re.DOTALL), lambda match: ''),
                        (re.compile(r'Add Your Comment<', re.IGNORECASE | re.DOTALL), lambda match: '<'),
                (re.compile(r'>More [\w].+?<', re.IGNORECASE), lambda match: '><'),
                                (re.compile(r'>[\w].+? News<', re.IGNORECASE), lambda match: '><'),
                         
                ]

    remove_tags = [
                                dict(attrs={'class' : 'quote'}),
                
                                dict(name='footer'),
                                dict(attrs={'id' : 'header_addons'}),
                                dict(attrs={'class' : 'hoverException'}),
                                dict(name='_li'),dict(name='li'),
                                dict(attrs={'class' : 'box related-articles clear'}),
                                dict(attrs={'class' : 'news-list'}),
                                dict(attrs={'class' : 'sponsored-section'}),
                                dict(attrs={'class' : 'pull-quote on-right'}),
                                dict(attrs={'class' : 'pull-quote on-left'}),

                             ]
    keep_only_tags = [
                dict(name='h1'),
                                dict(attrs={'class' : 'publish-info'}),
                                dict(name='h3', limit=2),
                                dict(attrs={'class' : 'clearfix hR new-style'}),
                             ]

    feeds          = [(u'UK News', u'http://www.express.co.uk/posts/rss/1/uk'),
                 (u'World News',u'http://www.express.co.uk/posts/rss/78/world'),
                         (u'Finance',u'http://www.express.co.uk/posts/rss/21/finance'),
                 (u'Sport',u'http://www.express.co.uk/posts/rss/65/sport'),
                 (u'Entertainment',u'http://www.express.co.uk/posts/rss/18/entertainment'),
                         (u'Lifestyle',u'http://www.express.co.uk/posts/rss/8/life&style'),
                 (u'Fun',u'http://www.express.co.uk/posts/rss/110/fun'),
                        ]

    def get_cover_url(self):
        print '============Cover ================='
        print
        soup = self.index_to_soup('http://www.express.co.uk/ourpaper/')
        cov = soup.find(attrs={'src' : re.compile('http://cdn.images.express.co.uk/img/covers/')})
        cov=str(cov)
        print '^^^^^^^', cov
        cov2 =  re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', cov)

        cov=str(cov2)
        cov=cov[2:len(cov)-2]

        print '&&&&&&&&',cov,'***'
        #cover_url=cov
        br = browser()
        br.set_handle_redirect(False)
        try:
            br.open_novisit(cov)
            cover_url = cov
        except:
            cover_url ='http://cdn.images.express.co.uk/img/static/ourpaper/header-back-issue-papers.jpg'

        return cover_url

    extra_css = '''
                    h1{font-weight:bold;font-size:175%;}
                    h2{font-weight:normal;font-size:75%;}
                    #p{font-size:14px;}
                    #body{font-size:14px;}
                    .photo-caption {display: block;margin-left: auto;margin-right: auto;width:100%;font-size:40%;}
                    .publish-info {font-size:50%;}
                    .photo img {display: block;margin-left: auto;margin-right: auto;width:100%;}
      '''


Express Recipe (new)
Spoiler:
Code:
import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre import browser
class AdvancedUserRecipe1376229553(BasicNewsRecipe):
    title          = u'Daily Express'
    __author__ = 'Dave Asbury'
    # 9-9-13 added article author and now use (re.compile(r'>[\w].+? News<'
    # 16-11-13 cover adjustment
    #19.1.14 changes due to website changes breaking recipe
    encoding    = 'utf-8'
    remove_empty_feeds = True
    remove_javascript     = True
    no_stylesheets        = True
    oldest_article = 1
    max_articles_per_feed = 2
    #auto_cleanup = True
    compress_news_images = True
    compress_news_images_max_size = 30
    ignore_duplicate_articles = {'title', 'url'}
    masthead_url = 'http://cdn.images.dailyexpress.co.uk/img/page/express_logo.png'
    #conversion_options = { 'linearize_tables' : True }
    preprocess_regexps = [
                                   (re.compile(r'<blockquote.*</blockquote>', re.DOTALL|re.IGNORECASE),lambda match: '</header>'),      
                                   (re.compile(r'widget', re.IGNORECASE | re.DOTALL), lambda match: ''),
                                   (re.compile(r'Related articles', re.IGNORECASE | re.DOTALL), lambda match: ''),
                                   (re.compile(r'Add Your Comment<', re.IGNORECASE | re.DOTALL), lambda match: '<'),
                                   (re.compile(r'>More [\w].+?<', re.IGNORECASE), lambda match: '><'),
                                   (re.compile(r'>[\w].+? News<', re.IGNORECASE), lambda match: '><'),
                                   #(re.compile(r'<footer class="mainFooter cf" id="mainFooter">.*</footer>', re.DOTALL|re.IGNORECASE),lambda match: ''),
                                   (re.compile(r'<section class="box related-articles clear">.*</footer>', re.DOTALL|re.IGNORECASE),lambda match: ''),
                                   (re.compile(r'<div style="display:inline;">.*</div>', re.DOTALL|re.IGNORECASE),lambda match: ''),
                                   (re.compile(r'<div style="display:none;">.*</div>', re.DOTALL|re.IGNORECASE),lambda match: ''),
		   #(re.compile(r'<nav>.*</nav>', re.DOTALL|re.IGNORECASE),lambda match: ''),
		   (re.compile(r'<div class="social-widget gplus">.*</header>', re.DOTALL|re.IGNORECASE),lambda match: '</header>'),      
          ]

    remove_tags = [
                                dict(attrs={'class' : 'quote'}),
                                dict(attrs={'class' : 'mainFooter cf'}),
                                dict(name='footer'),
                                dict(attrs={'id' : 'header_addons'}),
                                dict(attrs={'class' : 'hoverException'}),
                                dict(name='_li'),dict(name='li'),
                                dict(attrs={'class' : 'box related-articles clear'}),
                                dict(attrs={'class' : 'news-list'}),
                                dict(attrs={'class' : 'sponsored-section'}),
                                dict(attrs={'class' : 'pull-quote on-right'}),
                                dict(attrs={'class' : 'pull-quote on-left'}),

                             ]
    remove_tags_after = [dict(attrs={'class' : 'clearfix hR new-style'})]
    

    feeds          = [(u'UK News', u'http://www.express.co.uk/posts/rss/1/uk'),
                 (u'World News',u'http://www.express.co.uk/posts/rss/78/world'),
                         (u'Finance',u'http://www.express.co.uk/posts/rss/21/finance'),
                 (u'Sport',u'http://www.express.co.uk/posts/rss/65/sport'),
                 (u'Entertainment',u'http://www.express.co.uk/posts/rss/18/entertainment'),
                         (u'Lifestyle',u'http://www.express.co.uk/posts/rss/8/life&style'),
                 (u'Fun',u'http://www.express.co.uk/posts/rss/110/fun'),
                        ]

    def get_cover_url(self):
        print '============Cover ================='
        print
        soup = self.index_to_soup('http://www.express.co.uk/ourpaper/')
        cov = soup.find(attrs={'src' : re.compile('http://cdn.images.express.co.uk/img/covers/')})
        cov=str(cov)
        print '^^^^^^^', cov
        cov2 =  re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', cov)

        cov=str(cov2)
        cov=cov[2:len(cov)-2]

        print '&&&&&&&&',cov,'***'
        #cover_url=cov
        br = browser()
        br.set_handle_redirect(False)
        try:
            br.open_novisit(cov)
            cover_url = cov
        except:
            cover_url ='http://cdn.images.express.co.uk/img/static/ourpaper/header-back-issue-papers.jpg'

        return cover_url

    extra_css = '''
                    h1{font-weight:bold;font-size:175%;}
                    h2{font-weight:normal;font-size:75%;}
                    #p{font-size:14px;}
                    #body{font-size:14px;}
                    .photo-caption {display: block;margin-left: auto;margin-right: auto;width:100%;font-size:40%;}
                    .publish-info {font-size:50%;}
                    .photo img {display: block;margin-left: auto;margin-right: auto;width:100%;}
      '''

Log file
Spoiler:

Fetch news from Daily Express
Failed to initialize plugin: Kindle and Mobipocket DeDRM (0, 4, 18)
Failed to initialize plugin: u'C:\\Users\\Dave\\AppData\\Roaming\\calibre\\plug ins\\Kindle and Mobipocket DeDRM.zip'
Resolved conversion options
calibre version: 1.20.0
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 13.0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_compress': False,
'dont_download_recipe': False,
'duplicate_links_in_toc': False,
'embed_all_fonts': False,
'embed_font_family': None,
'enable_heuristics': False,
'expand_css': False,
'extra_css': None,
'extract_to': None,
'filter_css': None,
'fix_indents': True,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x0420B5B0>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'mobi_file_type': 'old',
'mobi_ignore_margins': False,
'mobi_keep_original_images': False,
'mobi_toc_at_start': False,
'no_chapters_in_toc': False,
'no_inline_navbars': True,
'no_inline_toc': False,
'output_profile': <calibre.customize.profiles.KindlePaperWhiteOutp ut object at 0x0420B950>,
'page_breaks_before': None,
'personal_doc': '[PDOC]',
'prefer_author_sort': False,
'prefer_metadata_cover': False,
'pretty_print': False,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'search_replace': None,
'series': None,
'series_index': None,
'share_not_sync': False,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'test': False,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: Recipe Input running
Using custom recipe
Skipping article Top 10 facts about Popeye (Fri, 17 Jan, 2014 00:01) from feed Fun as it is too old.
Skipping article Top 10 facts about Sherlock Holmes (Thu, 16 Jan, 2014 00:01) from feed Fun as it is too old.
Skipping article Top 10 facts about museums (Wed, 15 Jan, 2014 00:01) from feed Fun as it is too old.
Skipping article Top 10 facts about Casablanca (Tue, 14 Jan, 2014 00:01) from feed Fun as it is too old.
Skipping article Top 10 facts about Glasgow (Mon, 13 Jan, 2014 00:00) from feed Fun as it is too old.
Skipping article Top 10 facts about Siberia (Fri, 10 Jan, 2014 00:00) from feed Fun as it is too old.
Skipping article Top ten facts about Cambridge (Thu, 09 Jan, 2014 00:00) from feed Fun as it is too old.
Skipping article Top 10 facts about singing (Wed, 08 Jan, 2014 00:00) from feed Fun as it is too old.
Skipping article Top 10 facts about Twelfth Night (Mon, 06 Jan, 2014 00:00) from feed Fun as it is too old.
Skipping article Top 10 facts about... lords (Fri, 03 Jan, 2014 00:00) from feed Fun as it is too old.
============Cover =================

^^^^^^^ <img src="http://cdn.images.express.co.uk/img/covers/287x361front/2014-01-19.jpg" alt="" width="287" height="361" />
&&&&&&&& http://cdn.images.express.co.uk/img/...2014-01-19.jpg ***
Downloading
Fetching http://www.express.co.uk/news/uk/454...l-jibe-at-poor
Downloading
Fetching http://www.express.co.uk/news/uk/454...ling-marriages
Downloading
Fetching http://www.express.co.uk/news/world/...talian-retrial
Downloading
Fetching http://www.express.co.uk/news/world/...ive-in-Vietnam
Downloading
Fetching http://www.express.co.uk/finance/cit...interest-level
Processing images...
Fetching http://b.scorecardresearch.com/p?c1=...52&cv=2.0&cj=1
Processing images...
Processing images...
Fetching http://b.scorecardresearch.com/p?c1=...52&cv=2.0&cj=1
Fetching http://b.scorecardresearch.com/p?c1=...52&cv=2.0&cj=1
Processing images...
Fetching http://b.scorecardresearch.com/p?c1=...52&cv=2.0&cj=1
Recursion limit reached. Skipping links in http://www.express.co.uk/news/uk/454...l-jibe-at-poor
Recursion limit reached. Skipping links in http://www.express.co.uk/news/world/...ive-in-Vietnam
Recursion limit reached. Skipping links in http://www.express.co.uk/news/uk/454...ling-marriages
Recursion limit reached. Skipping links in http://www.express.co.uk/news/world/...talian-retrial
Processing images...
Recursion limit reached. Skipping links in http://www.express.co.uk/finance/cit...interest-level
http://www.express.co.uk/news/uk/454...l-jibe-at-poor saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_0\artic le_0\index.xhtml
Downloading
Fetching http://www.express.co.uk/finance/cit...lling-from-MPs
Downloaded article: Edwina Currie's cruel jibe at the poor from http://www.express.co.uk/news/uk/454...l-jibe-at-poor
http://www.express.co.uk/news/world/...ive-in-Vietnam saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_1\artic le_1\index.xhtml
http://www.express.co.uk/news/uk/454...ling-marriages saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_0\artic le_1\index.xhtml
Downloading
Fetching http://www.express.co.uk/sport/footb...erto-Gilardino
http://www.express.co.uk/news/world/...talian-retrial saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_1\artic le_0\index.xhtml
Downloaded article: 'I have to believe my son is alive in Vietnam' from http://www.express.co.uk/news/world/...ive-in-Vietnam
Downloaded article: MPs tell Cameron to fight 'plague' of failing marriages from http://www.express.co.uk/news/uk/454...ling-marriages
Downloading
Downloaded article: Amanda Knox's fears over Italian retrial from http://www.express.co.uk/news/world/...etrialFetching
http://www.express.co.uk/entertainme...-a-perfect-man
Downloading
Fetching http://www.express.co.uk/sport/footb...b-in-the-WORLD
http://www.express.co.uk/finance/cit...interest-level saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_2\artic le_0\index.xhtml
Downloading
Fetching http://www.express.co.uk/entertainme...made-me-a-star
Downloaded article: Rate hike risk as jobless toll falls: Mark Carney urged not to raise interest level from http://www.express.co.uk/finance/cit...interest-level
Could not fetch link http://www.express.co.uk/sport/footb...erto-Gilardino
Traceback (most recent call last):
File "site-packages\calibre\web\fetch\simple.py", line 518, in process_links
File "site-packages\calibre\web\fetch\simple.py", line 250, in fetch_url
FetchError: Not Found

http://www.express.co.uk/sport/footb...erto-Gilardino saved to
Downloading
Fetching http://www.express.co.uk/life-style/...e-grand-Tourer
Failed to download article: West Ham boss Sam Allardyce targets Italian international Alberto Gilardino from http://www.express.co.uk/sport/footb...erto-Gilardino
Traceback (most recent call last):
File "site-packages\calibre\utils\threadpool.py", line 95, in run
File "site-packages\calibre\web\feeds\news.py", line 1106, in fetch_article
File "site-packages\calibre\web\feeds\news.py", line 1101, in _fetch_article
Exception: Could not fetch article. The debug traceback is available earlier in this log



Processing images...
Recursion limit reached. Skipping links in http://www.express.co.uk/finance/cit...lling-from-MPs
http://www.express.co.uk/finance/cit...lling-from-MPs saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_2\artic le_1\index.xhtml
Downloading
Fetching http://www.express.co.uk/life-style/...nd-Switzerland
Downloaded article: Energy grid firms set for a grilling from MPs from http://www.express.co.uk/finance/cit...lling-from-MPs
Processing images...
Processing images...
Recursion limit reached. Skipping links in http://www.express.co.uk/life-style/...e-grand-Tourer
Processing images...
Recursion limit reached. Skipping links in http://www.express.co.uk/entertainme...made-me-a-star
Processing images...
Recursion limit reached. Skipping links in http://www.express.co.uk/entertainme...-a-perfect-man
Recursion limit reached. Skipping links in http://www.express.co.uk/sport/footb...b-in-the-WORLD
http://www.express.co.uk/entertainme...made-me-a-star saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_4\artic le_1\index.xhtml
Downloaded article: Failing to get Bourne lead made me a star from http://www.express.co.uk/entertainme...made-me-a-star
http://www.express.co.uk/life-style/...e-grand-Tourer saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_5\artic le_0\index.xhtml
http://www.express.co.uk/entertainme...-a-perfect-man saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_4\artic le_0\index.xhtml
Downloaded article: Honda Civic Tourer: The grand Tourer from http://www.express.co.uk/life-style/...e-grand-Tourer
Downloaded article: Dad may have been a gangster, but to us he was a perfect man from http://www.express.co.uk/entertainme...-a-perfect-man
http://www.express.co.uk/sport/footb...b-in-the-WORLD saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_3\artic le_0\index.xhtml
Downloaded article: David Moyes insists Man Utd are biggest club in the WORLD from http://www.express.co.uk/sport/footb...b-in-the-WORLD
Processing images...
Recursion limit reached. Skipping links in http://www.express.co.uk/life-style/...nd-Switzerland
http://www.express.co.uk/life-style/...nd-Switzerland saved to d:\temp\calibre_tnwotw\c1n8sv_plumber\feed_5\artic le_1\index.xhtml
Downloaded article: Slope off for a family skiing break this half-term holiday from http://www.express.co.uk/life-style/...nd-Switzerland
Failed to download the following articles:
West Ham boss Sam Allardyce targets Italian international Alberto Gilardino from Sport
http://www.express.co.uk/sport/footb...erto-Gilardino
Traceback (most recent call last):
File "site-packages\calibre\utils\threadpool.py", line 95, in run
File "site-packages\calibre\web\feeds\news.py", line 1106, in fetch_article
File "site-packages\calibre\web\feeds\news.py", line 1101, in _fetch_article
Exception: Could not fetch article. The debug traceback is available earlier in this log

Parsing all content...
Parsing feed_2/article_0/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_2/article_0/index.html as HTML
Parsing feed_2/article_1/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_2/article_1/index.html as HTML
Parsing feed_0/article_1/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_0/article_1/index.html as HTML
Parsing feed_3/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_3/index.html as HTML
Parsing index.html ...
Forcing index.html into XHTML namespace
Parsing feed_3/article_0/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_3/article_0/index.html as HTML
Parsing feed_4/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_4/index.html as HTML
Parsing feed_1/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_1/index.html as HTML
Parsing feed_0/article_0/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_0/article_0/index.html as HTML
Parsing feed_1/article_0/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_1/article_0/index.html as HTML
Parsing feed_4/article_1/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_4/article_1/index.html as HTML
Parsing feed_0/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_0/index.html as HTML
Parsing feed_5/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_5/index.html as HTML
Parsing feed_5/article_0/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_5/article_0/index.html as HTML
Parsing feed_5/article_1/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_5/article_1/index.html as HTML
Parsing feed_1/article_1/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_1/article_1/index.html as HTML
Parsing feed_4/article_0/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_4/article_0/index.html as HTML
Parsing feed_2/index.html ...
Initial parse failed, using more forgiving parsers
Parsing feed_2/index.html as HTML
Referenced file u'/search/Geoff%2bHo%3fs%3dGeoff%2bHo%26b%3d1' not found
Referenced file u'/search/Henry%2bFitzherbert%3fs%3dHenry%2bFitzherbert%26b% 3d1' not found
Referenced file u'/search/Marco%2bGiannangeli%3fs%3dMarco%2bGiannangeli%26b% 3d1' not found
Referenced file u'/search/Helen%2bMassy-Beresford%3fs%3dHelen%2bMassy-Beresford%26b%3d1' not found
Referenced file u'/search/Tom%2bStewart%3fs%3dTom%2bStewart%26b%3d1' not found
Referenced file u'/news/world/454721/I-have-to-believe-my-son-is-alive-in-Vietnam' not found
Referenced file u'/search/Paula%2bMurray%3fs%3dPaula%2bMurray%26b%3d1' not found
Referenced file u'/search/John%2bRichardson%3fs%3dJohn%2bRichardson%26b%3d1' not found
Referenced file u'/news/uk/454730/Edwina-Currie-s-cruel-jibe-at-poor' not found
Referenced file u'/search/Lucy%2bJohnstone%3fs%3dLucy%2bJohnstone%26b%3d1' not found
Referenced file u'//www.googleadservices.com/pagead/conversion.js' not found
Referenced file u'/opensearch.xml' not found
Referenced file u'/entertainment/films/454753/Failing-to-get-Bourne-lead-made-me-a-star' not found
Referenced file u'/finance/city/454745/Rate-hike-risk-as-jobless-toll-falls-Mark-Carney-urged-not-to-raise-interest-level' not found
Referenced file u'/entertainment/books/454754/Dad-may-have-been-a-gangster-but-to-us-he-was-a-perfect-man' not found
Referenced file u'/sport/football/454711/David-Moyes-insists-Man-Utd-are-biggest-club-in-the-WORLD' not found
Referenced file u'/search/David%2bMeikle%3fs%3dDavid%2bMeikle%26b%3d1' not found
Referenced file u'/news/uk/454749/MPs-tell-Cameron-to-fight-plague-of-failing-marriages' not found
Referenced file u'/search/Nicola%2bIseard%3fs%3dNicola%2bIseard%26b%3d1' not found
Referenced file u'/news/world/454726/Amanda-Knox-s-fears-over-Italian-retrial' not found
Referenced file u'/finance/city/454744/Energy-grid-firms-set-for-a-grilling-from-MPs' not found
Referenced file u'//s7.addthis.com/js/300/addthis_.js' not found
Referenced file u'feed_6/index.html' not found
Referenced file u'/search/Mike%2bParker%3fs%3dMike%2bParker%26b%3d1' not found
Referenced file u'/life-style/travel/454731/Best-ski-resorts-for-families-this-half-term-in-Austria-France-Colorado-and-Switzerland' not found
Referenced file u'/life-style/cars/454737/Honda-Civic-Tourer-The-grand-Tourer' not found
Referenced file u'feed_3/article_1/index.html' not found
Reading TOC from NCX...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Removing fake margins...
Found 22 items of level: div_8
Found 19 items of level: div_1
Found 44 items of level: div_3
Found 50 items of level: div_2
Found 22 items of level: div_5
Found 33 items of level: div_4
Found 55 items of level: div_7
Found 22 items of level: div_6
Found 132 items of level: p_7
Found 2 items of level: p_2
Ignoring level div_8
Ignoring level div_5
Ignoring level p_2
Ignoring level div_6
div_1 left margin stats: Counter({u'': 11})
div_1 right margin stats: Counter({u'': 11})
div_3 left margin stats: Counter({u'': 44})
div_3 right margin stats: Counter({u'': 44})
div_2 left margin stats: Counter({u'': 11})
div_2 right margin stats: Counter({u'': 11})
div_4 left margin stats: Counter({u'': 33})
div_4 right margin stats: Counter({u'': 33})
div_7 left margin stats: Counter({u'': 55})
div_7 right margin stats: Counter({u'': 55})
p_7 left margin stats: Counter({u'0': 132})
p_7 right margin stats: Counter({u'0': 132})
Cleaning up manifest...
Trimming unused files from manifest...
Trimming u'feed_0/article_1/images/img1.png' from manifest
Trimming u'feed_0/article_0/images/img1.png' from manifest
Trimming u'feed_1/article_0/images/img1.png' from manifest
Trimming u'feed_1/article_1/images/img1.png' from manifest
Creating MOBI Output...
Serializing resources...
Converting TOC for MOBI periodical indexing...
Using mastheadImage supplied in manifest...
Creating MOBI 6 output
Generating in-line TOC...
Applying case-transforming CSS...
Parsing manglecase.css ...

Parsing tocstyle.css ...
Rasterizing SVG images...
Converting XHTML to Mobipocket markup...
Failed to find image: http://cdn.images.express.co.uk/img/...-MP-454730.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...ary/118906.jpg

Failed to find image: http://cdn.images.express.co.uk/img/...nce-454749.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...ner-454726.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...ary/118868.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...and-454721.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...tes-454745.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...ing-454744.jpg

Failed to find image: http://cdn.images.express.co.uk/img/...yes-454711.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...ary/118847.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...ter-454754.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...ary/118905.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...ilm-454753.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...ary/118903.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...rer-454737.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...day-454731.jpg
Failed to find image: http://cdn.images.express.co.uk/img/...ary/118875.jpg

Serializing markup content...
Compressing markup content...

Generating MOBI index for a periodical
MOBI output written to d:\temp\calibre_tnwotw\ulo5md_recipe_out.mobi

scissors is offline   Reply With Quote
Old 01-19-2014, 07:54 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,261
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
That website likely has some invalid markup that is preventing calibre from parsing the page. From a quick look, it looks like it uses invalid HTML comments that look like this:

<!—OVOLABS_2 START—>

note the use of the mdash instead of a double hyphen.

I fixed the recpe to take care of that. https://github.com/kovidgoyal/calibr...7727e30fcb8ba8
kovidgoyal is online now   Reply With Quote
Advert
Old 01-19-2014, 09:03 AM   #3
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Hi Kovid

Thanks for that.

However, I rewrote the recipe as it was getting messy.
This is the new one, which seems a lot faster.

I would ask 1 question, regarding the code for auto clean up.
In the recipe I wanted photos and the writer info to not be cleaned up.

I used the following

auto_cleanup_keep = '//section[@class="photo"]'
#auto_cleanup_keep = '//div[@class="publish-info"]'
auto_cleanup = True

The 2nd line is commented out because when i add it the photos disappear. Is it a case of the auto_cleanup_keep command can only be used once?

Kind Regards
Dave


Express, new recipe
Spoiler:
Code:
import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre import browser
class AdvancedUserRecipe1390132023(BasicNewsRecipe):
    title          = u'Daily Express'
    __author__ = 'Dave Asbury'
   # 19.1.14 written due to website changes
    oldest_article = 1
    max_articles_per_feed = 10
    compress_news_images = True
    compress_news_images_max_size = 30
    ignore_duplicate_articles = {'title', 'url'}
    masthead_url = 'http://cdn.images.dailyexpress.co.uk/img/page/express_logo.png'
    auto_cleanup_keep = '//section[@class="photo"]'
    #auto_cleanup_keep = '//div[@class="publish-info"]' 
    auto_cleanup = True
    no_stylesheets        = False
    preprocess_regexps = [
		 (re.compile(r'\| [\w].+?\| [\w].+?\| Daily Express', re.IGNORECASE | re.DOTALL), lambda match: ''),
         	
         		]
    feeds          = [

		(u'UK News', u'http://www.express.co.uk/posts/rss/1/uk'),
                                (u'World News',u'http://www.express.co.uk/posts/rss/78/world'),
                                (u'Finance',u'http://www.express.co.uk/posts/rss/21/finance'),
                                (u'Sport',u'http://www.express.co.uk/posts/rss/65/sport'),
                                (u'Entertainment',u'http://www.express.co.uk/posts/rss/18/entertainment'),
                                (u'Lifestyle',u'http://www.express.co.uk/posts/rss/8/life&style'),
                                (u'Fun',u'http://www.express.co.uk/posts/rss/110/fun'),
                        ]

    def get_cover_url(self):
        print '============Cover ================='
        print
        soup = self.index_to_soup('http://www.express.co.uk/ourpaper/')
        cov = soup.find(attrs={'src' : re.compile('http://cdn.images.express.co.uk/img/covers/')})
        cov=str(cov)
        print '^^^^^^^', cov
        cov2 =  re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', cov)

        cov=str(cov2)
        cov=cov[2:len(cov)-2]

        print '&&&&&&&&',cov,'***'
        #cover_url=cov
        br = browser()
        br.set_handle_redirect(False)
        try:
            br.open_novisit(cov)
            cover_url = cov
        except:
            cover_url ='http://cdn.images.express.co.uk/img/static/ourpaper/header-back-issue-papers.jpg'

        return cover_url


    extra_css = '''
                    #h1{font-weight:bold;font-size:175%;}
                    h2{display: block;margin-left: auto;margin-right: auto;width:100%;font-weight:bold;font-size:175%;}
                    #p{font-size:14px;}
                    #body{font-size:14px;}
                    .photo-caption {display: block;margin-left: auto;margin-right: auto;width:100%;font-size:40%;}
                    .publish-info {font-size:50%;}
                    .photo img {display: block;margin-left: auto;margin-right: auto;width:100%;}
      '''
scissors is offline   Reply With Quote
Old 01-19-2014, 09:30 AM   #4
PeterT
Grand Sorcerer
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
PeterT's Avatar
 
Posts: 12,545
Karma: 74358018
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
From looking at the manual I *think* you want
Code:
auto_cleanup_keep = '//section[@class="photo"]|//div[@class="publish-info"]' 
auto_cleanup = True
PeterT is offline   Reply With Quote
Old 01-19-2014, 10:33 AM   #5
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Quote:
Originally Posted by PeterT View Post
From looking at the manual I *think* you want
Code:
auto_cleanup_keep = '//section[@class="photo"]|//div[@class="publish-info"]' 
auto_cleanup = True
Thanks.

Unfortunately, the publish info /seems/ to get ignored.I tried various tags etc.

No worries. The main article works.
scissors is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
recipe for Express.de - german schuster Recipes 1 06-05-2011 09:58 AM
Markdown / Blockquote conversion giving unexpected results Agama Conversion 23 02-23-2011 11:16 AM
Globe and Mail Recipe Rewrite.. Szing Recipes 9 01-21-2011 09:06 PM
Recipe produces no Images - Please help. Onecanuck Recipes 6 12-16-2010 08:29 PM
Unutterably Silly Unexpected results of the pumpkin pie kennyc Lounge 7 11-24-2010 12:14 PM


All times are GMT -4. The time now is 04:01 AM.


MobileRead.com is a privately owned, operated and funded community.