08-24-2011, 04:18 AM | #1 |
Enthusiast
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
|
Fairbanks Daily News-miner News Recipe Submission
Here's the best I could do for the Fairbanks Daily News-miner newspaper.
I only know Bash and C well, and little Python. I figure, when somebody else around Alaska here pulls this in when they're using Calibre, they'll have somewhat of an outline of listed bugs/anomalies to work with rather then starting from nothing and trying to find each bug/anomaly. (These are marked within TODO inline comments.) Oh, I could pretty this up for you people taking submission to hide the anomalies, but I'm more of an honest guy and know leaving comments (if sometimes too many), is a way better method! This way, another can easily fix rather then struggling from the beginning blind. This recipe is well commented for anybody willing to fix things more: 1) Article titles should likely be bold font? 2) Only need story_item_date and omit number of views/posts/pipe symbols. 3) Need a newline after each index/toc entry when pulling more then one RSS feed for some reason. Sorry, I am not a fan of Python or QT! (... and for some reason, I can't attach this file and only post inline. :-( Code:
#import re # Provides preprocess_regexps re.compile #import string # Provides self.tag_to_string #from calibre import strftime from calibre.web.feeds.news import BasicNewsRecipe #from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, NavigableString # Provides soup class FairbanksDailyNewsminer(BasicNewsRecipe): title = u'Fairbanks Daily News-miner' __author__ = 'Roger' oldest_article = 7 max_articles_per_feed = 100 description = ''''The voice of interior Alaska since 1903''' publisher = 'http://www.newsminer.com/' category = 'news, Alaska, Fairbanks' language = 'en' #extra_css = ''' # p{font-weight: normal;text-align: justify} # ''' remove_javascript = True use_embedded_content = False no_stylesheets = True language = 'en' encoding = 'utf8' conversion_options = {'linearize_tables':True} # TODO: I don't see any photos in my Mobi file with this masterhead_url! masthead_url = 'http://d2uh5w9wm14i0w.cloudfront.net/sites/635/assets/top_masthead_-_menu_pic.jpg' # In order to omit seeing number of views, number of posts and the pipe # symbol for divider after the title and date of the article, a regex or # manual processing is needed to get just the "story_item_date updated" # (which contains the date). Everything else on this line is pretty much not needed. # # HTML line containing story_item_date: # <div class="signature_line"><span title="2011-08-22T23:37:14Z" class="story_item_date updated">Aug 22, 2011</span> | 2370 views | 52 <a href="/pages/full_story/push?article-Officials+tout+new+South+Cushman+homeless+living+facility%20&id=15183753#comments_15183753"><img alt="52 comments" class="dont_touch_me" src="http://d2uh5w9wm14i0w.cloudfront.net/images/comments-icon.gif" title="52 comments" /></a> | <span id="number_recommendations_15183753" class="number_recommendations">9</span> <a href="#1" id="recommend_link_15183753" onclick="Element.remove('recommend_link_15183753'); new Ajax.Request('/community/content/recommend/15183753', {asynchronous:true, evalScripts:true}); return false;"><img alt="9 recommendations" class="dont_touch_me" src="http://d2uh5w9wm14i0w.cloudfront.net/images/thumbs-up-icon.gif" title="9 recommendations" /></a> | <a href="#1" onclick="$j.facebox({ajax: '/community/content/email_friend_pane/15183753'}); return false;"><span style="position: relative;"><img alt="email to a friend" class="dont_touch_me" src="http://d2uh5w9wm14i0w.cloudfront.net/images/email-this.gif" title="email to a friend" /></span></a> | <span><a href="/printer_friendly/15183753" target="_blank"><img alt="print" class="dont_touch_me" src="http://d2uh5w9wm14i0w.cloudfront.net/images/print_icon.gif" title="print" /></a></span><span id="email_content_message_15183753" class="signature_email_message"></span></div> # The following was suggested, but it looks like I also need to define self & soup # (as well as bring in extra soup depends?) #date = self.tag_to_string(soup.find('span', attrs={'class':'story_item_date updated'})) #preprocess_regexps = [(re.compile(r'<span[^>]*addthis_separator*>'), lambda match: '') ] #preprocess_regexps = [(re.compile(r'span class="addthis_separator">|</span>'), lambda match: '') ] #preprocess_regexps = [ # (re.compile(r'<start>.*?<end>', re.IGNORECASE | re.DOTALL), lambda match : ''), # ] #def get_browser(self): #def preprocess_html(soup, first_fetch): # date = self.tag_to_string(soup.find('span', attrs={'class':'story_item_date updated'})) # return # Try to keep some tags - some might not be needed here keep_only_tags = [ #date = self.tag_to_string(soup.find('span', attrs={'class':'story_item_date updated'})), dict(name='div', attrs={'class':'hnews hentry item'}), dict(name='div', attrs={'class':'story_item_headline entry-title'}), #dict(name='span', attrs={'class':'story_item_date updated'}), dict(name='div', attrs={'class':'full_story'}) ] #remove_tags = [ # dict(name='div', attrs={'class':'story_tools'}), # dict(name='p', attrs={'class':'ad_label'}), # ] # Try to remove some bothersome tags remove_tags = [ #dict(name='img', attrs={'alt'}), dict(name='img', attrs={'class':'dont_touch_me'}), dict(name='span', attrs={'class':'number_recommendations'}), #dict(name='div', attrs={'class':'signature_line'}), dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'}), dict(name='div', attrs={'class':['addthis_toolbox','addthis_default_style']}), dict(name='span', attrs={'class':'addthis_separator'}), dict(name='div', attrs={'class':'related_content'}), dict(name='div', attrs={'class':'comments_container'}), #dict(name='div', attrs={'class':'signature_line'}), dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'}), dict(name='div', attrs={'id':'comments_container'}) ] # This one works but only gets title, date and clips article content! #remove_tags_after = [ # dict(name='span', attrs={'class':'story_item_date updated'}) # ] #remove_tags_after = [ # dict(name='div', attrs={'class':'advertisement'}), # ] # Try clipping tags before and after to prevent pulling img views/posts numbers after date? #remove_tags_before = [ # dict(name='span', attrs={'class':'story_item_date updated'}) # ] #extra_css # tweak the appearance # TODO: Change article titles <h2?> to bold? # Comment-out or uncomment any of the following RSS feeds according to your # liking. # # TODO: Adding more then one RSS Feed, and newline will be omitted for # entries within the Table of Contents or Index of Articles # # TODO: Some random bits of text is trailing the last page (or TOC on MOBI # files), these are bits of public posts and comments and need to also be # removed. # feeds = [ (u'Alaska News', u'http://newsminer.com/rss/rss_feeds/alaska_news?content_type=article&tags=alaska_news&page_name=rss_feeds&instance=alaska_news'), (u'Local News', u'http://newsminer.com/rss/rss_feeds/local_news?content_type=article&tags=local_news&page_name=rss_feeds&offset=0&instance=local_news'), # (u'Business', u'http://newsminer.com/rss/rss_feeds/business_news?content_type=article&tags=business_news&page_name=rss_feeds&instance=business_news'), # (u'Politics', u'http://newsminer.com/rss/rss_feeds/politics_news?content_type=article&tags=politics_news&page_name=rss_feeds&instance=politics_news'), # (u'Sports', u'http://newsminer.com/rss/rss_feeds/sports_news?content_type=article&tags=sports_news&page_name=rss_feeds&instance=sports_news'), # (u'Latitude 65 feed', u'http://newsminer.com/rss/rss_feeds/latitude_65?content_type=article&tags=latitude_65&page_name=rss_feeds&offset=0&instance=latitude_65'), (u'Sundays', u'http://newsminer.com/rss/rss_feeds/Sundays?content_type=article&tags=alaska_science_forum+scott_mccrea+interior_gardening+in_the_bush+judy_ferguson+book_reviews+theresa_bakker+judith_kleinfeld+interior_scrapbook+nuggets_comics+freeze_frame&page_name=rss_feeds&tag_inclusion=or&instance=Sundays'), # (u'Outdoors', u'http://newsminer.com/rss/rss_feeds/Outdoors?content_type=article&tags=outdoors&page_name=rss_feeds&instance=Outdoors'), # (u'Fairbanks Grizzlies', u'http://newsminer.com/rss/rss_feeds/fairbanks_grizzlies?content_type=article&tags=fairbanks_grizzlies&page_name=rss_feeds&instance=fairbanks_grizzlies'), (u'Newsminer', u'http://newsminer.com/rss/rss_feeds/Newsminer?content_type=article&tags=ted_stevens_bullets+ted_stevens+sports_news+business_news+fairbanks_grizzlies+dermot_cole_column+outdoors+alaska_science_forum+scott_mccrea+interior_gardening+in_the_bush+judy_ferguson+book_reviews+theresa_bakker+judith_kleinfeld+interior_scrapbook+nuggets_comics+freeze_frame&page_name=rss_feeds&tag_inclusion=or&instance=Newsminer'), # (u'Opinion', u'http://newsminer.com/rss/rss_feeds/Opinion?content_type=article&tags=editorials&page_name=rss_feeds&instance=Opinion'), # (u'Youth', u'http://newsminer.com/rss/rss_feeds/Youth?content_type=article&tags=youth&page_name=rss_feeds&instance=Youth'), # (u'Dermot Cole Blog', u'http://newsminer.com/rss/rss_feeds/dermot_cole_blog+rss?content_type=blog+entry&sort_by=posted_on&user_ids=3015275&page_name=blogs_dermot_cole&limit=10&instance=dermot_cole_blog+rss'), # (u'Dermot Cole Column', u'http://newsminer.com/rss/rss_feeds/Dermot_Cole_column?content_type=article&tags=dermot_cole_column&page_name=rss_feeds&instance=Dermot_Cole_column'), # (u'Sarah Palin', u'http://newsminer.com/rss/rss_feeds/sarah_palin?content_type=article&tags=palin_in_the_news+palin_on_the_issues&page_name=rss_feeds&tag_inclusion=or&instance=sarah_palin') ] |
08-25-2011, 09:28 AM | #2 |
Enthusiast
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
|
I've finally figured-out to rename recipe.py to recipe.txt to upload.
Hopefully this attaches fine without line wrap issues. I've also updated this recipe with bold font on article titles, as well as a few other font modifications. Now has a masterhead_url (header image for Kindle/MOBI reader devices). There's quite a few comments embedded, but necessary if somebody wants to try editing the signature_line (|date line|num of views|num of comments||). Actually, this is looking pretty good. Think I'll relax and read the newspaper now. As far as I'm concerned, go ahead and submit this recipe. |
Advert | |
|
08-25-2011, 08:30 PM | #3 |
Enthusiast
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
|
Here's a newly updated commenting-out some feeds causing duplicate stories/articles.
1) Commented out Newminer RSS Feed - this is a feed containing all RSS feeds embedded into one URL. 2) Commented out Sundays RSS Feed - feed is for readers consistently missing Sundays news. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
New Fairbanks Daily News-miner News Recipe -- Need Date inclusion only | rogerx | Recipes | 5 | 08-24-2011 10:12 AM |
New York Daily News | dabla75 | Recipes | 0 | 06-20-2011 02:09 PM |
NY Daily News | muggsly | Recipes | 1 | 03-21-2011 09:44 PM |
Custom Daily News Recipe | mean_gene | Recipes | 0 | 12-27-2010 01:07 PM |