|
|
Thread Tools | Search this Thread |
08-23-2011, 11:49 AM | #1 |
Enthusiast
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
|
New Fairbanks Daily News-miner News Recipe -- Need Date inclusion only
Any suggestions pertaining to line #73... how do I just include only the DATE (story_item_date updated) of a span class and omit the rest of a div class?
I also need to figure out how to convert the titles of the news articles to a bold font style. Once I polish this off, I figure I can then submit. Code:
from calibre.web.feeds.news import BasicNewsRecipe import re class FairbanksDailyNewsminer(BasicNewsRecipe): title = u'Fairbanks Daily News-miner' __author__ = 'Roger' oldest_article = 7 max_articles_per_feed = 100 description = ''''The voice of interior Alaska since 1903''' publisher = 'http://www.newsminer.com/' category = 'news, Alaska, Fairbanks' language = 'en' #extra_css = ''' # p{font-weight: normal;text-align: justify} # ''' remove_javascript = True use_embedded_content = False no_stylesheets = True language = 'en' encoding = 'utf8' conversion_options = {'linearize_tables':True} masthead_url = 'http://d2uh5w9wm14i0w.cloudfront.net/sites/635/assets/top_masthead_-_menu_pic.jpg' # I just need "story_item_date updated", trash the rest of the line! # <div class="signature_line"><span title="2011-08-22T10:35:58Z" class="story_item_date updated">Aug 22, 2011</span> | 1463 views | 19 <a href="/pages/full_story/push?article........class="signature_email_message"></span></div> #preprocess_regexps = [(re.compile(r'<span[^>]*addthis_separator*>'), lambda match: '') ] #preprocess_regexps = [(re.compile(r'span class="addthis_separator">|</span>'), lambda match: '') ] #preprocess_regexps = [ # (re.compile(r'<start>.*?<end>', re.IGNORECASE | re.DOTALL), lambda match : ''), # ] keep_only_tags = [ dict(name='div', attrs={'class':'hnews hentry item'}), dict(name='div', attrs={'class':'story_item_headline entry-title'}), dict(name='span', attrs={'class':'story_item_date updated'}), dict(name='div', attrs={'class':'full_story'}) ] #remove_tags = [ # dict(name='div', attrs={'class':'story_tools'}), # dict(name='p', attrs={'class':'ad_label'}), # ] remove_tags = [ dict(name='div', attrs={'class':'signature_line'}), dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'}), dict(name='div', attrs={'class':['addthis_toolbox','addthis_default_style']}), dict(name='span', attrs={'class':'addthis_separator'}), dict(name='div', attrs={'class':'related_content'}), dict(name='div', attrs={'class':'comments_container'}), #dict(name='div', attrs={'class':'signature_line'}), dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'}), dict(name='div', attrs={'id':'comments_container'}) ] #remove_tags_after = [ # dict(name='div', attrs={'class':'advertisement'}), # ] #extra_css # tweak the appearance (ie. Change titles to bold!) # Uncomment the following feeds once Dates are included and Titles are bold! feeds = [ (u'Alaska News', u'http://newsminer.com/rss/rss_feeds/alaska_news?content_type=article&tags=alaska_news&page_name=rss_feeds&instance=alaska_news') # (u'Alaska News', u'http://newsminer.com/rss/rss_feeds/alaska_news?content_type=article&tags=alaska_news&page_name=rss_feeds&instance=alaska_news'), # (u'Local News', u'http://newsminer.com/rss/rss_feeds/local_news?content_type=article&tags=local_news&page_name=rss_feeds&offset=0&instance=local_news'), # (u'Business', u'http://newsminer.com/rss/rss_feeds/business_news?content_type=article&tags=business_news&page_name=rss_feeds&instance=business_news'), # (u'Politics', u'http://newsminer.com/rss/rss_feeds/politics_news?content_type=article&tags=politics_news&page_name=rss_feeds&instance=politics_news'), # (u'Sports', u'http://newsminer.com/rss/rss_feeds/sports_news?content_type=article&tags=sports_news&page_name=rss_feeds&instance=sports_news'), # (u'Latitude 65 feed', u'http://newsminer.com/rss/rss_feeds/latitude_65?content_type=article&tags=latitude_65&page_name=rss_feeds&offset=0&instance=latitude_65'), # (u'Sundays', u'http://newsminer.com/rss/rss_feeds/Sundays?content_type=article&tags=alaska_science_forum+scott_mccrea+interior_gardening+in_the_bush+judy_ferguson+book_reviews+theresa_bakker+judith_kleinfeld+interior_scrapbook+nuggets_comics+freeze_frame&page_name=rss_feeds&tag_inclusion=or&instance=Sundays'), # (u'Outdoors', u'http://newsminer.com/rss/rss_feeds/Outdoors?content_type=article&tags=outdoors&page_name=rss_feeds&instance=Outdoors'), # (u'Fairbanks Grizzlies', u'http://newsminer.com/rss/rss_feeds/fairbanks_grizzlies?content_type=article&tags=fairbanks_grizzlies&page_name=rss_feeds&instance=fairbanks_grizzlies'), # (u'Newsminer', u'http://newsminer.com/rss/rss_feeds/Newsminer?content_type=article&tags=ted_stevens_bullets+ted_stevens+sports_news+business_news+fairbanks_grizzlies+dermot_cole_column+outdoors+alaska_science_forum+scott_mccrea+interior_gardening+in_the_bush+judy_ferguson+book_reviews+theresa_bakker+judith_kleinfeld+interior_scrapbook+nuggets_comics+freeze_frame&page_name=rss_feeds&tag_inclusion=or&instance=Newsminer'), # (u'Opinion', u'http://newsminer.com/rss/rss_feeds/Opinion?content_type=article&tags=editorials&page_name=rss_feeds&instance=Opinion'), # (u'Youth', u'http://newsminer.com/rss/rss_feeds/Youth?content_type=article&tags=youth&page_name=rss_feeds&instance=Youth'), # (u'Dermot Cole Blog', u'http://newsminer.com/rss/rss_feeds/dermot_cole_blog+rss?content_type=blog+entry&sort_by=posted_on&user_ids=3015275&page_name=blogs_dermot_cole&limit=10&instance=dermot_cole_blog+rss'), # (u'Dermot Cole Column', u'http://newsminer.com/rss/rss_feeds/Dermot_Cole_column?content_type=article&tags=dermot_cole_column&page_name=rss_feeds&instance=Dermot_Cole_column'), # (u'Sarah Palin', u'http://newsminer.com/rss/rss_feeds/sarah_palin?content_type=article&tags=palin_in_the_news+palin_on_the_issues&page_name=rss_feeds&tag_inclusion=or&instance=sarah_palin') ] Last edited by rogerx; 08-23-2011 at 12:01 PM. Reason: blah comment |
08-23-2011, 12:19 PM | #2 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
Advert | |
|
08-23-2011, 08:54 PM | #3 |
Enthusiast
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
|
Ah, many thanks! Sorry, I'm a Bash junky and not a perl/python, but am trying to patiently learn.
After sleeping on this and comparing it to the Anchorage Daily News (Official Kindle Version feed), I realized I only needed a title page with the published date. However, since individual stories are constantly updated and this date is the actual updated publish/re-edit date, I should probably just use this. Now I just need to just google for changing fonts <h2> etc. Once I clean this file up, I'll publish for other users. Think I'm just going to leave many of the feeds commented-out (but still in the file in case others have interests) as the Anchorage Daily News (Official Kindle Version feed) only pushes News, Opinions, Sports, Outdoors and Letters to the Editor sections. Personally, like most, just want to news/facts. |
08-23-2011, 10:23 PM | #4 |
Enthusiast
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
|
I've done some research, and it looks like the above snippet is leading me into a more undesirable complex recipe file.
Even though I get almost 100% good results with a basic news recipe, to just clip the date from one (1) undesirable line with this snippet looks to require manual rewriting all of the Calibre functions for the entire HTML fetching & rendering operations. (Similar to the New York Times recipe.) As such, a simple regexp (ie. preprocess_regexps calibre function) should be able to clip the date from the following line of html tags: (Note, undesirable tags occur after the date and immediately following </span> tag.) Code:
# I just need "story_item_date updated", trash the rest of the line! # <div class="signature_line"><span title="2011-08-22T10:35:58Z" class="story_item_date updated">Aug 22, 2011</span> | 1463 views | 19 <a href="/pages/full_story/push?article........class="signature_email_message"></span></div> #preprocess_regexps Last edited by rogerx; 08-23-2011 at 10:24 PM. Reason: grammar |
08-24-2011, 05:51 AM | #5 |
Enthusiast
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
|
I've just tested this recipe file on my Kindle DXG instead of FBReader and it looks really good! Better then what I thought as I was viewing the resulting .mobi file through FBReader -- and looks like FBReader was really screwing me up!
Viewing the .mobi file on my Kindle DXG and everything looks really good except for: 1) Article title should probably be bold font. 2) Newline bug looks to be really a FBReader bug. I don't see this on my Kindle Reader! ;-) 3) About the only anomaly, number of views and number of comments/posts along with pipe symbols persists. 4) The masthead_url image shows on my Kindle! (Another bug specific to FBReader. ;-) 5) I've got four feeds uncommented and thinking of uncommenting all or most of them. (I have to view wireless charges first.) I'm posting this here, instead of under the newer post with the inline posting of this news recipe because it has yet to show up on the list for the past hour. :-( |
Advert | |
|
08-24-2011, 10:12 AM | #6 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I've read your posts, but can't tell what you want. The snippet of code I posted was to extract the date for you so you can do something with it. Presumably, you want to display it womewhereYou didn't say what you want to do with it. It certainly doesn't require "manual rewriting all of the Calibre functions for the entire HTML fetching & rendering operations."
As for bolding the title, you can use extra_css. As for "views and number of comments/posts along with pipe symbol," I think you're asking how to remove that "junk", and the answer is you use remove_tags. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
New York Daily News | dabla75 | Recipes | 0 | 06-20-2011 02:09 PM |
NY Daily News | muggsly | Recipes | 1 | 03-21-2011 09:44 PM |
Remove date from news title | crisnoh | Recipes | 1 | 03-17-2011 02:07 PM |
Custom Daily News Recipe | mean_gene | Recipes | 0 | 12-27-2010 01:07 PM |
News / periodicals date on the kindle | prophet | Calibre | 3 | 12-04-2010 07:05 PM |