|
|
Thread Tools | Search this Thread |
11-05-2010, 01:29 AM | #1 |
Member
Posts: 12
Karma: 10
Join Date: Oct 2010
Location: UK
Device: Kindle 3 WiFi, Kindle Paperwhite 2013
|
Recipe works when mocked up as Python file, fails when converted to Recipe
Code:
import urllib2 from BeautifulSoup import BeautifulSoup from calibre.web.feeds.news import BasicNewsRecipe class Counterpunch(BasicNewsRecipe): ''' Parses counterpunch.com for articles ''' def parse_index(self): feeds = [] title, url = 'Counterpunch', 'http://www.counterpunch.com' articles = self.parse_page(url) if articles: feeds.append((title, articles)) return feeds def parse_page(self, url): fd = urllib2.urlopen(url) soup = BeautifulSoup(fd, fromEncoding='iso-8859-1') articles = [] current_date = '' #Gets all dates and entries in the correctly dispersed way e.g. date, list of articles for date, next date, next list of articles #first expression gets entries, second gets dates dates_and_articles = soup.findAll(lambda tag: (tag.name == 'p' and tag.attrs == [(u'class', u'style2')] and len(tag) == 4 and 'Website of the' not in tag.decode('utf-8')) or (tag.name == 'font' and tag.attrs == [(u'color', u'#990000'), (u'size', u'-1')])) for tag in dates_and_articles: #if 'Today\'s\n Stories' in tag.contents: if tag.name == 'p': #logic to deal with different ways names are printed (color difference I belive) if tag.find('span', {'class': 'style1'}): author = tag.contents[0].contents[0] + ': ' url = 'http://www.counterpunch.com/' + tag.contents[3].attrs[0][1] else: author = tag.contents[0] + ': ' url = 'http://www.counterpunch.com/' + tag.contents[3].attrs[0][1] title = author + str(tag.contents[3].contents[0]) articles.append({'title': title, 'url': url, 'description':'', 'date': current_date}) #if new date, update current_date elif tag.name == 'font': current_date = tag.contents[0] #print('the date is {0}').format(current_date) #cut just one days articles for clearer, quicker debugging articles = [a for a in articles if a['date'] == 'October 11, 2010'] return articles #for debugging on the cmd #c = Counterpunch() #print c.parse_index() This is the first recipe I have written. It is for a site that has no rss. The articles are in a table at the side of the page separated by date headings. I mocked it up as a .py file first. I got it to a workable state where it will spit out a list of feeds on the commandline. I then made the few small changes to it to make it into a recipe and test with 'ebook-convert counterpunch.recipe test --test -vv' but I get the below traceback: Code:
1% Converting input to HTML... InputFormatPlugin: Recipe Input running 1% Fetching feeds... Traceback (most recent call last): File "/tmp/init.py", line 48, in <module> File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/cli.py", line 254, in main File "/home/kovid/build/calibre/src/calibre/ebooks/conversion/plumber.py", line 836, in run File "/home/kovid/build/calibre/src/calibre/customize/conversion.py", line 216, in __call__ File "/home/kovid/build/calibre/src/calibre/web/feeds/input.py", line 105, in convert File "/home/kovid/build/calibre/src/calibre/web/feeds/news.py", line 712, in download File "/home/kovid/build/calibre/src/calibre/web/feeds/news.py", line 837, in build_index File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 15, in parse_index articles = self.parse_page(url) File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 28, in parse_page dates_and_articles = soup.findAll(lambda tag: (tag.name == 'p' and File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 768, in findAll File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 332, in _findAll File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 890, in search File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 849, in searchTag File "/usr/lib/python2.6/site-packages/BeautifulSoup.py", line 907, in _matches File "/tmp/calibre_0.7.26_tmp_Ep1Dpi/calibre_0.7.26_IUpdj4_recipes/recipe0.py", line 31, in <lambda> 'Website of the' not in tag.decode('utf-8')) or TypeError: 'NoneType' object is not callable Can anyone get it to run to grab the feeds for calibre? Thanks |
11-05-2010, 09:55 PM | #2 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I tested briefly on another machine, and got your feed parsed correctly. The articles weren't pulling, and I didn't debug why, but you were parsing the articles and building the feed from your source page just fine.
The recipe didn't finish, and I'm not sure if all you articles were parsed correctly, but most were. I started to play with it, added a postprocess_html for debugging, cleaned up some comments, added some print statements and the recipe finished, (empty articles) but that's as far as I went. I know it's not much, but I thought you might want to know you weren't ignored. |
Advert | |
|
12-21-2010, 05:15 PM | #3 |
Junior Member
Posts: 4
Karma: 10
Join Date: Dec 2010
Device: Kindle 3
|
Counterpunch is a good web publication and as a calibre user I would appreciate it if its recipe gets debugged and put into the software distribution.
|
07-28-2011, 06:40 PM | #4 |
Member
Posts: 19
Karma: 10
Join Date: Jul 2010
Device: Calibre
|
It's been a year and a half since the original post. Does anyone know about any developments? I really would like to get a hold of a working recipe for CounterPunch. Thanks.
|
07-29-2011, 03:21 PM | #5 |
Member
Posts: 12
Karma: 10
Join Date: Oct 2010
Location: UK
Device: Kindle 3 WiFi, Kindle Paperwhite 2013
|
I rewrote it and got it working.
I have contributed it to Calibre. It will be included from the version released today (0.8.12). If you don't want to update you can use the file attached to this post. Enjoy! |
Advert | |
|
07-31-2011, 07:01 PM | #6 |
Member
Posts: 19
Karma: 10
Join Date: Jul 2010
Device: Calibre
|
Thank you so much. So far so good! I love it!
|
08-05-2011, 10:00 AM | #7 |
Member
Posts: 19
Karma: 10
Join Date: Jul 2010
Device: Calibre
|
There seems to be a limit of 10 entries per day. Actually some days there are less than ten and some days there are more than 10. So how does that work? Is there a way to make sure that no entries are repeated and that all entries eventually get pulled off? I'm new to this, so I am not sure how it works. Thanks.
|
09-04-2011, 04:57 AM | #8 |
Member
Posts: 12
Karma: 10
Join Date: Oct 2010
Location: UK
Device: Kindle 3 WiFi, Kindle Paperwhite 2013
|
Counterpunch have redesigned their site and now have an RSS feed, making things easier for the recipe.
I have rewritten and submitted it to Calibre. It will be in the next version, which should be released next Friday (9 Sept). You can use the version I attached to this post if you want in the meantime. @aritza The new recipe has a limit of 7 days/100 posts but since it works by RSS now it is really limited by the number of posts in the feed (25 at this time.) |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
To MOBI, Chapter detection fails? Works for EPUB | Fmstrat | Calibre | 7 | 08-29-2010 05:37 PM |
Help a beginner:Python/Recipe Unicode and ASCII | Starson17 | Calibre | 2 | 02-15-2010 11:10 AM |
NY Times Recipe in Calibre 6.36 Fails | keyrunner | Calibre | 1 | 01-28-2010 11:56 AM |
Is it possible to specify output format in recipe file | madcow_x2 | Calibre | 3 | 01-07-2010 04:10 PM |
Recipe works from 1 machine, not from another | BarryTX | Calibre | 12 | 07-18-2009 12:31 AM |