01-23-2008, 10:33 AM | #1 |
Addict
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
|
python coding...
I am trying to write down a simple applet for web2lrf/libprs500, to download the magazine the Atlantic (http://www.theatlantic.com/) - it is free since today...
damn, I dont know python so I have a couple of problems... 1) under http://www.theatlantic.com/doc/current, all the links are relative (e.g. <a href="/doc/200801/millbank">), so I began with: preprocess_regexps = [(re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in [ (r'<a href="/', lambda match : match.group().replace(match.group(1), '<a href="http://www.theatlantic.com')), ] ] ... is it right? 2) at the end of every run I get the error (freely translated by me: italian windows version!) Exception exceptions.WindowsError: WindowsError(32, 'Impossible to access the file. File is used by another process') in <bound method atlantic.__de l__ of <atlantic.atlantic object at 0x0111A690>> ignored I add that I get this error even under other scripts I tried to write for other newspapers, but this didnt prevent an LRF output to be written. In this case instead, the LRF just contains the header and nothing else - probably it has something to do with question 1)... any idea? Alessandro |
01-23-2008, 04:04 PM | #2 |
creator of calibre
Posts: 44,559
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
1) No you need to re-implement the parse_feeds function so that it scan the page http://www.theatlantic.com/doc/current and returns a list of the form
Code:
[('Title', 'URL'), ('Title2', 'URL2'), ...] You can use the BeautifulSoup class to easily parse the HTML |
Advert | |
|
01-25-2008, 05:04 AM | #3 | |
Addict
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
|
Quote:
I'm afraid I didnt see in those any example of parse_feeds reimplementation, damn. Alessandro |
|
01-28-2008, 05:30 PM | #4 |
Enthusiast
Posts: 26
Karma: 11777
Join Date: Jun 2007
Location: Brooklyn
Device: PRS-500,Treo 750, Archos 605 Wifi
|
You might need to do something similar to what I did to download The Nation.
Check out the profile at https://libprs500.kovidgoyal.net/att...s/thenation.py |
01-30-2008, 06:50 AM | #5 |
Addict
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
|
Thanks, secretsubscribe,
I'm beginning to see the light... Now I can download a couple of MB of The Atlantic, but I still have one problem: The text of each article is splitted in some parts, and at the end of each one you have the usual line reading: "Pages: 1 2 3 next>". The url to which those numbers point are relative, e.g.: <span class="hankpym"> <span class="safaritime">1</span> <a href="/doc/200801/miller-education/2">2</a> <a href="/doc/200801/miller-education/3">3</a> </span> <a href="/doc/200801/miller-education/2">next></a> so I'd like to replace those, but if I add this: preprocess_regexps = \ [ (re.compile(i[0], re.IGNORECASE | re.DOTALL), i[1]) for i in [ (r'<a href="/', lambda match : match.group().replace(match.group(1), '<a href="http://www.theatlantic.com')), # .... ] ] in addition to yours (modified) def parse_feeds, it isnt able anymore to find any link. So, how can I replace relative->absolute the links in the individual articles? any hint appreciated... Alessandro |
Advert | |
|
01-30-2008, 12:25 PM | #6 |
creator of calibre
Posts: 44,559
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You'll have to increase max_recursions and use --match-regexp
|
01-30-2008, 09:47 PM | #7 |
creator of calibre
Posts: 44,559
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Here's The Atlantic
Code:
## Copyright (C) 2008 Kovid Goyal kovid@kovidgoyal.net ## This program is free software; you can redistribute it and/or modify ## it under the terms of the GNU General Public License as published by ## the Free Software Foundation; either version 2 of the License, or ## (at your option) any later version. ## ## This program is distributed in the hope that it will be useful, ## but WITHOUT ANY WARRANTY; without even the implied warranty of ## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ## GNU General Public License for more details. ## ## You should have received a copy of the GNU General Public License along ## with this program; if not, write to the Free Software Foundation, Inc., ## 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. import re from libprs500.ebooks.lrf.web.profiles import DefaultProfile from libprs500.ebooks.BeautifulSoup import BeautifulSoup class Atlantic(DefaultProfile): title = 'The Atlantic' max_recursions = 2 INDEX = 'http://www.theatlantic.com/doc/current' preprocess_regexps = [ (re.compile(r'<body.*?<div id="storytop"', re.DOTALL|re.IGNORECASE), lambda m: '<body><div id="storytop"') ] def parse_feeds(self): articles = [] src = self.browser.open(self.INDEX).read() soup = BeautifulSoup(src) issue = soup.find('span', attrs={'class':'issue'}) if issue: self.timefmt = ' [%s]'%self.tag_to_string(issue).rpartition('|')[-1].strip().replace('/', '-') for item in soup.findAll('div', attrs={'class':'item'}): a = item.find('a') if a and a.has_key('href'): url = a['href'] url = 'http://www.theatlantic.com/'+url.replace('/doc', 'doc/print') title = self.tag_to_string(a) byline = item.find(attrs={'class':'byline'}) date = self.tag_to_string(byline) if byline else '' description = '' articles.append({ 'title':title, 'date':date, 'url':url, 'description':description }) return {'Current Issue' : articles } |
01-31-2008, 05:35 AM | #8 |
Addict
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
|
thank you for the help!
Unfortunatlely, id dies at once, with this error: File "C:\Programmi\libprs500\atlantic.py", line 42, in parse_feeds self.timefmt = ' [%s]'%self.tag_to_string(issue).rpartition('|')[-1].strip().replace('/', '-') AttributeError: 'Atlantic' object has no attribute 'tag_to_string' what do you think? Alessandro |
01-31-2008, 01:25 PM | #9 |
creator of calibre
Posts: 44,559
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Upgrade to the latest version of libprs (it's a builtin feed there)
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Seriously thoughtful Coding Help Requested | poohbear_nc | Lounge | 10 | 08-24-2010 11:42 AM |
using python with windows xp | tuufbiz1 | Kindle Formats | 10 | 05-06-2009 12:53 AM |
Python 2.5 and Calibre | FizzyWater | Calibre | 1 | 03-27-2009 03:15 AM |
Python 2.5 or 2.6? | itimpi | Calibre | 5 | 01-19-2009 01:48 PM |
Some horrible and outrageous examples of disgraceful coding | Snowman | Lounge | 44 | 12-15-2008 04:18 PM |