12-03-2007, 08:27 PM | #106 |
creator of calibre
Posts: 44,411
Karma: 23977332
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@JTravers
Just realized I can't look at the cleanup code as I don't have a subscription. Try the following to debug Code:
def cleanup(self): res = self.browser.open('whatever the url was') print res.read() |
12-04-2007, 12:37 AM | #107 | |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Quote:
Code:
match_regexps = ['http://online.barrons.com/.*?html\?mod=.*?'] |
|
Advert | |
|
12-04-2007, 01:12 AM | #108 |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Still hangs -- both when I login and when I don't. If you have the time to check, you should be able to test even without logging in. You can use my profile from the prior post.
|
12-04-2007, 04:17 PM | #109 |
creator of calibre
Posts: 44,411
Karma: 23977332
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Hmm another regression was preventing match_regexps from working. Fixed in svn. Note that in your case match regexps should be
match_regexps = ['http://online.barrons.com/.*?html\?mod=.*?|file://.*'] As for the cleanup hanging it seems to be following a long redirect chain Use the following code to see the HTTP responses being sent by the server Code:
def cleanup(self): try: self.browser.set_debug_responses(True) import sys, logging logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.INFO) res = self.browser.open('http://online.barrons.com/logout') except: import traceback traceback.print_exc() |
12-04-2007, 04:41 PM | #110 |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Thanks for all of your help, Kovid.
I'll take a look at the code and link you recommended and see if I can come up with a solution. Once that's all worked out, the profiles I made for WSJ.com and Barrons.com should be pretty much done. I'll probably start working on other finance/investment sites after that. (The WSJ.com blogs should be pretty easy to implement -- and they're free, too!). |
Advert | |
|
12-05-2007, 03:21 AM | #111 |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Error
What does the following error mean?
Code:
Traceback (most recent call last): File "convert_from.py", line 187, in <module> File "convert_from.py", line 181, in main File "convert_from.py", line 123, in process_profile File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 92, in __init__ File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 104, in build_index File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 159, in parse_feeds ValueError: too many values to unpack http://feeds.portfolio.com/portfolio/businessspin Thanks. |
12-05-2007, 03:39 AM | #112 |
creator of calibre
Posts: 44,411
Karma: 23977332
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That means the get_feeds function is not returning a correct sequence.
|
12-05-2007, 03:44 AM | #113 |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
I'm trying to setup profiles for some full content feeds, in which I go no further than listing the articles with descriptions (since the descriptions in the feed contain the full content). However, I noticed that linked text in a feed description is removed.
I know html2lrf had a regression which removed linked text completely (which you have already fixed). So I thought maybe this was a regression, too. If not, perhaps you could set it up so that it just strips the links from the descriptions but keeps the text in place. Thanks. |
12-05-2007, 03:47 AM | #114 |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
|
12-05-2007, 11:53 AM | #115 |
creator of calibre
Posts: 44,411
Karma: 23977332
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Can you give me an example of such a feed, so I can debug.
|
12-05-2007, 04:17 PM | #116 |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Here's one from the profile I was working on.
http://feeds.portfolio.com/portfolio/businessspin I've attached the lrf generated from the profile, so you can see the results. |
12-05-2007, 04:28 PM | #117 |
creator of calibre
Posts: 44,411
Karma: 23977332
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Ah ok should be fixed in svn, let me know if if still gives you trouble.
|
12-05-2007, 10:09 PM | #118 |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
max_recursions error
Whenever I set max_recursions to 0 or 1 in a profile, I get the following error after the lrf is generated:
Code:
Exception exceptions.WindowsError: WindowsError(32, 'The process cannot access the file because it is being used by another process') in <bound method Portfolio.__del__ of <portfolio.Portfolio object at 0x00FCFCF0>> ignored Last edited by JTravers; 12-05-2007 at 10:12 PM. |
12-05-2007, 10:35 PM | #119 |
creator of calibre
Posts: 44,411
Karma: 23977332
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That error can be safely ignored, all it means is that some temporary file was not deleted.
|
12-06-2007, 03:19 AM | #120 |
Groupie
Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
New Profiles
Just to let everyone know, I posted profiles for the Wall Street Journal, Barron's, and Portfolio.com on Kovid's wiki.
https://libprs500.kovidgoyal.net/wiki/UserProfiles Subscribers to WSJ and Barron's should be able to get all the content using the --username and --password options in web2lrf. Non-subscribers will get the free articles only. Be aware that because of the peculiarities of how concurrent logins are handled at the WSJ and Barron's sites, you may get locked out of your account for a short period of time using the WSJ and Barrons profiles. You would probably have to run the profiles (with login credentials) multiple times before this happens, though. So if you're only running it once within a reasonable period of time, you should be safe. |
Tags |
libprs500, web2lrf |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
web2lrf to capture blog archive? | Deputy-Dawg | Sony Reader Dev Corner | 1 | 02-14-2008 11:41 PM |
web2lrf: La Repubblica | alexxxm | Sony Reader | 1 | 11-13-2007 12:27 PM |