web2lrf - Page 8

kovidgoyal · 12-03-2007, 08:27 PM

@JTravers

Just realized I can't look at the cleanup code as I don't have a subscription. Try the following to debug

Code:

def cleanup(self):
    res = self.browser.open('whatever the url was')
    print res.read()

JTravers · 12-04-2007, 12:37 AM

Quote:

Originally Posted by kovidgoyal

@JTravers
match_regexp works on the contents of the href attribute, i.e. the URL itself, not on the <a> tag.

Here's the code I'm using for the link regexp:

Code:

match_regexps = ['http://online.barrons.com/.*?html\?mod=.*?']

But I can see webpages being fetched from entirely different domains than barrons.com. I've attached my profile for Barrons. You should be able to test it (at your convenience, of course) without supplying a username and password, as there are some articles that are available to non-subscribers.

JTravers · 12-04-2007, 01:12 AM

Quote:

Originally Posted by kovidgoyal

@JTravers

Just realized I can't look at the cleanup code as I don't have a subscription. Try the following to debug

Code:

def cleanup(self):
    res = self.browser.open('whatever the url was')
    print res.read()

Still hangs -- both when I login and when I don't. If you have the time to check, you should be able to test even without logging in. You can use my profile from the prior post.

kovidgoyal · 12-04-2007, 04:17 PM

Hmm another regression was preventing match_regexps from working. Fixed in svn. Note that in your case match regexps should be

match_regexps = ['http://online.barrons.com/.*?html\?mod=.*?|file://.*']

As for the cleanup hanging it seems to be following a long redirect chain

Use the following code to see the HTTP responses being sent by the server

Code:

def cleanup(self):
            try:
                self.browser.set_debug_responses(True)
                import sys, logging
                logger = logging.getLogger("mechanize")
                logger.addHandler(logging.StreamHandler(sys.stdout))
                logger.setLevel(logging.INFO)

                res = self.browser.open('http://online.barrons.com/logout')
            except:
                import traceback
                traceback.print_exc()

You may find the documentation at http://wwwsearch.sourceforge.net/mechanize/ useful for understanding how the browser object works.

JTravers · 12-04-2007, 04:41 PM

Thanks for all of your help, Kovid.

I'll take a look at the code and link you recommended and see if I can come up with a solution.

Once that's all worked out, the profiles I made for WSJ.com and Barrons.com should be pretty much done.

I'll probably start working on other finance/investment sites after that. (The WSJ.com blogs should be pretty easy to implement -- and they're free, too!).

JTravers · 12-05-2007, 03:21 AM

What does the following error mean?

Code:

Traceback (most recent call last):
  File "convert_from.py", line 187, in <module>
  File "convert_from.py", line 181, in main
  File "convert_from.py", line 123, in process_profile
  File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 92, in __init__
  File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 104, in build_index
  File "libprs500\ebooks\lrf\web\profiles\__init__.pyo", line 159, in parse_feeds
ValueError: too many values to unpack

I get it when trying to process the following feed:
http://feeds.portfolio.com/portfolio/businessspin

Thanks.

kovidgoyal · 12-05-2007, 03:39 AM

That means the get_feeds function is not returning a correct sequence.

JTravers · 12-05-2007, 03:44 AM

I'm trying to setup profiles for some full content feeds, in which I go no further than listing the articles with descriptions (since the descriptions in the feed contain the full content). However, I noticed that linked text in a feed description is removed.

I know html2lrf had a regression which removed linked text completely (which you have already fixed). So I thought maybe this was a regression, too. If not, perhaps you could set it up so that it just strips the links from the descriptions but keeps the text in place.

Thanks.

JTravers · 12-05-2007, 03:47 AM

Quote:

Originally Posted by kovidgoyal

That means the get_feeds function is not returning a correct sequence.

User error on my part. I forgot a comma between the feed title and URL.

kovidgoyal · 12-05-2007, 11:53 AM

Can you give me an example of such a feed, so I can debug.

JTravers · 12-05-2007, 04:17 PM

Quote:

Originally Posted by kovidgoyal

Can you give me an example of such a feed, so I can debug.

Here's one from the profile I was working on.
http://feeds.portfolio.com/portfolio/businessspin

I've attached the lrf generated from the profile, so you can see the results.

kovidgoyal · 12-05-2007, 04:28 PM

Ah ok should be fixed in svn, let me know if if still gives you trouble.

JTravers · 12-05-2007, 10:09 PM

Whenever I set max_recursions to 0 or 1 in a profile, I get the following error after the lrf is generated:

Code:

Exception exceptions.WindowsError: WindowsError(32, 'The process cannot access 
the file because it is being used by another process') in <bound method Portfolio.__del__ of 
<portfolio.Portfolio object at 0x00FCFCF0>> ignored

If I then set max_recursions to 2 or more, the error goes away.

kovidgoyal · 12-05-2007, 10:35 PM

That error can be safely ignored, all it means is that some temporary file was not deleted.

JTravers · 12-06-2007, 03:19 AM

Just to let everyone know, I posted profiles for the Wall Street Journal, Barron's, and Portfolio.com on Kovid's wiki.
https://libprs500.kovidgoyal.net/wiki/UserProfiles

Subscribers to WSJ and Barron's should be able to get all the content using the --username and --password options in web2lrf. Non-subscribers will get the free articles only.

Be aware that because of the peculiarities of how concurrent logins are handled at the WSJ and Barron's sites, you may get locked out of your account for a short period of time using the WSJ and Barrons profiles. You would probably have to run the profiles (with login credentials) multiple times before this happens, though. So if you're only running it once within a reasonable period of time, you should be safe.

12-03-2007, 08:27 PM	#106
kovidgoyal creator of calibre Posts: 44,411 Karma: 23977332 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@JTravers Just realized I can't look at the cleanup code as I don't have a subscription. Try the following to debug Code: def cleanup(self): res = self.browser.open('whatever the url was') print res.read()

12-04-2007, 04:17 PM	#109
kovidgoyal creator of calibre Posts: 44,411 Karma: 23977332 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Hmm another regression was preventing match_regexps from working. Fixed in svn. Note that in your case match regexps should be match_regexps = ['http://online.barrons.com/.?html\?mod=.?\|file://.*'] As for the cleanup hanging it seems to be following a long redirect chain Use the following code to see the HTTP responses being sent by the server Code: def cleanup(self): try: self.browser.set_debug_responses(True) import sys, logging logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.INFO) res = self.browser.open('http://online.barrons.com/logout') except: import traceback traceback.print_exc() You may find the documentation at http://wwwsearch.sourceforge.net/mechanize/ useful for understanding how the browser object works.

12-05-2007, 10:09 PM	#118
JTravers Groupie Posts: 182 Karma: 1078201 Join Date: Sep 2007 Device: iPad Air 2	max_recursions error Whenever I set max_recursions to 0 or 1 in a profile, I get the following error after the lrf is generated: Code: Exception exceptions.WindowsError: WindowsError(32, 'The process cannot access the file because it is being used by another process') in <bound method Portfolio.__del__ of <portfolio.Portfolio object at 0x00FCFCF0>> ignored If I then set max_recursions to 2 or more, the error goes away. Last edited by JTravers; 12-05-2007 at 10:12 PM.

12-06-2007, 03:19 AM	#120
JTravers Groupie Posts: 182 Karma: 1078201 Join Date: Sep 2007 Device: iPad Air 2	New Profiles Just to let everyone know, I posted profiles for the Wall Street Journal, Barron's, and Portfolio.com on Kovid's wiki. https://libprs500.kovidgoyal.net/wiki/UserProfiles Subscribers to WSJ and Barron's should be able to get all the content using the --username and --password options in web2lrf. Non-subscribers will get the free articles only. Be aware that because of the peculiarities of how concurrent logins are handled at the WSJ and Barron's sites, you may get locked out of your account for a short period of time using the WSJ and Barrons profiles. You would probably have to run the profiles (with login credentials) multiple times before this happens, though. So if you're only running it once within a reasonable period of time, you should be safe.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
web2lrf to capture blog archive?	Deputy-Dawg	Sony Reader Dev Corner	1	02-14-2008 11:41 PM
web2lrf: La Repubblica	alexxxm	Sony Reader	1	11-13-2007 12:27 PM

12-04-2007, 04:41 PM	#110
JTravers Groupie Posts: 182 Karma: 1078201 Join Date: Sep 2007 Device: iPad Air 2	Thanks for all of your help, Kovid. I'll take a look at the code and link you recommended and see if I can come up with a solution. Once that's all worked out, the profiles I made for WSJ.com and Barrons.com should be pretty much done. I'll probably start working on other finance/investment sites after that. (The WSJ.com blogs should be pretty easy to implement -- and they're free, too!).

12-05-2007, 03:39 AM	#112
kovidgoyal creator of calibre Posts: 44,411 Karma: 23977332 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That means the get_feeds function is not returning a correct sequence.

12-05-2007, 03:44 AM	#113
JTravers Groupie Posts: 182 Karma: 1078201 Join Date: Sep 2007 Device: iPad Air 2	I'm trying to setup profiles for some full content feeds, in which I go no further than listing the articles with descriptions (since the descriptions in the feed contain the full content). However, I noticed that linked text in a feed description is removed. I know html2lrf had a regression which removed linked text completely (which you have already fixed). So I thought maybe this was a regression, too. If not, perhaps you could set it up so that it just strips the links from the descriptions but keeps the text in place. Thanks.

12-05-2007, 11:53 AM	#115
kovidgoyal creator of calibre Posts: 44,411 Karma: 23977332 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Can you give me an example of such a feed, so I can debug.

12-05-2007, 04:28 PM	#117
kovidgoyal creator of calibre Posts: 44,411 Karma: 23977332 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Ah ok should be fixed in svn, let me know if if still gives you trouble.

12-05-2007, 10:35 PM	#119
kovidgoyal creator of calibre Posts: 44,411 Karma: 23977332 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That error can be safely ignored, all it means is that some temporary file was not deleted.

Advert

Advert