New York Times Scoop!

ignatz · 03-12-2004, 12:03 AM

Okay, I have worked out a rudimentary site file that reads the New York Times front page from the RadioUserland rss feed. It works well, but the format of the output contents page is really ugly and needs some major help. (On the other hand, the story pages look great.) I'm struggling to figure out how to make this change, but it the meantime, feel free to give it a whirl. Any comments are welcome. Any guidance on perl would be great.

You can either copy the following text and save it in a file with the .site extension (eg. NYT_Front.site) or just download the attachment.

#NYTimes Front Page
#sitescooper .site file by Ignatz Sol
URL: http://partners.userland.com/nytrss/nytHomepage.xml
Name: New York Times: Front Page
Description: The latest New York Times front page headlines.
ContentsFormat: rss

Levels: 2
StoryURL: http://www.nytimes.com/.+USERLAND.*
StoryToPrintableSub: s/USERLAND/USERLAND&pagewanted=print&position=/

StoryStart: </head>
StoryEnd: /NYT_TEXT

ignatz · 03-12-2004, 11:00 AM

Never mind this one: I've got much better coming. Admittedly, it's not mine. I've found a great scoop by Kennis Koldewyn on the sitescooper mailing list and I'm modifying it to improve it and make it easier for everyone to get what they want from it. Stay tuned, the New York Times is almost within reach.

ignatz · 03-12-2004, 04:52 PM

Okay, now I've got a good one. The core of this scoop came from the sitescooper mailing list and was written by Kennis Koldewyn. I've just expanded and tweaked it a bit. The basic idea is great. You have an html file on your desktop that contains links to all the text only menus at the NYT. This local html file is your URL. The site file is 3 levels deep, so you get your local file as the top level, then the link to headlines, and finally the stories. In preliminary testing it has performed admirably.

However, there are a few outstanding issues. First, I recommend that you severely limit the categories from which you download. There are a lot of stories available and your converted file can easily get big in a hurry. The raw html file here has every option commented out except for National and International headlines. But I have included every category that you see on this page. What you must do is delete the open and close comment markers on the sections that you want. (Open comment is "".) I've been using only 10 sections and I can quickly go up to 900KB unconverted. (iSilo then shrinks this back down to around 300KB.) If your raw converted filesize is above 500KB, sitescooper will stop scooping. You have to add a parameter into your scooping command to redefine the limit. For example, if your command is:

perl sitescooper.pl -site NYTimes.site -misilox

and sitescooper is reporting that it's running over the limit, you can add a parameter like the following:

perl sitescooper.pl -site NYTimes.site -limit 1000 -misilox

This will up the limit to 1000KB. If it's still not enough go back and change it again.

Also, some of the categories keep stories that are way out of date. If the stories are more than 10 days old, the URL that this site file uses gets redirected (because of the way that NYT archives their old content) and you lose the printer-friendly page. So if that page is split over two pages, you won't get the second page. I have tried a few tricks, such as setting the "StoryFollowLinks" parameter in the site file to 1, but hasn't worked. I'm also looking at possible ways to filter out the older URLs and just not scoop them at all, but that involves some perl date manipulation, and I haven't got that knack yet.

Also, sometimes I've seen story pages left blank on one run that work fine on the next run. This may be some sort of network issue or something. But if it doesn't work the first time through, try running it again and see if it picks up what it missed the first time around.

Regardless, in my testing it has worked fabulously. There's no cookies issue. The printer friendly pages make for nice reading. If you've been waiting for a non-Avantgo NYTimes, here's your chance. If this works for you, please let me know! If you encounter any weird behavior, please let me know. I haven't checked even a 1/4 of the possible pages, so anything could happen. The movies section had slightly different formatting than the other pages and required a little tweaking. Some other section might also.

To summarize, download the new_york_times.html file below (actually it shows up as new_york_time.txt, because html extension is not allowed - once you download it, change the extension back to html). Download the NYTimes.site file. Put them in your sitescooper folder. You will have to edit the URL portion of the site file to reflect exactly where the new_york_times.html file is. Then create a batch to run this one exclusively, like in the examples at the top of the page, or add the NYTimes.site file to your sites directory and let it run when the rest of your sites run.

Sitescooper is more complicated than the other guys, but well worth the effort. Any questions or comments? Let me hear it...

ignatz · 03-15-2004, 10:47 PM

I'm now using this scoop daily and it works great. Another advantage of grabbing this much info with Sitescooper is that it can compare with it's cache and not spend time reconverting data that it has already done. With several days to weeks of stories available, this could help speed conversion.

My iSilo converts at 5:30 am, Sitescooper at 5:45, and then it all syncs at 6. When I walk in and grab my Palm I'm ready to go...

If anyone has concerns about configuring these files, I'd be happy to help, and will customize them for you. Let me know.

melvynadam · 03-22-2004, 09:34 AM

Okay, I've never used SiteScopper but would love to get the NY Times into my iSilo every day. Is there a way of doing this without Sitescooper?

ignatz · 04-08-2004, 10:16 AM

Here's a few "premade" individual section scoops for International, National, and Technology. More are available upon request. Alex, you can add these to the "scoops" section here.

They do require one edit. You must open the .site file and change the path to the appropriate html file (included here).

Alexander Turcic · 04-08-2004, 11:37 AM

Thanks Ignatz. I will do this as soon as I am back at work (next week, after Eastern). My machine at work does all the scooping work :P

Zire · 04-08-2004, 01:45 PM

Really would like the NYTimes thing put to rest. Thanks for the update.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
New York Times review: K2	akira28	Amazon Kindle	32	02-28-2009 02:23 PM
New York times about Kindle 2	Kris777	News	12	02-18-2009 08:51 AM
New York Times on 505	Hamza	Sony Reader	21	03-03-2008 12:55 PM
iLiad New York Times	King Mook Mook	iRex	0	12-30-2007 03:22 PM
New Reader Ad in New York Times	TadW	Sony Reader	7	07-28-2007 01:11 PM

03-12-2004, 11:00 AM	#2
ignatz mechanoholic Posts: 582 Karma: 1000217 Join Date: Mar 2004 Location: Sarasota, FL Device: Nook STR/iPhone 4S/EVO 4G	Never mind this one: I've got much better coming. Admittedly, it's not mine. I've found a great scoop by Kennis Koldewyn on the sitescooper mailing list and I'm modifying it to improve it and make it easier for everyone to get what they want from it. Stay tuned, the New York Times is almost within reach.

03-15-2004, 10:47 PM	#4
ignatz mechanoholic Posts: 582 Karma: 1000217 Join Date: Mar 2004 Location: Sarasota, FL Device: Nook STR/iPhone 4S/EVO 4G	I'm now using this scoop daily and it works great. Another advantage of grabbing this much info with Sitescooper is that it can compare with it's cache and not spend time reconverting data that it has already done. With several days to weeks of stories available, this could help speed conversion. My iSilo converts at 5:30 am, Sitescooper at 5:45, and then it all syncs at 6. When I walk in and grab my Palm I'm ready to go... If anyone has concerns about configuring these files, I'd be happy to help, and will customize them for you. Let me know.

03-22-2004, 09:34 AM	#5
melvynadam Connoisseur Posts: 69 Karma: 10 Join Date: Jan 2004 Location: Israel Device: Kindle Paperwhite, iPhone, iPad	Okay, I've never used SiteScopper but would love to get the NY Times into my iSilo every day. Is there a way of doing this without Sitescooper?

04-08-2004, 11:37 AM	#7
Alexander Turcic Fully Converged Posts: 18,171 Karma: 14021202 Join Date: Oct 2002 Location: Switzerland Device: Too many to count here.	Thanks Ignatz. I will do this as soon as I am back at work (next week, after Eastern). My machine at work does all the scooping work :P

04-08-2004, 01:45 PM	#8
Zire Fanatic Posts: 522 Karma: 14050 Join Date: May 2003 Location: Astoria, NY Device: Zire 71	Really would like the NYTimes thing put to rest. Thanks for the update.