03-12-2004, 12:03 AM | #1 |
mechanoholic
Posts: 582
Karma: 1000217
Join Date: Mar 2004
Location: Sarasota, FL
Device: Nook STR/iPhone 4S/EVO 4G
|
New York Times Scoop!
Okay, I have worked out a rudimentary site file that reads the New York Times front page from the RadioUserland rss feed. It works well, but the format of the output contents page is really ugly and needs some major help. (On the other hand, the story pages look great.) I'm struggling to figure out how to make this change, but it the meantime, feel free to give it a whirl. Any comments are welcome. Any guidance on perl would be great.
You can either copy the following text and save it in a file with the .site extension (eg. NYT_Front.site) or just download the attachment. #NYTimes Front Page
#sitescooper .site file by Ignatz Sol URL: http://partners.userland.com/nytrss/nytHomepage.xml Name: New York Times: Front Page Description: The latest New York Times front page headlines. ContentsFormat: rss Levels: 2 StoryURL: http://www.nytimes.com/.+USERLAND.* StoryToPrintableSub: s/USERLAND/USERLAND&pagewanted=print&position=/ StoryStart: </head> StoryEnd: /NYT_TEXT |
03-12-2004, 11:00 AM | #2 |
mechanoholic
Posts: 582
Karma: 1000217
Join Date: Mar 2004
Location: Sarasota, FL
Device: Nook STR/iPhone 4S/EVO 4G
|
Never mind this one: I've got much better coming. Admittedly, it's not mine. I've found a great scoop by Kennis Koldewyn on the sitescooper mailing list and I'm modifying it to improve it and make it easier for everyone to get what they want from it. Stay tuned, the New York Times is almost within reach.
|
03-12-2004, 04:52 PM | #3 |
mechanoholic
Posts: 582
Karma: 1000217
Join Date: Mar 2004
Location: Sarasota, FL
Device: Nook STR/iPhone 4S/EVO 4G
|
Okay, now I've got a good one. The core of this scoop came from the sitescooper mailing list and was written by Kennis Koldewyn. I've just expanded and tweaked it a bit. The basic idea is great. You have an html file on your desktop that contains links to all the text only menus at the NYT. This local html file is your URL. The site file is 3 levels deep, so you get your local file as the top level, then the link to headlines, and finally the stories. In preliminary testing it has performed admirably.
However, there are a few outstanding issues. First, I recommend that you severely limit the categories from which you download. There are a lot of stories available and your converted file can easily get big in a hurry. The raw html file here has every option commented out except for National and International headlines. But I have included every category that you see on this page. What you must do is delete the open and close comment markers on the sections that you want. (Open comment is "<!--" and close comment is "-->".) I've been using only 10 sections and I can quickly go up to 900KB unconverted. (iSilo then shrinks this back down to around 300KB.) If your raw converted filesize is above 500KB, sitescooper will stop scooping. You have to add a parameter into your scooping command to redefine the limit. For example, if your command is: perl sitescooper.pl -site NYTimes.site -misilox and sitescooper is reporting that it's running over the limit, you can add a parameter like the following:perl sitescooper.pl -site NYTimes.site -limit 1000 -misilox This will up the limit to 1000KB. If it's still not enough go back and change it again.Also, some of the categories keep stories that are way out of date. If the stories are more than 10 days old, the URL that this site file uses gets redirected (because of the way that NYT archives their old content) and you lose the printer-friendly page. So if that page is split over two pages, you won't get the second page. I have tried a few tricks, such as setting the "StoryFollowLinks" parameter in the site file to 1, but hasn't worked. I'm also looking at possible ways to filter out the older URLs and just not scoop them at all, but that involves some perl date manipulation, and I haven't got that knack yet. Also, sometimes I've seen story pages left blank on one run that work fine on the next run. This may be some sort of network issue or something. But if it doesn't work the first time through, try running it again and see if it picks up what it missed the first time around. Regardless, in my testing it has worked fabulously. There's no cookies issue. The printer friendly pages make for nice reading. If you've been waiting for a non-Avantgo NYTimes, here's your chance. If this works for you, please let me know! If you encounter any weird behavior, please let me know. I haven't checked even a 1/4 of the possible pages, so anything could happen. The movies section had slightly different formatting than the other pages and required a little tweaking. Some other section might also. To summarize, download the new_york_times.html file below (actually it shows up as new_york_time.txt, because html extension is not allowed - once you download it, change the extension back to html). Download the NYTimes.site file. Put them in your sitescooper folder. You will have to edit the URL portion of the site file to reflect exactly where the new_york_times.html file is. Then create a batch to run this one exclusively, like in the examples at the top of the page, or add the NYTimes.site file to your sites directory and let it run when the rest of your sites run. Sitescooper is more complicated than the other guys, but well worth the effort. Any questions or comments? Let me hear it... Last edited by ignatz; 03-12-2004 at 05:25 PM. |
03-15-2004, 10:47 PM | #4 |
mechanoholic
Posts: 582
Karma: 1000217
Join Date: Mar 2004
Location: Sarasota, FL
Device: Nook STR/iPhone 4S/EVO 4G
|
I'm now using this scoop daily and it works great. Another advantage of grabbing this much info with Sitescooper is that it can compare with it's cache and not spend time reconverting data that it has already done. With several days to weeks of stories available, this could help speed conversion.
My iSilo converts at 5:30 am, Sitescooper at 5:45, and then it all syncs at 6. When I walk in and grab my Palm I'm ready to go... If anyone has concerns about configuring these files, I'd be happy to help, and will customize them for you. Let me know. |
03-22-2004, 09:34 AM | #5 |
Connoisseur
Posts: 69
Karma: 10
Join Date: Jan 2004
Location: Israel
Device: Kindle Paperwhite, iPhone, iPad
|
Okay, I've never used SiteScopper but would love to get the NY Times into my iSilo every day. Is there a way of doing this without Sitescooper?
|
04-08-2004, 10:16 AM | #6 |
mechanoholic
Posts: 582
Karma: 1000217
Join Date: Mar 2004
Location: Sarasota, FL
Device: Nook STR/iPhone 4S/EVO 4G
|
Here's a few "premade" individual section scoops for International, National, and Technology. More are available upon request. Alex, you can add these to the "scoops" section here.
They do require one edit. You must open the .site file and change the path to the appropriate html file (included here). Last edited by ignatz; 04-08-2004 at 01:33 PM. |
04-08-2004, 11:37 AM | #7 |
Fully Converged
Posts: 18,171
Karma: 14021202
Join Date: Oct 2002
Location: Switzerland
Device: Too many to count here.
|
Thanks Ignatz. I will do this as soon as I am back at work (next week, after Eastern). My machine at work does all the scooping work :P
|
04-08-2004, 01:45 PM | #8 |
Fanatic
Posts: 522
Karma: 14050
Join Date: May 2003
Location: Astoria, NY
Device: Zire 71
|
Really would like the NYTimes thing put to rest. Thanks for the update.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
New York Times review: K2 | akira28 | Amazon Kindle | 32 | 02-28-2009 02:23 PM |
New York times about Kindle 2 | Kris777 | News | 12 | 02-18-2009 08:51 AM |
New York Times on 505 | Hamza | Sony Reader | 21 | 03-03-2008 12:55 PM |
iLiad New York Times | King Mook Mook | iRex | 0 | 12-30-2007 03:22 PM |
New Reader Ad in New York Times | TadW | Sony Reader | 7 | 07-28-2007 01:11 PM |