10-08-2006, 04:34 AM | #1 |
Addict
Posts: 364
Karma: 1035291
Join Date: Jul 2006
Location: Redmond, WA
Device: iPad Mini,Kindle Paperwhite
|
Web2Book
Hi all
Here's a program to make HTML, RTF, LRF or PDF files (the latter supports rich formatting if you have htmldoc installed) from RSS feeds and other websites. You need .Net Framework 2.0 or later installed to run it. PDF output is in iso-8859-15 character set, so some European languages are supported. The program can write the output files on your PC or sync them directly to the Sony Reader over USB. Just go to Tools-Options and make sure the options are set the way you want them, add a bunch of RSS feeds to the datagrid on the main window, and hit Go! The program can generate files on your PC or sync directly to your Reader if it is attached. If you want to use feeds that others have already set up, open the File menu, and select Subscribe. You'll be shown the set of available published feeds. Click the checkboxes next to the ones you want and click the Subscribe button, and they'll be added to your setup. Note that you can subscribe separately to webpage entries from the webpages tab. Attached are three screenshots; if all you want to do is look at RSS content then the last screenshot covers most of what you'll deal with (once you've checked the options in the first screenshot). The complex looking dialog in the middle is for extracting full HTML from RSS feeds that only include summaries in the feeds; with some tweaking you can get the app to get you full content with ads and other noise stripped. Non-geeks can stop reading here and should just try the app out using the Subscribe facility in the File menu. Hardcore geeks who understand regular expressions, read on for details of how to add new feeds that no-one has published yet. web2book supports a fairly powerful extension mechanism. Selecting a feed entry and clicking the Customize button brings up the advanced settings. Once in this property view you can also use the Test button to test your configuration for that feed; if all is well it will open your PDF reader eventually with the output for that site. A fairly detailed log is also generated to help troubleshooting. Once you are satisfied with the results for the entry you created, you can share it with others by clicking the Publish button. The properties are mostly to support getting full versions of articles, possibly via modified links that point to lower noise printable versions, and extracting a subset of the article HTML (to skip ads, etc). The various properties for Feeds are: Url - pretty obvious; this is the RSS feed URL. Enabled - whether to include this feed when you click on Go! from the main view. Days - how many days back to go when using RSS entries. Content Element - in most cases you can leave this blank; if specified (and if the Link Element field described below is blank) then the body of the element with this name will be used for the article text. If blank then rss2book will look for any of 'description', 'summary' or 'content'. Link Element - the element in the RSS feed that specifies the link to the full article. Don't specify anything here unless you actually want the full article. Otherwise this will typically be either 'link' or 'guid' for most RSS feeds. Link Extractor Pattern - this is an optional regular expression that will be applied to the link element to parse it into a collection of one or more substrings. You need to use unnamed groups (i.e. bits of regular expression pattern enclosed in parentheses) to identify the various substrings. If you leave this blank the original link will be used to create a single-element collection. Two simple examples: (\d+) - will extract the first sequence of numbers found in the link element http://(.*) - will strip off the leading http:// from the link element Apply extractor to linked content instead of link text - if this is checked, then the extractor pattern above is not applied to the link; instead, we follow the link and retrieve the web page at that link, then apply the extractor pattern to the contents of that page. This is useful, for example, to extract 'printable version' URLs from article pages if there is no simple textual mapping from an article URL to the corresponding 'printable version' URL, but the 'printable version' URL is contained in the article page (tip: for web pages that have printable versions, the printable version is preferable).Link Formatter - this is a format string that gets used to create a new link from the collection created above by the link extractor. It consists of a string with parameters {0}, {1}, {2}, etc, which are expanded to the various substrings in the collection. If you leave it blank that is equivalent to "{0}" - i.e. just use the first substring. Content Extraction Pattern - this is a regular expression that is applied to the article content HTML from the previous step. It should have a single unnamed group; the text that matches that group is used as the final article content HTML. If left blank then the full article content from the link processing step is used. Content Reformatter: This is similar to the link formatter. It can be used to wrap or insert some additional HTML around the content extracted by the pattern in the last step. Ifd left blank it has no effect. Once again positional parameters {0}, ... are used to identify the matched groups from the content extraction step. The Tools menu has a regular expression tester that you may find helpful when doing advanced feed setups. Okay, this probably sounds more complicated than it is, so here are some examples: Name: BBC News URL: http://newsrss.bbc.co.uk/rss/newsonl...t_page/rss.xml Link Element: guid Link Extractor Pattern: http://(.*) Link Reformatter: http://newsvote.bbc.co.uk/mpapps/pagetools/print/{0} Content Extraction Pattern: i.e. get the RSS feed from the URL, pull out the links in the 'guid' elements, strip off the 'http://' part, prepend http://newsvote.bbc.co.uk/mpapps/pagetools/print/. then get the HTML at that link. Name: Slate URL: http://www.slate.com/rss/ Link Element: link Link Extractor Pattern: (\d+) Link Reformatter: http://www.slate.com/toolbar.aspx?action=read&id={0} Content Extraction Pattern: (\<font.*)Article URL I.e. get the RSS from http://www.slate.com/rss/, pull out each 'link' element, extract the sequence of digits from such an element and append it to 'http://www.slate.com/toolbar.aspx?action=read&id=', fetch the HTML at that URL, then extract everything starting from the first '<font>' tag up to but not including the text 'Article URL'. Name: Reuters Top News URL: http://feeds.reuters.com/reuters/topNews/ Link Element: guid Link Reformatter: http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=topNews&storyID={0} Content Extraction Pattern: (<span class=\"artTitle.*)</td> i.e. get the RSS at http://feeds.reuters.com/reuters/topNews/, pull out each guid element, append the guid to 'http://today.reuters.com/misc/PrinterFriendlyPopup.aspx?type=topNews&storyID=', get the HTML at that URL. If you want to put Wikipedia articles on your reader, use something like: URL: http://en.wikipedia.org/wiki/Nikola_Tesla HTML: checked Content Extraction Pattern: <!-- start content -->(.*)<!-- end content --> Website entries support some metacharacters in the URL for dates, namely @yyyy, @yy, @mm and @dd. These are expanded to the year, month of day (either 4 or 2 digits for year; two digits for the others). If you specify a Number Of Days entry, then the URL will be expanded for each day in range and the contents for each day will be concatenated, starting with the oldest, and ending with the current day. For example, the following will get one week of Dilbert comic strips: Url: http://www.unitedmedia.com/comics/di...yyy@mm@dd.html Number Of Days: 6 Content Extractor Pattern: (<IMG SRC="/comics/dilbert/archive/images/dilbert[^>]*>) Content Reformatter: {0}<br> Last edited by geekraver; 04-16-2007 at 12:46 PM. |
10-10-2006, 02:36 AM | #2 |
Addict
Posts: 364
Karma: 1035291
Join Date: Jul 2006
Location: Redmond, WA
Device: iPad Mini,Kindle Paperwhite
|
bump as editing the posting and attachments didn't.
|
Advert | |
|
10-27-2006, 03:11 PM | #3 |
Member
Posts: 23
Karma: 47
Join Date: Oct 2006
Device: Sony Reader/Treo 600
|
I'm surprised by the lack of comment on this app... This is most likely the coolest side function for the Reader yet! The RSS included with Sony's CONNECT software is basically broken compared to this.
Thanks Geekraver! You have a fan... |
10-27-2006, 04:41 PM | #4 |
Recovering Gadget Addict
Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
|
I subscribed to this thread in anticipation of trying it. I agree that RSS solutions are a big deal. Many like me are probably just waiting a bit to see what the best and easiest solutions turn out to be.
So I guess I have the same question... where are all the early software adopters with reports on how great this is? ;-) Let me clarify.. sometimes things don't always come out the way you mean them... I'm not in any way trying to say that if it was good we'd see early adopter reports. I am trying to say I'm surprised that there aren't more people eager to try this because it sounds so good! It makes you really cringe when you realize something you meant in a positive way can sound so negative! Sorry for the ambiguity. |
10-27-2006, 07:50 PM | #5 |
Member
Posts: 23
Karma: 47
Join Date: Oct 2006
Device: Sony Reader/Treo 600
|
I could definatly add more info to the use of this app, good pointer Bob.
It did take a .NET update on my XP and I had to trackdown the opensource version of HTMLdoc. I do admit to being a nerd for a living too, so this might be a bit more daunting for others. After getting it all together, I tried the app and didn't think it worked, but on a reboot, it did work. Flawlessly. I'll see if I can't backtrack my steps and provide some links if others are having a problem getting this going. (pipe up if you are) The real strenth of this app is the Table of contents it builds. It seperates each RSS site and gives a sub table of contents for that selection. I wasn't even aware the Reader would do this until I used RSS2Book. Then, once viewing, each article in the feed is well formatted, with what appears to be a solid effort to lenth each article justify when possible. Too cool. |
Advert | |
|
10-28-2006, 02:46 AM | #6 |
Addict
Posts: 364
Karma: 1035291
Join Date: Jul 2006
Location: Redmond, WA
Device: iPad Mini,Kindle Paperwhite
|
Glad you like it! I just wish I could figure out why my C# code to interface with te reader is busted (I've pored over Igor's Python code and can't see what I'm doing wrong but I get weird errors). If I get past that hurdle then I will finally get the app to sync straight to the Reader.
I'm also interested in how folks think the app should deal with successive updates. I tend to run it about once eevery three days and replace the old rss2book.pdf on my reader with a new one, but I have to remember when I last ran it. I could generate a separate PDF for each day. Any other ideas? Unfortunately it doesn't seem like you can delete files on the reader itself without attaching it to the PC, or I would definitely go the single day at a time route and just get dumping files whenever the reader is attached, relying on them being deleted on the Reader manually once they are read. One other problem I've noticed is that with a couple of sites I get weird characters when viewing on the reader. I suspect this is a character set issue. Perhaps it could be worked around by using embedded fonts. Anyone else notice this? |
10-28-2006, 02:50 AM | #7 |
Recovering Gadget Addict
Posts: 5,381
Karma: 676161
Join Date: May 2004
Location: Pittsburgh, PA
Device: iPad
|
This may be way off base, but would it make sense to keep the last n files or n days? Then a daily load with "keep 7 days" would give you anything that was generated within the last week. If the date was part of the filename, even better. And still better yet if only posts that haven't been downloaded are included in successive files.
|
10-28-2006, 05:25 AM | #8 |
Addict
Posts: 364
Karma: 1035291
Join Date: Jul 2006
Location: Redmond, WA
Device: iPad Mini,Kindle Paperwhite
|
Well, I found my bugs in prsutils (silly me wasn't packing the structs; my C has really gotten rusty). Should finally be able to get the synching done straight from rss2book within a few days if I can snatch some time from work/family/other projects.
|
10-28-2006, 10:00 AM | #9 |
Member
Posts: 11
Karma: 10
Join Date: Oct 2006
Device: Sony PRS-500
|
I'm a new adopter to this Sony Reader thing, and to RSS as well. This looks like a great app for me, to maximize my usage of the Reader- the idea of just loading a completed PDF from the previous day of RSS feeds in the morning on the way to work sounds great. I'm already sold on the thing since I do a lot of traveling and not having to carry around an extensive library is great, but if I can also throw my news/blog/internet reads on there, that's even better.
Is there any way someone could write a more detailed explanation of how to install and use these programs? I found a version of HTMLdoc, but can't figure out how to use it/install it. And I don't know how to put the RSS feeds into your program, much less make it produce a beautiful PDF like you made. I was getting confused with the three lines in the Options menu, the middle one is the formatting details, but I wasn't quite sure of how to set up the first and third. And like I said, I don't know if it even works if you haven't installed HTMLdoc, which I'm trying to look over at the moment. I agree with HeavyB that this app really makes the reader very cool for users like myself. Appreciate any help you can offer for less literate users like myself. |
10-28-2006, 01:39 PM | #10 |
Member
Posts: 11
Karma: 10
Join Date: Oct 2006
Device: Sony PRS-500
|
Think the problem I'm having is I don't know how to install HTMLdoc. Usually I just look for the .exe, but can't seem to find it.
|
10-28-2006, 01:50 PM | #11 |
Jah Blessed
Posts: 1,295
Karma: 1373
Join Date: Apr 2003
Location: The Netherlands
Device: iPod Touch
|
I think people would be greatly helped if an HTMLDoc binary was bundled with the distribution.
|
10-28-2006, 02:51 PM | #12 |
Enthusiast
Posts: 44
Karma: 1061025
Join Date: Oct 2006
Device: Kindle me, Sony Reader missus
|
binaries: links, haven't tried them yet
I haven't tried these yet but if I get one to work, I'll post.
Looks like htmldoc needs some libraries, but it is a bit Martian to me. http://mamboxchange.com/projects/htmldoc/ http://www.paehl.com/open_source/?HT...Source_Version http://tecfa.unige.ch/guides/utils/h...compiling.html http://www.htmldoc.org/documentation...UNIXLinux.html Best, tony |
10-28-2006, 03:11 PM | #13 |
Enthusiast
Posts: 35
Karma: 12
Join Date: Oct 2006
Device: Amazon Kindle, Sony Reader
|
Geekraver's rss2book app is really great! I can't recommend it enough. It took me about 15 minutes to get my PC set up to use it.
First I downloaded and installed .NET framework 2.0 here... http://msdn2.microsoft.com/en-us/net.../aa731542.aspx Then I downloaded and installed the Open Source version of HTML2Doc (and the 2 required dlls here)... http://www.paehl.com/open_source/?HT...Source_Version Then I downloaded and installed, rss2book from the top post of this Forum thread. Now, I'm reading all my favorite rss feeds as pdf ebooks on my Sony Reader. I agree with HeavyB above, that the virtuoso touch getting a Table of Contents of your rss feeds in the pdf. Try it. You'll like it! |
10-28-2006, 04:14 PM | #14 |
Member
Posts: 11
Karma: 10
Join Date: Oct 2006
Device: Sony PRS-500
|
Um... how do you use the dlls? And what should the output path be? Arghh....
|
10-28-2006, 05:42 PM | #15 |
Addict
Posts: 364
Karma: 1035291
Join Date: Jul 2006
Location: Redmond, WA
Device: iPad Mini,Kindle Paperwhite
|
There are places you can download it, such as here . However, if you install a prebuilt version it may be the commercial version which has some restrictions I believe. I don't know whether the free version is available precompiled or not.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
rss2book release 20 now available | geekraver | Sony Reader | 4 | 01-26-2007 02:36 PM |
rss2book release 19 | geekraver | Sony Reader | 2 | 12-30-2006 11:51 AM |
rss2book release 18 | geekraver | Sony Reader | 0 | 12-22-2006 04:57 AM |
rss2book release 16 | geekraver | Sony Reader | 1 | 12-13-2006 06:56 AM |
rss2book release 13 | geekraver | Sony Reader | 0 | 11-13-2006 03:41 AM |