Rss2Book - Page 15

geekraver · 08-29-2007, 08:34 PM

Quote:

Originally Posted by guardianx

What do i do with this info? sorry i'm new.

You use the Subscribe option on the File menu.

dietric · 09-02-2007, 06:42 PM

Quote:

Originally Posted by dietric

I dont want to alarm anyone unnecessarily, but McAffee VirusScan reports that the temporary files created thru the RSS conversion process are infected with the Exploit-ObscureHtml trojan. Tis might well be VirusScan being overzealous aobut the HTML content, but yo should know nevertheless (since it also prevents the program from working correctly).

Would the developer be inclined to look into this problem? I would really love to use this software, but mentioned problems prevents me form doing so.

Best
-ds

JSWolf · 09-02-2007, 08:00 PM

Quote:

Originally Posted by dietric

Would the developer be inclined to look into this problem? I would really love to use this software, but mentioned problems prevents me form doing so.

Best
-ds

McAffee is giving you a false positive. Either update McAffee or find a virus scanner that actually works. Or you could always turn it off, get your RSS feed sorted and then turn it back on.

guardianx · 09-05-2007, 02:14 PM

Quote:

Originally Posted by geekraver

You use the Subscribe option on the File menu.

when i do that the program crashes. everything i do the program crashes wtf.
i followed the d/l instruction. I'm not that a newbie when it comes to computer. I guess i'm out of luck then oh well i will stick with book design. when all of the bugs is fixed i will give this program another shot.

squeezebag · 09-06-2007, 06:30 PM

Anyone else having problem with the subscribe or publish functions? I'm using version 24 and it hangs everytime i invoke either.

flamaest · 09-06-2007, 07:30 PM

This tool definitely has its merits and I used it for a long time.

Honestly, after feedbooks.com showed up with their newspaper feature and their synchronization tool for my Sony Reader, I can now dock my reader and load up all my RSS feeds from feedbooks in seconds.

I still do appreciate RSS2book for introducing me to properly formatted PDF RSS feeds and for those stubborn websites with limited RSS feeds.

F.

geekraver · 09-07-2007, 07:35 PM

Quote:

Originally Posted by squeezebag

Anyone else having problem with the subscribe or publish functions? I'm using version 24 and it hangs everytime i invoke either.

The problem is my DSL speed. Verizon cannot upgrade me as I am on frame relay and there is no ATM or FIOS available in my area. I will look into other solutions for hosting this.

angrytrousers · 09-09-2007, 06:48 AM

Hi! Great program.

Is it possible to generate SEPARATE pdfs for each story in a feed?
I'm trying to create an archive of stories from a particular site, and I'd rather have separate pdfs than one giant one with a months worth of stories.

HTMLDoc doesn't seem to natively have this feature either. Maybe I'd have to recursively run your program for each link?

Thanks!

toomanybarts · 09-11-2007, 06:23 PM

If someone can help me understand how I would pull content from the following website (using the "Web Page Tab of rss2book) it will go a long way to me understanding not only how this program works, but also the REGEX expressions rqd to get at the content (and only the content) we are all using this program for :
"http://www.timesonline.co.uk/tol/comment/columnists/jeremy_clarkson/"

There are a number of links on the page that reference the various blog entries I want to pull, but when I change rss2book settings for "followlinks" to depth 2 (or more) I get this error
"Processing clarkson

System.UriFormatException: Invalid URI: The URI scheme is not valid.
at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
at System.Uri..ctor(String uriString)
at web2book.Utils.ExtractContent(String contentExtractor, String contentFormatter, String url, String html, String linkProcessor, Int32 depth, StringBuilder log)
at web2book.Utils.GetContent(String link, String html, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.Utils.GetHtml(String url, Int32 numberOfDays, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.WebPage.GetHtml(ISource mySourceGroup, Int32 displayWidth, Int32 displayHeight, Int32 displayDepth, StringBuilder log)
at web2book.MainForm.AddSource(ContentSourceList sourceClass, ContentSource source, Boolean isAutoUpdate)"

IF I leave it set at 1 I get
"Processing clarkson

Final content:
===================

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><meta name="ROBOTS" content="NOARCHIVE" /><script type="text/javascript">
// Variables required for DART. MUST BE IN THE HEAD.
var time = new Date();
randnum = (time.getTime());
</script><title> Jeremy Clarkson Columns & Comment | Times Online </title><meta name="Description" content="The UKs favourite motoring journalist comments on British society and culture in his weekly columns on Times Online"><link rel="shortcut icon" type="image/x-icon" href="/tol//img/favicon.ico" type="image/x-icon" /><link rel="stylesheet" type="text/css" href="/tol/css/alternate.css" title="Alternate Style Sheet" /><link rel="stylesheet" type="text/css" href="/tol/css/tol.css"/>
<link rel="stylesheet" type="text/css" href="/tol/css/ie.css"/><link rel="stylesheet" type="text/css" href="/tol/css/typography.css"/><script language="javascript" type="text/javascript" src="/tol/js/tol.js"></script></head><body><div id="top"/><div id="shell"><div id="page"><script language="javascript" type="text/javascript" src="/tol/js/DM_client.js"></script><script language="javascript" type="text/javascript">
DM_addToLoc("Network",escape("Times"));
DM_addToLoc("SiteName",escape("Times Online"));
</script><script language="javascript" type="text/javascript">
// Index page for Revenue sciences"

....there's loads more, this is just part of the content. The point is, I thought that changing the "Follow links to Depth" setting to 2 would grab not only the page referred to in the URL, but also follow the links from that URL's page?
I would then need to work on what REGEX would be needed to tidy up the resulting mass of content. (That would be problem / lesson 2, but one thing at a time!)

Am I missing something?
(I realise there is a RSS feed page where I can pull the current top 4 or 5 blog entries and adinb has helped me clean this up to be readable, what I want to understand is how do I manipulate Webpages)

(Thank-you again to adinb who has been helping me with this problem using the rss feed and the "Feed" tab of rss2book, via PM, it's people like him that keep these types of forums useful...I thought it may be useful for others to understand how it all works and to lighten the load on adinb!)

Thank-you all in advance.

Liviu_5 · 09-13-2007, 12:44 AM

Hi,

I tried to use Rss2book to pull down some newspapers feeds, one worked nicely after I figured out a good regex to get just the text, but for the other whatever I try I get the following message repeated as many times as the #feeds and of course with the appropriate time/date I try (US Eastern +7 hrs - so I tried at 11.31 pm US Eastern, I get exactly the following, I try seven minutes later I get the message with 06.38...):

Processing Evenimentul
Thu, 13 Sep 2007 06:31:55 EEST is out of range
Thu, 13 Sep 2007 06:31:55 EEST is out of range
....

Is there anything I can do about it?

The feed link is not in English, but the same was true for the other newspaper that works just fine:

http://www.evz.ro/rss.php/evz.xml

squeezebag · 09-13-2007, 04:07 PM

Okay. I'm must be losing my mind.

I've been able to extract the The New Yorker with the following setup:

URL: http://feeds.newyorker.com/services/...everything.xml
Link Element: Link
Apply extractor to linked content is checked
Link Reformatter: {0}?printable=true
Content Extraction pattern: (.*) 

Then I changed computers, installed the latest .net updates, downloaded Web2Book, and duplicated the settings and it's not working. I only get the article headings - it doesn't seem to be following the link.

Any ideas? What's changed?

thanks,
Andy

rkellmer · 09-19-2007, 12:01 AM

I just bought a Sony Reader last week. It is great.

Here is my problem: I have about 3,000 webpages that are on my local computer. Each one is a conversion of a single book. I have tried to convert them to PDF by opening them in Internet Explorer, and using the local address as the URL in Web2book. Web2book gives me the following message:
--------------------------------------------------------------------------
System.UriFormatException: Invalid URI: A port was expected because of there is a colon (':') present but the port could not be parsed.
at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
at System.Uri..ctor(String uriString)
at web2book.Utils.GetUrlResponse(String url, String& error, String postData, ICredentials creds, String contentType)
at web2book.Utils.GetWebResponse(String url, String& error, String postData, ICredentials creds, String contentType)
at web2book.Utils.GetContent(String link, String html, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.Utils.GetHtml(String url, Int32 numberOfDays, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log)
at web2book.WebPage.GetHtml(ISource mySourceGroup, Int32 displayWidth, Int32 displayHeight, Int32 displayDepth, StringBuilder log)
at web2book.MainForm.AddSource(ContentSourceList sourceClass, ContentSource source, Boolean isAutoUpdate)
--------------------------------------------------------------------------
I can get around this by posting each webpage on my Geocities site, but that is a lot of extra work. Any idea how I can convert the local html file without doing all that?

Thanks!!

dietric · 09-23-2007, 03:39 PM

I'm trying to create a Web2Book feed for
http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml

I would like to rewrite the links to link to the printable version, but the pattern to replace the link is somewhat complex:
The link in the feed looks like this:
http://www.spiegel.de/politik/auslan...506744,00.html
The printable version like this:
http://www.spiegel.de/politik/auslan...506744,00.html

From what I can see by examining other links the constants are:
- http://www.spiegel.de/ (obviously)
- one or more folder names
- the actual file name consists of three numbers separated by comma
- in the printable version, the string "druck-" is added before the third number
- the extension is .html

I'm not so good with RegEx, help would be appreciated.

adinb · 09-24-2007, 06:00 AM

Quote:

Originally Posted by dietric

I'm trying to create a Web2Book feed for
http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml

I would like to rewrite the links to link to the printable version, but the pattern to replace the link is somewhat complex:
The link in the feed looks like this:
http://www.spiegel.de/politik/auslan...506744,00.html
The printable version like this:
http://www.spiegel.de/politik/auslan...506744,00.html

From what I can see by examining other links the constants are:
- http://www.spiegel.de/ (obviously)
- one or more folder names
- the actual file name consists of three numbers separated by comma
- in the printable version, the string "druck-" is added before the third number
- the extension is .html

I'm not so good with RegEx, help would be appreciated.

how about (http://www.spiegel.de.*/\d,\d{4},)(\d+,\d\d\.html)
then in the link constructor you could use {1}druck-{2}

I'm all ears for a more efficient regex that is more efficient.

-adin

dietric · 09-24-2007, 09:10 PM

Quote:

Originally Posted by adinb

how about (http://www.spiegel.de.*/\d,\d{4},)(\d+,\d\d\.html)
then in the link constructor you could use {1}druck-{2}

I'm all ears for a more efficient regex that is more efficient.

-adin

That worked out great, thanks. I have tested and published the feed.

09-13-2007, 12:44 AM	#220
Liviu_5 Books and more books Posts: 917 Karma: 69499 Join Date: Mar 2006 Location: White Plains, NY, USA Device: Nook Color, Itouch, Nokia770, Sony 650, Sony 700(dead), Ebk(given)	Error "Thu, 13 Sep 2007 06:31:55 EEST is out of range" Hi, I tried to use Rss2book to pull down some newspapers feeds, one worked nicely after I figured out a good regex to get just the text, but for the other whatever I try I get the following message repeated as many times as the #feeds and of course with the appropriate time/date I try (US Eastern +7 hrs - so I tried at 11.31 pm US Eastern, I get exactly the following, I try seven minutes later I get the message with 06.38...): Processing Evenimentul Thu, 13 Sep 2007 06:31:55 EEST is out of range Thu, 13 Sep 2007 06:31:55 EEST is out of range .... Is there anything I can do about it? The feed link is not in English, but the same was true for the other newspaper that works just fine: http://www.evz.ro/rss.php/evz.xml

09-19-2007, 12:01 AM	#222
rkellmer Junior Member Posts: 6 Karma: 10 Join Date: Sep 2007 Location: Hesperia, CA Device: Sony Reader PRS500 / iPhone 3GS / iPad 32GB	Converting webpages (located on my computer) to PDF I just bought a Sony Reader last week. It is great. Here is my problem: I have about 3,000 webpages that are on my local computer. Each one is a conversion of a single book. I have tried to convert them to PDF by opening them in Internet Explorer, and using the local address as the URL in Web2book. Web2book gives me the following message: -------------------------------------------------------------------------- System.UriFormatException: Invalid URI: A port was expected because of there is a colon (':') present but the port could not be parsed. at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind) at System.Uri..ctor(String uriString) at web2book.Utils.GetUrlResponse(String url, String& error, String postData, ICredentials creds, String contentType) at web2book.Utils.GetWebResponse(String url, String& error, String postData, ICredentials creds, String contentType) at web2book.Utils.GetContent(String link, String html, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log) at web2book.Utils.GetHtml(String url, Int32 numberOfDays, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log) at web2book.WebPage.GetHtml(ISource mySourceGroup, Int32 displayWidth, Int32 displayHeight, Int32 displayDepth, StringBuilder log) at web2book.MainForm.AddSource(ContentSourceList sourceClass, ContentSource source, Boolean isAutoUpdate) -------------------------------------------------------------------------- I can get around this by posting each webpage on my Geocities site, but that is a lot of extra work. Any idea how I can convert the local html file without doing all that? Thanks!!

09-23-2007, 03:39 PM	#223
dietric Junior Member Posts: 9 Karma: 10 Join Date: Aug 2007 Device: Sony Reader	Help with regular expression I'm trying to create a Web2Book feed for http://www.spiegel.de/schlagzeilen/rss/0,5291,,00.xml I would like to rewrite the links to link to the printable version, but the pattern to replace the link is somewhat complex: The link in the feed looks like this: http://www.spiegel.de/politik/auslan...506744,00.html The printable version like this: http://www.spiegel.de/politik/auslan...506744,00.html From what I can see by examining other links the constants are: - http://www.spiegel.de/ (obviously) - one or more folder names - the actual file name consists of three numbers separated by comma - in the printable version, the string "druck-" is added before the third number - the extension is .html I'm not so good with RegEx, help would be appreciated.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
rss2book release 20 now available	geekraver	Sony Reader	4	01-26-2007 02:36 PM
rss2book release 19	geekraver	Sony Reader	2	12-30-2006 11:51 AM
rss2book release 18	geekraver	Sony Reader	0	12-22-2006 04:57 AM
rss2book release 16	geekraver	Sony Reader	1	12-13-2006 06:56 AM
rss2book release 13	geekraver	Sony Reader	0	11-13-2006 03:41 AM

09-06-2007, 06:30 PM	#215
squeezebag Junior Member Posts: 7 Karma: 10 Join Date: Jun 2007 Device: Sony Reader	Anyone else having problem with the subscribe or publish functions? I'm using version 24 and it hangs everytime i invoke either.

09-06-2007, 07:30 PM	#216
flamaest Groupie Posts: 155 Karma: 1044459 Join Date: Jul 2007 Device: prs-500	This tool definitely has its merits and I used it for a long time. Honestly, after feedbooks.com showed up with their newspaper feature and their synchronization tool for my Sony Reader, I can now dock my reader and load up all my RSS feeds from feedbooks in seconds. I still do appreciate RSS2book for introducing me to properly formatted PDF RSS feeds and for those stubborn websites with limited RSS feeds. F.

09-09-2007, 06:48 AM	#218
angrytrousers Junior Member Posts: 1 Karma: 10 Join Date: Sep 2007 Device: SPH-A580	Hi! Great program. Is it possible to generate SEPARATE pdfs for each story in a feed? I'm trying to create an archive of stories from a particular site, and I'd rather have separate pdfs than one giant one with a months worth of stories. HTMLDoc doesn't seem to natively have this feature either. Maybe I'd have to recursively run your program for each link? Thanks!

09-11-2007, 06:23 PM	#219
toomanybarts Junior Member Posts: 4 Karma: 10 Join Date: Jul 2007 Device: Sony Reader	If someone can help me understand how I would pull content from the following website (using the "Web Page Tab of rss2book) it will go a long way to me understanding not only how this program works, but also the REGEX expressions rqd to get at the content (and only the content) we are all using this program for : "http://www.timesonline.co.uk/tol/comment/columnists/jeremy_clarkson/" There are a number of links on the page that reference the various blog entries I want to pull, but when I change rss2book settings for "followlinks" to depth 2 (or more) I get this error "Processing clarkson System.UriFormatException: Invalid URI: The URI scheme is not valid. at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind) at System.Uri..ctor(String uriString) at web2book.Utils.ExtractContent(String contentExtractor, String contentFormatter, String url, String html, String linkProcessor, Int32 depth, StringBuilder log) at web2book.Utils.GetContent(String link, String html, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log) at web2book.Utils.GetHtml(String url, Int32 numberOfDays, String linkProcessor, String contentExtractor, String contentFormatter, Int32 depth, StringBuilder log) at web2book.WebPage.GetHtml(ISource mySourceGroup, Int32 displayWidth, Int32 displayHeight, Int32 displayDepth, StringBuilder log) at web2book.MainForm.AddSource(ContentSourceList sourceClass, ContentSource source, Boolean isAutoUpdate)" IF I leave it set at 1 I get "Processing clarkson Final content: =================== <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><meta name="ROBOTS" content="NOARCHIVE" /><script type="text/javascript"> // Variables required for DART. MUST BE IN THE HEAD. var time = new Date(); randnum = (time.getTime()); </script><!-- Code to display title of the HTML page --><title> Jeremy Clarkson Columns & Comment \| Times Online </title><meta name="Description" content="The UKs favourite motoring journalist comments on British society and culture in his weekly columns on Times Online"><link rel="shortcut icon" type="image/x-icon" href="/tol//img/favicon.ico" type="image/x-icon" /><link rel="stylesheet" type="text/css" href="/tol/css/alternate.css" title="Alternate Style Sheet" /><link rel="stylesheet" type="text/css" href="/tol/css/tol.css"/> <link rel="stylesheet" type="text/css" href="/tol/css/ie.css"/><link rel="stylesheet" type="text/css" href="/tol/css/typography.css"/><script language="javascript" type="text/javascript" src="/tol/js/tol.js"></script></head><body><div id="top"/><div id="shell"><div id="page"><!-- START REVENUE SCIENCE PIXELLING CODE --><script language="javascript" type="text/javascript" src="/tol/js/DM_client.js"></script><script language="javascript" type="text/javascript"> DM_addToLoc("Network",escape("Times")); DM_addToLoc("SiteName",escape("Times Online")); </script><script language="javascript" type="text/javascript"> // Index page for Revenue sciences" ....there's loads more, this is just part of the content. The point is, I thought that changing the "Follow links to Depth" setting to 2 would grab not only the page referred to in the URL, but also follow the links from that URL's page? I would then need to work on what REGEX would be needed to tidy up the resulting mass of content. (That would be problem / lesson 2, but one thing at a time!) Am I missing something? (I realise there is a RSS feed page where I can pull the current top 4 or 5 blog entries and adinb has helped me clean this up to be readable, what I want to understand is how do I manipulate Webpages) (Thank-you again to adinb who has been helping me with this problem using the rss feed and the "Feed" tab of rss2book, via PM, it's people like him that keep these types of forums useful...I thought it may be useful for others to understand how it all works and to lighten the load on adinb!) Thank-you all in advance.

09-13-2007, 04:07 PM	#221
squeezebag Junior Member Posts: 7 Karma: 10 Join Date: Jun 2007 Device: Sony Reader	Okay. I'm must be losing my mind. I've been able to extract the The New Yorker with the following setup: URL: http://feeds.newyorker.com/services/...everything.xml Link Element: Link Apply extractor to linked content is checked Link Reformatter: {0}?printable=true Content Extraction pattern: <!-- start article rail -->(.*) <!-- end article body --> Then I changed computers, installed the latest .net updates, downloaded Web2Book, and duplicated the settings and it's not working. I only get the article headings - it doesn't seem to be following the link. Any ideas? What's changed? thanks, Andy

Advert

Advert