recipe to pull web page similar to 'print/save as pdf'

JPD · 09-25-2010, 07:20 PM

Let me apologize right up front for my lack of savvy in html, calibre, or programming of any sort. I've been creating eBooks from instructional web sites for my Kindle DX. The web sites are typically set up like book chapters; from the TOC you select a 'chapter', and you click 'next/back' to navigate within each chapter. I go page by page, and do a file/print/save as pdf. Then I open them with Acrobat Pro to customize the metadata. Then I send them to Amazon for conversion, but as they often have scientific notation, figures, etc., they don't convert well, so I end up USB synching and dragging the pdf file from my Mac to the Kindle. Then repeat for every page...

I just discovered Calibre, and thought I'd found salvation. While all the articles are focused on news feeds, I thought it should be simple enough to create a custom 'news source', and use each url of the web page I want as the 'feed'. Wrong. All I end up with are strings of html code. Am I trying to do something that can't be done, or is it not as simple as just entering a web page's URL into the 'feed' field? Any help would appreciated.

Starson17 · 09-25-2010, 08:10 PM

Quote:

Originally Posted by JPD

Am I trying to do something that can't be done, or is it not as simple as just entering a web page's URL into the 'feed' field?

It's not as simple as just entering a web page's URL into the 'feed' field. A "feed" has a special format that gets parsed by the recipe. Your web page doesn't have that format. You wrote that you click on 'next/back' to get what you want. However, there are probably lots of links on each page. Links for ads, links to register, links to navigate home, etc. The recipe would need to follow only the links you want, and none of the others.

Can it be done? Probably, but you'd need to custom write it to do what you need. You could look at some of the multipage recipes. You might also consider web scrapers like wget, web2disk, WinHtTrack, etc.

JPD · 09-25-2010, 10:32 PM

Thanks for the tips Starson17, and for verifying that there's more to this than I suspected. I suppose I'll keep plugging along with pdfs as before while trying to 'go to school' on this subject. Although even the calibre's basic getting started tutorials are over my head right now.

JPD · 09-26-2010, 02:00 AM

I tried using Bookit to convert web pages to mobi, but ran into the same brick wall of an error message others noted. Then I tried instapaper, as Calbre has a recipe for 'read later' web pages, but it didn't preserve any of the web page formatting or images. So then tried just viewing the page source of the web page I wanted to convert, saved it as a file, added it as a book to my Calibre library, and did a mobi convert. It worked almost perfectly, preserved all the formatting, but the fatal flaw was it just had boxes with '?' icons where the images should be - it was not pulling the embedded images, e.g. '<img src="redliq.gif" width="251" height="110" /></p><pre>', where if you click on the gif link in the source it brings up the image, but it's not making it into the eBook.

If anyone has any suggestions it would be most welcome.

Starson17 · 09-26-2010, 08:42 AM

Quote:

Originally Posted by JPD

I tried using Bookit to convert web pages to mobi, but ran into the same brick wall of an error message others noted. Then I tried instapaper, as Calbre has a recipe for 'read later' web pages, but it didn't preserve any of the web page formatting or images. So then tried just viewing the page source of the web page I wanted to convert, saved it as a file, added it as a book to my Calibre library, and did a mobi convert. It worked almost perfectly, preserved all the formatting, but the fatal flaw was it just had boxes with '?' icons where the images should be - it was not pulling the embedded images, e.g. '<img src="redliq.gif" width="251" height="110" /></p><pre>', where if you click on the gif link in the source it brings up the image, but it's not making it into the eBook.

If anyone has any suggestions it would be most welcome.

Same suggestion as before. Tools like wget can extract a web site, following links you define and returning html. Alternatively, you can try using the recipe system. You may need to turn on recursion. It's hard to make any suggestions to deal with your issues, since you haven't provided any links or copies of sites that you're having problems with.

JPD · 09-26-2010, 12:16 PM

duh, I was saving the html file to my computer, and using that file to convert to mobi, but of course all the paths to the images now pointed to where the file was on my computer, while the actual images were on the web site's server.

So I spent the night crawling through the website, going to 'page info/media' for every page, selecting every img, and saving the 2,000 collected .gifs to the same folder the html files were in. Now Calibre gave me a complete mobi with all the images, with one flaw - it plops one of the images at the top of every page. But I was more concerned with not having to repeat what I'd just done for every site, so after much searching found a wonderful FF web-scraper plug-in, iMacros ( https://addons.mozilla.org/en-US/firefox/addon/3863/ ), that will save all the web files, html and imgs. This is an enormous time saver, but I still get the unwanted image at the top of every converted mobi. Any ideas, short of learning to use Sigil and editing them as ePubs (which ain't gonna happen)?

Here's an example of one of the web pages I'm trying to convert:
http://www.chemguide.co.uk/analysis/...ation.html#top

In any zip file I convert to mobi in Calibre, there will be an image at position 1.0 of the eBook.

Starson17 · 09-26-2010, 01:23 PM

Quote:

Originally Posted by JPD

duh, I was saving the html file to my computer, and using that file to convert to mobi, but of course all the paths to the images now pointed to where the file was on my computer, while the actual images were on the web site's server.
Here's an example of one of the web pages I'm trying to convert:
http://www.chemguide.co.uk/analysis/...ation.html#top

I opened that page in FireFox, told FF to save it to my desktop, dragged the saved html into Calibre (the images were in the matching folder that FF made when it saved) and opened it to see all images correct. The save from FF saved images locally, and they were correctly picked up by Calibre when it made the book. Where did you have trouble?

JPD · 09-26-2010, 03:05 PM

Thanks for the assistance. I tried it following your protocol, but still get one of the page images at position 1.0, above the 'Electromagnetic Radiation' page heading where the book should actually start. The image still shows correctly where it's supposed to as well. It's almost as if the image is being inserted at the beginning as a book cover, although to the right it shows the generic book image. I don't know what I'm doing differently than you. I'm using a PPC Mac w/ OS 10.5.8, FF 3.6.10, and Calibre 0.7.20.

JPD · 09-26-2010, 03:11 PM

I don't know if this is related, but when I quite calibre i get this error message:

ERROR: ERROR: Unhandled exception: <b>IOError</b>:[Errno 2] No such file or directory: '/var/folders/3g/3g++kTeeHJmwGtYBJz9CQk+++TI/-Tmp-/calibre_0.7.20_tmp_gFSqaR/ipc_result_1_7_q_9c8r.pickle'

Traceback (most recent call last):
File "/Applications/calibre.app/Contents/Resources/Python/lib/python2.6/site.py", line 147, in main
return run_entry_point()
File "/Applications/calibre.app/Contents/Resources/Python/lib/python2.6/site.py", line 116, in run_entry_point
return getattr(pmod, func)()
File "site-packages/calibre/utils/ipc/worker.py", line 101, in main
IOError: [Errno 2] No such file or directory: '/var/folders/3g/3g++kTeeHJmwGtYBJz9CQk+++TI/-Tmp-/calibre_0.7.20_tmp_gFSqaR/ipc_result_1_7_q_9c8r.pickle'

Starson17 · 09-26-2010, 03:14 PM

Are you looking at the saved html, an epub or some other converted format? Perhaps you want to ask up in the main forum, as this isn't really a recipe issue and you may find more focused help there.

JPD · 09-26-2010, 03:36 PM

I save the FF page and drag the html file to calibre; at this point it's zip, and I haven't opened anything yet. I then convert to mobi, and it's then that I view the converted file and there's an image at position 1.0 where the content should actually begin. I'm happy to take this to another forum, but before that I'd like to try and understand why you're conversion of the same web page is rendering correctly, without this stray image, and mine is not.

Starson17 · 09-27-2010, 08:15 AM

Quote:

Originally Posted by JPD

I save the FF page and drag the html file to calibre; at this point it's zip, and I haven't opened anything yet. I then convert to mobi, and it's then that I view the converted file and there's an image at position 1.0 where the content should actually begin. I'm happy to take this to another forum, but before that I'd like to try and understand why you're conversion of the same web page is rendering correctly, without this stray image, and mine is not.

I don't convert to to mobi, as I have nothing that uses mobi. Hold on ....I just converted to mobi, and I get an image at the top. It's a conversion issue, not a recipe issue. I wish I could help, but .......

kovidgoyal · 09-27-2010, 01:30 PM

MOBI doesn't support floating images, so calibre puts em where they appear in the source document markup

Starson17 · 09-27-2010, 01:43 PM

Quote:

Originally Posted by kovidgoyal

MOBI doesn't support floating images, so calibre puts em where they appear in the source document markup

And what that means is you could grab the image tag in the recipe and put it where you want. That may be a lot of effort for a single page. I'm not sure if that work would carry over to other pages you are interested in.

Perhaps you should just edit the epub before conversion?

JPD · 09-28-2010, 09:21 PM

I think editing the epub before conversion sounds like the best approach for this. Does that require learning Sigil, or is there a simpler, more basic editor for such minor edits that would be approachable to a newbie? And do I convert from zip to epub first, edit, then convert to mobi?. If you can advise the tools and basic approach I need, I can take further questions to another forum. I appreciate all your help. Thanks.

09-25-2010, 07:20 PM	#1
JPD Member Posts: 12 Karma: 10 Join Date: Sep 2010 Device: Kindle	recipe to pull web page similar to 'print/save as pdf' Let me apologize right up front for my lack of savvy in html, calibre, or programming of any sort. I've been creating eBooks from instructional web sites for my Kindle DX. The web sites are typically set up like book chapters; from the TOC you select a 'chapter', and you click 'next/back' to navigate within each chapter. I go page by page, and do a file/print/save as pdf. Then I open them with Acrobat Pro to customize the metadata. Then I send them to Amazon for conversion, but as they often have scientific notation, figures, etc., they don't convert well, so I end up USB synching and dragging the pdf file from my Mac to the Kindle. Then repeat for every page... I just discovered Calibre, and thought I'd found salvation. While all the articles are focused on news feeds, I thought it should be simple enough to create a custom 'news source', and use each url of the web page I want as the 'feed'. Wrong. All I end up with are strings of html code. Am I trying to do something that can't be done, or is it not as simple as just entering a web page's URL into the 'feed' field? Any help would appreciated.

09-26-2010, 02:00 AM	#4
JPD Member Posts: 12 Karma: 10 Join Date: Sep 2010 Device: Kindle	no luck bookit or instapaper; html source worked, but no images I tried using Bookit to convert web pages to mobi, but ran into the same brick wall of an error message others noted. Then I tried instapaper, as Calbre has a recipe for 'read later' web pages, but it didn't preserve any of the web page formatting or images. So then tried just viewing the page source of the web page I wanted to convert, saved it as a file, added it as a book to my Calibre library, and did a mobi convert. It worked almost perfectly, preserved all the formatting, but the fatal flaw was it just had boxes with '?' icons where the images should be - it was not pulling the embedded images, e.g. '<img src="redliq.gif" width="251" height="110" /></p><pre>', where if you click on the gif link in the source it brings up the image, but it's not making it into the eBook. If anyone has any suggestions it would be most welcome.

09-26-2010, 12:16 PM	#6
JPD Member Posts: 12 Karma: 10 Join Date: Sep 2010 Device: Kindle	almost with iMacros, but unwanted image on top of converted eBook page duh, I was saving the html file to my computer, and using that file to convert to mobi, but of course all the paths to the images now pointed to where the file was on my computer, while the actual images were on the web site's server. So I spent the night crawling through the website, going to 'page info/media' for every page, selecting every img, and saving the 2,000 collected .gifs to the same folder the html files were in. Now Calibre gave me a complete mobi with all the images, with one flaw - it plops one of the images at the top of every page. But I was more concerned with not having to repeat what I'd just done for every site, so after much searching found a wonderful FF web-scraper plug-in, iMacros ( https://addons.mozilla.org/en-US/firefox/addon/3863/ ), that will save all the web files, html and imgs. This is an enormous time saver, but I still get the unwanted image at the top of every converted mobi. Any ideas, short of learning to use Sigil and editing them as ePubs (which ain't gonna happen)? Here's an example of one of the web pages I'm trying to convert: http://www.chemguide.co.uk/analysis/...ation.html#top In any zip file I convert to mobi in Calibre, there will be an image at position 1.0 of the eBook.

09-28-2010, 09:21 PM	#15
JPD Member Posts: 12 Karma: 10 Join Date: Sep 2010 Device: Kindle	edit epub before conversion I think editing the epub before conversion sounds like the best approach for this. Does that require learning Sigil, or is there a simpler, more basic editor for such minor edits that would be approachable to a newbie? And do I convert from zip to epub first, edit, then convert to mobi?. If you can advise the tools and basic approach I need, I can take further questions to another forum. I appreciate all your help. Thanks.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
would like a recipe to pull down a free online book	N13L5	Recipes	17	10-09-2010 10:38 AM
Financial Times / FT - help creating a UK print edition recipe	ndeb123	Recipes	1	09-29-2010 10:55 AM
Recipe - save some date for later retrieval	mh445	Calibre	3	07-19-2010 04:06 PM
Anyway to save a web page as an RTF?	Fugubot	Sony Reader	16	02-06-2007 12:23 PM
Print magazines are better when they emulate the web	Bob Russell	News	0	05-18-2006 05:53 PM

09-25-2010, 10:32 PM	#3
JPD Member Posts: 12 Karma: 10 Join Date: Sep 2010 Device: Kindle	Thanks for the tips Starson17, and for verifying that there's more to this than I suspected. I suppose I'll keep plugging along with pdfs as before while trying to 'go to school' on this subject. Although even the calibre's basic getting started tutorials are over my head right now.

09-26-2010, 03:05 PM	#8
JPD Member Posts: 12 Karma: 10 Join Date: Sep 2010 Device: Kindle	Thanks for the assistance. I tried it following your protocol, but still get one of the page images at position 1.0, above the 'Electromagnetic Radiation' page heading where the book should actually start. The image still shows correctly where it's supposed to as well. It's almost as if the image is being inserted at the beginning as a book cover, although to the right it shows the generic book image. I don't know what I'm doing differently than you. I'm using a PPC Mac w/ OS 10.5.8, FF 3.6.10, and Calibre 0.7.20.

09-26-2010, 03:11 PM	#9
JPD Member Posts: 12 Karma: 10 Join Date: Sep 2010 Device: Kindle	I don't know if this is related, but when I quite calibre i get this error message: ERROR: ERROR: Unhandled exception: <b>IOError</b>:[Errno 2] No such file or directory: '/var/folders/3g/3g++kTeeHJmwGtYBJz9CQk+++TI/-Tmp-/calibre_0.7.20_tmp_gFSqaR/ipc_result_1_7_q_9c8r.pickle' Traceback (most recent call last): File "/Applications/calibre.app/Contents/Resources/Python/lib/python2.6/site.py", line 147, in main return run_entry_point() File "/Applications/calibre.app/Contents/Resources/Python/lib/python2.6/site.py", line 116, in run_entry_point return getattr(pmod, func)() File "site-packages/calibre/utils/ipc/worker.py", line 101, in main IOError: [Errno 2] No such file or directory: '/var/folders/3g/3g++kTeeHJmwGtYBJz9CQk+++TI/-Tmp-/calibre_0.7.20_tmp_gFSqaR/ipc_result_1_7_q_9c8r.pickle'

09-26-2010, 03:14 PM	#10
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	Are you looking at the saved html, an epub or some other converted format? Perhaps you want to ask up in the main forum, as this isn't really a recipe issue and you may find more focused help there.

09-26-2010, 03:36 PM	#11
JPD Member Posts: 12 Karma: 10 Join Date: Sep 2010 Device: Kindle	I save the FF page and drag the html file to calibre; at this point it's zip, and I haven't opened anything yet. I then convert to mobi, and it's then that I view the converted file and there's an image at position 1.0 where the content should actually begin. I'm happy to take this to another forum, but before that I'd like to try and understand why you're conversion of the same web page is rendering correctly, without this stray image, and mine is not.

09-27-2010, 01:30 PM	#13
kovidgoyal creator of calibre Posts: 45,345 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	MOBI doesn't support floating images, so calibre puts em where they appear in the source document markup

Advert

Advert