Recipes - Re-usable code - Page 2

achims · 11-03-2011, 05:34 PM

This is building up on "Recipe to download an EPUB from feed" by Starsom17.
You can use it to download all EBooks offered from a News Website, in all formats you like (epub, pdf, mobi, ...).

To see how it works, first take a look at Starsom17's post. His trick is needed to cheat the recipe process so that it gets some epub to work on.

Additionally, this recipe looks for links to other EBook formats, downloads them to a common temporary directory and then applies a system call "calibredb add -1 dir", so that all formats are added to the calibre db as one single logical book.

If there are several logical books to download, you'll need to create a directory and make a system call for each one (or, don't use the -1 option, if there is only one format per book).

Note: I have tested this on Linux and it works fine. Maybe on other OS one has to tweak the system call.

Spoiler:

Code:

import re, zipfile, os
from calibre.ptempfile import PersistentTemporaryDirectory
from calibre.ptempfile import PersistentTemporaryFile
from urlparse import urlparse

GET_MOBI=False
GET_PDF=True

class DownloadAllFormats(BasicNewsRecipe):

    def build_index(self):
        browser = self.get_browser()

        # find the links (Adjust to your needs!)
        epublink = browser.find_link(text_regex=re.compile('.*Download ePub.*'))
        mobilink = browser.find_link(text_regex=re.compile('.*Download Mobi.*'))
        pdflink = browser.find_link(text_regex=re.compile('.*Download PDF.*'))

        # Cheat calibre's recipe method, as in post from Starsom17
        self.report_progress(0,_('downloading epub'))
        response = browser.follow_link(epublink)
        dir = PersistentTemporaryDirectory()
        epub_file = PersistentTemporaryFile(suffix='.epub',dir=dir)
        epub_file.write(response.read())
        epub_file.close()
        zfile = zipfile.ZipFile(epub_file.name, 'r')
        self.report_progress(0.1,_('extracting epub'))
        zfile.extractall(self.output_dir)
        epub_file.close()
        index = os.path.join(self.output_dir, 'content.opf')
        self.report_progress(0.2,_('epub downloaded and extracted'))


        #
        # Now, download the remaining files
        #
        if (GET_MOBI):
           self.report_progress(0.3,_('downloading mobi'))
           mobi_file = PersistentTemporaryFile(suffix='.mobi',dir=dir)
           browser.back()
           response = browser.follow_link(mobilink)
           mobi_file.write(response.read())
           mobi_file.close()

        if (GET_PDF):
           self.report_progress(0.4,_('downloading pdf'))
           pdf_file = PersistentTemporaryFile(suffix='.pdf',dir=dir)
           browser.back()
           response = browser.follow_link(pdflink)
           pdf_file.write(response.read())
           pdf_file.close()

        # Get all formats into Calibre's database as one single book entry
        self.report_progress(0.6,_('Adding files to Calibre db'))
        cmd = "calibredb add -1 " + dir
        os.system(cmd)

        return index

Starson17 · 11-09-2011, 10:01 AM

This is not my code, but there have been many requests for code to handle sites where each article is split into multiple pages. At the bottom of each page will be a button to go to the next page. Here is typical code from Darko Miletic's builtin recipe for Adventure Gamers that is used in this situation:

You may want to look at the source for an article at Adventure Gamers with FireBug or equivalent. The append_page code identifies each "next page" button, follows the link it points to ("nexturl"), finds the article text on that next page, inserts that text into the first page beneath the article text found on the first page, and recursively reiterates that process until the last page (identified by not having the "next page" button) is found.

The append_page code is then used in preprocess_html.

Spoiler:

nickredding · 11-21-2011, 08:56 PM

Kindle Fire treats masthead logos differently than its e-ink cousins, and they end up not looking as good as on e-ink readers. The Fire automatically scales the logos and color-inverts them (so black becomes white, red become turquoise, etc.). The logo is displayed an an almost-black background (it's actually a slight gradiant).

The Fire also displays the publication front page on the Newsstand bookshelf, so this encouraged me to go looking for a source of these front page images instead of looking at the default calibre image.

The following code fragments can be inserted into your custom recipe to invoke a custom masthead logo and a front page image (if it's available).

Spoiler:

Note that when you develop a masthead logo, plan for it to be color-inverted (so if you want the original color, provide the color-inverted version as the logo). The background should be R/G/B 211/211/211 and (after being inverted) it will blend with the Fire background to appear transparent. If you are really picky you can make the background pretty well perfect by using a linear gradiant (top to bottom) of 211/211/211 to 214/214/214. The size of the logo isn't all that important since the Fire will scale it, but logos at least 250 pixels wide will look better than smaller ones since upscaling doesn't work as well as reduction.

I have atached 4 Fire-friendly logos in a ZIP file (NY Times, Wall Street Journal, Globe and Mail, National Post).

kiavash · 01-10-2012, 01:21 AM

Quote:

Originally Posted by kiklop74

Let us assume that you have a feed with links that all point to redirected pages. By default Calibre does not handle this case so the safest way of doing this could be summarized like this:

Code:

    def print_version(self, url):
        return self.browser.open_novisit(url).geturl()

Of course similar thing can be done with urllib2 but using internal browser automatically adds support for sites that require login.

Actually you can at the same time get the print page to. Just modify the code to something like this:

PHP Code:


			
    def print_version(self, url):

        return self.browser.open_novisit(url).geturl().replace('/article.asp?HH_ID=', '/Print.asp?Id=')

Of course modify the replace part fot your page.

kiavash · 01-19-2012, 02:35 PM

Some sites need to submit login information twice. Bellow is an example that worked with MWJournal. It submit the credentials 1st, then saves the outcome to the system temp location, then open it again and submit. In this case the 2nd page didn't have a form a fill so just submit. Some other sites may need more info to be filled then follow normal procedure to fill and submit.

Spoiler:

kiavash · 02-10-2012, 01:02 PM

Some sites don't include the figures/images into articles and instead the reader needs to click on an href link to see the image/figure. This wouldn't be possible on many ebook readers. To embed the images into output ebook, the tag type needs to be changed from <a> to <img>. Also the "href" property needs to be changed to "src". The following code does the job by looking for all the links to jpg files, then changed them to <img> tags.The code should be included into preprocess_html

Spoiler:

kiklop74 · 06-13-2012, 06:55 PM

How to search for a specific part of tag attribute:

Code:

dict(attrs={'someattribute':re.compile('(^|| )somestring($|| )', re.DOTALL)})

For example to remove all tags that have class Sample (along with other clases) this will do the work:

Code:

remove_tags = [
dict(attrs={'class':re.compile('(^|| )Sample($|| )', re.DOTALL)})                   
]

kovidgoyal · 06-13-2012, 11:38 PM

@kiklop74: An easier way would be:

Code:

remove_tags = [
dict(attrs={'class':lambda x: x and 'Sample' in x.split()}),
]

kiklop74 · 12-09-2012, 01:23 PM

Sometimes sites can be badly implemented or overloaded so that first fetch of an article fails but second or third passes OK. To add that functionality to the calibre recipe you can use this approach:

Code:

# In the include section add this
from calibre.ptempfile import PersistentTemporaryFile

#later in the recipe class add this
class MyRecipeclass(BasicNewsRecipe):
# ...
    temp_files              = []
    articles_are_obfuscated = True  

# and than somewhere in the class add this method

    def get_obfuscated_article(self, url):
        count = 0
        attempts = 4
        html = None
        while (count < attempts):
            try:
                response = self.browser.open(url)
                html = response.read()
                count = attempts
            except:
                print "Retrying download..."
            count += 1
            
        if html is None:
           pass
           
        tfile = PersistentTemporaryFile('_fa.html')
        tfile.write(html)
        tfile.close()
        self.temp_files.append(tfile)
        
        return tfile.name

Replaces the value of variable attempts to change number of download attempts. This works just fine. That approach was used in Financial Times UK recipe.

kiklop74 · 02-12-2013, 07:16 AM

If you would like to add series support for some of your recipes this is what needs to be done:

Code:

    def get_cover_url(self):
        soup = self.index_to_soup('someurl')
        #determine somehow the series number of the publication
        # and store it in seriesnr variable
        self.conversion_options.update({'series':'My series name'})
        self.conversion_options.update({'series_index':seriesnr})
        # code for cover url if any
        return None

It is usefull for magazines or newspapers where you can easily track the number of publication.

All this applies mostly to EPUB the rest of the formats AFAIK do not offer a chance to store this metadata.

koliberek · 06-25-2013, 06:14 AM

Can I collect clips, translate them into Polish, put in ebook and publish on the Polish forum? The goal is to make them available to users, who don't speak English. I would like to have permission to publish it.

TIA

kiklop74 · 06-25-2013, 09:52 AM

I doubt it would be a problem. Kovid is the owner of this forum so it is his call in the end.

kovidgoyal · 06-25-2013, 02:23 PM

Feel free to do so, I have no objections.

koliberek · 06-27-2013, 02:23 AM

Thanks a lot

sup · 12-16-2013, 04:04 PM

Quote:

Originally Posted by Pahan

Here is a recipe template that keeps track of already downloaded feed items and only downloads items that it hasn't seen before or whose description, content, or URL have changed. It does so by overriding the parse_feeds method.
Some caveats:

I recommend setting max_articles_per_feed and oldest_article to very high values. The first time, the recipe will download every item in every feed, but after that, it will "remember" not to do it again and will grab all new articles no matter how much time had elapsed since the last time it had been run and how many entries had been added. In particular, if you set max_articles_per_feed to a small value and the feed is one that lists all articles in a particular order, you might never see new articles.
The list of items downloaded for each feed will be stored in "Calibre configuration directory/recipes/recipe_storage/Recipe title/Feed title". This is probably suboptimal, and there ought to be a persistent storage API for recipes, but it's the best I could come up with.
The list of items downloaded is written to disk before the items are actually downloaded. Thus, if an item fails to download for some reason, the recipe won't know, and will not try to download it again. This could probably be fixed by writing the new item lists to temporary files and overriding some method later in the sequence to "commit" by overwriting the downloaded item lists with the new lists. (Thus, if the recipe fails before that, it will never get to that point, so the old lists will remain intact and will redownload next time the recipe is run.)
If there are no new items to download and remove_empty_feeds is set to True, the recipe will return an empty list of feeds, which will cause Calibre to raise an error. As far as I can tell, there is nothing that the recipe can do about that without a lot more coding.
I've tried to make this code portable, but I've only tested it on Linux systems, so let me know if it doesn't work on the other platforms. I am particularly unsure about newline handling.

Spoiler:

This is a simple version of the above method that does not keep track of changes and assumes that what was once put online never changes (which is generally not true but for some feeds is). Also, it is using the parse_index method instead of parse_feeds as it assumes you to scrap a website. The same caveats but the first one apply. This recipe only keeps the last twenty articles for any given section - if you need more, change the limit.

Code:

Spoiler: 


from calibre.constants import config_dir, CONFIG_DIR_MODE
import os
def parse_index(self):
    # Read already downloaded articles
    recipe_dir = os.path.join(config_dir,'recipes')
    old_articles = os.path.join(recipe_dir,self.title.encode('utf-8').replace('/',':'))
    past_items = []
    if os.path.exists(old_articles):
       with file(old_articles) as f:
           for h in f:
               l = h.strip().split(" ")
               past_items.append((l[0]," ".join(l[1:])))
    old_urls = [x[0] for x in past_items]
    count_items = {}
    current_items = []
    # Keep a list of only 20 latest articles for each section
    past_items.reverse()
    for item in past_items:
        if item[1] in count_items.keys():
            if count_items[item[1]] < 20:
                count_items[item[1]] += 1
                current_items.append(item)
        else:
            count_items[item[1]] = 1
            current_items.append(item)
    current_items.reverse()  
# do stuff to get 'list_of_articles' containing dictionnaries in the form like this {'title':title,'url':url}
    # and to get variable 'feed_name'; see the following link for details:
    # http://manual.calibre-ebook.com/news_recipe.html#calibre.web.feeds.news.BasicNewsRecipe.parse_index
    ans = []
    for article in list_of_articles
        if article['url'] not in old_urls:
            current_items.append((article['url'],feed_name))
    ans.append((feed_name,list_of articles
    # Write already downloaded articles
    with file(old_articles,'w') as f:
        f.write('\n'.join('{} {}'.format(*x) for x in current_items))
    return ans

11-09-2011, 10:01 AM	#17
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	Multiple Page Sites This is not my code, but there have been many requests for code to handle sites where each article is split into multiple pages. At the bottom of each page will be a button to go to the next page. Here is typical code from Darko Miletic's builtin recipe for Adventure Gamers that is used in this situation: You may want to look at the source for an article at Adventure Gamers with FireBug or equivalent. The append_page code identifies each "next page" button, follows the link it points to ("nexturl"), finds the article text on that next page, inserts that text into the first page beneath the article text found on the first page, and recursively reiterates that process until the last page (identified by not having the "next page" button) is found. The append_page code is then used in preprocess_html. Spoiler: Code: INDEX = u'http://www.adventuregamers.com' def append_page(self, soup, appendtag, position): pager = soup.find('div',attrs={'class':'toolbar_fat_next'}) if pager: nexturl = self.INDEX + pager.a['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('div', attrs={'class':'bodytext'}) newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() appendtag.insert(position,texttag) def preprocess_html(self, soup): self.append_page(soup, soup.body, 3) pager = soup.find('div',attrs={'class':'toolbar_fat}) if pager: pager.extract() return self.adeify_images(soup)

01-19-2012, 02:35 PM	#20
kiavash Old Linux User Posts: 36 Karma: 12 Join Date: Jan 2012 Device: NST	Some sites need to submit login information twice. Bellow is an example that worked with MWJournal. It submit the credentials 1st, then saves the outcome to the system temp location, then open it again and submit. In this case the 2nd page didn't have a form a fill so just submit. Some other sites may need more info to be filled then follow normal procedure to fill and submit. Spoiler: PHP Code: def get_browser(self): ... raw = br.submit().read() # submit the form and read the 2nd login page # save it to an htm temp file with TemporaryFile(suffix='.htm') as fname: with open(fname, 'wb') as f: f.write(raw) br.open_local_file(fname) br.select_form(nr=0) # finds submit on the 2nd form didwelogin = br.submit().read() # submit it and read the return html ... return br

02-10-2012, 01:02 PM	#21
kiavash Old Linux User Posts: 36 Karma: 12 Join Date: Jan 2012 Device: NST	Embed images into an ebook Some sites don't include the figures/images into articles and instead the reader needs to click on an href link to see the image/figure. This wouldn't be possible on many ebook readers. To embed the images into output ebook, the tag type needs to be changed from <a> to <img>. Also the "href" property needs to be changed to "src". The following code does the job by looking for all the links to jpg files, then changed them to <img> tags.The code should be included into preprocess_html Spoiler: PHP Code: def preprocess_html(self, soup): # Includes all the figures inside the final ebook # Finds all the jpg links for figure in soup.findAll('a', attrs = {'href' : lambda x: x and 'jpg' in x}): # makes sure that the link points to the absolute web address if figure['href'].startswith('/'): figure['href'] = self.site + figure['href'] figure.name = 'img' # converts the links to img figure['src'] = figure['href'] # with the same address as href figure['style'] = 'display:block' # adds /n before and after the image del figure['href'] del figure['target'] return soup

06-13-2012, 06:55 PM	#22
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	How to search for a specific part of tag attribute: Code: dict(attrs={'someattribute':re.compile('(^\|\| )somestring($\|\| )', re.DOTALL)}) For example to remove all tags that have class Sample (along with other clases) this will do the work: Code: remove_tags = [ dict(attrs={'class':re.compile('(^\|\| )Sample($\|\| )', re.DOTALL)}) ]

06-13-2012, 11:38 PM	#23
kovidgoyal creator of calibre Posts: 45,345 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@kiklop74: An easier way would be: Code: remove_tags = [ dict(attrs={'class':lambda x: x and 'Sample' in x.split()}), ]

02-12-2013, 07:16 AM	#25
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	If you would like to add series support for some of your recipes this is what needs to be done: Code: def get_cover_url(self): soup = self.index_to_soup('someurl') #determine somehow the series number of the publication # and store it in seriesnr variable self.conversion_options.update({'series':'My series name'}) self.conversion_options.update({'series_index':seriesnr}) # code for cover url if any return None It is usefull for magazines or newspapers where you can easily track the number of publication. All this applies mostly to EPUB the rest of the formats AFAIK do not offer a chance to store this metadata.

06-25-2013, 06:14 AM	#26
koliberek Junior Member Posts: 7 Karma: 10 Join Date: May 2013 Device: K3 (Keyboard)	Can I collect clips, translate them into Polish, put in ebook and publish on the Polish forum? The goal is to make them available to users, who don't speak English. I would like to have permission to publish it. TIA

06-25-2013, 09:52 AM	#27
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	I doubt it would be a problem. Kovid is the owner of this forum so it is his call in the end.

06-25-2013, 02:23 PM	#28
kovidgoyal creator of calibre Posts: 45,345 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Feel free to do so, I have no objections.

06-27-2013, 02:23 AM	#29
koliberek Junior Member Posts: 7 Karma: 10 Join Date: May 2013 Device: K3 (Keyboard)	Thanks a lot

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
DR800 The working (usable) screen resolution	PaulS	iRex	7	04-23-2010 12:27 PM
Let's create a source code repository for DR 800 related code?	jraf	iRex	3	03-11-2010 12:26 PM
any usable epub reader?	janw	iRex	10	09-04-2009 12:25 PM
FICTIONWISE, still usable?	jcbeam	Amazon Kindle	4	03-19-2009 01:17 PM
iLiad usable for scientists?	doctorow	iRex	5	08-14-2006 05:00 PM

Advert

Advert