Custom recipes (archive, read-only) - Page 185

TonytheBookworm · 09-18-2010, 09:18 PM

Quote:

Originally Posted by marbs

i am having trouble with my recipe:

Spoiler:

if not, pleas tell me where to look.
and thank you starson for the help so far. i think this message was posted in orderly fation

I don't understand the 'he' language but i think this will work for you.
What I done was split the returned url and then appended it it sort of like you were doing. I put some print statements in there so you can see what is actually being used as the final print_url when you run
ebook-convert yourecipenamehere.recipe output_dir --test -vv >myrecipe.txt
when you run that you can see the print statements in the myrecipe.txt

use this for your print_url code

Spoiler:

bhandarisaurabh · 09-18-2010, 09:31 PM

Quote:

Originally Posted by TonytheBookworm

This should work for the CURRENT ARTICLE MONTH/YEAR
It has a form that you select the different year but I'm not sure what the actual true urls are that it uses on that. So I just stuck with the current month year since I figured that is what you would want anyway. If you will look even though September 2010 is selected on the page the article content still says August 18 or whatever. That is the same date that is on the original page.

Anyway the only thing that I don't understand how to do is get the description to drop the text that is inside the <a>. Once that is done I will post an update.

Updated Code to do descr correctly

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class IW(BasicNewsRecipe):
    title      = 'Industry Week'
    __author__ = 'Tonythebookworm'
    description = ''
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'Manufactoring'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 40
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 200 # only gets the first 200 articles
    INDEX = 'http://www.industryweek.com'
    
    
    
    remove_tags = [dict(name='div', attrs={'class':['crumbNav']}),
                   dict(name='i')]
    
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Current Month", u"http://www.industryweek.com/Archive.aspx"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        
        for item in soup.findAll('a', attrs={'class':'article'}):
         
         link = item['href']
         soup = self.index_to_soup(url)    
         if link:
         
          url         = self.INDEX + link
          title       = self.tag_to_string(item)
          descr    = item.parent
          item.extract()
          descr       = self.tag_to_string(descr)
          #print 'the url is: ', url
          #print 'the title is: ', title
          #print 'the descr is: ', descr
          current_articles.append({'title': title, 'url': url, 'description': descr, 'date':''}) # append all this
        return current_articles
      

   
    def print_version(self, url):
        split1 = url.split("=")
        print_url = 'http://www.industryweek.com/PrintArticle.aspx?ArticleID=' + split1[1]
        
        return print_url

hey thanks for the recipe ,may be they release the content for magazine online one month before it is available for print.anyway thanks

You are a genius

TonytheBookworm · 09-18-2010, 09:33 PM

Quote:

Originally Posted by bhandarisaurabh

hey thanks for the recipe ,may be they release the content for magazine online one month before it is available for print.anyway thanks

You are a genius

Far from a genius, but thanks for the compliment.

TonytheBookworm · 09-18-2010, 10:14 PM

Finally got Popular Science to work like I want. It goes 7 days back and also I have it remove any found Gallery: 's for image slide shows and so forth.

AgiZ · 09-19-2010, 05:26 AM

Can i request a recipe or even add the site into release please?
The site is http://slo-tech.com/ and is the best Slovenian tech news site.
Pleeeeease

marbs · 09-19-2010, 07:08 AM

Quote:

Originally Posted by TonytheBookworm

use this for your print_url code

Spoiler:

i ran it a few times and all the articles seem to be downloading with ebook convert, but when i try it in calibre, i get some empty articles. what to do?

also, i just saw that a small number of articles have a different format. both the web address (is "it.themarker.com" and not "themarker.com"), the way to get the print version, and the page format are different. is there any way to do an "if" or something like that? to deal with 2 different articles in different ways?

thanks again tony. you are a life saver.

TonytheBookworm · 09-19-2010, 02:58 PM

bad post... code totally wrong..

marbs · 09-19-2010, 03:18 PM

i need to go over your code slowly. i am not sure i understand it at all. can i use it as is? i would love an explanation when you have the time.

BTW, the IT address is "http://it.themarker.com/tmit/article/XXXXX"
and the print version is "http://it.themarker.com/tmit/PrintArticle/XXXXX"

how would you do the clean up for the different pages (or should i just leave it?)

thanks again for all your help. i really do appreciate it.

TonytheBookworm · 09-19-2010, 03:32 PM

Quote:

Originally Posted by marbs

i need to go over your code slowly. i am not sure i understand it at all. can i use it as is? i would love an explanation when you have the time.

BTW, the IT address is "http://it.themarker.com/tmit/article/XXXXX"
and the print version is "http://it.themarker.com/tmit/PrintArticle/XXXXX"

how would you do the clean up for the different pages (or should i just leave it?)

thanks again for all your help. i really do appreciate it.

thats what i get for posting code without testing it... Anyway.
this might do the trick. (i can't seem to get it to find it.themarket link) so your gonna have to be my eyes in the field on this one. Cause what happens is this. for instance you have cars.themarket.com when it goes to that link it converts it to themarket in the cases i have seen. if you know a specific url that i can test please let me know. because as i'm seeing things like law.themarket and cars.themarket and careers the market all revert to www.themarket.com/xxxxxxxxx and on on

here is what I have come up with thus far. sorry about the previous code.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re

class AdvancedUserRecipe1283848012(BasicNewsRecipe):
    description   = 'TheMarker'
    cover_url      = 'http://static.ispot.co.il/wp-content/upload/2009/09/themarker.jpg'
    title          = u'The Marker1'
    language       = 'he'
    simultaneous_downloads = 5
    #delay                  = 6   
    remove_javascript     = True
    timefmt        = '[%a, %d %b, %Y]'
    oldest_article = 2
    #remove_tags = [dict(name='tr', attrs={'bgcolor':['#738A94']})          ]
    max_articles_per_feed = 10
    #extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
    feeds          = [(u'Head Lines', u'http://www.themarker.com/tmc/content/xml/rss/hpfeed.xml'), 
                      (u'TA Market', u'http://www.themarker.com/tmc/content/xml/rss/sections/marketfeed.xml'),
                      (u'Real Estate', u'http://www.themarker.com/tmc/content/xml/rss/sections/realEstaterfeed.xml'),
                      (u'Wall Street & Global', u'http://www.themarker.com/tmc/content/xml/rss/sections/wallsfeed.xml'), 
                      (u'Law', u'http://www.themarker.com/tmc/content/xml/rss/sections/lawfeed.xml'), 
                      (u'Media', u'http://www.themarker.com/tmc/content/xml/rss/sections/mediafeed.xml'), 
                      (u'Consumer', u'http://www.themarker.com/tmc/content/xml/rss/sections/consumerfeed.xml'), 
                      (u'Career', u'http://www.themarker.com/tmc/content/xml/rss/sections/careerfeed.xml'), 
                      (u'Car', u'http://www.themarker.com/tmc/content/xml/rss/sections/carfeed.xml'), 
                      (u'High Tech', u'http://www.themarker.com/tmc/content/xml/rss/sections/hightechfeed.xml'), 
                      (u'Investor Guide', u'http://www.themarker.com/tmc/content/xml/rss/sections/investorGuidefeed.xml')]
    ##def print_version(self, url):
    # baseURL=url.replace('tmc/article.jhtml?ElementId=', 'ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F')
     #  print 'BASE IS :', baseURL
      # s= baseURL + '.xml'
       #return s
       #http://www.themarker.com/tmc/article.jhtml?ElementId=zz20100918_6121
       #http://www.themarker.com/ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2Fzz20100918_6121.xml
       
       
    def print_version(self, url):
        print 'ORG URL IS: ', url
        split1 = url.split("=")
        print 'THE SPLIT IS: ', split1 
        weblinks = url
      
        if weblinks is not None:
            for link in weblinks:
                #---------------------------------------------------------
                #here we need some help with some regexpressions
                #we are trying to find it.themarker.com in a url
                #-----------------------------------------------------------
                re1='.*?'	# Non-greedy match on filler
                re2='(it\\.themarker\\.com)'	# Fully Qualified Domain Name 1
                rg = re.compile(re1+re2,re.IGNORECASE|re.DOTALL)
                m = rg.search(url)
                
                
                if m:
                 split2 = url.split("article/")
                 print 'FOUND IT: ', url
                 print_url = 'http://it.themarker.com/tmit/PrintArticle/' + split2[1]
                
                else:
                    print_url = 'http://www.themarker.com/ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F' + split1[1]+'.xml'
                 
        print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
        return print_url

marbs · 09-19-2010, 04:17 PM

i was reading stuff online.
look what i found:
http://lifehacker.com/157701/get-rss...r-gmail-labels
it is way way WAY out of my capabilities.
could anyone create a news feed for gmail? one that requires a username and password?
the feed address is https://mail.google.com/mail/feed/atom/label/

TonytheBookworm · 09-19-2010, 05:34 PM

Quote:

Originally Posted by marbs

i need to go over your code slowly. i am not sure i understand it at all. can i use it as is? i would love an explanation when you have the time.

BTW, the IT address is "http://it.themarker.com/tmit/article/XXXXX"
and the print version is "http://it.themarker.com/tmit/PrintArticle/XXXXX"

how would you do the clean up for the different pages (or should i just leave it?)

thanks again for all your help. i really do appreciate it.

look at the updated code I posted. Test that on your end and see if it works for you. I changed the reg expression and it finds the link correctly on my end. it finds it.themarker.com and changes it and anything else it leaves as that themarker.com/********* stuff
here is the code

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re

class AdvancedUserRecipe1283848012(BasicNewsRecipe):
    description   = 'TheMarker'
    cover_url      = 'http://static.ispot.co.il/wp-content/upload/2009/09/themarker.jpg'
    title          = u'The Marker1'
    language       = 'he'
    simultaneous_downloads = 5
    #delay                  = 6   
    remove_javascript     = True
    timefmt        = '[%a, %d %b, %Y]'
    oldest_article = 2
    #remove_tags = [dict(name='tr', attrs={'bgcolor':['#738A94']})          ]
    max_articles_per_feed = 10
    #extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
    feeds          = [(u'Head Lines', u'http://www.themarker.com/tmc/content/xml/rss/hpfeed.xml'), 
                      (u'TA Market', u'http://www.themarker.com/tmc/content/xml/rss/sections/marketfeed.xml'),
                      (u'Real Estate', u'http://www.themarker.com/tmc/content/xml/rss/sections/realEstaterfeed.xml'),
                      (u'Wall Street & Global', u'http://www.themarker.com/tmc/content/xml/rss/sections/wallsfeed.xml'), 
                      (u'Law', u'http://www.themarker.com/tmc/content/xml/rss/sections/lawfeed.xml'), 
                      (u'Media', u'http://www.themarker.com/tmc/content/xml/rss/sections/mediafeed.xml'), 
                      (u'Consumer', u'http://www.themarker.com/tmc/content/xml/rss/sections/consumerfeed.xml'), 
                      (u'Career', u'http://www.themarker.com/tmc/content/xml/rss/sections/careerfeed.xml'), 
                      (u'Car', u'http://www.themarker.com/tmc/content/xml/rss/sections/carfeed.xml'), 
                      (u'High Tech', u'http://www.themarker.com/tmc/content/xml/rss/sections/hightechfeed.xml'), 
                      (u'Investor Guide', u'http://www.themarker.com/tmc/content/xml/rss/sections/investorGuidefeed.xml')]
    ##def print_version(self, url):
    # baseURL=url.replace('tmc/article.jhtml?ElementId=', 'ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F')
     #  print 'BASE IS :', baseURL
      # s= baseURL + '.xml'
       #return s
       #http://www.themarker.com/tmc/article.jhtml?ElementId=zz20100918_6121
       #http://www.themarker.com/ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2Fzz20100918_6121.xml
       
       
    def print_version(self, url):
        print 'ORG URL IS: ', url
        split1 = url.split("=")
        print 'THE SPLIT IS: ', split1 
        weblinks = url
      
        if weblinks is not None:
            for link in weblinks:
                #---------------------------------------------------------
                #here we need some help with some regexpressions
                #we are trying to find it.themarker.com in a url
                #-----------------------------------------------------------
                re1='.*?'	# Non-greedy match on filler
                re2='(it\\.themarker\\.com)'	# Fully Qualified Domain Name 1
                rg = re.compile(re1+re2,re.IGNORECASE|re.DOTALL)
                m = rg.search(url)
                
                
                if m:
                 split2 = url.split("article/")
                 print 'FOUND it: ', url
                 print_url = 'http://it.themarker.com/tmit/PrintArticle/' + split2[1]
                
                else:
                    print_url = 'http://www.themarker.com/ibo/misc/printFriendly.jhtml?ElementId=%2Fibo%2Frepositories%2Fstories%2Fm1_2000%2F' + split1[1]+'.xml'
                 
        print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
        return print_url

noxxx · 09-19-2010, 05:36 PM

I updated the Feed for Tagesanzeiger (Swiss Newspaper)

Code:

class AdvancedUserRecipe1284927619(BasicNewsRecipe):
    title = u'Tagesanzeiger'
    publisher = u'Tamedia AG'
    oldest_article = 2
    max_articles_per_feed = 100
    description = 'tagesanzeiger.ch: Nichts verpassen'
    category = 'News, Politik, Nachrichten, Schweiz, Zürich'
    language = 'de'
    conversion_options = {
                             'comments'  : description
                            ,'tags'      : category
                            ,'language'  : language
                            ,'publisher' : publisher
                         }
    
    remove_tags = [
	 dict(name='img')
                    ,dict(name='div',attrs={'class':['swissquote ad','boxNews','centerAD','contentTabs2','sbsLabel']})
                    ,dict(name='div',attrs={'id':['colRightAd','singleRight','singleSmallRight','MailInfo','metaLine','sidebarSky','contentFooter','commentInfo','commentInfo2','commentInfo3','footerBottom','clear','boxExclusiv','singleLogo','navSearch','headerLogin','headerBottomRight','horizontalNavigation','subnavigation','googleAdSense','footerAd','contentbox','articleGalleryNav']})
	,dict(name='form',attrs={'id':['articleMailForm','commentform']})
	,dict(name='div',attrs={'style':['position:absolute']})
	,dict(name='script',attrs={'type':['text/javascript']})
	,dict(name='p',attrs={'class':['schreiben','smallPrint','charCounter','caption']})
     ]
    feeds = [
	 (u'Front', u'http://www.tagesanzeiger.ch/rss.html')
	,(u'Zürich', u'http://www.tagesanzeiger.ch/zuerich/rss.html')
	,(u'Schweiz', u'http://www.tagesanzeiger.ch/schweiz/rss.html')
	,(u'Ausland', u'http://www.tagesanzeiger.ch/ausland/rss.html')
	,(u'Digital', u'http://www.tagesanzeiger.ch/digital/rss.html')
	,(u'Wissen', u'http://www.tagesanzeiger.ch/wissen/rss.html')
	,(u'Panorama', u'http://www.tagesanzeiger.ch/panorama/rss.html')
	,(u'Wirtschaft', u'http://www.tagesanzeiger.ch/wirtschaft/rss.html')
	,(u'Sport', u'http://www.tagesanzeiger.ch/sport/rss.html')
	,(u'Kultur', u'http://www.tagesanzeiger.ch/kultur/rss.html')
	,(u'Leben', u'http://www.tagesanzeiger.ch/leben/rss.html')
	,(u'Auto', u'http://www.tagesanzeiger.ch/auto/rss.html')]

    def print_version(self, url):
        return url + '/print.html'

Any suggestions are welcomed...

Cheers noxxx

bhandarisaurabh · 09-19-2010, 09:51 PM

Quote:

Originally Posted by TonytheBookworm

Far from a genius, but thanks for the compliment.

Can you help me with the recipe of business standard the website has been updated
I want help in the print section of the recipe
if the article url is http://www.business-standard.com/ind...ky-way/406220/
and print url is
http://www.business-standard.com/ind...ono=406220&tp=
then how will we define the print section

TonytheBookworm · 09-19-2010, 10:34 PM

Quote:

Originally Posted by bhandarisaurabh

Can you help me with the recipe of business standard the website has been updated
I want help in the print section of the recipe
if the article url is http://www.business-standard.com/ind...ky-way/406220/
and print url is
http://www.business-standard.com/ind...ono=406220&tp=
then how will we define the print section

something like this would be my first guess: if it doesn't work I'll have to test it later.

Spoiler:

greenapple · 09-20-2010, 07:27 AM

Hi, I'd like to request for reddit.com feed recipe.
Doesn't seem to download when I use a basic configuration.
Thanks!

09-19-2010, 02:58 PM	#2767
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	bad post... code totally wrong.. Last edited by TonytheBookworm; 09-19-2010 at 04:09 PM. Reason: sorry about that.

09-19-2010, 03:18 PM	#2768
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	i need to go over your code slowly. i am not sure i understand it at all. can i use it as is? i would love an explanation when you have the time. BTW, the IT address is "http://it.themarker.com/tmit/article/XXXXX" and the print version is "http://it.themarker.com/tmit/PrintArticle/XXXXX" how would you do the clean up for the different pages (or should i just leave it?) thanks again for all your help. i really do appreciate it. Last edited by marbs; 09-19-2010 at 03:20 PM.

09-19-2010, 04:17 PM	#2770
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	gmail news i was reading stuff online. look what i found: http://lifehacker.com/157701/get-rss...r-gmail-labels it is way way WAY out of my capabilities. could anyone create a news feed for gmail? one that requires a username and password? the feed address is https://mail.google.com/mail/feed/atom/label/

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

09-19-2010, 05:26 AM	#2765
AgiZ Junior Member Posts: 4 Karma: 10 Join Date: Aug 2010 Device: Nook	Can i request a recipe or even add the site into release please? The site is http://slo-tech.com/ and is the best Slovenian tech news site. Pleeeeease

09-20-2010, 07:27 AM	#2775
greenapple Evangelist Posts: 404 Karma: 664 Join Date: Dec 2009 Device: Kindle Paperwhite, Kindle DX, Kobo Aura HD	Hi, I'd like to request for reddit.com feed recipe. Doesn't seem to download when I use a basic configuration. Thanks!

Advert

Advert