Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 08-25-2012, 02:13 AM   #1
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
The Chronicle of Higher Education

The magazine for those who lurk in a corner of the evil academia XD

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class Chronicle(BasicNewsRecipe):

    title       = 'The Chronicle of Higher Education'
    __author__  = 'Rick Shang'

    description = 'Weekly news and job-information source for college and university faculty members, administrators, and students.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
			dict(name='div', attrs={'class':'article'}),
			]
    remove_tags = [dict(name='div',attrs={'class':'related module1'})]
    no_javascript = True
    no_stylesheets = True


    needs_subscription = True
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('http://chronicle.com/myaccount/login')
            br.select_form(nr=1)
            br['username']   = self.username
            br['password'] = self.password
	    br.submit()
        return br

    def parse_index(self):

	#Go to the issue
        soup0 = self.index_to_soup('http://chronicle.com/section/Archives/39/')
        issue = soup0.find('ul',attrs={'class':'feature-promo-list'}).li
	issueurl = "http://chronicle.com"+issue.a['href']

	#Find date
	dates = self.tag_to_string(issue.a).split(': ')[-1]
	self.timefmt = u' [%s]'%dates

        #Go to the main body
        soup = self.index_to_soup(issueurl)
	div0 = soup.find ('div', attrs={'id':'article-body'})	
	
        feeds = OrderedDict()
	for div in div0.findAll('div',attrs={'class':'module1'}):
		section_title = self.tag_to_string(div.find('h3'))
		for post in div.findAll('li',attrs={'class':'sub-promo'}):
			articles = []
			a=post.find('a', href=True)
			title=self.tag_to_string(a)
			url="http://chronicle.com"+a['href'].strip()
			author=""
			desc=self.tag_to_string(post.find('p'))
			articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
		
			if articles:
				if section_title not in feeds:
					feeds[section_title] = []
				feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
    def preprocess_html(self,soup):
	#process all the images
	for div in soup.findAll('div', attrs={'class':'tableauPlaceholder'}):
		
		noscripts=div.find('noscript').a
		div.replaceWith(noscripts)
	for div0 in soup.findAll('div',text='Powered by Tableau'):
		div0.extract()
	return soup
rainrdx is offline   Reply With Quote
Old 08-25-2012, 05:48 PM   #2
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Updated with a (terrible) cover (but the best I can get)

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class Chronicle(BasicNewsRecipe):

    title       = 'The Chronicle of Higher Education'
    __author__  = 'Rick Shang'

    description = 'Weekly news and job-information source for college and university faculty members, administrators, and students.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
			dict(name='div', attrs={'class':'article'}),
			]
    remove_tags = [dict(name='div',attrs={'class':'related module1'})]
    no_javascript = True
    no_stylesheets = True


    needs_subscription = True
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('http://chronicle.com/myaccount/login')
            br.select_form(nr=1)
            br['username']   = self.username
            br['password'] = self.password
	    br.submit()
        return br

    def parse_index(self):

	#Go to the issue
        soup0 = self.index_to_soup('http://chronicle.com/section/Archives/39/')
        issue = soup0.find('ul',attrs={'class':'feature-promo-list'}).li
	issueurl = "http://chronicle.com"+issue.a['href']

	#Find date
	dates = self.tag_to_string(issue.a).split(': ')[-1]
	self.timefmt = u' [%s]'%dates

	#Find cover
	cover=soup0.find('div',attrs={'class':'promo'}).findNext('div')
	self.cover_url="http://chronicle.com"+cover.find('img')['src']

        #Go to the main body
        soup = self.index_to_soup(issueurl)
	div0 = soup.find ('div', attrs={'id':'article-body'})	
	
        feeds = OrderedDict()
	for div in div0.findAll('div',attrs={'class':'module1'}):
		section_title = self.tag_to_string(div.find('h3'))
		for post in div.findAll('li',attrs={'class':'sub-promo'}):
			articles = []
			a=post.find('a', href=True)
			title=self.tag_to_string(a)
			url="http://chronicle.com"+a['href'].strip()
			author=""
			desc=self.tag_to_string(post.find('p'))
			articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
		
			if articles:
				if section_title not in feeds:
					feeds[section_title] = []
				feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
    def preprocess_html(self,soup):
	#process all the images
	for div in soup.findAll('div', attrs={'class':'tableauPlaceholder'}):
		
		noscripts=div.find('noscript').a
		div.replaceWith(noscripts)
	for div0 in soup.findAll('div',text='Powered by Tableau'):
		div0.extract()
	return soup
rainrdx is offline   Reply With Quote
Advert
Old 09-01-2012, 09:12 PM   #3
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Update:
Bug Fix
Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class Chronicle(BasicNewsRecipe):

    title       = 'The Chronicle of Higher Education'
    __author__  = 'Rick Shang'

    description = 'Weekly news and job-information source for college and university faculty members, administrators, and students.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
			dict(name='div', attrs={'class':'article'}),
			]
    remove_tags = [dict(name='div',attrs={'class':['related module1','maintitle']}),
			dict(name='div', attrs={'id':['section-nav','icon-row']})]
    no_javascript = True
    no_stylesheets = True


    needs_subscription = True
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('http://chronicle.com/myaccount/login')
            br.select_form(nr=1)
            br['username']   = self.username
            br['password'] = self.password
	    br.submit()
        return br

    def parse_index(self):

	#Go to the issue
        soup0 = self.index_to_soup('http://chronicle.com/section/Archives/39/')
        issue = soup0.find('ul',attrs={'class':'feature-promo-list'}).li
	issueurl = "http://chronicle.com"+issue.a['href']

	#Find date
	dates = self.tag_to_string(issue.a).split(': ')[-1]
	self.timefmt = u' [%s]'%dates

	#Find cover
	cover=soup0.find('div',attrs={'class':'promo'}).findNext('div')
	self.cover_url="http://chronicle.com"+cover.find('img')['src']

        #Go to the main body
        soup = self.index_to_soup(issueurl)
	div = soup.find ('div', attrs={'id':'article-body'})	
	
        feeds = OrderedDict()
	section_title = ''
	for post in div.findAll('li'):
		articles = []
		a=post.find('a', href=True)
		if a is not None:
			title=self.tag_to_string(a)
			url="http://chronicle.com"+a['href'].strip()
			author=""
			sectiontitle=post.findPrevious('h3')
			if sectiontitle is None:
				sectiontitle=post.findPrevious('h4')
			section_title=self.tag_to_string(sectiontitle)
			desc=self.tag_to_string(post.find('p'))
			articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
		
			if articles:
				if section_title not in feeds:
					feeds[section_title] = []
				feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
    def preprocess_html(self,soup):
	#process all the images
	for div in soup.findAll('div', attrs={'class':'tableauPlaceholder'}):
		
		noscripts=div.find('noscript').a
		div.replaceWith(noscripts)
	for div0 in soup.findAll('div',text='Powered by Tableau'):
		div0.extract()
	return soup
rainrdx is offline   Reply With Quote
Old 09-24-2012, 06:15 PM   #4
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Updated with the fixed cover and aesthetic modifications.

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class Chronicle(BasicNewsRecipe):

    title       = 'The Chronicle of Higher Education'
    __author__  = 'Rick Shang'

    description = 'Weekly news and job-information source for college and university faculty members, administrators, and students.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
			dict(name='div', attrs={'class':'article'}),
			]
    remove_tags = [dict(name='div',attrs={'class':['related module1','maintitle']}),
			dict(name='div', attrs={'id':['section-nav','icon-row', 'enlarge-popup']}),
			dict(name='a', attrs={'class':'show-enlarge enlarge'})]
    no_javascript = True
    no_stylesheets = True


    needs_subscription = True
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('http://chronicle.com/myaccount/login')
            br.select_form(nr=1)
            br['username']   = self.username
            br['password'] = self.password
	    br.submit()
        return br

    def parse_index(self):

	#Go to the issue
        soup0 = self.index_to_soup('http://chronicle.com/section/Archives/39/')
        issue = soup0.find('ul',attrs={'class':'feature-promo-list'}).li
	issueurl = "http://chronicle.com"+issue.a['href']

	#Find date
	dates = self.tag_to_string(issue.a).split(': ')[-1]
	self.timefmt = u' [%s]'%dates

	#Find cover
	cover=soup0.find('div',attrs={'class':'side-content'}).find(attrs={'src':re.compile("photos/biz/Current")})
	if cover is not None:
		if "chronicle.com" in cover['src']:
			self.cover_url=cover['src']
		else:
			self.cover_url="http://chronicle.com" + cover['src']
        #Go to the main body
        soup = self.index_to_soup(issueurl)
	div = soup.find ('div', attrs={'id':'article-body'})	
	
        feeds = OrderedDict()
	section_title = ''
	for post in div.findAll('li'):
		articles = []
		a=post.find('a', href=True)
		if a is not None:
			title=self.tag_to_string(a)
			url="http://chronicle.com"+a['href'].strip()
			author=""
			sectiontitle=post.findPrevious('h3')
			if sectiontitle is None:
				sectiontitle=post.findPrevious('h4')
			section_title=self.tag_to_string(sectiontitle)
			desc=self.tag_to_string(post.find('p'))
			articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
		
			if articles:
				if section_title not in feeds:
					feeds[section_title] = []
				feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
    def preprocess_html(self,soup):
	#process all the images
	for div in soup.findAll('div', attrs={'class':'tableauPlaceholder'}):
		
		noscripts=div.find('noscript').a
		div.replaceWith(noscripts)
	for div0 in soup.findAll('div',text='Powered by Tableau'):
		div0.extract()
	return soup

Last edited by rainrdx; 09-24-2012 at 06:25 PM.
rainrdx is offline   Reply With Quote
Old 01-14-2013, 08:48 PM   #5
rainrdx
Connoisseur
rainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy bluerainrdx can differentiate black from dark navy blue
 
Posts: 55
Karma: 13316
Join Date: Jul 2012
Device: iPad
Update to accommodate to different article page layouts

Code:
import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from collections import OrderedDict

class Chronicle(BasicNewsRecipe):

    title       = 'The Chronicle of Higher Education'
    __author__  = 'Rick Shang'

    description = 'Weekly news and job-information source for college and university faculty members, administrators, and students.'
    language = 'en'
    category = 'news'
    encoding = 'UTF-8'
    keep_only_tags = [
			dict(name='div', attrs={'class':['article','blog-mod']}),
			]
    remove_tags = [dict(name='div',attrs={'class':['related module1','maintitle','entry-utility','object-meta']}),
			dict(name='div', attrs={'id':['section-nav','icon-row', 'enlarge-popup','confirm-popup']}),
			dict(name='a', attrs={'class':'show-enlarge enlarge'})]
    no_javascript = True
    no_stylesheets = True


    needs_subscription = True
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            br.open('http://chronicle.com/myaccount/login')
            br.select_form(nr=1)
            br['username']   = self.username
            br['password'] = self.password
	    br.submit()
        return br

    def parse_index(self):

	#Go to the issue
        soup0 = self.index_to_soup('http://chronicle.com/section/Archives/39/')
        issue = soup0.find('ul',attrs={'class':'feature-promo-list'}).li
	issueurl = "http://chronicle.com"+issue.a['href']

	#Find date
	dates = self.tag_to_string(issue.a).split(': ')[-1]
	self.timefmt = u' [%s]'%dates

	#Find cover
	cover=soup0.find('div',attrs={'class':'side-content'}).find(attrs={'src':re.compile("photos/biz/Current")})
	if cover is not None:
		if "chronicle.com" in cover['src']:
			self.cover_url=cover['src']
		else:
			self.cover_url="http://chronicle.com" + cover['src']
        #Go to the main body
        soup = self.index_to_soup(issueurl)
	div = soup.find ('div', attrs={'id':'article-body'})	
	
        feeds = OrderedDict()
	section_title = ''
	for post in div.findAll('li'):
		articles = []
		a=post.find('a', href=True)
		if a is not None:
			title=self.tag_to_string(a)
			url="http://chronicle.com"+a['href'].strip()
			author=""
			sectiontitle=post.findPrevious('h3')
			if sectiontitle is None:
				sectiontitle=post.findPrevious('h4')
			section_title=self.tag_to_string(sectiontitle)
			desc=self.tag_to_string(post.find('p'))
			articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
		
			if articles:
				if section_title not in feeds:
					feeds[section_title] = []
				feeds[section_title] += articles
        ans = [(key, val) for key, val in feeds.iteritems()]
        return ans
    def preprocess_html(self,soup):
	#process all the images
	for div in soup.findAll('div', attrs={'class':'tableauPlaceholder'}):
		
		noscripts=div.find('noscript').a
		div.replaceWith(noscripts)
	for div0 in soup.findAll('div',text='Powered by Tableau'):
		div0.extract()
	return soup
rainrdx is offline   Reply With Quote
Advert
Old 06-29-2024, 02:48 PM   #6
dave99999
Junior Member
dave99999 began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Jun 2024
Device: Boox Note Air 2 Plus
Update recipe

Looks like this recipe has been sitting broken for a long time. Any chance of an update? I fear I lack the skills...
dave99999 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Seriously thoughtful "Higher Education: How Colleges are Wasting our Money" MartinParish Lounge 29 08-31-2010 11:13 AM
Chronicle of Higher Ed: History of Intellectual Property Piracy ltamote News 9 02-23-2010 08:35 PM
The Chronicle discusses the Kindle Steven Lyle Jordan News 27 05-12-2009 06:47 PM
Chronicle of Higher Education says closer, but still no banana. NatCh Sony Reader 4 12-15-2006 10:33 PM


All times are GMT -4. The time now is 06:25 PM.


MobileRead.com is a privately owned, operated and funded community.