11-29-2010, 03:00 PM | #1 |
Junior Member
Posts: 1
Karma: 10
Join Date: Nov 2010
Device: Kindle 2gw
|
Reason Magazine request
Hello,
Im would like to replicate the print edition of Reason for my kindle2 The rss http://feeds.feedburner.com/reason/Articles updates daily with new articles. http://reason.com/rss gives a large list of feeds by staff writer, topic, etc. Can i recreate the print edition with the rss or http://reason.com/issues/december-2010 which lays everything out as links? Thanks |
03-25-2022, 03:29 PM | #2 |
Member
Posts: 11
Karma: 10
Join Date: Apr 2020
Device: Kobo Aura One
|
I have created a recipe that works when pointed at an archived edition of the magazine. But when I set Needs Subscription to True, it does not download any of the articles. The current issue is May 2022 and is only available to subscribers.
I have tested my login on the website itself, so I know that is working. I also don't get a login error so it seems that my login is working. However, none of the links become active. It almost feels like my login works but then the recipe attempts to download according to the recipe before the links are activated. Does anyone have any tips for troubleshooting subscription recipes? Testing only seems to work on without subscriptions, so I can only test on archived issues which works. I have included my recipe below. I modified recipe for The Atlantic to start with. Thanks for any help anyone can provide. Code:
#!/usr/bin/env python # vim:fileencoding=utf-8 # License: GPLv3 Copyright: 2015, Kovid Goyal <kovid at kovidgoyal.net> from __future__ import unicode_literals import json from xml.sax.saxutils import escape, quoteattr from calibre.web.feeds.news import BasicNewsRecipe # {{{ parse article JSON def process_image_block(lines, block): caption = block.get('captionText') caption_lines = [] if caption: if block.get('attributionText', '').strip(): caption += ' (' + block['attributionText'] + ')' caption_lines.append('<p style="font-style: italic">' + caption + '</p>') lines.append('<div style="text-align: center"><img src={}/>'.format(quoteattr(block['url']))) lines.extend(caption_lines) lines.append('</div>') def json_to_html(raw): data = json.loads(raw) # open('/t/p.json', 'w').write(json.dumps(data, indent=2)) data = sorted((v['data'] for v in data['props']['pageProps']['urqlState'].values()), key=len)[-1] article = json.loads(data)['article'] lines = [] lines.append('<h1 style="align: center">' + escape(article['title']) + '</h1>') lines.append('<h2 style="align: center">' + escape(article['dek']) + '</h2>') auts = ', '.join(x['displayName'] for x in article['authors']) if auts: lines.append('<p style="align: center">by ' + escape(auts) + '</p>') if article.get('leadArt') and 'image' in article['leadArt']: process_image_block(lines, article['leadArt']['image']) for item in article['content']: tn = item.get('__typename', '') if tn.endswith('Image'): process_image_block(lines, item) continue html = item.get('innerHtml') if html is None or '</iframe>' in html: continue if 'innerHtml' not in item: continue tagname = item.get('tagName', 'P').lower() lines.append('<{0}>{1}</{0}>'.format(tagname, html)) return '<html><body><div id="from-json-by-calibre">' + '\n'.join(lines) + '</div></body></html>' class NoJSON(ValueError): pass def extract_html(soup): script = soup.findAll('script', id='__NEXT_DATA__') if not script: raise NoJSON('No script tag with JSON data found') raw = script[0].contents[0] return json_to_html(raw) # }}} def classes(classes): q = frozenset(classes.split(' ')) return dict( attrs={'class': lambda x: x and frozenset(x.split()).intersection(q)} ) def prefix_classes(classes): q = classes.split() def test(x): if x: for cls in x.split(): for c in q: if cls.startswith(c): return True return False return dict(attrs={'class': test}) class Reason(BasicNewsRecipe): title = 'Reason' description = 'Free minds and free markets' INDEX = 'https://reason.com/magazine/' #INDEX = 'https://reason.com/issue/april-2022/' __author__ = 'Howard Cornett' language = 'en' encoding = 'utf-8' needs_subscription = True remove_tags = [ classes( 'next-post-link the-tags tag rcom-social tools comments-header-show logo-header navbar navbar-expanded-lg primary content-info sidebar magicSidebar advertisement logo entry-subtitle' ), ] no_stylesheets = True remove_attributes = ['style'] extra_css = ''' .credit { text-align: right; font-size: 75%; display: block } .figcaption { font-size: 75% } .caption { font-size: 75% } .lead-img { display: block } p.dropcap:first-letter { float: left; text-transform: uppercase; font-weight: bold; font-size: 5.55em; line-height: 0.83; margin: 0; padding-right: 7px; margin-bottom: -2px; text-align: center; } ''' def get_browser(self): br = BasicNewsRecipe.get_browser(self) return br def preprocess_raw_html(self, raw_html, url): try: return extract_html(self.index_to_soup(raw_html)) except NoJSON: self.log.warn('No JSON found in: {} falling back to HTML'.format(url)) except Exception: self.log.exception('Failed to extract JSON data from: {} falling back to HTML'.format(url)) return raw_html def preprocess_html(self, soup): for img in soup.findAll('img', attrs={'data-lazy-src': True}): # img['src'] = img['data-lazy-src'].split()[0] data_lazy_src = img['data-lazy-src'] if ',' in data_lazy_src: img['src'] = data_lazy_src.split(',')[0] else: img['src'] = data_lazy_src.split()[0] return soup #def print_version(self, url): #ans = url.partition('?')[0] + '?single_page=true' #if '/video/' in ans: #ans = None #return ans def parse_index(self): soup = self.index_to_soup(self.INDEX) cover = soup.find('img', title=lambda value: value and value.startswith('Reason Magazine,')) if cover is not None: self.cover_url = cover['data-lazy-src'] current_section, current_articles = 'Cover Story', [] feeds = [] for div in soup.findAll('div', attrs={'class': lambda x: x and set(x.split()).intersection({'issue-header-right', 'toc-category-list'})}): #print('The value of div is ', div) for h3 in div.findAll('h3', attrs={'class': True}): cls = h3['class'] if hasattr(cls, 'split'): cls = cls.split() if 'toc-department' in cls: if current_articles: feeds.append((current_section, current_articles)) current_articles = [] current_section = self.tag_to_string(h3) self.log('\nFound section:', current_section) title = h3.find_next_sibling().a.text url = h3.find_next_sibling().a['href'] desc = h3.find_next_sibling().p.text current_articles.append({ 'title': title, 'url': url, 'description': desc }) for h2 in div.findAll('h2', attrs={'class': True}): cls = h2['class'] if hasattr(cls, 'split'): cls = cls.split() if 'toc-department' in cls: if current_articles: feeds.append((current_section, current_articles)) current_articles = [] current_section = self.tag_to_string(h2) self.log('\nFound section:', current_section) for article in div.findAll('article', attrs={'class': True}): h4 = article.find('h4') if h4.a is not None: title = h4.a.text url = h4.a['href'] else: title = '' url = '' p = article.find('p', class_='entry-subtitle') desc = h4.find_next_sibling().text current_articles.append({ 'title': title, 'url': url, 'description': desc }) if current_articles: feeds.append((current_section, current_articles)) return feeds if __name__ == '__main__': import sys from calibre.ebooks.BeautifulSoup import BeautifulSoup print(extract_html(BeautifulSoup(open(sys.argv[-1]).read()))) calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36' |
Advert | |
|
03-26-2022, 10:26 PM | #3 |
creator of calibre
Posts: 44,409
Karma: 23977332
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You need to actually login in get_browser() just adding needs_subscription is not enough
|
03-28-2022, 10:43 AM | #4 |
Member
Posts: 11
Karma: 10
Join Date: Apr 2020
Device: Kobo Aura One
|
Success!
Thank you, Kovid! I totally missed that detail. I added that to my recipe and now it works! Here is the successful recipe.
Code:
#!/usr/bin/env python # vim:fileencoding=utf-8 # License: GPLv3 Copyright: 2015, Kovid Goyal <kovid at kovidgoyal.net> from __future__ import unicode_literals import json from xml.sax.saxutils import escape, quoteattr from calibre.web.feeds.news import BasicNewsRecipe # {{{ parse article JSON def process_image_block(lines, block): caption = block.get('captionText') caption_lines = [] if caption: if block.get('attributionText', '').strip(): caption += ' (' + block['attributionText'] + ')' caption_lines.append('<p style="font-style: italic">' + caption + '</p>') lines.append('<div style="text-align: center"><img src={}/>'.format(quoteattr(block['url']))) lines.extend(caption_lines) lines.append('</div>') def json_to_html(raw): data = json.loads(raw) # open('/t/p.json', 'w').write(json.dumps(data, indent=2)) data = sorted((v['data'] for v in data['props']['pageProps']['urqlState'].values()), key=len)[-1] article = json.loads(data)['article'] lines = [] lines.append('<h1 style="align: center">' + escape(article['title']) + '</h1>') lines.append('<h2 style="align: center">' + escape(article['dek']) + '</h2>') auts = ', '.join(x['displayName'] for x in article['authors']) if auts: lines.append('<p style="align: center">by ' + escape(auts) + '</p>') if article.get('leadArt') and 'image' in article['leadArt']: process_image_block(lines, article['leadArt']['image']) for item in article['content']: tn = item.get('__typename', '') if tn.endswith('Image'): process_image_block(lines, item) continue html = item.get('innerHtml') if html is None or '</iframe>' in html: continue if 'innerHtml' not in item: continue tagname = item.get('tagName', 'P').lower() lines.append('<{0}>{1}</{0}>'.format(tagname, html)) return '<html><body><div id="from-json-by-calibre">' + '\n'.join(lines) + '</div></body></html>' class NoJSON(ValueError): pass def extract_html(soup): script = soup.findAll('script', id='__NEXT_DATA__') if not script: raise NoJSON('No script tag with JSON data found') raw = script[0].contents[0] return json_to_html(raw) # }}} def classes(classes): q = frozenset(classes.split(' ')) return dict( attrs={'class': lambda x: x and frozenset(x.split()).intersection(q)} ) def prefix_classes(classes): q = classes.split() def test(x): if x: for cls in x.split(): for c in q: if cls.startswith(c): return True return False return dict(attrs={'class': test}) class Reason(BasicNewsRecipe): title = 'Reason' description = 'Free minds and free markets' INDEX = 'https://reason.com/magazine/' #INDEX = 'https://reason.com/issue/april-2022/' __author__ = 'Howard Cornett' language = 'en' encoding = 'utf-8' needs_subscription = True remove_tags = [ classes( 'next-post-link the-tags tag rcom-social tools comments-header-show logo-header navbar navbar-expanded-lg primary content-info sidebar magicSidebar advertisement logo entry-subtitle' ), ] no_stylesheets = True remove_attributes = ['style'] extra_css = ''' .credit { text-align: right; font-size: 75%; display: block } .figcaption { font-size: 75% } .caption { font-size: 75% } .lead-img { display: block } p.dropcap:first-letter { float: left; text-transform: uppercase; font-weight: bold; font-size: 5.55em; line-height: 0.83; margin: 0; padding-right: 7px; margin-bottom: -2px; text-align: center; } ''' def get_browser(self): br = BasicNewsRecipe.get_browser(self) if self.username is not None and self.password is not None: br.open('https://reason.com/login') br.select_form(id='login_form') br['text_username'] = self.username br['password_password'] = self.password br.submit() return br def preprocess_raw_html(self, raw_html, url): try: return extract_html(self.index_to_soup(raw_html)) except NoJSON: self.log.warn('No JSON found in: {} falling back to HTML'.format(url)) except Exception: self.log.exception('Failed to extract JSON data from: {} falling back to HTML'.format(url)) return raw_html def preprocess_html(self, soup): for img in soup.findAll('img', attrs={'data-lazy-src': True}): # img['src'] = img['data-lazy-src'].split()[0] data_lazy_src = img['data-lazy-src'] if ',' in data_lazy_src: img['src'] = data_lazy_src.split(',')[0] else: img['src'] = data_lazy_src.split()[0] return soup #def print_version(self, url): #ans = url.partition('?')[0] + '?single_page=true' #if '/video/' in ans: #ans = None #return ans def parse_index(self): soup = self.index_to_soup(self.INDEX) cover = soup.find('img', title=lambda value: value and value.startswith('Reason Magazine,')) if cover is not None: self.cover_url = cover['src'] current_section, current_articles = 'Cover Story', [] feeds = [] for div in soup.findAll('div', attrs={'class': lambda x: x and set(x.split()).intersection({'issue-header-right', 'toc-category-list'})}): #print('The value of div is ', div) for h3 in div.findAll('h3', attrs={'class': True}): cls = h3['class'] if hasattr(cls, 'split'): cls = cls.split() if 'toc-department' in cls: if current_articles: feeds.append((current_section, current_articles)) current_articles = [] current_section = self.tag_to_string(h3) self.log('\nFound section:', current_section) title = h3.find_next_sibling().a.text url = h3.find_next_sibling().a['href'] desc = h3.find_next_sibling().p.text current_articles.append({ 'title': title, 'url': url, 'description': desc }) for h2 in div.findAll('h2', attrs={'class': True}): cls = h2['class'] if hasattr(cls, 'split'): cls = cls.split() if 'toc-department' in cls: if current_articles: feeds.append((current_section, current_articles)) current_articles = [] current_section = self.tag_to_string(h2) self.log('\nFound section:', current_section) for article in div.findAll('article', attrs={'class': True}): h4 = article.find('h4') if h4.a is not None: title = h4.a.text url = h4.a['href'] else: title = '' url = '' p = article.find('p', class_='entry-subtitle') desc = h4.find_next_sibling().text current_articles.append({ 'title': title, 'url': url, 'description': desc }) if current_articles: feeds.append((current_section, current_articles)) return feeds if __name__ == '__main__': import sys from calibre.ebooks.BeautifulSoup import BeautifulSoup print(extract_html(BeautifulSoup(open(sys.argv[-1]).read()))) calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36' |
03-28-2022, 01:04 PM | #5 |
creator of calibre
Posts: 44,409
Karma: 23977332
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
Advert | |
|
Tags |
reason |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PRS-950 Any reason to switch it off?? | petercreasey | Sony Reader | 7 | 11-04-2010 08:56 PM |
The Magazine is Dead. Long Live the Magazine! | kennyc | News | 17 | 11-01-2010 12:36 PM |
Any reason for Lrf? | GlenBarrington | Sony Reader | 13 | 08-05-2010 01:44 AM |
Magazine FREEBIES - HUB magazine: Weekly Tales of SF | Dr. Drib | Deals and Resources (No Self-Promotion or Affiliate Links) | 3 | 11-07-2009 12:07 PM |
Reason For Calibre? | CChamp | Calibre | 6 | 01-11-2009 02:58 PM |