03-10-2023, 06:21 AM | #1 |
Junior Member
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
|
My SZ Recipe does not fetch all articles
Hi there,
I am very frustrated right now and really hope someone can help me out here. I am trying to fetch news from www.sueddeutsche.de which works fine for some articles but for others, it does not. The paragraph is somehow hidden in the html code and doesnt get extracted. I have a subscription so the articles should be visible even though they are behind a paywall. But the fetching process doesnt work only on some articles regardless whether they are behind a paywall or not. This article is for example not working: view-source:https://www.sueddeutsche.de/politik/...215?print=true I use the print=true tag because it is much cleaner then... I am really looking forward to any idea or code example. If we can figure that out here I'd be happy to share the recipe via calibre because the last recipes I found are quite old... Thank you!! PHP Code:
|
03-10-2023, 02:11 PM | #2 |
Fanatic
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
|
maybe auto_cleanup fails.
You can check just by removing auto_cleanup, It'll load everyrthing from the page.. maybe then you'll know if the fetched link itself doesn't actually have any content.. or if its a login issue. |
Advert | |
|
03-10-2023, 06:13 PM | #3 | |
Junior Member
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
|
Quote:
I've rechecked again and testet via console ouput and debugging mode. And now it seems to be a login issue after all. But what is wrong with the "def get_browser(self)" section? |
|
03-19-2023, 05:19 AM | #4 |
Junior Member
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
|
H,
unfortunately I am not able to resolve the issue. When I enter the login data on https://id.sueddeutsche.de/login end press enter -> it does not login automatically. I have to klick on the Login Button. In addition, in the settings of my SZ profile I can see my logged in sessions, but not the browser from calibre news receipe. Does that mean, that the browser.submit() function is probably also not working and I am not logged in after all? Is there an alternative to browser.submit() function? Here is the form of id.sueddeutsche.de/login Code:
<div id="loginbox"> <form class="top-boxes" id="login-form" method="post" role="form" action="/login"><div class="form-group floating-label js-required"><label for="id_login">E-Mail Adresse</label><input type="text" name="login" id="login_login-form" class="form-control" /></div><div class="form-group floating-label js-required"><label for="id_password">Passwort</label><input type="password" name="password" id="password_login-form" class="form-control" /><div class="field-help help"><a href="/resetpassword">Passwort vergessen</a></div></div><div class="form-group rememberme checkbox-group"><div class="table-box"><div class="custom-checkbox"><input type="checkbox" name="remember_me" id="id_remember_me" value="on" class="form-control" checked="checked" /><div class="box"><div class="tick"></div></div></div><div class="label-box"><label for="id_remember_me">Angemeldet bleiben</label></div></div></div><div class="form-group hidden"><input type="hidden" name="login_ticket" id="login_ticket_login-form" value="LT-l0wVOXDqTgF9GfUzQhy7HuN63LIni" /></div><div id="creTracking-login"></div> |
03-19-2023, 05:22 AM | #5 |
Junior Member
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
|
Hi,
unfortunately my recipe does not work and I cant figure out how to solve it. When I login manually on url = https://id.sueddeutsche.de/login and I press Enter after filling in the fields, nothing happens. Maybe the browser.submit() function is also not working? Is there an alternative to submit() to login with the browser session with the news recipe? |
Advert | |
|
03-19-2023, 05:24 AM | #6 |
Junior Member
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
|
Also, in my SZ profile I can see all my logged in devices.
But the browser session of calibre is not visible so I assume my login does not work. |
03-21-2023, 01:03 PM | #7 |
Fanatic
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
|
try
Code:
def get_browser(self): def is_form_login(form): return "id" in form.attrs and form.attrs['id'] == "login-form" browser = BasicNewsRecipe.get_browser(self) # Login url = 'https://id.sueddeutsche.de/login' browser.open(url) browser.select_form(predicate=is_form_login) browser['login'] = self.username browser['password'] = self.password browser.submit() return browser |
04-11-2023, 02:05 PM | #8 | |
Junior Member
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
|
Quote:
I debugged the the output and the login works. In the console output I can read my profile ID, which is only visible after successful login. But unfortunately only two or three articles are readable. The strange thing is, that some articles behind the paywall are readable and others are not. The rest of the articles are reduced. Any ideas? Code:
# -*- coding: utf-8 -*- __license__ = 'GPL v3' #import from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup from calibre import strftime import time ##SZ class Sueddeutsche(BasicNewsRecipe): title = u'SZ8' description = 'News from Germany' publisher = u'Süddeutsche Zeitung' category = 'news, politics' timefmt = ' [%a, %d %b %Y]' oldest_article = 1 max_articles_per_feed = 10 language = 'de' encoding = 'utf-8' publication_type = 'newspaper' remove_empty_feeds = True needs_subscription = True simultaneous_downloads = 1 recursions = 0 feeds = [ #(u'Politik', u'http://rss.sueddeutsche.de/rss/Politik'), (u'SZ', u'https://www.sueddeutsche.de/news/rss? search=&sort=date&dep%5B%5D=politik&typ%5B%5D=article&all%5B%5D=sys&all%5B%5D=time&sys%5B%5D=sz&catsz%5B%5D=szTopThemes'), ] def get_browser(self): def is_form_login(form): return "id" in form.attrs and form.attrs['id'] == "login-form" browser = BasicNewsRecipe.get_browser(self) # Login url = 'https://id.sueddeutsche.de/login' browser.open(url) browser.select_form(predicate=is_form_login) #browser.select_form(nr=0) # first form browser['login'] = self.username browser['password'] = self.password browser.submit() return browser def print_version(self, url): if '?' in url: new_url = self.browser.open(url + '&print=true').geturl() else: new_url = self.browser.open(url + '?print=true').geturl() return new_url |
|
04-12-2023, 02:34 AM | #9 |
Fanatic
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
|
maybe don't use print_version part.
check once. If it works, you can add auto_cleanup = True. Why is sz feed link so long? just use (u'SZ', u'https://www.sueddeutsche.de/news/rss'), |
05-02-2023, 02:36 PM | #10 |
Junior Member
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
|
Sorry again for my late reply.
I've tested again in any possible way but the problem persists. The feed I was using reduces the amount of articles to a specific kind and source. But the download of articles that are restricted and not simple dpa news still dont work. Any other suggestion? |
05-03-2023, 03:20 AM | #11 |
Fanatic
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
|
Pm me your login details. Attach the recipe, I can check.
|
05-21-2023, 03:23 AM | #12 |
Junior Member
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
|
Hi,
i still dont get it to work... Thanks @unkn0wn for all your input. The initial login procedure works, but probably it's not staying logged in (without javascript?). Maybe we need something similar to wsj or irish times recipes? Current status is: Code:
# -*- coding: utf-8 -*- __license__ = 'GPL v3' ''' Fetch sueddeutsche.de ''' from calibre.web.feeds.news import BasicNewsRecipe, classes class Sueddeutsche(BasicNewsRecipe): title = u'SZ' description = 'News from Germany, Access to online content' publisher = u'Süddeutsche Zeitung' category = 'news, politics, Germany' timefmt = ' [%a, %d %b %Y]' oldest_article = 1 max_articles_per_feed = 100 language = 'de' encoding = 'utf-8' publication_type = 'newspaper' remove_attributes = ['style', 'height', 'width'] needs_subscription = True use_embedded_content = False no_stylesheets = True def get_browser(self): def is_form_login(form): return "id" in form.attrs and form.attrs['id'] == "login-form" browser = BasicNewsRecipe.get_browser(self) url = 'https://id.sueddeutsche.de/login' browser.open(url) browser.select_form(predicate=is_form_login) #browser.select_form(nr=0) browser['login'] = self.username browser['password'] = self.password browser.submit() return browser keep_only_tags = [ classes('lp_is_start custom-1qvpywd') ] remove_tags = [ dict(name=['button', 'aside', 'nav']), classes('teaserable-layout teaserable-layout--teaser') ] feeds = [ (u'SZ', u'https://www.sueddeutsche.de/news/rss'), ] def preprocess_html(self, soup): for pic in soup.findAll('picture'): if nos := pic.find('noscript'): nos.name = 'div' for img in soup.findAll('img', attrs={'src':lambda n: n and n.startswith('data:')}): img.extract() return soup def print_version(self, url): return url.split('?')[0] |
08-14-2024, 11:53 PM | #13 |
Fanatic
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Recipe for single articles | aschiller | Recipes | 1 | 11-07-2019 04:31 AM |
Failed to fetch multipage articles | Susa | Recipes | 2 | 03-25-2019 01:49 AM |
How to fetch articles from infinite scrolling page | Ramana | Recipes | 2 | 12-07-2018 08:22 AM |
Fetch News for The Wall Street Journal (En) is not downloading it's articles | Brookings | Recipes | 0 | 09-04-2014 05:36 AM |
Fetch Recipe as PDF | Jim77 | Calibre | 12 | 12-29-2010 10:07 AM |