Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-10-2023, 06:21 AM   #1
Sushi5675
Junior Member
Sushi5675 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
My SZ Recipe does not fetch all articles

Hi there,

I am very frustrated right now and really hope someone can help me out here.

I am trying to fetch news from www.sueddeutsche.de which works fine for some articles but for others, it does not. The paragraph is somehow hidden in the html code and doesnt get extracted.

I have a subscription so the articles should be visible even though they are behind a paywall. But the fetching process doesnt work only on some articles regardless whether they are behind a paywall or not.

This article is for example not working:

view-source:https://www.sueddeutsche.de/politik/...215?print=true

I use the print=true tag because it is much cleaner then...

I am really looking forward to any idea or code example.
If we can figure that out here I'd be happy to share the recipe via calibre because the last recipes I found are quite old...

Thank you!!

PHP Code:
# -*- coding: utf-8 -*-
__license__ 'GPL v3'

#import
from calibre.web.feeds.news import BasicNewsRecipe
from calibre
.ebooks.BeautifulSoup import BeautifulSoup
from calibre import strftime

##SZ
class Sueddeutsche(BasicNewsRecipe):
    
title u'SZ8'
    
description 'News from Germany'
    
publisher u'Süddeutsche Zeitung'
    
category 'news, politics'
    
timefmt ' [%a, %d %b %Y]'
    
oldest_article 1
    max_articles_per_feed 
10
    language 
'de'
    
encoding 'utf-8'
    
publication_type 'newspaper'
    
remove_empty_feeds True
    needs_subscription 
True
    use_embedded_content 
False
    no_stylesheets 
True
    remove_javascript 
False
    auto_cleanup 
True
    
#simultaneous_downloads = 1
    #articles_are_obfuscated = True

    
    #add login

    
def get_browser(self):
        
browser BasicNewsRecipe.get_browser(self)
        
# Login
        
url 'https://id.sueddeutsche.de/login'
        
browser.open(url)
        
browser.select_form(nr=0)  # first form
        
browser['login'] = self.username
        browser
['password'] = self.password
        browser
.submit()
        return 
browser

    feeds 
= [  
        (
u'Politik'u'http://rss.sueddeutsche.de/rss/Politik'),
    ]
    
    
    
def print_version(selfurl):
        return 
url '?print=true' 
Sushi5675 is offline   Reply With Quote
Old 03-10-2023, 02:11 PM   #2
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
maybe auto_cleanup fails.

You can check just by removing auto_cleanup, It'll load everyrthing from the page.. maybe then you'll know if the fetched link itself doesn't actually have any content.. or if its a login issue.
unkn0wn is offline   Reply With Quote
Advert
Old 03-10-2023, 06:13 PM   #3
Sushi5675
Junior Member
Sushi5675 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
Quote:
Originally Posted by unkn0wn View Post
maybe auto_cleanup fails.

You can check just by removing auto_cleanup, It'll load everyrthing from the page.. maybe then you'll know if the fetched link itself doesn't actually have any content.. or if its a login issue.
Thank you for your reply. Unfortunately it doesnt work with auto_cleanup = false or removed.

I've rechecked again and testet via console ouput and debugging mode.

And now it seems to be a login issue after all.

But what is wrong with the "def get_browser(self)" section?
Sushi5675 is offline   Reply With Quote
Old 03-19-2023, 05:19 AM   #4
Sushi5675
Junior Member
Sushi5675 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
H,

unfortunately I am not able to resolve the issue.

When I enter the login data on
https://id.sueddeutsche.de/login
end press enter -> it does not login automatically. I have to klick on the Login Button.

In addition, in the settings of my SZ profile I can see my logged in sessions, but not the browser from calibre news receipe.

Does that mean, that the browser.submit() function is probably also not working and I am not logged in after all?

Is there an alternative to browser.submit() function?

Here is the form of id.sueddeutsche.de/login

Code:
<div id="loginbox">
                     <form class="top-boxes" id="login-form" method="post" role="form" action="/login"><div class="form-group floating-label js-required"><label for="id_login">E-Mail Adresse</label><input type="text" name="login" id="login_login-form" class="form-control" /></div><div class="form-group floating-label js-required"><label for="id_password">Passwort</label><input type="password" name="password" id="password_login-form" class="form-control" /><div class="field-help help"><a href="&#x2F;resetpassword">Passwort vergessen</a></div></div><div class="form-group rememberme checkbox-group"><div class="table-box"><div class="custom-checkbox"><input type="checkbox" name="remember_me" id="id_remember_me" value="on" class="form-control" checked="checked" /><div class="box"><div class="tick"></div></div></div><div class="label-box"><label for="id_remember_me">Angemeldet bleiben</label></div></div></div><div class="form-group hidden"><input type="hidden" name="login_ticket" id="login_ticket_login-form" value="LT-l0wVOXDqTgF9GfUzQhy7HuN63LIni" /></div><div id="creTracking-login"></div>
Sushi5675 is offline   Reply With Quote
Old 03-19-2023, 05:22 AM   #5
Sushi5675
Junior Member
Sushi5675 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
Hi,

unfortunately my recipe does not work and I cant figure out how to solve it.


When I login manually on url = https://id.sueddeutsche.de/login and I press Enter after filling in the fields, nothing happens.

Maybe the browser.submit() function is also not working?
Is there an alternative to submit() to login with the browser session with the news recipe?
Sushi5675 is offline   Reply With Quote
Advert
Old 03-19-2023, 05:24 AM   #6
Sushi5675
Junior Member
Sushi5675 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
Also, in my SZ profile I can see all my logged in devices.
But the browser session of calibre is not visible so I assume my login does not work.
Sushi5675 is offline   Reply With Quote
Old 03-21-2023, 01:03 PM   #7
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
try
Code:
    def get_browser(self):
        
        def is_form_login(form):
            return "id" in form.attrs and form.attrs['id'] == "login-form"

        browser = BasicNewsRecipe.get_browser(self)
        # Login
        url = 'https://id.sueddeutsche.de/login'
        browser.open(url)
        browser.select_form(predicate=is_form_login)
        browser['login'] = self.username
        browser['password'] = self.password
        browser.submit()
        return browser
unkn0wn is offline   Reply With Quote
Old 04-11-2023, 02:05 PM   #8
Sushi5675
Junior Member
Sushi5675 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
Quote:
Originally Posted by unkn0wn View Post
try
Code:
    def get_browser(self):
        
        def is_form_login(form):
            return "id" in form.attrs and form.attrs['id'] == "login-form"

        browser = BasicNewsRecipe.get_browser(self)
        # Login
        url = 'https://id.sueddeutsche.de/login'
        browser.open(url)
        browser.select_form(predicate=is_form_login)
        browser['login'] = self.username
        browser['password'] = self.password
        browser.submit()
        return browser
Thanks unkn0wn, notification of your post didnt work so please excuse my late reply.

I debugged the the output and the login works.
In the console output I can read my profile ID, which is only visible after successful login.

But unfortunately only two or three articles are readable.

The strange thing is, that some articles behind the paywall are readable and others are not. The rest of the articles are reduced.

Any ideas?


Code:
# -*- coding: utf-8 -*-
__license__ = 'GPL v3'

#import
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre import strftime
import time


##SZ
class Sueddeutsche(BasicNewsRecipe):
    title = u'SZ8'
    description = 'News from Germany'
    publisher = u'Süddeutsche Zeitung'
    category = 'news, politics'
    timefmt = ' [%a, %d %b %Y]'
    oldest_article = 1
    max_articles_per_feed = 10
    language = 'de'
    encoding = 'utf-8'
    publication_type = 'newspaper'
    remove_empty_feeds = True
    needs_subscription = True

    
    simultaneous_downloads = 1
    recursions = 0

    feeds = [  
        #(u'Politik', u'http://rss.sueddeutsche.de/rss/Politik'),
        
        (u'SZ', u'https://www.sueddeutsche.de/news/rss?		search=&sort=date&dep%5B%5D=politik&typ%5B%5D=article&all%5B%5D=sys&all%5B%5D=time&sys%5B%5D=sz&catsz%5B%5D=szTopThemes'), 
    ]
    
    def get_browser(self):
            def is_form_login(form):
                return "id" in form.attrs and form.attrs['id'] == "login-form"
            browser = BasicNewsRecipe.get_browser(self)
            # Login
            url = 'https://id.sueddeutsche.de/login'
            browser.open(url)
            browser.select_form(predicate=is_form_login)
            #browser.select_form(nr=0)  # first form
            browser['login'] = self.username
            browser['password'] = self.password
            browser.submit()
            return browser
    
    def print_version(self, url):
            if '?' in url:
                new_url = self.browser.open(url + '&print=true').geturl()
            else: 
                new_url = self.browser.open(url + '?print=true').geturl()
            return new_url
Sushi5675 is offline   Reply With Quote
Old 04-12-2023, 02:34 AM   #9
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
maybe don't use print_version part.
check once.
If it works, you can add auto_cleanup = True.

Why is sz feed link so long?
just use (u'SZ', u'https://www.sueddeutsche.de/news/rss'),
unkn0wn is offline   Reply With Quote
Old 05-02-2023, 02:36 PM   #10
Sushi5675
Junior Member
Sushi5675 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
Sorry again for my late reply.

I've tested again in any possible way but the problem persists.

The feed I was using reduces the amount of articles to a specific kind and source. But the download of articles that are restricted and not simple dpa news still dont work.

Any other suggestion?
Sushi5675 is offline   Reply With Quote
Old 05-03-2023, 03:20 AM   #11
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
Pm me your login details. Attach the recipe, I can check.
unkn0wn is offline   Reply With Quote
Old 05-21-2023, 03:23 AM   #12
Sushi5675
Junior Member
Sushi5675 began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2023
Device: kindle paperwhite
Hi,

i still dont get it to work... Thanks @unkn0wn for all your input.

The initial login procedure works, but probably it's not staying logged in (without javascript?). Maybe we need something similar to wsj or irish times recipes?

Current status is:

Code:
# -*- coding: utf-8 -*-
__license__ = 'GPL v3'

'''
Fetch sueddeutsche.de
'''
from calibre.web.feeds.news import BasicNewsRecipe, classes

class Sueddeutsche(BasicNewsRecipe):

    title = u'SZ'
    description = 'News from Germany, Access to online content'
    publisher = u'Süddeutsche Zeitung'
    category = 'news, politics, Germany'
    timefmt = ' [%a, %d %b %Y]'
    oldest_article = 1
    max_articles_per_feed = 100
    language = 'de'
    encoding = 'utf-8'
    publication_type = 'newspaper'
    remove_attributes = ['style', 'height', 'width']
    needs_subscription = True
    use_embedded_content = False
    no_stylesheets = True
    
    def get_browser(self):
        
        def is_form_login(form):
            return "id" in form.attrs and form.attrs['id'] == "login-form"
        
        browser = BasicNewsRecipe.get_browser(self)

        url = 'https://id.sueddeutsche.de/login'
        browser.open(url)

        browser.select_form(predicate=is_form_login)
        #browser.select_form(nr=0)  
        browser['login'] = self.username
        browser['password'] = self.password
        browser.submit()

        return browser
    
    keep_only_tags = [
        classes('lp_is_start custom-1qvpywd')
    ]
    
    remove_tags = [
        dict(name=['button', 'aside', 'nav']),
        classes('teaserable-layout teaserable-layout--teaser')
    ]

    feeds = [	
         (u'SZ', u'https://www.sueddeutsche.de/news/rss'),       
    ]
    
    def preprocess_html(self, soup):
        for pic in soup.findAll('picture'):
            if nos := pic.find('noscript'):
                nos.name = 'div'
        for img in soup.findAll('img', attrs={'src':lambda n: n and n.startswith('data:')}):
            img.extract()
        return soup
    
    def print_version(self, url):
        return url.split('?')[0]
Sushi5675 is offline   Reply With Quote
Old 08-14-2024, 11:53 PM   #13
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
https://github.com/kovidgoyal/calibr...88f0a3983787d4
unkn0wn is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe for single articles aschiller Recipes 1 11-07-2019 04:31 AM
Failed to fetch multipage articles Susa Recipes 2 03-25-2019 01:49 AM
How to fetch articles from infinite scrolling page Ramana Recipes 2 12-07-2018 08:22 AM
Fetch News for The Wall Street Journal (En) is not downloading it's articles Brookings Recipes 0 09-04-2014 05:36 AM
Fetch Recipe as PDF Jim77 Calibre 12 12-29-2010 10:07 AM


All times are GMT -4. The time now is 05:08 PM.


MobileRead.com is a privately owned, operated and funded community.