Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 07-24-2024, 11:54 AM   #1
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 564
Karma: 82944
Join Date: May 2021
Device: kindle
HTTP/3 support?

I see that mechanize doesn't support htttp3 requests yet.
Is it possible to do this some other way?
unkn0wn is offline   Reply With Quote
Old 07-24-2024, 12:31 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,699
Karma: 24967300
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You can use the read_url() function from scraper/simple.py
kovidgoyal is offline   Reply With Quote
Old 07-26-2024, 02:49 AM   #3
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 564
Karma: 82944
Join Date: May 2021
Device: kindle
how do i set custom headers.
I think I'll have to learn a bit about qt, and try these things.

Also we will still use mechanize when downloading images after processing html content, that I got with read_url in get_obfuscated.. so they'll also fail
unkn0wn is offline   Reply With Quote
Old 07-26-2024, 02:59 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,699
Karma: 24967300
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
What server supports HTTP/3 but not HTTP 1.1? The general solution for this is to create a new recipe class that uses a QtWebEngine as the browser. This however is a long project, instead if you need to just do a quick hack, override get_browser() in your recipe to return self. Then implement the open method in your recipe. These will be used by the fetcher to download anything including images.

Code:
def open(self, url):
   return read_url(self.storage, url)  # here self.storage should be a list created in the __init__ method of your recipe
As for adding headers, there is currently no facility for that, however it should be easy to add.
kovidgoyal is offline   Reply With Quote
Old 07-26-2024, 04:48 AM   #5
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 564
Karma: 82944
Join Date: May 2021
Device: kindle
this is the reicpe.
(science magazine, was requested here)

got a 403 error.. It could be accessed with read_url.
The below recipe works but images fail with 403.
Code:
#!/usr/bin/env python
from calibre.scraper.simple import read_url
from calibre.web.feeds.news import BasicNewsRecipe, classes
from calibre.ebooks.BeautifulSoup import BeautifulSoup


def absurl(url):
    if url.startswith('/'):
        url = 'https://www.science.org' + url
    return url

class science(BasicNewsRecipe):
    title = 'Science Journal'
    __author__ = 'unkn0wn'
    description = (
        'Science continues to publish the very best in research across the sciences, with articles that '
        'consistently rank among the most cited in the world.'
    )
    encoding = 'utf-8'
    no_javascript = True
    no_stylesheets = True
    remove_attributes = ['style', 'height', 'width']
    language = 'en'

    extra_css = '''
        .news-article__figure__caption {font-size:small; text-align:center;}
        .contributors, .core-self-citation, .meta-panel__left-content, .news-article__hero__top-meta,
		.news-article__hero__bottom-meta, #bibliography, #elettersSection {font-size:small;}
        img {display:block; margin:0 auto;}
        .core-lede {font-style:italic; color:#202020;}
    '''

    ignore_duplicate_articles = {'url'}

    def preprocess_html(self, soup):
        for p in soup.findAll(attrs={'role':'paragraph'}):
            p.name = 'p'
        return soup

    keep_only_tags = [
        classes('meta-panel__left-content news-article__hero__info news-article__hero__figure bodySection'),
        dict(name='h1', attrs={'property':'name'}),
        classes('core-lede contributors core-self-citation'),
        dict(attrs={'data-core-wrapper':'content'})	
    ]

    remove_tags = [
        classes('pb-ad')
    ]

    articles_are_obfuscated = True
    def get_obfuscated_article(self, url):
        return { 'data': read_url([], url), 'url': url }

    def parse_index(self):
        url = 'https://www.science.org/toc/science/current'

        soup = BeautifulSoup(read_url([], url))
        tme = soup.find(**classes('journal-issue__vol'))
        if tme:
            self.timefmt = ' [%s]' % self.tag_to_string(tme).strip()
        det = soup.find(attrs={'id':'journal-issue-details'})
        if det:
            self.description = self.tag_to_string(det).strip()

        feeds = []

        div = soup.find('div', attrs={'class':'toc__body'})
        for sec in div.findAll('section', **classes('toc__section')):
            name = sec.find(**classes('sidebar-article-title--decorated'))
            section = self.tag_to_string(name).strip()
            self.log(section)

            articles = []

            for card in sec.findAll(**classes('card-header')):
                ti = card.find(**classes('article-title'))
                url = absurl(ti.a['href'])
                title = self.tag_to_string(ti).strip()
                desc = ''
                meta = card.find(**classes('card-meta'))
                if meta:
                    desc = self.tag_to_string(meta).strip()
                self.log('          ', title, '\n\t', desc, '\n\t', url)
                articles.append({'title': title, 'description':desc, 'url': url})
            feeds.append((section, articles))
        return feeds
I made changes and tried something but got a I got a TypeError
should i've not used index_to_soup with read_url
File "<string>", line 66, in parse_index
File "calibre\web\feeds\news.py", line 731, in index_to_soup
TypeError: science.open() got an unexpected keyword argument 'timeout'

Last edited by unkn0wn; 07-26-2024 at 04:52 AM.
unkn0wn is offline   Reply With Quote
Old 07-26-2024, 06:24 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,699
Karma: 24967300
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Something like this (untested)

Code:
#!/usr/bin/env python
import threading
from io import BytesIO

from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre.scraper.simple import read_url
from calibre.web.feeds.news import BasicNewsRecipe, classes


def absurl(url):
    if url.startswith('/'):
        url = 'https://www.science.org' + url
    return url



class Response(BytesIO):
    def __init__(self, url, text):
        super().__init__(text.encode('utf-8'))
        self.url = url
    def geturl(self): return self.url


class Browser:

    def __init__(self, local):
        self.thread_local = local
        self.thread_local.storage = []

    def open(self, url, *a, **kw):
        return Response(read_url(self.thread_local.storage, url), url)
    open_novisit = open


class science(BasicNewsRecipe):
    title = 'Science Journal'
    __author__ = 'unkn0wn'
    description = (
        'Science continues to publish the very best in research across the sciences, with articles that '
        'consistently rank among the most cited in the world.'
    )
    encoding = 'utf-8'
    no_javascript = True
    no_stylesheets = True
    remove_attributes = ['style', 'height', 'width']
    language = 'en'
    simultaneous_downloads = 1

    extra_css = '''
        .news-article__figure__caption {font-size:small; text-align:center;}
        .contributors, .core-self-citation, .meta-panel__left-content, .news-article__hero__top-meta,
		.news-article__hero__bottom-meta, #bibliography, #elettersSection {font-size:small;}
        img {display:block; margin:0 auto;}
        .core-lede {font-style:italic; color:#202020;}
    '''

    ignore_duplicate_articles = {'url'}

    keep_only_tags = [
        classes('meta-panel__left-content news-article__hero__info news-article__hero__figure bodySection'),
        dict(name='h1', attrs={'property':'name'}),
        classes('core-lede contributors core-self-citation'),
        dict(attrs={'data-core-wrapper':'content'})
    ]

    remove_tags = [
        classes('pb-ad')
    ]

    def __init__(self, *a, **kw):
        self.thread_local = threading.local()
        super().__init__(*a, **kw)

    def get_browser(self, *a, **kw):
        return Browser(self.thread_local)

    def preprocess_html(self, soup):
        for p in soup.findAll(attrs={'role':'paragraph'}):
            p.name = 'p'
        return soup

    def parse_index(self):
        url = 'https://www.science.org/toc/science/current'

        soup = BeautifulSoup(read_url([], url))
        tme = soup.find(**classes('journal-issue__vol'))
        if tme:
            self.timefmt = ' [%s]' % self.tag_to_string(tme).strip()
        det = soup.find(attrs={'id':'journal-issue-details'})
        if det:
            self.description = self.tag_to_string(det).strip()

        feeds = []

        div = soup.find('div', attrs={'class':'toc__body'})
        for sec in div.findAll('section', **classes('toc__section')):
            name = sec.find(**classes('sidebar-article-title--decorated'))
            section = self.tag_to_string(name).strip()
            self.log(section)

            articles = []

            for card in sec.findAll(**classes('card-header')):
                ti = card.find(**classes('article-title'))
                url = absurl(ti.a['href'])
                title = self.tag_to_string(ti).strip()
                desc = ''
                meta = card.find(**classes('card-meta'))
                if meta:
                    desc = self.tag_to_string(meta).strip()
                self.log('          ', title, '\n\t', desc, '\n\t', url)
                articles.append({'title': title, 'description':desc, 'url': url})
            feeds.append((section, articles))
        return feeds
kovidgoyal is offline   Reply With Quote
Old 07-26-2024, 06:36 AM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,699
Karma: 24967300
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
actually that wont work read_url() currently is only meant for HTML files not binary data.
kovidgoyal is offline   Reply With Quote
Old 07-26-2024, 11:07 AM   #8
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 564
Karma: 82944
Join Date: May 2021
Device: kindle
hmmm
I'll just ignore the missing images and put up recipes then.

It has like 4 sub-publications, that could work with the same code.
unkn0wn is offline   Reply With Quote
Old 07-26-2024, 11:31 AM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,699
Karma: 24967300
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I suggest you wait a little I might have some time to implement a proper solution in the next weke or two
kovidgoyal is offline   Reply With Quote
Old 07-26-2024, 01:53 PM   #10
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 564
Karma: 82944
Join Date: May 2021
Device: kindle
okay
unkn0wn is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Support for HTTP 308 redirects modified Recipes 12 09-18-2023 03:39 AM
Firmware Update Kindle discontinuing MOBI support, implementing EPUB support nesler Amazon Kindle 2 05-04-2022 09:02 AM
Does calibre support retry-after http headers ? SimonMc Library Management 6 12-15-2021 12:40 PM
[Newbie] Book cover list support? Popup footnote support nqk KOReader 1 02-19-2016 06:23 AM
What are: url:http|// ... urn:urn|uuid| ... uri:http|// 44reader Library Management 5 07-05-2012 02:42 PM


All times are GMT -4. The time now is 12:29 PM.


MobileRead.com is a privately owned, operated and funded community.