HTTP/3 support?

unkn0wn · 07-24-2024, 11:54 AM

I see that mechanize doesn't support htttp3 requests yet.
Is it possible to do this some other way?

kovidgoyal · 07-24-2024, 12:31 PM

You can use the read_url() function from scraper/simple.py

unkn0wn · 07-26-2024, 02:49 AM

how do i set custom headers.
I think I'll have to learn a bit about qt, and try these things.

Also we will still use mechanize when downloading images after processing html content, that I got with read_url in get_obfuscated.. so they'll also fail

kovidgoyal · 07-26-2024, 02:59 AM

What server supports HTTP/3 but not HTTP 1.1? The general solution for this is to create a new recipe class that uses a QtWebEngine as the browser. This however is a long project, instead if you need to just do a quick hack, override get_browser() in your recipe to return self. Then implement the open method in your recipe. These will be used by the fetcher to download anything including images.

Code:

def open(self, url):
   return read_url(self.storage, url)  # here self.storage should be a list created in the __init__ method of your recipe

As for adding headers, there is currently no facility for that, however it should be easy to add.

unkn0wn · 07-26-2024, 04:48 AM

this is the reicpe.
(science magazine, was requested here)

got a 403 error.. It could be accessed with read_url.
The below recipe works but images fail with 403.

Code:

#!/usr/bin/env python
from calibre.scraper.simple import read_url
from calibre.web.feeds.news import BasicNewsRecipe, classes
from calibre.ebooks.BeautifulSoup import BeautifulSoup


def absurl(url):
    if url.startswith('/'):
        url = 'https://www.science.org' + url
    return url

class science(BasicNewsRecipe):
    title = 'Science Journal'
    __author__ = 'unkn0wn'
    description = (
        'Science continues to publish the very best in research across the sciences, with articles that '
        'consistently rank among the most cited in the world.'
    )
    encoding = 'utf-8'
    no_javascript = True
    no_stylesheets = True
    remove_attributes = ['style', 'height', 'width']
    language = 'en'

    extra_css = '''
        .news-article__figure__caption {font-size:small; text-align:center;}
        .contributors, .core-self-citation, .meta-panel__left-content, .news-article__hero__top-meta,
		.news-article__hero__bottom-meta, #bibliography, #elettersSection {font-size:small;}
        img {display:block; margin:0 auto;}
        .core-lede {font-style:italic; color:#202020;}
    '''

    ignore_duplicate_articles = {'url'}

    def preprocess_html(self, soup):
        for p in soup.findAll(attrs={'role':'paragraph'}):
            p.name = 'p'
        return soup

    keep_only_tags = [
        classes('meta-panel__left-content news-article__hero__info news-article__hero__figure bodySection'),
        dict(name='h1', attrs={'property':'name'}),
        classes('core-lede contributors core-self-citation'),
        dict(attrs={'data-core-wrapper':'content'})	
    ]

    remove_tags = [
        classes('pb-ad')
    ]

    articles_are_obfuscated = True
    def get_obfuscated_article(self, url):
        return { 'data': read_url([], url), 'url': url }

    def parse_index(self):
        url = 'https://www.science.org/toc/science/current'

        soup = BeautifulSoup(read_url([], url))
        tme = soup.find(**classes('journal-issue__vol'))
        if tme:
            self.timefmt = ' [%s]' % self.tag_to_string(tme).strip()
        det = soup.find(attrs={'id':'journal-issue-details'})
        if det:
            self.description = self.tag_to_string(det).strip()

        feeds = []

        div = soup.find('div', attrs={'class':'toc__body'})
        for sec in div.findAll('section', **classes('toc__section')):
            name = sec.find(**classes('sidebar-article-title--decorated'))
            section = self.tag_to_string(name).strip()
            self.log(section)

            articles = []

            for card in sec.findAll(**classes('card-header')):
                ti = card.find(**classes('article-title'))
                url = absurl(ti.a['href'])
                title = self.tag_to_string(ti).strip()
                desc = ''
                meta = card.find(**classes('card-meta'))
                if meta:
                    desc = self.tag_to_string(meta).strip()
                self.log('          ', title, '\n\t', desc, '\n\t', url)
                articles.append({'title': title, 'description':desc, 'url': url})
            feeds.append((section, articles))
        return feeds

I made changes and tried something but got a I got a TypeError
should i've not used index_to_soup with read_url
File "<string>", line 66, in parse_index
File "calibre\web\feeds\news.py", line 731, in index_to_soup
TypeError: science.open() got an unexpected keyword argument 'timeout'

kovidgoyal · 07-26-2024, 06:24 AM

Something like this (untested)

Code:

#!/usr/bin/env python
import threading
from io import BytesIO

from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre.scraper.simple import read_url
from calibre.web.feeds.news import BasicNewsRecipe, classes


def absurl(url):
    if url.startswith('/'):
        url = 'https://www.science.org' + url
    return url



class Response(BytesIO):
    def __init__(self, url, text):
        super().__init__(text.encode('utf-8'))
        self.url = url
    def geturl(self): return self.url


class Browser:

    def __init__(self, local):
        self.thread_local = local
        self.thread_local.storage = []

    def open(self, url, *a, **kw):
        return Response(read_url(self.thread_local.storage, url), url)
    open_novisit = open


class science(BasicNewsRecipe):
    title = 'Science Journal'
    __author__ = 'unkn0wn'
    description = (
        'Science continues to publish the very best in research across the sciences, with articles that '
        'consistently rank among the most cited in the world.'
    )
    encoding = 'utf-8'
    no_javascript = True
    no_stylesheets = True
    remove_attributes = ['style', 'height', 'width']
    language = 'en'
    simultaneous_downloads = 1

    extra_css = '''
        .news-article__figure__caption {font-size:small; text-align:center;}
        .contributors, .core-self-citation, .meta-panel__left-content, .news-article__hero__top-meta,
		.news-article__hero__bottom-meta, #bibliography, #elettersSection {font-size:small;}
        img {display:block; margin:0 auto;}
        .core-lede {font-style:italic; color:#202020;}
    '''

    ignore_duplicate_articles = {'url'}

    keep_only_tags = [
        classes('meta-panel__left-content news-article__hero__info news-article__hero__figure bodySection'),
        dict(name='h1', attrs={'property':'name'}),
        classes('core-lede contributors core-self-citation'),
        dict(attrs={'data-core-wrapper':'content'})
    ]

    remove_tags = [
        classes('pb-ad')
    ]

    def __init__(self, *a, **kw):
        self.thread_local = threading.local()
        super().__init__(*a, **kw)

    def get_browser(self, *a, **kw):
        return Browser(self.thread_local)

    def preprocess_html(self, soup):
        for p in soup.findAll(attrs={'role':'paragraph'}):
            p.name = 'p'
        return soup

    def parse_index(self):
        url = 'https://www.science.org/toc/science/current'

        soup = BeautifulSoup(read_url([], url))
        tme = soup.find(**classes('journal-issue__vol'))
        if tme:
            self.timefmt = ' [%s]' % self.tag_to_string(tme).strip()
        det = soup.find(attrs={'id':'journal-issue-details'})
        if det:
            self.description = self.tag_to_string(det).strip()

        feeds = []

        div = soup.find('div', attrs={'class':'toc__body'})
        for sec in div.findAll('section', **classes('toc__section')):
            name = sec.find(**classes('sidebar-article-title--decorated'))
            section = self.tag_to_string(name).strip()
            self.log(section)

            articles = []

            for card in sec.findAll(**classes('card-header')):
                ti = card.find(**classes('article-title'))
                url = absurl(ti.a['href'])
                title = self.tag_to_string(ti).strip()
                desc = ''
                meta = card.find(**classes('card-meta'))
                if meta:
                    desc = self.tag_to_string(meta).strip()
                self.log('          ', title, '\n\t', desc, '\n\t', url)
                articles.append({'title': title, 'description':desc, 'url': url})
            feeds.append((section, articles))
        return feeds

kovidgoyal · 07-26-2024, 06:36 AM

actually that wont work read_url() currently is only meant for HTML files not binary data.

unkn0wn · 07-26-2024, 11:07 AM

hmmm

I'll just ignore the missing images and put up recipes then.

It has like 4 sub-publications, that could work with the same code.

kovidgoyal · 07-26-2024, 11:31 AM

I suggest you wait a little I might have some time to implement a proper solution in the next weke or two

unkn0wn · 07-26-2024, 01:53 PM

okay

07-24-2024, 11:54 AM	#1
unkn0wn Fanatic Posts: 542 Karma: 82944 Join Date: May 2021 Device: kindle	HTTP/3 support? I see that mechanize doesn't support htttp3 requests yet. Is it possible to do this some other way?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Support for HTTP 308 redirects	modified	Recipes	12	09-18-2023 03:39 AM
Firmware Update Kindle discontinuing MOBI support, implementing EPUB support	nesler	Amazon Kindle	2	05-04-2022 09:02 AM
Does calibre support retry-after http headers ?	SimonMc	Library Management	6	12-15-2021 12:40 PM
[Newbie] Book cover list support? Popup footnote support	nqk	KOReader	1	02-19-2016 06:23 AM
What are: url:http\|// ... urn:urn\|uuid\| ... uri:http\|//	44reader	Library Management	5	07-05-2012 02:42 PM

07-24-2024, 12:31 PM	#2
kovidgoyal creator of calibre Posts: 44,524 Karma: 24495784 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You can use the read_url() function from scraper/simple.py

07-26-2024, 02:49 AM	#3
unkn0wn Fanatic Posts: 542 Karma: 82944 Join Date: May 2021 Device: kindle	how do i set custom headers. I think I'll have to learn a bit about qt, and try these things. Also we will still use mechanize when downloading images after processing html content, that I got with read_url in get_obfuscated.. so they'll also fail

07-26-2024, 02:59 AM	#4
kovidgoyal creator of calibre Posts: 44,524 Karma: 24495784 Join Date: Oct 2006 Location: Mumbai, India Device: Various	What server supports HTTP/3 but not HTTP 1.1? The general solution for this is to create a new recipe class that uses a QtWebEngine as the browser. This however is a long project, instead if you need to just do a quick hack, override get_browser() in your recipe to return self. Then implement the open method in your recipe. These will be used by the fetcher to download anything including images. Code: def open(self, url): return read_url(self.storage, url) # here self.storage should be a list created in the __init__ method of your recipe As for adding headers, there is currently no facility for that, however it should be easy to add.

07-26-2024, 06:36 AM	#7
kovidgoyal creator of calibre Posts: 44,524 Karma: 24495784 Join Date: Oct 2006 Location: Mumbai, India Device: Various	actually that wont work read_url() currently is only meant for HTML files not binary data.

07-26-2024, 11:07 AM	#8
unkn0wn Fanatic Posts: 542 Karma: 82944 Join Date: May 2021 Device: kindle	hmmm I'll just ignore the missing images and put up recipes then. It has like 4 sub-publications, that could work with the same code.

07-26-2024, 11:31 AM	#9
kovidgoyal creator of calibre Posts: 44,524 Karma: 24495784 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I suggest you wait a little I might have some time to implement a proper solution in the next weke or two

07-26-2024, 01:53 PM	#10
unkn0wn Fanatic Posts: 542 Karma: 82944 Join Date: May 2021 Device: kindle	okay