Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-17-2024, 03:52 AM   #1
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 7
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
Post Help to finish the recipe of my favorite news site

Hi all,

I am trying to create a working news recipe for elcorreo.com a spanish news site. There is a built in recipe but it just downloads the links and that is it. Not even the cover.

link to official news site: www.elcorreo.com

I managed to find that if you open any article and replace the ".html" with "_amp.html" you can open the 'immersive reader' in edge to read the full article. And inspecting the "_amp.html" site you can find a script json with the content of the article.

So I started to create a custom one but I can only reach so far. I managed to retrieve the cover, title, subtitle and main image and delete the rest that is not relevant but I need help to add the content of the article by replacing the URLs of the articles to search in and finding the Script tag that contains a JSON with the article content.

This is the code that I have so far:

Spoiler:
#!/usr/bin/env python
__license__ = 'GPL v3'
__author__ = 'Carlos Marten based on Kovid Goyal official version'
__copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net'
description = 'Elcorreo Newspaper (Spain) - v1.0 16.04.2022'
__docformat__ = 'restructuredtext en'

'''
Elcorreo.com
'''

from calibre.web.feeds.news import BasicNewsRecipe
from html5_parser import parse
import datetime
from datetime import date

class Elcorreo(BasicNewsRecipe):
__author__ = 'Carlos Marten'
description = 'Elcorreo'
now = datetime.datetime.now()
title = u'El Correo ['+str(date.today())+']'
publisher = u'Ediciones El Pa\xeds SL'
category = 'News, politics, culture, economy, general interest'

language = 'es'
timefmt = '[%a, %d %b, %Y]'
oldest_article = 5
max_articles_per_feed = 4
recursion = 2

no_stylesheets = True
remove_attributes = ['width', 'height','display','margin','padding', 'position','border']
remove_javascript = True
use_embedded_content = False
ignore_duplicate_articles = {'title', 'url'}
compress_news_images = False

#auto_cleanup = True
#scale_news_images_to_device = True

def getcoverurl():
now = datetime.datetime.now()
return 'https://portada.iperiodico.es/'+str(now.year)+'/0'+str(now.month)+'/'+str(now.day)+'_elcorreo.750.jpg'
cover_url = getcoverurl()

def preprocess_html(self, soup):
for alink in soup.findAll('a'):
if alink.string is not None:
tstr = alink.string
alink.replaceWith(tstr)
return soup

extra_css = '''
img{
all: initial;
width: 100%
}
h1 { font-size: 22px }
h2 { font-size: 20px }

'''

keep_only_tags = [
dict(name='h1', attrs={'class': [
'v-a-t', #title
]}),
dict(name='h2', attrs={'class': [
'v-a-sub-t', #subtitle
]}),

dict(name='script', attrs={'type': 'application/ld+json',}), #json with article (closed)

dict(name='article', attrs={'class': [
'v-a v-a--d v-a--d-bs v-a--p-b', #article
]}),
dict(name='div', attrs={'class': [
'amp-access-hide', #article (closed)
]}),

]

remove_tags = [
dict(attrs={'class': [
'v-drpw__w', #social
'v-mdl-tpc', #section topics related
'content-exclusive-bg', #paywall
'v-d__btn-c', #comenta y reporta error
'v-i-b', #compartir
'v-pill-m', #icono de play y ampliar imagen
'v-mdl-ath__c', #comentarios
]},),
dict(attrs={'class': [
'v-a-img', #image

]},),

]


def postprocess_html(self, soup, first):
return soup

feeds = [
(u'Portada', u'https://www.elcorreo.com/rss/2.0/portada/'),

]


calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'



I really tried with the API of calibre and reviewing other recipes but cannot manage to do it. I think I need to do something in the preprocesshtml function but no clue, really.

Can someone with extensive recipe knowledge help?
martencarlos is offline   Reply With Quote
Old 04-17-2024, 07:24 AM   #2
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 7
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
I think I found how to replace the desktop URL with the mobile URL adding this code:

#replace desktop url with mobile url
def get_article_url(self, article):
desktopUrl = BasicNewsRecipe.get_article_url(self, article)
mobileUrl = desktopUrl.replace(".html", "_amp.html")
return mobileUrl

But now I can't retrieve the article image and I am still missing the article content in json inside the script tag.
martencarlos is offline   Reply With Quote
Old 04-17-2024, 08:05 AM   #3
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
builtin recipe isn't workimg?
unkn0wn is offline   Reply With Quote
Old 04-17-2024, 08:49 AM   #4
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 7
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
Quote:
Originally Posted by unkn0wn View Post
builtin recipe isn't workimg?
No, it is only including the links to the index and the link to the article.
martencarlos is offline   Reply With Quote
Old 04-18-2024, 03:59 AM   #5
unkn0wn
Fanatic
unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.unkn0wn can do the Funky Gibbon.
 
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
https://github.com/kovidgoyal/calibr...7b66c77715216f

I just tested this and output is too large >120Mb. Help me hash out some of the feeds.

There so many articles, just in past 24 hours from this website.
Code:
feeds = [
        ('Portada', 'http://www.elcorreo.com/rss/atom/portada'),
        ('Mundo', 'http://www.elcorreo.com/rss/atom/?section=internacional'),
        ('Bizkaia', 'http://www.elcorreo.com/rss/atom/?section=bizkaia'),
        ('Opinión', 'https://www.elcorreo.com/rss/atom/?section=opinion'),
        ('Internacional', 'https://www.elcorreo.com/rss/atom/?section=internacional'),
        ('Ciencia', 'https://www.elcorreo.com/rss/atom/?section=ciencia'),
        ('Guipuzkoa', 'http://www.elcorreo.com/rss/atom/?section=gipuzkoa'),
        ('Araba', 'http://www.elcorreo.com/rss/atom/?section=araba'),
        ('La Rioja', 'http://www.elcorreo.com/rss/atom/?section=larioja'),
        ('Miranda', 'http://www.elcorreo.com/rss/atom/?section=miranda'),
        ('Economía', 'http://www.elcorreo.com/rss/atom/?section=economia'),
        ('Culturas', 'http://www.elcorreo.com/rss/atom/?section=culturas'),
        ('Politica', 'http://www.elcorreo.com/rss/atom/?section=politica'),
        ('De tiendas', 'https://www.elcorreo.com/rss/atom/?section=de-tiendas'),
        ('Deportes', 'https://www.elcorreo.com/rss/atom/?section=deportes'),
        ('Elecciones', 'https://www.elcorreo.com/rss/atom/?section=elecciones'),
        ('Sociedad', 'https://www.elcorreo.com/rss/atom/?section=sociedad'),
        ('Vivir', 'https://www.elcorreo.com/rss/atom/?section=vivir'),
        ('Tecnología', 'http://www.elcorreo.com/rss/atom/?section=tecnologia'),
        ('Gente - Estilo', 'http://www.elcorreo.com/rss/atom/?section=gente-estilo'),
        ('Planes', 'http://www.elcorreo.com/rss/atom/?section=planes'),
        ('Athletic', 'http://www.elcorreo.com/rss/atom/?section=athletic'),
        ('Alavés', 'http://www.elcorreo.com/rss/atom/?section=alaves'),
        ('Bilbao Basket', 'http://www.elcorreo.com/rss/atom/?section=bilbaobasket'),
        ('Baskonia', 'http://www.elcorreo.com/rss/atom/?section=baskonia'),
        ('Deportes', 'http://www.elcorreo.com/rss/atom/?section=deportes'),
        ('Jaiak', 'http://www.elcorreo.com/rss/atom/?section=jaiak'),
        ('La Blanca', 'http://www.elcorreo.com/rss/atom/?section=la-blanca-vitoria'),
        ('Aste Nagusia', 'http://www.elcorreo.com/rss/atom/?section=aste-nagusia-bilbao'),
        ('Semana Santa', 'http://www.elcorreo.com/rss/atom/?section=semana-santa'),
        ('Festivales', 'http://www.elcorreo.com/rss/atom/?section=festivales')
    ]
unkn0wn is offline   Reply With Quote
Old 04-18-2024, 10:55 AM   #6
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 7
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
Hi,

I left the most import ones. There can't be that many articles in 24hours. I guess a lot are duplicates and are in more than one feed. This is not a big newspaper.
Also maybe images are not optimized.

feeds = [
('Portada', 'http://www.elcorreo.com/rss/atom/portada'),
('Mundo', 'http://www.elcorreo.com/rss/atom/?section=internacional'),
('Bizkaia', 'http://www.elcorreo.com/rss/atom/?section=bizkaia'),
('Opinión', 'https://www.elcorreo.com/rss/atom/?section=opinion'),
('Internacional', 'https://www.elcorreo.com/rss/atom/?section=internacional'),
('Ciencia', 'https://www.elcorreo.com/rss/atom/?section=ciencia'),
('Economía', 'http://www.elcorreo.com/rss/atom/?section=economia'),
('Politica', 'http://www.elcorreo.com/rss/atom/?section=politica'),
('Deportes', 'https://www.elcorreo.com/rss/atom/?section=deportes'),
('Tecnología', 'http://www.elcorreo.com/rss/atom/?section=tecnologia'),
('Deportes', 'http://www.elcorreo.com/rss/atom/?section=deportes'),
]
martencarlos is offline   Reply With Quote
Old 04-18-2024, 11:35 AM   #7
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 7
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
Btw, I just had a look at the new Recipe code and used it to download the newspaper. I don't know how you did it but it looks perfect. Thank you very much!

And you were right, there are a lot of articles and they are not duplicated.

Any ideas how we could reduce the size to be sent via email to the kindle?

Maybe optimize images?

Thanks again! really apretiate it.
martencarlos is offline   Reply With Quote
Old 04-18-2024, 12:20 PM   #8
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 7
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
Ok by adding the following I managed to downsize to epub to decent size to send via email:

max_articles_per_feed = 10 #articles
compress_news_images = True
martencarlos is offline   Reply With Quote
Old 06-27-2024, 03:00 AM   #9
martencarlos
Junior Member
martencarlos began at the beginning.
 
martencarlos's Avatar
 
Posts: 7
Karma: 10
Join Date: Apr 2024
Device: Kindle paperwhite 2022
Hello,

Sorry for reopening the thread but the built-in recipe for 'El correo' stopped working.

I get the following error message:
Spoiler:


Fetch news from El Correo [2024-06-27]
Conversion options changed from defaults:
verbose: 2
output_profile: 'kindle_pw3'
Resolved conversion options
calibre version: 7.12.0
{'add_alt_text_to_img': False,
'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': None,
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_download_recipe': False,
'dont_split_on_page_breaks': True,
'duplicate_links_in_toc': False,
'embed_all_fonts': False,
'embed_font_family': None,
'enable_heuristics': False,
'epub_flatten': False,
'epub_inline_toc': False,
'epub_max_image_size': 'none',
'epub_toc_at_end': False,
'epub_version': '2',
'expand_css': False,
'extra_css': None,
'extract_to': None,
'filter_css': None,
'fix_indents': True,
'flow_size': 260,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x000001E66270EED0>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'no_chapters_in_toc': False,
'no_default_epub_cover': False,
'no_inline_navbars': False,
'no_svg_cover': False,
'output_profile': <calibre.customize.profiles.KindlePaperWhite3Outpu t object at 0x000001E662722E50>,
'page_breaks_before': None,
'prefer_metadata_cover': False,
'preserve_cover_aspect_ratio': False,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'search_replace': None,
'series': None,
'series_index': None,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'test': False,
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'transform_css_rules': None,
'transform_html_rules': None,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
InputFormatPlugin: Recipe Input running
Downloading recipe urn: custom:1005
Using user agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Failed feed: Portada
Traceback (most recent call last):
File "calibre\web\feeds\news.py", line 1722, in parse_feeds
File "mechanize\_mechanize.py", line 241, in open_novisit
File "mechanize\_mechanize.py", line 313, in _mech_open
mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden

Failed feed: Mundo
Traceback (most recent call last):
File "calibre\web\feeds\news.py", line 1722, in parse_feeds
File "mechanize\_mechanize.py", line 241, in open_novisit
File "mechanize\_mechanize.py", line 313, in _mech_open
mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden

Failed feed: Bizkaia
Traceback (most recent call last):
File "calibre\web\feeds\news.py", line 1722, in parse_feeds
File "mechanize\_mechanize.py", line 241, in open_novisit
File "mechanize\_mechanize.py", line 313, in _mech_open
mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden

Failed feed: Opinión
Traceback (most recent call last):
File "calibre\web\feeds\news.py", line 1722, in parse_feeds
File "mechanize\_mechanize.py", line 241, in open_novisit
File "mechanize\_mechanize.py", line 313, in _mech_open
mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden

Failed feed: Internacional
Traceback (most recent call last):
File "calibre\web\feeds\news.py", line 1722, in parse_feeds
File "mechanize\_mechanize.py", line 241, in open_novisit
File "mechanize\_mechanize.py", line 313, in _mech_open
mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden

Failed feed: Economía
Traceback (most recent call last):
File "calibre\web\feeds\news.py", line 1722, in parse_feeds
File "mechanize\_mechanize.py", line 241, in open_novisit
File "mechanize\_mechanize.py", line 313, in _mech_open
mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden

Failed feed: Planes
Traceback (most recent call last):
File "calibre\web\feeds\news.py", line 1722, in parse_feeds
File "mechanize\_mechanize.py", line 241, in open_novisit
File "mechanize\_mechanize.py", line 313, in _mech_open
mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden

Traceback (most recent call last):
File "runpy.py", line 198, in _run_module_as_main
File "runpy.py", line 88, in _run_code
File "site.py", line 83, in <module>
File "site.py", line 78, in main
File "site.py", line 50, in run_entry_point
File "calibre\utils\ipc\worker.py", line 215, in main
File "calibre\gui2\convert\gui_conversion.py", line 31, in gui_convert_recipe
File "calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
File "calibre\ebooks\conversion\plumber.py", line 1127, in run
File "calibre\customize\conversion.py", line 245, in __call__
File "calibre\ebooks\conversion\plugins\recipe_input.py ", line 138, in convert
File "calibre\web\feeds\news.py", line 1069, in download
File "calibre\web\feeds\news.py", line 1259, in build_index
ValueError: No articles found, aborting

Last edited by theducks; 06-27-2024 at 04:14 AM. Reason: SPOILER LOG files
martencarlos is offline   Reply With Quote
Reply

Tags
elcorreo, elcorreo.com, recipe, recipe broken, recipe request


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Foreign Affairs recipe for news from the site (not the magazine) mendesitba Recipes 0 12-08-2015 10:14 PM
NHK Easy News (Japanese News site) beemanfunk Recipes 1 12-25-2014 04:44 AM
IDG.se - Recipe for swedish news site khromov Recipes 3 09-18-2011 10:40 PM
Is there a recipe for "Le Figaro", a french news site? mg666 Recipes 0 05-12-2011 06:50 AM


All times are GMT -4. The time now is 04:07 PM.


MobileRead.com is a privately owned, operated and funded community.