Help to finish the recipe of my favorite news site

martencarlos · 04-17-2024, 03:52 AM

Hi all,

I am trying to create a working news recipe for elcorreo.com a spanish news site. There is a built in recipe but it just downloads the links and that is it. Not even the cover.

link to official news site: www.elcorreo.com

I managed to find that if you open any article and replace the ".html" with "_amp.html" you can open the 'immersive reader' in edge to read the full article. And inspecting the "_amp.html" site you can find a script json with the content of the article.

So I started to create a custom one but I can only reach so far. I managed to retrieve the cover, title, subtitle and main image and delete the rest that is not relevant but I need help to add the content of the article by replacing the URLs of the articles to search in and finding the Script tag that contains a JSON with the article content.

This is the code that I have so far:

Spoiler:

I really tried with the API of calibre and reviewing other recipes but cannot manage to do it. I think I need to do something in the preprocesshtml function but no clue, really.

Can someone with extensive recipe knowledge help?

martencarlos · 04-17-2024, 07:24 AM

I think I found how to replace the desktop URL with the mobile URL adding this code:

#replace desktop url with mobile url
def get_article_url(self, article):
desktopUrl = BasicNewsRecipe.get_article_url(self, article)
mobileUrl = desktopUrl.replace(".html", "_amp.html")
return mobileUrl

But now I can't retrieve the article image and I am still missing the article content in json inside the script tag.

unkn0wn · 04-17-2024, 08:05 AM

builtin recipe isn't workimg?

martencarlos · 04-17-2024, 08:49 AM

Quote:

Originally Posted by unkn0wn

builtin recipe isn't workimg?

No, it is only including the links to the index and the link to the article.

unkn0wn · 04-18-2024, 03:59 AM

https://github.com/kovidgoyal/calibr...7b66c77715216f

I just tested this and output is too large >120Mb. Help me hash out some of the feeds.

There so many articles, just in past 24 hours from this website.

Code:

feeds = [
        ('Portada', 'http://www.elcorreo.com/rss/atom/portada'),
        ('Mundo', 'http://www.elcorreo.com/rss/atom/?section=internacional'),
        ('Bizkaia', 'http://www.elcorreo.com/rss/atom/?section=bizkaia'),
        ('Opinión', 'https://www.elcorreo.com/rss/atom/?section=opinion'),
        ('Internacional', 'https://www.elcorreo.com/rss/atom/?section=internacional'),
        ('Ciencia', 'https://www.elcorreo.com/rss/atom/?section=ciencia'),
        ('Guipuzkoa', 'http://www.elcorreo.com/rss/atom/?section=gipuzkoa'),
        ('Araba', 'http://www.elcorreo.com/rss/atom/?section=araba'),
        ('La Rioja', 'http://www.elcorreo.com/rss/atom/?section=larioja'),
        ('Miranda', 'http://www.elcorreo.com/rss/atom/?section=miranda'),
        ('Economía', 'http://www.elcorreo.com/rss/atom/?section=economia'),
        ('Culturas', 'http://www.elcorreo.com/rss/atom/?section=culturas'),
        ('Politica', 'http://www.elcorreo.com/rss/atom/?section=politica'),
        ('De tiendas', 'https://www.elcorreo.com/rss/atom/?section=de-tiendas'),
        ('Deportes', 'https://www.elcorreo.com/rss/atom/?section=deportes'),
        ('Elecciones', 'https://www.elcorreo.com/rss/atom/?section=elecciones'),
        ('Sociedad', 'https://www.elcorreo.com/rss/atom/?section=sociedad'),
        ('Vivir', 'https://www.elcorreo.com/rss/atom/?section=vivir'),
        ('Tecnología', 'http://www.elcorreo.com/rss/atom/?section=tecnologia'),
        ('Gente - Estilo', 'http://www.elcorreo.com/rss/atom/?section=gente-estilo'),
        ('Planes', 'http://www.elcorreo.com/rss/atom/?section=planes'),
        ('Athletic', 'http://www.elcorreo.com/rss/atom/?section=athletic'),
        ('Alavés', 'http://www.elcorreo.com/rss/atom/?section=alaves'),
        ('Bilbao Basket', 'http://www.elcorreo.com/rss/atom/?section=bilbaobasket'),
        ('Baskonia', 'http://www.elcorreo.com/rss/atom/?section=baskonia'),
        ('Deportes', 'http://www.elcorreo.com/rss/atom/?section=deportes'),
        ('Jaiak', 'http://www.elcorreo.com/rss/atom/?section=jaiak'),
        ('La Blanca', 'http://www.elcorreo.com/rss/atom/?section=la-blanca-vitoria'),
        ('Aste Nagusia', 'http://www.elcorreo.com/rss/atom/?section=aste-nagusia-bilbao'),
        ('Semana Santa', 'http://www.elcorreo.com/rss/atom/?section=semana-santa'),
        ('Festivales', 'http://www.elcorreo.com/rss/atom/?section=festivales')
    ]

martencarlos · 04-18-2024, 10:55 AM

Hi,

I left the most import ones. There can't be that many articles in 24hours. I guess a lot are duplicates and are in more than one feed. This is not a big newspaper.
Also maybe images are not optimized.

feeds = [
('Portada', 'http://www.elcorreo.com/rss/atom/portada'),
('Mundo', 'http://www.elcorreo.com/rss/atom/?section=internacional'),
('Bizkaia', 'http://www.elcorreo.com/rss/atom/?section=bizkaia'),
('Opinión', 'https://www.elcorreo.com/rss/atom/?section=opinion'),
('Internacional', 'https://www.elcorreo.com/rss/atom/?section=internacional'),
('Ciencia', 'https://www.elcorreo.com/rss/atom/?section=ciencia'),
('Economía', 'http://www.elcorreo.com/rss/atom/?section=economia'),
('Politica', 'http://www.elcorreo.com/rss/atom/?section=politica'),
('Deportes', 'https://www.elcorreo.com/rss/atom/?section=deportes'),
('Tecnología', 'http://www.elcorreo.com/rss/atom/?section=tecnologia'),
('Deportes', 'http://www.elcorreo.com/rss/atom/?section=deportes'),
]

martencarlos · 04-18-2024, 11:35 AM

Btw, I just had a look at the new Recipe code and used it to download the newspaper. I don't know how you did it but it looks perfect. Thank you very much!

And you were right, there are a lot of articles and they are not duplicated.

Any ideas how we could reduce the size to be sent via email to the kindle?

Maybe optimize images?

Thanks again! really apretiate it.

martencarlos · 04-18-2024, 12:20 PM

Ok by adding the following I managed to downsize to epub to decent size to send via email:

max_articles_per_feed = 10 #articles
compress_news_images = True

martencarlos · 06-27-2024, 03:00 AM

Hello,

Sorry for reopening the thread but the built-in recipe for 'El correo' stopped working.

I get the following error message:

Spoiler:

04-17-2024, 03:52 AM	#1
martencarlos Junior Member Posts: 7 Karma: 10 Join Date: Apr 2024 Device: Kindle paperwhite 2022	Help to finish the recipe of my favorite news site Hi all, I am trying to create a working news recipe for elcorreo.com a spanish news site. There is a built in recipe but it just downloads the links and that is it. Not even the cover. link to official news site: www.elcorreo.com I managed to find that if you open any article and replace the ".html" with "_amp.html" you can open the 'immersive reader' in edge to read the full article. And inspecting the "_amp.html" site you can find a script json with the content of the article. So I started to create a custom one but I can only reach so far. I managed to retrieve the cover, title, subtitle and main image and delete the rest that is not relevant but I need help to add the content of the article by replacing the URLs of the articles to search in and finding the Script tag that contains a JSON with the article content. This is the code that I have so far: Spoiler: #!/usr/bin/env python __license__ = 'GPL v3' __author__ = 'Carlos Marten based on Kovid Goyal official version' __copyright__ = '2008, Kovid Goyal kovid@kovidgoyal.net' description = 'Elcorreo Newspaper (Spain) - v1.0 16.04.2022' __docformat__ = 'restructuredtext en' ''' Elcorreo.com ''' from calibre.web.feeds.news import BasicNewsRecipe from html5_parser import parse import datetime from datetime import date class Elcorreo(BasicNewsRecipe): __author__ = 'Carlos Marten' description = 'Elcorreo' now = datetime.datetime.now() title = u'El Correo ['+str(date.today())+']' publisher = u'Ediciones El Pa\xeds SL' category = 'News, politics, culture, economy, general interest' language = 'es' timefmt = '[%a, %d %b, %Y]' oldest_article = 5 max_articles_per_feed = 4 recursion = 2 no_stylesheets = True remove_attributes = ['width', 'height','display','margin','padding', 'position','border'] remove_javascript = True use_embedded_content = False ignore_duplicate_articles = {'title', 'url'} compress_news_images = False #auto_cleanup = True #scale_news_images_to_device = True def getcoverurl(): now = datetime.datetime.now() return 'https://portada.iperiodico.es/'+str(now.year)+'/0'+str(now.month)+'/'+str(now.day)+'_elcorreo.750.jpg' cover_url = getcoverurl() def preprocess_html(self, soup): for alink in soup.findAll('a'): if alink.string is not None: tstr = alink.string alink.replaceWith(tstr) return soup extra_css = ''' img{ all: initial; width: 100% } h1 { font-size: 22px } h2 { font-size: 20px } ''' keep_only_tags = [ dict(name='h1', attrs={'class': [ 'v-a-t', #title ]}), dict(name='h2', attrs={'class': [ 'v-a-sub-t', #subtitle ]}), dict(name='script', attrs={'type': 'application/ld+json',}), #json with article (closed) dict(name='article', attrs={'class': [ 'v-a v-a--d v-a--d-bs v-a--p-b', #article ]}), dict(name='div', attrs={'class': [ 'amp-access-hide', #article (closed) ]}), ] remove_tags = [ dict(attrs={'class': [ 'v-drpw__w', #social 'v-mdl-tpc', #section topics related 'content-exclusive-bg', #paywall 'v-d__btn-c', #comenta y reporta error 'v-i-b', #compartir 'v-pill-m', #icono de play y ampliar imagen 'v-mdl-ath__c', #comentarios ]},), dict(attrs={'class': [ 'v-a-img', #image ]},), ] def postprocess_html(self, soup, first): return soup feeds = [ (u'Portada', u'https://www.elcorreo.com/rss/2.0/portada/'), ] calibre_most_common_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36' I really tried with the API of calibre and reviewing other recipes but cannot manage to do it. I think I need to do something in the preprocesshtml function but no clue, really. Can someone with extensive recipe knowledge help?

06-27-2024, 03:00 AM	#9
martencarlos Junior Member Posts: 7 Karma: 10 Join Date: Apr 2024 Device: Kindle paperwhite 2022	Hello, Sorry for reopening the thread but the built-in recipe for 'El correo' stopped working. I get the following error message: Spoiler: Fetch news from El Correo [2024-06-27] Conversion options changed from defaults: verbose: 2 output_profile: 'kindle_pw3' Resolved conversion options calibre version: 7.12.0 {'add_alt_text_to_img': False, 'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0, 'book_producer': None, 'change_justification': 'original', 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': None, 'dehyphenate': True, 'delete_blank_paragraphs': True, 'disable_font_rescaling': False, 'dont_download_recipe': False, 'dont_split_on_page_breaks': True, 'duplicate_links_in_toc': False, 'embed_all_fonts': False, 'embed_font_family': None, 'enable_heuristics': False, 'epub_flatten': False, 'epub_inline_toc': False, 'epub_max_image_size': 'none', 'epub_toc_at_end': False, 'epub_version': '2', 'expand_css': False, 'extra_css': None, 'extract_to': None, 'filter_css': None, 'fix_indents': True, 'flow_size': 260, 'font_size_mapping': None, 'format_scene_breaks': True, 'html_unwrap_factor': 0.4, 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x000001E66270EED0>, 'insert_blank_line': False, 'insert_blank_line_size': 0.5, 'insert_metadata': False, 'isbn': None, 'italicize_common_cases': True, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'markup_chapter_headings': True, 'max_toc_links': 50, 'minimum_line_height': 120.0, 'no_chapters_in_toc': False, 'no_default_epub_cover': False, 'no_inline_navbars': False, 'no_svg_cover': False, 'output_profile': <calibre.customize.profiles.KindlePaperWhite3Outpu t object at 0x000001E662722E50>, 'page_breaks_before': None, 'prefer_metadata_cover': False, 'preserve_cover_aspect_ratio': False, 'pretty_print': True, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_fake_margins': True, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': '', 'search_replace': None, 'series': None, 'series_index': None, 'smarten_punctuation': False, 'sr1_replace': '', 'sr1_search': '', 'sr2_replace': '', 'sr2_search': '', 'sr3_replace': '', 'sr3_search': '', 'start_reading_at': None, 'subset_embedded_fonts': False, 'tags': None, 'test': False, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'toc_title': None, 'transform_css_rules': None, 'transform_html_rules': None, 'unsmarten_punctuation': False, 'unwrap_lines': True, 'use_auto_toc': False, 'verbose': 2} InputFormatPlugin: Recipe Input running Downloading recipe urn: custom:1005 Using user agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Failed feed: Portada Traceback (most recent call last): File "calibre\web\feeds\news.py", line 1722, in parse_feeds File "mechanize\_mechanize.py", line 241, in open_novisit File "mechanize\_mechanize.py", line 313, in _mech_open mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden Failed feed: Mundo Traceback (most recent call last): File "calibre\web\feeds\news.py", line 1722, in parse_feeds File "mechanize\_mechanize.py", line 241, in open_novisit File "mechanize\_mechanize.py", line 313, in _mech_open mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden Failed feed: Bizkaia Traceback (most recent call last): File "calibre\web\feeds\news.py", line 1722, in parse_feeds File "mechanize\_mechanize.py", line 241, in open_novisit File "mechanize\_mechanize.py", line 313, in _mech_open mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden Failed feed: Opinión Traceback (most recent call last): File "calibre\web\feeds\news.py", line 1722, in parse_feeds File "mechanize\_mechanize.py", line 241, in open_novisit File "mechanize\_mechanize.py", line 313, in _mech_open mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden Failed feed: Internacional Traceback (most recent call last): File "calibre\web\feeds\news.py", line 1722, in parse_feeds File "mechanize\_mechanize.py", line 241, in open_novisit File "mechanize\_mechanize.py", line 313, in _mech_open mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden Failed feed: Economía Traceback (most recent call last): File "calibre\web\feeds\news.py", line 1722, in parse_feeds File "mechanize\_mechanize.py", line 241, in open_novisit File "mechanize\_mechanize.py", line 313, in _mech_open mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden Failed feed: Planes Traceback (most recent call last): File "calibre\web\feeds\news.py", line 1722, in parse_feeds File "mechanize\_mechanize.py", line 241, in open_novisit File "mechanize\_mechanize.py", line 313, in _mech_open mechanize._response.get_seek_wrapper_class.<locals >.httperror_seek_wrapper: HTTP Error 403: Forbidden Traceback (most recent call last): File "runpy.py", line 198, in _run_module_as_main File "runpy.py", line 88, in _run_code File "site.py", line 83, in <module> File "site.py", line 78, in main File "site.py", line 50, in run_entry_point File "calibre\utils\ipc\worker.py", line 215, in main File "calibre\gui2\convert\gui_conversion.py", line 31, in gui_convert_recipe File "calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert File "calibre\ebooks\conversion\plumber.py", line 1127, in run File "calibre\customize\conversion.py", line 245, in __call__ File "calibre\ebooks\conversion\plugins\recipe_input.py ", line 138, in convert File "calibre\web\feeds\news.py", line 1069, in download File "calibre\web\feeds\news.py", line 1259, in build_index ValueError: No articles found, aborting Last edited by theducks; 06-27-2024 at 04:14 AM. Reason: SPOILER LOG files

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Foreign Affairs recipe for news from the site (not the magazine)	mendesitba	Recipes	0	12-08-2015 10:14 PM
NHK Easy News (Japanese News site)	beemanfunk	Recipes	1	12-25-2014 04:44 AM
IDG.se - Recipe for swedish news site	khromov	Recipes	3	09-18-2011 10:40 PM
Is there a recipe for "Le Figaro", a french news site?	mg666	Recipes	0	05-12-2011 06:50 AM

04-17-2024, 07:24 AM	#2
martencarlos Junior Member Posts: 7 Karma: 10 Join Date: Apr 2024 Device: Kindle paperwhite 2022	I think I found how to replace the desktop URL with the mobile URL adding this code: #replace desktop url with mobile url def get_article_url(self, article): desktopUrl = BasicNewsRecipe.get_article_url(self, article) mobileUrl = desktopUrl.replace(".html", "_amp.html") return mobileUrl But now I can't retrieve the article image and I am still missing the article content in json inside the script tag.

04-17-2024, 08:05 AM	#3
unkn0wn Fanatic Posts: 542 Karma: 82944 Join Date: May 2021 Device: kindle	builtin recipe isn't workimg?

04-18-2024, 10:55 AM	#6
martencarlos Junior Member Posts: 7 Karma: 10 Join Date: Apr 2024 Device: Kindle paperwhite 2022	Hi, I left the most import ones. There can't be that many articles in 24hours. I guess a lot are duplicates and are in more than one feed. This is not a big newspaper. Also maybe images are not optimized. feeds = [ ('Portada', 'http://www.elcorreo.com/rss/atom/portada'), ('Mundo', 'http://www.elcorreo.com/rss/atom/?section=internacional'), ('Bizkaia', 'http://www.elcorreo.com/rss/atom/?section=bizkaia'), ('Opinión', 'https://www.elcorreo.com/rss/atom/?section=opinion'), ('Internacional', 'https://www.elcorreo.com/rss/atom/?section=internacional'), ('Ciencia', 'https://www.elcorreo.com/rss/atom/?section=ciencia'), ('Economía', 'http://www.elcorreo.com/rss/atom/?section=economia'), ('Politica', 'http://www.elcorreo.com/rss/atom/?section=politica'), ('Deportes', 'https://www.elcorreo.com/rss/atom/?section=deportes'), ('Tecnología', 'http://www.elcorreo.com/rss/atom/?section=tecnologia'), ('Deportes', 'http://www.elcorreo.com/rss/atom/?section=deportes'), ]

04-18-2024, 11:35 AM	#7
martencarlos Junior Member Posts: 7 Karma: 10 Join Date: Apr 2024 Device: Kindle paperwhite 2022	Btw, I just had a look at the new Recipe code and used it to download the newspaper. I don't know how you did it but it looks perfect. Thank you very much! And you were right, there are a lot of articles and they are not duplicated. Any ideas how we could reduce the size to be sent via email to the kindle? Maybe optimize images? Thanks again! really apretiate it.

04-18-2024, 12:20 PM	#8
martencarlos Junior Member Posts: 7 Karma: 10 Join Date: Apr 2024 Device: Kindle paperwhite 2022	Ok by adding the following I managed to downsize to epub to decent size to send via email: max_articles_per_feed = 10 #articles compress_news_images = True

Advert

Advert