11-27-2015, 05:51 AM | #16 |
Member
Posts: 18
Karma: 10
Join Date: Jan 2011
Device: sony prs-650
|
|
11-27-2015, 06:52 AM | #17 |
creator of calibre
Posts: 43,994
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
the size difference is simply because the new recipe does not reduce image quality.
|
Advert | |
|
12-14-2015, 04:38 AM | #18 |
Connoisseur
Posts: 67
Karma: 10
Join Date: Oct 2012
Device: Kindle 3
|
Supplemental feeds
The Guardian changed its web format drastically in November 2015. Prior to that extra section articles were stored in named folders, eg "Cook", "G2" etc and the old script would scrape all these in. A member of the Guardian's User Help team sent me a link to a missing article from the Cook section, pointing me to url www.theguardian.com/lifeandstyle/2015/nov/14/ and further investigation showed that nearly all articles from supplements are now stored in date folders.
Following Kovid's recommendation on adding feeds I added these line to the bottom of the Guardian recipe: def parse_index(self): feeds = self.parse_section(self.base_url) feeds += self.parse_section('http://www.theguardian.com/politics/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/lifeandstyle/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/uk/commentisfree/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/travel/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/lifeandstyle/food-and-drink/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/tv-and-radio/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/theguardian/theguide/'+strftime('%Y/%b/%d')) return feeds and this works well for the Saturday Guardian which is my main interest. Other sections can be added for other days as needed. For it to work two lines need to be added near the top of the script: from calibre import strftime (I have it at line 11) this brings in the PC time via calibre, to use in the feed urls above. I have used the trick of resetting my PC time to a previous Saturday to scrape an earlier issue! ignore_duplicate_articles = {'title', 'url'} (my line 38) needed because there may be several links to the same article in different parts of the newspaper. Hope this may be a some use to other Guardian readers dismayed by the loss of wanted supplements! And thanks to Kovid for very helpful suggestions. Paddy |
12-14-2015, 09:56 AM | #19 |
Member
Posts: 18
Karma: 10
Join Date: Jan 2011
Device: sony prs-650
|
Thanks Paddy, but I'm afraid all that is beyond me. I don't understand!
Kieran |
12-14-2015, 01:39 PM | #20 |
Connoisseur
Posts: 67
Karma: 10
Join Date: Oct 2012
Device: Kindle 3
|
Kieran, if you add a custom news source, customise built in recipe for The Guardian and Observer, Kovid has done most of the work for you (see message #12 in this thread). Scan down to the bottom of the recipe and you will see he has added the Sports section, it looks like this:
def parse_index(self): feeds = self.parse_section(self.base_url) feeds += self.parse_section('http://www.theguardian.com/uk/sport', 'Sport - ') return feeds I don't want the Sports section, so I took that out and replaced it with the sections I do want, eg Travel. But I had to add the dates, or the file is enormous (it swells from 7Mb to 72Mb!) because I assume it scrapes everything it finds. The date relates to a specific issue, eg 2015/dec/12 for last Saturday. Does that help, or make it worse? Paddy |
Advert | |
|
12-14-2015, 01:45 PM | #21 |
Member
Posts: 18
Karma: 10
Join Date: Jan 2011
Device: sony prs-650
|
Thanks, I'll let you know after I've had a play tomorrow. It doesn't look much like COBOL which I was familiar with in the early 80s!
|
12-15-2015, 04:55 AM | #22 |
Connoisseur
Posts: 67
Karma: 10
Join Date: Oct 2012
Device: Kindle 3
|
My expertise was Z80 machine code I'm afraid, dates me a bit!! So Python takes a bit of getting used to, but nice to dabble with a purpose...
Paddy |
12-16-2015, 04:09 AM | #23 |
Connoisseur
Posts: 67
Karma: 10
Join Date: Oct 2012
Device: Kindle 3
|
|
12-16-2015, 04:13 AM | #24 |
Member
Posts: 18
Karma: 10
Join Date: Jan 2011
Device: sony prs-650
|
Thanks for the tips. Unfortunately my back has decided to trap a nerve, so I can neither sit down nor stand up comfortably.
I'm taking a break from this till it clears up. |
12-16-2015, 01:46 PM | #25 | |
Connoisseur
Posts: 67
Karma: 10
Join Date: Oct 2012
Device: Kindle 3
|
Quote:
Paddy |
|
12-16-2015, 01:51 PM | #26 |
Member
Posts: 18
Karma: 10
Join Date: Jan 2011
Device: sony prs-650
|
The ibuprofen is starting to work and I've found that I can use the computer chair if it's tilted just so.
So I've been fiddling with some success. Probably I'll have some further questions, but thanks again to you both. Kieran |
12-16-2015, 01:54 PM | #27 | |
Member
Posts: 18
Karma: 10
Join Date: Jan 2011
Device: sony prs-650
|
Quote:
Yes I would like to peruse your code. |
|
12-17-2015, 05:36 AM | #28 |
Connoisseur
Posts: 67
Karma: 10
Join Date: Oct 2012
Device: Kindle 3
|
Guardian recipe with dates
Kieran, attached to this message. It's a text file (Notepad), just copy and paste it into a new custom news source: it will name itself automatically when you save it.
I'm also on Ibuprofen, strained my shoulder planing a rain-swollen hardwood door yesterday. Couple of old crocks! Paddy |
12-18-2015, 06:09 PM | #29 |
Junior Member
Posts: 5
Karma: 10
Join Date: Feb 2015
Device: Sony prs-t3
|
Hi , I've been having problems with the Guardian too for the last few weeks. Even though I've next to zero knowledge of programming I tried the previous copy and paste fix [thanks Paddy]
It worked but it still didn't have the section I was looking for, that's been missing from the download for 3 weeks now... culture>books If anybody could work up the lines of code that I could paste into the previous custom recipe I'd be very grateful. |
12-19-2015, 04:57 AM | #30 |
Member
Posts: 18
Karma: 10
Join Date: Jan 2011
Device: sony prs-650
|
I haven't tried to download this section but if you go onto the web page for books, it is:
http://www.theguardian.com/books rather than: http://www.theguardian.com/culture/books/ Similarly I wanted the Tech section but I see that is in fact now called Technology. Edit: I have taken Paddy's file from the post above and added 2 book sections. It is very rough and ready code but produced lots of book section stuff. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
The Guardian / Observer (UK) | ribena | Recipes | 3 | 11-19-2014 10:38 AM |
The Guardian, modified | ajnorman | Recipes | 20 | 01-10-2014 11:02 AM |
Guardian scheduled download failing | nickd | Recipes | 2 | 04-10-2011 04:35 AM |
The Guardian 24 automatic download | rio | iRex | 39 | 12-01-2009 05:36 AM |
The Guardian Reviews the DX | poohbear_nc | News | 3 | 07-06-2009 09:33 AM |