Guardian download - Page 2

didsbury · 11-27-2015, 05:51 AM

https://www.dropbox.com/s/cktwcc20jc...ibre1.rtf?dl=0

Can you see these three screenshots?

kovidgoyal · 11-27-2015, 06:52 AM

the size difference is simply because the new recipe does not reduce image quality.

paddyrm · 12-14-2015, 04:38 AM

The Guardian changed its web format drastically in November 2015. Prior to that extra section articles were stored in named folders, eg "Cook", "G2" etc and the old script would scrape all these in. A member of the Guardian's User Help team sent me a link to a missing article from the Cook section, pointing me to url www.theguardian.com/lifeandstyle/2015/nov/14/ and further investigation showed that nearly all articles from supplements are now stored in date folders.

Following Kovid's recommendation on adding feeds I added these line to the bottom of the Guardian recipe:

def parse_index(self):
feeds = self.parse_section(self.base_url)
feeds += self.parse_section('http://www.theguardian.com/politics/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/lifeandstyle/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/uk/commentisfree/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/travel/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/lifeandstyle/food-and-drink/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/tv-and-radio/'+strftime('%Y/%b/%d'))
feeds += self.parse_section('http://www.theguardian.com/theguardian/theguide/'+strftime('%Y/%b/%d'))
return feeds

and this works well for the Saturday Guardian which is my main interest. Other sections can be added for other days as needed.

For it to work two lines need to be added near the top of the script:

from calibre import strftime
(I have it at line 11) this brings in the PC time via calibre, to use in the feed urls above. I have used the trick of resetting my PC time to a previous Saturday to scrape an earlier issue!

ignore_duplicate_articles = {'title', 'url'}
(my line 38) needed because there may be several links to the same article in different parts of the newspaper.

Hope this may be a some use to other Guardian readers dismayed by the loss of wanted supplements! And thanks to Kovid for very helpful suggestions.

Paddy

didsbury · 12-14-2015, 09:56 AM

Thanks Paddy, but I'm afraid all that is beyond me. I don't understand!

Kieran

paddyrm · 12-14-2015, 01:39 PM

Kieran, if you add a custom news source, customise built in recipe for The Guardian and Observer, Kovid has done most of the work for you (see message #12 in this thread). Scan down to the bottom of the recipe and you will see he has added the Sports section, it looks like this:

def parse_index(self):
feeds = self.parse_section(self.base_url)
feeds += self.parse_section('http://www.theguardian.com/uk/sport', 'Sport - ')
return feeds

I don't want the Sports section, so I took that out and replaced it with the sections I do want, eg Travel. But I had to add the dates, or the file is enormous (it swells from 7Mb to 72Mb!) because I assume it scrapes everything it finds. The date relates to a specific issue, eg 2015/dec/12 for last Saturday.

Does that help, or make it worse?

Paddy

didsbury · 12-14-2015, 01:45 PM

Thanks, I'll let you know after I've had a play tomorrow. It doesn't look much like COBOL which I was familiar with in the early 80s!

paddyrm · 12-15-2015, 04:55 AM

My expertise was Z80 machine code I'm afraid, dates me a bit!! So Python takes a bit of getting used to, but nice to dabble with a purpose...

Paddy

paddyrm · 12-16-2015, 04:09 AM

Quote:

Originally Posted by didsbury

Thanks, I'll let you know after I've had a play tomorrow

Tip: try one section at a time, eg Travel or Lifeandstyle, get that working then build up the rest of the sections you want. -- Paddy

didsbury · 12-16-2015, 04:13 AM

Thanks for the tips. Unfortunately my back has decided to trap a nerve, so I can neither sit down nor stand up comfortably.

I'm taking a break from this till it clears up.

paddyrm · 12-16-2015, 01:46 PM

Quote:

Originally Posted by didsbury

Thanks for the tips. Unfortunately my back has decided to trap a nerve, so I can neither sit down nor stand up comfortably.

I'm taking a break from this till it clears up.

Sorry to hear that, hope it clears before Christmas, though it would be a good excuse to drink lots! I could always email you the script to copy into a new custom recipe, which worked well with my No 2 son, another G reader.

Paddy

didsbury · 12-16-2015, 01:51 PM

The ibuprofen is starting to work and I've found that I can use the computer chair if it's tilted just so.

So I've been fiddling with some success.

Probably I'll have some further questions, but thanks again to you both.

Kieran

didsbury · 12-16-2015, 01:54 PM

Quote:

Originally Posted by paddyrm

Sorry to hear that, hope it clears before Christmas, though it would be a good excuse to drink lots! I could always email you the script to copy into a new custom recipe, which worked well with my No 2 son, another G reader.

Paddy

Our posts cross!

Yes I would like to peruse your code.

paddyrm · 12-17-2015, 05:36 AM

Quote:

Originally Posted by didsbury

Our posts cross!

Yes I would like to peruse your code.

Kieran, attached to this message. It's a text file (Notepad), just copy and paste it into a new custom news source: it will name itself automatically when you save it.

I'm also on Ibuprofen, strained my shoulder planing a rain-swollen hardwood door yesterday. Couple of old crocks!

Paddy

Worzel · 12-18-2015, 06:09 PM

Hi , I've been having problems with the Guardian too for the last few weeks. Even though I've next to zero knowledge of programming I tried the previous copy and paste fix [thanks Paddy]

It worked but it still didn't have the section I was looking for, that's been missing from the download for 3 weeks now...
culture>books

If anybody could work up the lines of code that I could paste into the previous custom recipe I'd be very grateful.

didsbury · 12-19-2015, 04:57 AM

I haven't tried to download this section but if you go onto the web page for books, it is:

http://www.theguardian.com/books

rather than:

http://www.theguardian.com/culture/books/

Similarly I wanted the Tech section but I see that is in fact now called Technology.

Edit: I have taken Paddy's file from the post above and added 2 book sections. It is very rough and ready code but produced lots of book section stuff.

12-14-2015, 04:38 AM	#18
paddyrm Connoisseur Posts: 67 Karma: 10 Join Date: Oct 2012 Device: Kindle 3	Supplemental feeds The Guardian changed its web format drastically in November 2015. Prior to that extra section articles were stored in named folders, eg "Cook", "G2" etc and the old script would scrape all these in. A member of the Guardian's User Help team sent me a link to a missing article from the Cook section, pointing me to url www.theguardian.com/lifeandstyle/2015/nov/14/ and further investigation showed that nearly all articles from supplements are now stored in date folders. Following Kovid's recommendation on adding feeds I added these line to the bottom of the Guardian recipe: def parse_index(self): feeds = self.parse_section(self.base_url) feeds += self.parse_section('http://www.theguardian.com/politics/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/lifeandstyle/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/uk/commentisfree/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/travel/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/lifeandstyle/food-and-drink/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/tv-and-radio/'+strftime('%Y/%b/%d')) feeds += self.parse_section('http://www.theguardian.com/theguardian/theguide/'+strftime('%Y/%b/%d')) return feeds and this works well for the Saturday Guardian which is my main interest. Other sections can be added for other days as needed. For it to work two lines need to be added near the top of the script: *from calibre import strftime* (I have it at line 11) this brings in the PC time via calibre, to use in the feed urls above. I have used the trick of resetting my PC time to a previous Saturday to scrape an earlier issue! *ignore_duplicate_articles = {'title', 'url'}* (my line 38) needed because there may be several links to the same article in different parts of the newspaper. Hope this may be a some use to other Guardian readers dismayed by the loss of wanted supplements! And thanks to Kovid for very helpful suggestions. Paddy

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
The Guardian / Observer (UK)	ribena	Recipes	3	11-19-2014 10:38 AM
The Guardian, modified	ajnorman	Recipes	20	01-10-2014 11:02 AM
Guardian scheduled download failing	nickd	Recipes	2	04-10-2011 04:35 AM
The Guardian 24 automatic download	rio	iRex	39	12-01-2009 05:36 AM
The Guardian Reviews the DX	poohbear_nc	News	3	07-06-2009 09:33 AM

11-27-2015, 05:51 AM	#16
didsbury Member Posts: 18 Karma: 10 Join Date: Jan 2011 Device: sony prs-650	https://www.dropbox.com/s/cktwcc20jc...ibre1.rtf?dl=0 Can you see these three screenshots?

11-27-2015, 06:52 AM	#17
kovidgoyal creator of calibre Posts: 43,994 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	the size difference is simply because the new recipe does not reduce image quality.

12-14-2015, 09:56 AM	#19
didsbury Member Posts: 18 Karma: 10 Join Date: Jan 2011 Device: sony prs-650	Thanks Paddy, but I'm afraid all that is beyond me. I don't understand! Kieran

12-14-2015, 01:39 PM	#20
paddyrm Connoisseur Posts: 67 Karma: 10 Join Date: Oct 2012 Device: Kindle 3	Kieran, if you add a custom news source, customise built in recipe for The Guardian and Observer, Kovid has done most of the work for you (see message #12 in this thread). Scan down to the bottom of the recipe and you will see he has added the Sports section, it looks like this: def parse_index(self): feeds = self.parse_section(self.base_url) feeds += self.parse_section('http://www.theguardian.com/uk/sport', 'Sport - ') return feeds I don't want the Sports section, so I took that out and replaced it with the sections I do want, eg Travel. But I had to add the dates, or the file is enormous (it swells from 7Mb to 72Mb!) because I assume it scrapes everything it finds. The date relates to a specific issue, eg 2015/dec/12 for last Saturday. Does that help, or make it worse? Paddy

12-14-2015, 01:45 PM	#21
didsbury Member Posts: 18 Karma: 10 Join Date: Jan 2011 Device: sony prs-650	Thanks, I'll let you know after I've had a play tomorrow. It doesn't look much like COBOL which I was familiar with in the early 80s!

12-15-2015, 04:55 AM	#22
paddyrm Connoisseur Posts: 67 Karma: 10 Join Date: Oct 2012 Device: Kindle 3	My expertise was Z80 machine code I'm afraid, dates me a bit!! So Python takes a bit of getting used to, but nice to dabble with a purpose... Paddy

12-16-2015, 04:13 AM	#24
didsbury Member Posts: 18 Karma: 10 Join Date: Jan 2011 Device: sony prs-650	Thanks for the tips. Unfortunately my back has decided to trap a nerve, so I can neither sit down nor stand up comfortably. I'm taking a break from this till it clears up.

12-16-2015, 01:51 PM	#26
didsbury Member Posts: 18 Karma: 10 Join Date: Jan 2011 Device: sony prs-650	The ibuprofen is starting to work and I've found that I can use the computer chair if it's tilted just so. So I've been fiddling with some success. Probably I'll have some further questions, but thanks again to you both. Kieran

12-18-2015, 06:09 PM	#29
Worzel Junior Member Posts: 5 Karma: 10 Join Date: Feb 2015 Device: Sony prs-t3	Hi , I've been having problems with the Guardian too for the last few weeks. Even though I've next to zero knowledge of programming I tried the previous copy and paste fix [thanks Paddy] It worked but it still didn't have the section I was looking for, that's been missing from the download for 3 weeks now... culture>books If anybody could work up the lines of code that I could paste into the previous custom recipe I'd be very grateful.

Advert

Advert