08-10-2011, 04:47 PM | #16 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
08-10-2011, 05:11 PM | #17 |
Enthusiast
Posts: 37
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
|
Advert | |
|
08-10-2011, 05:22 PM | #18 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
I'll be nice and look at your page - hold on .... There aren't any div tags like that. You should probably be doing something like this: Code:
for section in soup.findAll('li'): Code:
for post in section.findAll('a', href=True): |
|
08-10-2011, 05:34 PM | #19 |
Enthusiast
Posts: 37
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
I have an idea of what the <div> tag is I just never understand any of the recipe code referring to it.
Thanks, I appreciate you looking at it. That makes a lot of sense, also that Firebug extension is a great help. Here is my next problem. I really don't understand the whole indent thing in python. It always seems to give me errors. For example when I add Code:
for post in section.findAll('a', href=True): |
08-10-2011, 05:59 PM | #20 |
Grand Sorcerer
Posts: 12,752
Karma: 75000002
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
Python uses indentation to "nest" code. Be consistent and use spaces rather than tabs.
Not being an expert I think the idea is you want something close to this: Code:
from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import Tag, NavigableString class WSWS(BasicNewsRecipe): title = 'World Socialist Web Site' __author__ = 'International Committee of The Fourth International' description = 'WSWS' no_stylesheets = True remove_javascript = True def parse_index(self): articles = [] soup = self.index_to_soup('http://wsws.org/mobile/') cover = None feeds = [] for section in soup.findAll('li'): section_title = self.tag_to_string(section.find('b')) articles = [] for post in section.findAll('a', href=True): url = post['href'] if url.startswith('/'): url = 'http://www.wsws.org'+url title = self.tag_to_string(post) if str(post).find('class=') > 0: klass = post['class'] if klass != "": self.log() self.log('--> post: ', post) self.log('--> url: ', url) self.log('--> title: ', title) self.log('--> class: ', klass) articles.append({'title':title, 'url':url}) if articles: feeds.append((section_title, articles)) return feeds Code:
for post in section.findAll('a', href=True): url = post['href'] if url.startswith('/'): url = 'http://www.wsws.org'+url title = self.tag_to_string(post) if str(post).find('class=') > 0: klass = post['class'] if klass != "": self.log() self.log('--> post: ', post) self.log('--> url: ', url) self.log('--> title: ', title) self.log('--> class: ', klass) articles.append({'title':title, 'url':url}) |
Advert | |
|
08-10-2011, 06:20 PM | #21 |
Enthusiast
Posts: 37
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
Thanks for the suggestion Peter, after trying that out it gets stuck at the klass = post['class'] I'm not sure that I need those lines because they are for getting rid of extra links, but my links seem pretty straight forward. I also think that klass has something to do with the other specific page but I'm not sure.
Ugh I'm still having a hard time with indentation errors and I'm not sure what to do. Calibre will tell me that I have an error on line 29 and so I will look at it in Komodo Edit to match up the line numbers and have gone as far as deleting line 29 but still get the error, I don't know what the problem could me. |
08-11-2011, 09:44 AM | #22 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
ebook-convert _Test_1.recipe _Test_1 --test -vv > _Test.txt |
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Text file formatting - line feeds and spaces | Fallingwater | Workshop | 6 | 07-04-2011 03:42 PM |
Newbie question- PDF conversion without losing file formatting | simong6 | Amazon Kindle | 4 | 05-03-2011 05:26 PM |
PDB file (eReader) - How to keep the formatting? | Juliepac | Other formats | 0 | 11-26-2010 08:38 AM |
PDB file - how to keep the formatting? | Juliepac | Apple Devices | 0 | 11-25-2010 07:41 PM |
text file formatting | hobbyman | Calibre | 5 | 10-05-2008 06:18 PM |