File name formatting - Page 2

Starson17 · 08-10-2011, 03:47 PM

Quote:

Originally Posted by yoss15

I did, but I have no clue what that section means, like I said I don't have any experience with this stuff.

If you want to learn, I'll answer your questions. You should use FireBug and FireFox to inspect the source page.

yoss15 · 08-10-2011, 04:11 PM

Quote:

Originally Posted by Starson17

If you want to learn, I'll answer your questions. You should use FireBug and FireFox to inspect the source page.

I would really appreciate it. For starters what does the for section in soup.findAll line do?

Starson17 · 08-10-2011, 04:22 PM

Quote:

Originally Posted by yoss15

I would really appreciate it. For starters what does the for section in soup.findAll line do?

The job of parse_index is to look at a page and find links on that page to articles. The for section in soup.findAll is "finding all" tags that have a link in them to an article. More specifically, it's the beginning of that process. Do you know what a <div> tag is? The way that line works is it finds all tagged parts of the page that are tagged <div class="content">

I'll be nice and look at your page - hold on ....

There aren't any div tags like that.

You should probably be doing something like this:

Code:

for section in soup.findAll('li'):

Then something like:

Code:

for post in section.findAll('a', href=True):

That will find the <li> tags that have <a> tags inside with hrefs.

yoss15 · 08-10-2011, 04:34 PM

I have an idea of what the <div> tag is I just never understand any of the recipe code referring to it.

Thanks, I appreciate you looking at it. That makes a lot of sense, also that Firebug extension is a great help.

Here is my next problem. I really don't understand the whole indent thing in python. It always seems to give me errors. For example when I add

Code:

for post in section.findAll('a', href=True):

should it be indented under the other line? How do I properly indent it? Hitting tab seems to send it way to far but even with 1-5 spaces it still gives errors, I just don't understand it.

PeterT · 08-10-2011, 04:59 PM

Python uses indentation to "nest" code. Be consistent and use spaces rather than tabs.

Not being an expert I think the idea is you want something close to this:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class WSWS(BasicNewsRecipe):

    title      = 'World Socialist Web Site'
    __author__ = 'International Committee of The Fourth International'
    description = 'WSWS'

    no_stylesheets = True
    remove_javascript     = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup('http://wsws.org/mobile/')
        cover = None
        feeds = []
        for section in soup.findAll('li'):
            section_title = self.tag_to_string(section.find('b'))
            articles = []
            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.wsws.org'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})
            if articles:
                feeds.append((section_title, articles))
        return feeds

so the idea is you loop through all sections that are identified by "li" entries and then for each entry found use the loop

Code:

            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.wsws.org'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url})

to append each article to the list of articles

yoss15 · 08-10-2011, 05:20 PM

Thanks for the suggestion Peter, after trying that out it gets stuck at the klass = post['class'] I'm not sure that I need those lines because they are for getting rid of extra links, but my links seem pretty straight forward. I also think that klass has something to do with the other specific page but I'm not sure.

Ugh I'm still having a hard time with indentation errors and I'm not sure what to do.

Calibre will tell me that I have an error on line 29 and so I will look at it in Komodo Edit to match up the line numbers and have gone as far as deleting line 29 but still get the error, I don't know what the problem could me.

Starson17 · 08-11-2011, 08:44 AM

Quote:

Originally Posted by yoss15

How do I properly indent it? Hitting tab seems to send it way to far but even with 1-5 spaces it still gives errors, I just don't understand it.

Don't use any tabs - all spaces. For any part of the code, you indent each line the same. PeterT's indents look correct to me. I prefer running my recipe like this:

Code:

ebook-convert _Test_1.recipe _Test_1 --test   -vv > _Test.txt

That runs my _Test1.recipe file and puts the html produced by my recipe into the _Test folder and all of my print statements and verbose comments into the_Test.txt file. I use underscores so those files are all at the top of my directory list and I keep the recipe and text output (_Test.txt) files open in my editor at all times. I also have a batch file that runs the line above. I run the file, read the output file to see any errors or print statement output, then revise and run again.

08-10-2011, 04:34 PM	#19
yoss15 Enthusiast Posts: 37 Karma: 10 Join Date: Jul 2011 Device: Kindle	I have an idea of what the <div> tag is I just never understand any of the recipe code referring to it. Thanks, I appreciate you looking at it. That makes a lot of sense, also that Firebug extension is a great help. Here is my next problem. I really don't understand the whole indent thing in python. It always seems to give me errors. For example when I add Code: for post in section.findAll('a', href=True): should it be indented under the other line? How do I properly indent it? Hitting tab seems to send it way to far but even with 1-5 spaces it still gives errors, I just don't understand it.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Text file formatting - line feeds and spaces	Fallingwater	Workshop	6	07-04-2011 02:42 PM
Newbie question- PDF conversion without losing file formatting	simong6	Amazon Kindle	4	05-03-2011 04:26 PM
PDB file (eReader) - How to keep the formatting?	Juliepac	Other formats	0	11-26-2010 07:38 AM
PDB file - how to keep the formatting?	Juliepac	Apple Devices	0	11-25-2010 06:41 PM
text file formatting	hobbyman	Calibre	5	10-05-2008 05:18 PM

08-10-2011, 05:20 PM	#21
yoss15 Enthusiast Posts: 37 Karma: 10 Join Date: Jul 2011 Device: Kindle	Thanks for the suggestion Peter, after trying that out it gets stuck at the klass = post['class'] I'm not sure that I need those lines because they are for getting rid of extra links, but my links seem pretty straight forward. I also think that klass has something to do with the other specific page but I'm not sure. Ugh I'm still having a hard time with indentation errors and I'm not sure what to do. Calibre will tell me that I have an error on line 29 and so I will look at it in Komodo Edit to match up the line numbers and have gone as far as deleting line 29 but still get the error, I don't know what the problem could me.

Advert

Advert