02-17-2011, 07:41 PM | #1 |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2011
Device: Kindle
|
Help Please: remove_tags doesn't work in WSJ Chinese
Hello,
I edited/modified one recipe for WSJ Chinese and use remove_tags and remove_tags_after to remove the unwanted navigation bars or link. Unfortunately, it didn't work. Could someone please take a look and offer some opinions? Thanks a lot. from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1277443666(BasicNewsRecipe): title = u'x WSJ 華爾街日報' oldest_article = 32 max_articles_per_feed = 3 feeds = [ (u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'), (u'\u7279\u5BEB', u'http://chinese.wsj.com/big5/rss02.xml'), #(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'), #(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'), #(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'), #(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'), #(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml') #(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml') (u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml') #(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml') ] remove_tags = [dict(name='div', attrs={'class':['homepage']})] remove_tags_after = dict(id='bodypart') remove_javascript = True |
02-18-2011, 10:07 AM | #2 |
Connoisseur
Posts: 73
Karma: 44
Join Date: Sep 2010
Device: kindle 3
|
I'm not an expert but i think you have to set keep_only_tags - this is the tag which contains your article.
Check others recipe examples from folder where Calibre is installed: Calibre2\resources\recipes. I use sciencedaily.recipe template. Last edited by sorin; 02-18-2011 at 11:10 AM. |
Advert | |
|
02-18-2011, 10:17 PM | #3 |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2011
Device: Kindle
|
Well, I did use the "keep_only_tags" as the following and confirm that there are "<div id="bodypart">" in the HMTL. Unfortunately, it still does not work. So I'm wondering if I'm missing something. Any suggestion? Thanks.
=== from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1277443666(BasicNewsRecipe): title = u'x WSJ 華爾街日報' oldest_article = 32 max_articles_per_feed = 2 feeds = [ (u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'), (u'Report', u'http://chinese.wsj.com/big5/rss02.xml'), #(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'), #(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'), #(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'), #(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'), #(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml') #(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml') #(u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml') #(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml') ] remove_javascript = True keep_only_tags = [ dict(name='div', attrs={'id':'bodypart'}) ] # remove_tags = [dict(name='div', attrs={'class':['homepage']})] #remove_tags_after = dict(id='bodypart') #remove_javascript = True |
02-19-2011, 02:47 AM | #4 |
Connoisseur
Posts: 73
Karma: 44
Join Date: Sep 2010
Device: kindle 3
|
You have to set some spaces before variables (like title ..) from your class. Check this recipe:
from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1277443666(BasicNewsRecipe): title = u'x WSJ ?????' oldest_article = 32 max_articles_per_feed = 2 feeds = [ (u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'), (u'Report', u'http://chinese.wsj.com/big5/rss02.xml'), #(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'), #(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'), #(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'), #(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'), #(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml') #(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml') #(u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml') #(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml') ] remove_javascript = True keep_only_tags = [ dict(name='div', attrs={'id':'bodypart'}) ] # remove_tags = [dict(name='div', attrs={'class':['homepage']})] #remove_tags_after = dict(id='bodypart') #remove_javascript = True --------------- Read this thread, there is a command line very useful for testing recipes Last edited by sorin; 02-19-2011 at 03:16 AM. |
02-20-2011, 10:55 PM | #5 |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2011
Device: Kindle
|
Hi Sorin,
I did put the space before each line. The result is the same. Any other suggestions? Thanks. |
Advert | |
|
02-21-2011, 04:10 AM | #6 |
Connoisseur
Posts: 73
Karma: 44
Join Date: Sep 2010
Device: kindle 3
|
Copy your recipe in folder where Calibre is installed and run this in command prompt:
C:\Program Files\Calibre2>ebook-convert YourRecipe.recipe D:\temp –test -vv Check console for errors and index.html from D:\temp. |
Tags |
recipe, remove_tags, remove_tags_after |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
WSJ - Where is it? | cbnash | Nook Developer's Corner | 4 | 12-31-2010 01:48 PM |
Read Chinese books in Sony Reader PRS900 using Chinese Fonts | PSL | ePub | 3 | 10-08-2010 08:11 AM |
PRS-900 WSJ subscription through Sony vs WSJ direct | advocate2 | Sony Reader | 14 | 01-29-2010 11:52 AM |
Chinese Support : book name & fetching chinese webs | tnzshn | Calibre | 12 | 05-02-2009 01:21 AM |
Can calibre work in Chinese WindowsVista? | AndyJing | Calibre | 6 | 07-30-2008 10:10 PM |