Help Please: remove_tags doesn't work in WSJ Chinese

Jmot · 02-17-2011, 07:41 PM

Hello,

I edited/modified one recipe for WSJ Chinese and use remove_tags and remove_tags_after to remove the unwanted navigation bars or link. Unfortunately, it didn't work. Could someone please take a look and offer some opinions? Thanks a lot.

from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1277443666(BasicNewsRecipe):
title = u'x WSJ 華爾街日報'
oldest_article = 32
max_articles_per_feed = 3

feeds = [
(u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'),
(u'\u7279\u5BEB', u'http://chinese.wsj.com/big5/rss02.xml'),
#(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'),
#(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'),
#(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'),
#(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'),
#(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml')
#(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml')
(u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml')
#(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml')
]

remove_tags = [dict(name='div', attrs={'class':['homepage']})]
remove_tags_after = dict(id='bodypart')
remove_javascript = True

sorin · 02-18-2011, 10:07 AM

I'm not an expert but i think you have to set keep_only_tags - this is the tag which contains your article.
Check others recipe examples from folder where Calibre is installed: Calibre2\resources\recipes.
I use sciencedaily.recipe template.

Jmot · 02-18-2011, 10:17 PM

Well, I did use the "keep_only_tags" as the following and confirm that there are "<div id="bodypart">" in the HMTL. Unfortunately, it still does not work. So I'm wondering if I'm missing something. Any suggestion? Thanks.

===
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1277443666(BasicNewsRecipe):
title = u'x WSJ 華爾街日報'
oldest_article = 32
max_articles_per_feed = 2

feeds = [
(u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'),
(u'Report', u'http://chinese.wsj.com/big5/rss02.xml'),
#(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'),
#(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'),
#(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'),
#(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'),
#(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml')
#(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml')
#(u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml')
#(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml')
]

remove_javascript = True

keep_only_tags = [
dict(name='div', attrs={'id':'bodypart'})
]

# remove_tags = [dict(name='div', attrs={'class':['homepage']})]
#remove_tags_after = dict(id='bodypart')
#remove_javascript = True

sorin · 02-19-2011, 02:47 AM

You have to set some spaces before variables (like title ..) from your class. Check this recipe:

from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1277443666(BasicNewsRecipe):

title = u'x WSJ ?????'
oldest_article = 32
max_articles_per_feed = 2

feeds = [
(u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'),
(u'Report', u'http://chinese.wsj.com/big5/rss02.xml'),
#(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'),
#(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'),
#(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'),
#(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'),
#(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml')
#(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml')
#(u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml')
#(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml')
]

remove_javascript = True

keep_only_tags = [
dict(name='div', attrs={'id':'bodypart'})
]

# remove_tags = [dict(name='div', attrs={'class':['homepage']})]
#remove_tags_after = dict(id='bodypart')
#remove_javascript = True

---------------
Read this thread, there is a command line very useful for testing recipes

Jmot · 02-20-2011, 10:55 PM

Hi Sorin,

I did put the space before each line. The result is the same. Any other suggestions? Thanks.

sorin · 02-21-2011, 04:10 AM

Copy your recipe in folder where Calibre is installed and run this in command prompt:
C:\Program Files\Calibre2>ebook-convert YourRecipe.recipe D:\temp –test -vv
Check console for errors and index.html from D:\temp.

02-17-2011, 07:41 PM	#1
Jmot Junior Member Posts: 9 Karma: 10 Join Date: Feb 2011 Device: Kindle	Help Please: remove_tags doesn't work in WSJ Chinese Hello, I edited/modified one recipe for WSJ Chinese and use remove_tags and remove_tags_after to remove the unwanted navigation bars or link. Unfortunately, it didn't work. Could someone please take a look and offer some opinions? Thanks a lot. from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1277443666(BasicNewsRecipe): title = u'x WSJ 華爾街日報' oldest_article = 32 max_articles_per_feed = 3 feeds = [ (u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'), (u'\u7279\u5BEB', u'http://chinese.wsj.com/big5/rss02.xml'), #(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'), #(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'), #(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'), #(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'), #(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml') #(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml') (u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml') #(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml') ] remove_tags = [dict(name='div', attrs={'class':['homepage']})] remove_tags_after = dict(id='bodypart') remove_javascript = True

02-18-2011, 10:07 AM	#2
sorin Connoisseur Posts: 73 Karma: 44 Join Date: Sep 2010 Device: kindle 3	I'm not an expert but i think you have to set keep_only_tags - this is the tag which contains your article. Check others recipe examples from folder where Calibre is installed: Calibre2\resources\recipes. I use sciencedaily.recipe template. Last edited by sorin; 02-18-2011 at 11:10 AM.

02-19-2011, 02:47 AM	#4
sorin Connoisseur Posts: 73 Karma: 44 Join Date: Sep 2010 Device: kindle 3	You have to set some spaces before variables (like title ..) from your class. Check this recipe: from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1277443666(BasicNewsRecipe): title = u'x WSJ ?????' oldest_article = 32 max_articles_per_feed = 2 feeds = [ (u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'), (u'Report', u'http://chinese.wsj.com/big5/rss02.xml'), #(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'), #(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'), #(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'), #(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'), #(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml') #(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml') #(u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml') #(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml') ] remove_javascript = True keep_only_tags = [ dict(name='div', attrs={'id':'bodypart'}) ] # remove_tags = [dict(name='div', attrs={'class':['homepage']})] #remove_tags_after = dict(id='bodypart') #remove_javascript = True --------------- Read this thread, there is a command line very useful for testing recipes Last edited by sorin; 02-19-2011 at 03:16 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
WSJ - Where is it?	cbnash	Nook Developer's Corner	4	12-31-2010 01:48 PM
Read Chinese books in Sony Reader PRS900 using Chinese Fonts	PSL	ePub	3	10-08-2010 08:11 AM
PRS-900 WSJ subscription through Sony vs WSJ direct	advocate2	Sony Reader	14	01-29-2010 11:52 AM
Chinese Support : book name & fetching chinese webs	tnzshn	Calibre	12	05-02-2009 01:21 AM
Can calibre work in Chinese WindowsVista?	AndyJing	Calibre	6	07-30-2008 10:10 PM

02-18-2011, 10:17 PM	#3
Jmot Junior Member Posts: 9 Karma: 10 Join Date: Feb 2011 Device: Kindle	Well, I did use the "keep_only_tags" as the following and confirm that there are "<div id="bodypart">" in the HMTL. Unfortunately, it still does not work. So I'm wondering if I'm missing something. Any suggestion? Thanks. === from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1277443666(BasicNewsRecipe): title = u'x WSJ 華爾街日報' oldest_article = 32 max_articles_per_feed = 2 feeds = [ (u'\u8981\u805E', u'http://chinese.wsj.com/big5/rss01.xml'), (u'Report', u'http://chinese.wsj.com/big5/rss02.xml'), #(u'\u4E2D\u6E2F\u53F0', u'http://chinese.wsj.com/big5/rssbch.xml'), #(u'\u570B\u969B\u8CA1\u7D93', u'http://chinese.wsj.com/big5/rssglobal.xml'), #(u'\u4E2D\u570B\u80A1\u5E02', u'http://chinese.wsj.com/big5/rsschinastock.xml'), #(u'\u9999\u6E2F\u80A1\u5E02', u'http://chinese.wsj.com/big5/rssHKstock.xml'), #(u'\u5916\u532F\u5E02\u5834', u'http://chinese.wsj.com/big5/rssforex.xml') #(u'\u5168\u7403\u91D1\u878D\u5E02\u5834', u'http://chinese.wsj.com/big5/rssmarkets.xml') #(u'\u79D1\u6280', u'http://chinese.wsj.com/big5/rsstech.xml') #(u'\u80FD\u6E90\u8207\u6C7D\u8ECA', u'http://chinese.wsj.com/big5/rssautoene.xml') ] remove_javascript = True keep_only_tags = [ dict(name='div', attrs={'id':'bodypart'}) ] # remove_tags = [dict(name='div', attrs={'class':['homepage']})] #remove_tags_after = dict(id='bodypart') #remove_javascript = True

02-20-2011, 10:55 PM	#5
Jmot Junior Member Posts: 9 Karma: 10 Join Date: Feb 2011 Device: Kindle	Hi Sorin, I did put the space before each line. The result is the same. Any other suggestions? Thanks.

02-21-2011, 04:10 AM	#6
sorin Connoisseur Posts: 73 Karma: 44 Join Date: Sep 2010 Device: kindle 3	Copy your recipe in folder where Calibre is installed and run this in command prompt: C:\Program Files\Calibre2>ebook-convert YourRecipe.recipe D:\temp –test -vv Check console for errors and index.html from D:\temp.

Advert

Advert