05-13-2019, 02:48 PM | #1 |
Wizard
Posts: 1,033
Karma: 11123121
Join Date: Mar 2013
Location: Guben, Brandenburg, Germany
Device: Kobo Clara 2E, Tolino Shine 3
|
Replacement of Replacement Character
Once that I'm about to adjust my news download, I've still got a tiny little question: My news have in the online original quotation marks of this sort:
Code:
„...“ In the downloaded news they are replaced by the replacement character: Code:
�...� No big problem, but ... ugly. Is it possible to edit the recipe in a way that replaces the replacement characters by quotation marks (of any kind)? The original site is encoded in ISO-8859-1, and so is the encoding of the recipe. I replaced it by utf-8, but this didn't help. |
05-14-2019, 02:32 AM | #2 |
creator of calibre
Posts: 44,303
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Make sure the encoding field in the recipe matches the encoding of the website and you will be fine. if you want to do search and replace in the recipe you can use preprocess_regexps
|
Advert | |
|
05-14-2019, 02:49 AM | #3 |
Wizard
Posts: 1,033
Karma: 11123121
Join Date: Mar 2013
Location: Guben, Brandenburg, Germany
Device: Kobo Clara 2E, Tolino Shine 3
|
This has been the first thing I've been trying in spite of my technical ignorance: to check if the encoding of the original website where the news is from corresponded to the encoding of the recipe - and to my astonishment it did. So this is not the culprit, as it seems.
How do I use preprocess_regexps "step by step", please (for I'm really technically ignorant, sorry)? Edit: In the mean time I noticed that in single articles the quotation marks are displayed correctly, maintaining the same source code as the other articles. Hm .. the thing becomes interesting. Edit': There is one difference, however: In the articles with replacement character, quotes are represented by „...“, whereas in the correctly dispayed articles they are "...". Last edited by Leonatus; 05-14-2019 at 03:22 AM. |
05-14-2019, 06:18 AM | #4 |
Wizard
Posts: 1,033
Karma: 11123121
Join Date: Mar 2013
Location: Guben, Brandenburg, Germany
Device: Kobo Clara 2E, Tolino Shine 3
|
I read in Calibre's documentation that the preprocess_regexps should look like that:
Code:
preprocess_regexps = [ (re.compile(r'<!--Article ends here-->.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</body>'), ] |
05-14-2019, 02:14 PM | #5 |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Untested:
Code:
preprocess_regexps = [ (re.compile(r'[„“]'), lambda match: '"'), ] |
Advert | |
|
05-14-2019, 02:56 PM | #6 |
Wizard
Posts: 1,033
Karma: 11123121
Join Date: Mar 2013
Location: Guben, Brandenburg, Germany
Device: Kobo Clara 2E, Tolino Shine 3
|
Thank you, but doesn't work. The replacement characters still appear.
|
05-14-2019, 03:05 PM | #7 |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
I don't think I ever used unicode in regular expressions. Did you just copy my code or did you try to replace the „“ chars in it with the ones copied from the source webpage?
Otherwise this variant might work better: Code:
preprocess_regexps = [ (re.compile(r'„|“'), lambda match: '"'), ] |
05-14-2019, 03:14 PM | #8 |
Wizard
Posts: 1,033
Karma: 11123121
Join Date: Mar 2013
Location: Guben, Brandenburg, Germany
Device: Kobo Clara 2E, Tolino Shine 3
|
The variant didn't work either. I had simply copy/pasted the code fromyour post, the characters reproduced in #1 beeing originally copied from the website resp. the ebook-viewer of Calibre (the display is the same as on my reader).
The recipe is originally this: Code:
from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1295262156(BasicNewsRecipe): title = u'kath.net' __author__ = 'Bobus' description = u'Katholische Nachrichten' oldest_article = 7 language = 'de' max_articles_per_feed = 100 no_stylesheets = True encoding = 'iso-8859-1' feeds = [(u'kath.net', u'https://www.kath.net/2005/xml/index.xml')] def print_version(self, url): return url + "/print/yes" def get_browser(self, *a, **kwargs): kwargs['verify_ssl_certificates'] = False return BasicNewsRecipe.get_browser(self, *a, **kwargs) extra_css = 'td.textb {font-size: medium;}' |
05-14-2019, 04:12 PM | #9 |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Sorry, all the things I googled and tried didn't work. I'm running out of ideas.
|
05-15-2019, 12:49 AM | #10 |
creator of calibre
Posts: 44,303
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
you need to replace the replacement character, not the quote, since the quote will already have been repaced by the replacement character at the time preprocess_regexp runs
|
05-15-2019, 02:06 AM | #11 |
Wizard
Posts: 1,033
Karma: 11123121
Join Date: Mar 2013
Location: Guben, Brandenburg, Germany
Device: Kobo Clara 2E, Tolino Shine 3
|
Hm, that has been my consideration, too, but it didn't work either at least following Siebert's suggestion. Anyway, thanks for the help!
|
05-15-2019, 10:21 AM | #12 |
Wizard
Posts: 1,033
Karma: 11123121
Join Date: Mar 2013
Location: Guben, Brandenburg, Germany
Device: Kobo Clara 2E, Tolino Shine 3
|
Should I perhaps escape the replacement character, and how do I do this?
|
05-15-2019, 10:51 AM | #13 | |
Well trained by Cats
Posts: 30,341
Karma: 58032210
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
in theory you could escape any character \e\s\c\a\p\e (if in doubt, I escape symbols I search for. Not all, really need to be escaped) |
|
05-15-2019, 11:11 AM | #14 | |
Wizard
Posts: 1,033
Karma: 11123121
Join Date: Mar 2013
Location: Guben, Brandenburg, Germany
Device: Kobo Clara 2E, Tolino Shine 3
|
Quote:
Edit: In Wikipedia Specials (Unicode block) I found this: "... It has become increasingly common for software to interpret invalid UTF-8 by guessing the bytes are in another byte-based encoding such as ISO-8859-1." Last edited by Leonatus; 05-15-2019 at 11:19 AM. |
|
05-15-2019, 07:40 PM | #15 |
Enthusiast
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
|
According to wikipedia (see ISO-8859-1 and Windows-1252) webpages and emails are commonly mislabeled with the encoding ISO-8859-1 when it should be Windows-1252. Most web browsers and email clients will treat this encoding as Windows-1252. This practice is so prevalent that it became part of the HTML5 specification. So any webpage which claims to be encoded with ISO-8859-1 should be treated as being encoded with Windows-1252.
Code:
encoding = 'windows-1252' Last edited by lui1; 05-15-2019 at 07:51 PM. Reason: fix typos |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Should I go for a replacement? | n33raj18 | Amazon Kindle | 14 | 08-28-2014 07:18 AM |
Replacement Character Frustration | amo48 | Sigil | 4 | 05-18-2012 12:43 PM |
Touch Replacement Plan | PeterT | Kobo Reader | 3 | 06-18-2011 08:09 PM |
regex for character replacement, em-dash questions | cybmole | Calibre | 3 | 10-18-2010 03:09 PM |
PRS-600 So, should I ask for a replacement? | ziegl027 | Sony Reader | 8 | 01-25-2010 10:40 AM |