Replacement of Replacement Character

Leonatus · 05-13-2019, 02:48 PM

Once that I'm about to adjust my news download, I've still got a tiny little question: My news have in the online original quotation marks of this sort:

Code:

„...“

.
In the downloaded news they are replaced by the replacement character:

Code:

�...�

.
No big problem, but ... ugly.
Is it possible to edit the recipe in a way that replaces the replacement characters by quotation marks (of any kind)?
The original site is encoded in ISO-8859-1, and so is the encoding of the recipe. I replaced it by utf-8, but this didn't help.

kovidgoyal · 05-14-2019, 02:32 AM

Make sure the encoding field in the recipe matches the encoding of the website and you will be fine. if you want to do search and replace in the recipe you can use preprocess_regexps

Leonatus · 05-14-2019, 02:49 AM

This has been the first thing I've been trying in spite of my technical ignorance: to check if the encoding of the original website where the news is from corresponded to the encoding of the recipe - and to my astonishment it did. So this is not the culprit, as it seems.

How do I use preprocess_regexps "step by step", please (for I'm really technically ignorant, sorry)?

Edit: In the mean time I noticed that in single articles the quotation marks are displayed correctly, maintaining the same source code as the other articles. Hm .. the thing becomes interesting.

Edit': There is one difference, however: In the articles with replacement character, quotes are represented by „...“, whereas in the correctly dispayed articles they are "...".

Leonatus · 05-14-2019, 06:18 AM

I read in Calibre's documentation that the preprocess_regexps should look like that:

Code:

preprocess_regexps = [
   (re.compile(r'<!--Article ends here-->.*</body>', re.DOTALL|re.IGNORECASE),
    lambda match: '</body>'),
]

Unfortunately, I have no idea how to progreed in order replace all „ and “ by ". Could one of the pros here give me, please, a hint how to do this?

siebert · 05-14-2019, 02:14 PM

Untested:

Code:

preprocess_regexps = [
   (re.compile(r'[„“]'),
    lambda match: '"'),
]

Leonatus · 05-14-2019, 02:56 PM

Thank you, but doesn't work. The replacement characters still appear.

siebert · 05-14-2019, 03:05 PM

I don't think I ever used unicode in regular expressions. Did you just copy my code or did you try to replace the „“ chars in it with the ones copied from the source webpage?

Otherwise this variant might work better:

Code:

preprocess_regexps = [
   (re.compile(r'„|“'),
    lambda match: '"'),
]

Or you could post the whole recipe here, so I can test it.

Leonatus · 05-14-2019, 03:14 PM

The variant didn't work either. I had simply copy/pasted the code fromyour post, the characters reproduced in #1 beeing originally copied from the website resp. the ebook-viewer of Calibre (the display is the same as on my reader).
The recipe is originally this:

Code:

from calibre.web.feeds.news import BasicNewsRecipe


class AdvancedUserRecipe1295262156(BasicNewsRecipe):
    title = u'kath.net'
    __author__ = 'Bobus'
    description = u'Katholische Nachrichten'
    oldest_article = 7
    language = 'de'
    max_articles_per_feed = 100
    no_stylesheets = True
    encoding = 'iso-8859-1'

    feeds = [(u'kath.net', u'https://www.kath.net/2005/xml/index.xml')]

    def print_version(self, url):
        return url + "/print/yes"

    def get_browser(self, *a, **kwargs):
        kwargs['verify_ssl_certificates'] = False
        return BasicNewsRecipe.get_browser(self, *a, **kwargs)

    extra_css = 'td.textb {font-size: medium;}'

thank you for testing!

siebert · 05-14-2019, 04:12 PM

Sorry, all the things I googled and tried didn't work. I'm running out of ideas.

kovidgoyal · 05-15-2019, 12:49 AM

you need to replace the replacement character, not the quote, since the quote will already have been repaced by the replacement character at the time preprocess_regexp runs

Leonatus · 05-15-2019, 02:06 AM

Quote:

Originally Posted by kovidgoyal

you need to replace the replacement character, not the quote, since the quote will already have been repaced by the replacement character at the time preprocess_regexp runs

Hm, that has been my consideration, too, but it didn't work either at least following Siebert's suggestion. Anyway, thanks for the help!

Leonatus · 05-15-2019, 10:21 AM

Should I perhaps escape the replacement character, and how do I do this?

theducks · 05-15-2019, 10:51 AM

Quote:

Originally Posted by Leonatus

Should I perhaps escape the replacement character, and how do I do this?

the backslash is the 'escape'. \\ allows the \ to be the target.
in theory you could escape any character \e\s\c\a\p\e
(if in doubt, I escape symbols I search for. Not all, really need to be escaped)

Leonatus · 05-15-2019, 11:11 AM

Quote:

Originally Posted by theducks

the backslash is the 'escape'. \\ allows the \ to be the target.
in theory you could escape any character \e\s\c\a\p\e
(if in doubt, I escape symbols I search for. Not all, really need to be escaped)

I did this, but at no avail. My thought now is that perhaps the ISO 8859-1 code for the replacement character should be searched for, but this is very much beyond my capacities.
Edit: In Wikipedia Specials (Unicode block) I found this: "... It has become increasingly common for software to interpret invalid UTF-8 by guessing the bytes are in another byte-based encoding such as ISO-8859-1."

lui1 · 05-15-2019, 07:40 PM

According to wikipedia (see ISO-8859-1 and Windows-1252) webpages and emails are commonly mislabeled with the encoding ISO-8859-1 when it should be Windows-1252. Most web browsers and email clients will treat this encoding as Windows-1252. This practice is so prevalent that it became part of the HTML5 specification. So any webpage which claims to be encoded with ISO-8859-1 should be treated as being encoded with Windows-1252.

Code:

encoding = 'windows-1252'

05-14-2019, 02:49 AM	#3
Leonatus Wizard Posts: 1,055 Karma: 11391181 Join Date: Mar 2013 Location: Guben, Brandenburg, Germany Device: Kobo Clara 2E, Tolino Shine 3	This has been the first thing I've been trying in spite of my technical ignorance: to check if the encoding of the original website where the news is from corresponded to the encoding of the recipe - and to my astonishment it did. So this is not the culprit, as it seems. How do I use preprocess_regexps "step by step", please (for I'm really technically ignorant, sorry)? Edit: In the mean time I noticed that in single articles the quotation marks are displayed correctly, maintaining the same source code as the other articles. Hm .. the thing becomes interesting. Edit': There is one difference, however: In the articles with replacement character, quotes are represented by „...“, whereas in the correctly dispayed articles they are "...". Last edited by Leonatus; 05-14-2019 at 03:22 AM.

05-14-2019, 06:18 AM	#4
Leonatus Wizard Posts: 1,055 Karma: 11391181 Join Date: Mar 2013 Location: Guben, Brandenburg, Germany Device: Kobo Clara 2E, Tolino Shine 3	I read in Calibre's documentation that the preprocess_regexps should look like that: Code: preprocess_regexps = [ (re.compile(r'<!--Article ends here-->.*</body>', re.DOTALL\|re.IGNORECASE), lambda match: '</body>'), ] Unfortunately, I have no idea how to progreed in order replace all „ and “ by ". Could one of the pros here give me, please, a hint how to do this?

05-14-2019, 02:14 PM	#5
siebert Developer Posts: 155 Karma: 280 Join Date: Nov 2010 Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)	Untested: Code: preprocess_regexps = [ (re.compile(r'[„“]'), lambda match: '"'), ]

05-14-2019, 03:05 PM	#7
siebert Developer Posts: 155 Karma: 280 Join Date: Nov 2010 Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)	I don't think I ever used unicode in regular expressions. Did you just copy my code or did you try to replace the „“ chars in it with the ones copied from the source webpage? Otherwise this variant might work better: Code: preprocess_regexps = [ (re.compile(r'„\|“'), lambda match: '"'), ] Or you could post the whole recipe here, so I can test it.

05-15-2019, 07:40 PM	#15
lui1 Enthusiast Posts: 36 Karma: 10 Join Date: Dec 2017 Location: Los Angeles, CA Device: Smart Phone	According to wikipedia (see ISO-8859-1 and Windows-1252) webpages and emails are commonly mislabeled with the encoding ISO-8859-1 when it should be Windows-1252. Most web browsers and email clients will treat this encoding as Windows-1252. This practice is so prevalent that it became part of the HTML5 specification. So any webpage which claims to be encoded with ISO-8859-1 should be treated as being encoded with Windows-1252. Code: encoding = 'windows-1252' Last edited by lui1; 05-15-2019 at 07:51 PM. Reason: fix typos

05-13-2019, 02:48 PM	#1
Leonatus Wizard Posts: 1,055 Karma: 11391181 Join Date: Mar 2013 Location: Guben, Brandenburg, Germany Device: Kobo Clara 2E, Tolino Shine 3	Replacement of Replacement Character Once that I'm about to adjust my news download, I've still got a tiny little question: My news have in the online original quotation marks of this sort: Code: „...“ . In the downloaded news they are replaced by the replacement character: Code: �...� . No big problem, but ... ugly. Is it possible to edit the recipe in a way that replaces the replacement characters by quotation marks (of any kind)? The original site is encoded in ISO-8859-1, and so is the encoding of the recipe. I replaced it by utf-8, but this didn't help.

05-14-2019, 02:32 AM	#2
kovidgoyal creator of calibre Posts: 45,330 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Make sure the encoding field in the recipe matches the encoding of the website and you will be fine. if you want to do search and replace in the recipe you can use preprocess_regexps

05-14-2019, 02:56 PM	#6
Leonatus Wizard Posts: 1,055 Karma: 11391181 Join Date: Mar 2013 Location: Guben, Brandenburg, Germany Device: Kobo Clara 2E, Tolino Shine 3	Thank you, but doesn't work. The replacement characters still appear.

05-14-2019, 04:12 PM	#9
siebert Developer Posts: 155 Karma: 280 Join Date: Nov 2010 Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)	Sorry, all the things I googled and tried didn't work. I'm running out of ideas.

05-15-2019, 12:49 AM	#10
kovidgoyal creator of calibre Posts: 45,330 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	you need to replace the replacement character, not the quote, since the quote will already have been repaced by the replacement character at the time preprocess_regexp runs

05-15-2019, 10:21 AM	#12
Leonatus Wizard Posts: 1,055 Karma: 11391181 Join Date: Mar 2013 Location: Guben, Brandenburg, Germany Device: Kobo Clara 2E, Tolino Shine 3	Should I perhaps escape the replacement character, and how do I do this?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Should I go for a replacement?	n33raj18	Amazon Kindle	14	08-28-2014 07:18 AM
Replacement Character Frustration	amo48	Sigil	4	05-18-2012 12:43 PM
Touch Replacement Plan	PeterT	Kobo Reader	3	06-18-2011 08:09 PM
regex for character replacement, em-dash questions	cybmole	Calibre	3	10-18-2010 03:09 PM
PRS-600 So, should I ask for a replacement?	ziegl027	Sony Reader	8	01-25-2010 10:40 AM

Advert

Advert