06-13-2012, 11:45 AM | #1 |
Addict
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
\b matches accented characters
I was trying to catch instances where a blank space had been inserted in place of an apostrophe, rendering strings such as "John s " or "we ve " or "don t ", etc. So, I came up with:
Code:
\s(?=([st]|re|ve|ll)\b) Any ideas? |
06-13-2012, 02:54 PM | #2 |
Grand Sorcerer
Posts: 27,605
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Turn on the unicode properties (*UCP) so \b becomes unicode-aware. It's seeing those characters as non-word boundaries of some sort, otherwise.
Code:
(*UCP)\s(?=([st]|re|ve|ll)\b) Code:
<p>a séance töten don t</p> <p>don tyou see sheriff s</p> <p>we ll I'll be a mönkey s uncle</p> |
Advert | |
|
06-13-2012, 05:29 PM | #3 |
Addict
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
Thanks, as always, DD... works like a charm
tangentially, does using the (*UCP) mean that [a-z] would match the same results as \p{Ll}? |
06-13-2012, 06:21 PM | #4 | |
Grand Sorcerer
Posts: 27,605
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
You're quite welcome.
Quote:
|
|
06-14-2012, 04:12 AM | #5 |
frumious Bandersnatch
Posts: 7,516
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Besides, there are languages where A and Z are not the first and last letters of the alphabet. For instance, in Danish and Norwegian the alphabet is A ... Z Æ Ø Å (yes, dictionaries are from A to Å). What would [a-z] mean in these cases if it were to be extended to non-ascii characters?
|
Advert | |
|
06-14-2012, 07:40 AM | #6 | |
Imperfect Perfectionist
Posts: 480
Karma: 724664
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
|
Quote:
Regards, Kim |
|
06-14-2012, 08:32 AM | #7 | |
Grand Sorcerer
Posts: 27,605
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Why not just use \p{L} and catch all potential unicode letters? That's more than likely what people are intending to catch when they use [A-Za-z] anyway (whether they consciously realize it or not). Or do people purposely mean to exclude certain characters that occur in words like café or façade or naïve? Just a thought. I just know I've found that when using "letters" for search criteria in a regexp on an english language text... thinking strictly in terms of "english letters" will often produce results I didn't really intend. The original topic of this thread is a perfect example of this. So I've learned to approach Regex Find & Replace from a "unicode first" frame of mind when it comes to ebooks. Last edited by DiapDealer; 06-14-2012 at 08:36 AM. |
|
06-14-2012, 08:39 AM | #8 |
Addict
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
related question:
why does \b match letters that come after an apostrophe? eg. A search fro ’\b matches the apostrophes in "there’s" "it’s" "Bob’s", etc... |
06-14-2012, 08:59 AM | #9 | ||
Imperfect Perfectionist
Posts: 480
Karma: 724664
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
|
Quote:
Quote:
But I'll definitely try \p{L} One learns new tricks every day Regards, Kim |
||
06-14-2012, 09:15 AM | #10 | |
Grand Sorcerer
Posts: 27,605
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
* Before the first character in the string, if the first character is a word character. * After the last character in the string, if the last character is a word character. * Between two characters in the string, where one is a word character and the other is not a word character. A word character—without the (*UCP) flag—is [a-zA-Z0-9_] or \w "There's"—for better or worse—is not one word in the eyes of regex. Because an apostrophe is not a word character. "There" would be one word and "s" would be another. What are you wishing ’\b would find? Last edited by DiapDealer; 06-14-2012 at 09:45 AM. |
|
06-14-2012, 12:16 PM | #11 | |
Addict
Posts: 320
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
Quote:
The scenario I'm trying to catch is instances in which OCR software interpreted a ” as a ’ . my guess is that the appropriate regex would be something like ’\b(?!\p{Ll}). I could also probably add a negative lookbehind to exclude common instances of the ’ functioning as a (plural) possessive or to denote an omitted character (maybe something like (?<!s|in)). Mostly my question was academic: just a a way for me to get a better understanding of how and why reg-ex behaves the way it does. Last edited by ElMiko; 06-14-2012 at 12:18 PM. |
|
06-14-2012, 12:50 PM | #12 | |
Grand Sorcerer
Posts: 27,605
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
A lot of times (but certainly not always) in a closing quote situation, the previous character is going to be punctuation of some kind. Quotes within quotes will probably foul things up, though. Last edited by DiapDealer; 06-14-2012 at 01:03 PM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
For the sake of accented characters with Calibre | Naga | Conversion | 6 | 07-02-2011 07:48 AM |
Sorting with accented characters | chaley | Calibre | 20 | 12-11-2010 07:14 AM |
PRS-600 any way to type spanish accented characters? | arielinflux | Sony Reader | 1 | 03-17-2010 04:22 AM |
Foreign accented characters and libprs500 | Stingo | Calibre | 6 | 02-24-2008 07:51 PM |
Accented characters | bingle | Sony Reader | 7 | 07-25-2007 06:36 AM |