![]() |
#1 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 326
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
\b matches accented characters
I was trying to catch instances where a blank space had been inserted in place of an apostrophe, rendering strings such as "John s " or "we ve " or "don t ", etc. So, I came up with:
Code:
\s(?=([st]|re|ve|ll)\b) Any ideas? |
![]() |
![]() |
![]() |
#2 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,208
Karma: 202024788
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Turn on the unicode properties (*UCP) so \b becomes unicode-aware. It's seeing those characters as non-word boundaries of some sort, otherwise.
Code:
(*UCP)\s(?=([st]|re|ve|ll)\b) Code:
<p>a séance töten don t</p> <p>don tyou see sheriff s</p> <p>we ll I'll be a mönkey s uncle</p> |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 326
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
Thanks, as always, DD... works like a charm
tangentially, does using the (*UCP) mean that [a-z] would match the same results as \p{Ll}? |
![]() |
![]() |
![]() |
#4 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,208
Karma: 202024788
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
You're quite welcome.
Quote:
|
|
![]() |
![]() |
![]() |
#5 |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,541
Karma: 19001081
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Besides, there are languages where A and Z are not the first and last letters of the alphabet. For instance, in Danish and Norwegian the alphabet is A ... Z Æ Ø Å (yes, dictionaries are from A to Å). What would [a-z] mean in these cases if it were to be extended to non-ascii characters?
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Imperfect Perfectionist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 594
Karma: 863576
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
|
Quote:
Regards, Kim |
|
![]() |
![]() |
![]() |
#7 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,208
Karma: 202024788
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Why not just use \p{L} and catch all potential unicode letters? That's more than likely what people are intending to catch when they use [A-Za-z] anyway (whether they consciously realize it or not). Or do people purposely mean to exclude certain characters that occur in words like café or façade or naïve? Just a thought. ![]() I just know I've found that when using "letters" for search criteria in a regexp on an english language text... thinking strictly in terms of "english letters" will often produce results I didn't really intend. The original topic of this thread is a perfect example of this. So I've learned to approach Regex Find & Replace from a "unicode first" frame of mind when it comes to ebooks. ![]() Last edited by DiapDealer; 06-14-2012 at 09:36 AM. |
|
![]() |
![]() |
![]() |
#8 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 326
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
related question:
why does \b match letters that come after an apostrophe? eg. A search fro ’\b matches the apostrophes in "there’s" "it’s" "Bob’s", etc... |
![]() |
![]() |
![]() |
#9 | ||
Imperfect Perfectionist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 594
Karma: 863576
Join Date: Dec 2011
Location: Ølstykke, Denmark
Device: none
|
Quote:
Quote:
![]() But I'll definitely try \p{L} One learns new tricks every day ![]() Regards, Kim |
||
![]() |
![]() |
![]() |
#10 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,208
Karma: 202024788
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
* Before the first character in the string, if the first character is a word character. * After the last character in the string, if the last character is a word character. * Between two characters in the string, where one is a word character and the other is not a word character. A word character—without the (*UCP) flag—is [a-zA-Z0-9_] or \w "There's"—for better or worse—is not one word in the eyes of regex. Because an apostrophe is not a word character. "There" would be one word and "s" would be another. What are you wishing ’\b would find? Last edited by DiapDealer; 06-14-2012 at 10:45 AM. |
|
![]() |
![]() |
![]() |
#11 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 326
Karma: 56788
Join Date: Jun 2011
Device: Kindle
|
Quote:
The scenario I'm trying to catch is instances in which OCR software interpreted a ” as a ’ . my guess is that the appropriate regex would be something like ’\b(?!\p{Ll}). I could also probably add a negative lookbehind to exclude common instances of the ’ functioning as a (plural) possessive or to denote an omitted character (maybe something like (?<!s|in)). Mostly my question was academic: just a a way for me to get a better understanding of how and why reg-ex behaves the way it does. Last edited by ElMiko; 06-14-2012 at 01:18 PM. |
|
![]() |
![]() |
![]() |
#12 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28,208
Karma: 202024788
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
A lot of times (but certainly not always) in a closing quote situation, the previous character is going to be punctuation of some kind. Quotes within quotes will probably foul things up, though. Last edited by DiapDealer; 06-14-2012 at 02:03 PM. |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
For the sake of accented characters with Calibre | Naga | Conversion | 6 | 07-02-2011 08:48 AM |
Sorting with accented characters | chaley | Calibre | 20 | 12-11-2010 08:14 AM |
PRS-600 any way to type spanish accented characters? | arielinflux | Sony Reader | 1 | 03-17-2010 05:22 AM |
Foreign accented characters and libprs500 | Stingo | Calibre | 6 | 02-24-2008 08:51 PM |
Accented characters | bingle | Sony Reader | 7 | 07-25-2007 07:36 AM |