05-28-2024, 10:55 PM | #1 |
Connoisseur
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
|
Using Regex Find/Change to change by Unicode
I often have to handle diacrit characters in a non-unicode (8-bit) fonts, which need changing to their unicode equivalent.
To do this I can use a Grep search in Indesign's Find/Change to, for example, replace character \x{00E1} with character \x{0101} for ā It's not a big issue for me, but wondered why this is not possible with the Regex Find/Replace in Sigil. I looked in the documentation but could not find any information about this there. Thanks Jim |
05-28-2024, 11:20 PM | #2 |
Sigil Developer
Posts: 8,265
Karma: 5568412
Join Date: Nov 2009
Device: many
|
I am confused. Are you talking about changing font lookup tables or changing non-utf-8 encoded files into utf-8.
If the latter, if the 8 bit non-utf-8 encoding is properly specified in the xhtml character set meta data, Sigil should recognize it and properly re-encode all xhtml files from the encoding to utf-8. If you are talking about pasting in latin-1 or some other code page encoded text into Sigil and then trying to fix it in Sigil using Regular expression find and replace, you can do that as well since the pcre2 library used support using hex byte codes \xe1 to whatever unicode value you want. Just look up any good reference on regular expressions or the online documentation on the pcre2 (library). For example: https://www.pcre.org/current/doc/html/pcre2unicode.html where you can find \x and other escapes. |
Advert | |
|
05-28-2024, 11:52 PM | #3 | |
Connoisseur
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
|
Many thanks for that reference, Kevin.
I cannot see that I can post anscreen capture of a test page I just made, so I will try to show it here with a copy of teh code from that page. : Quote:
I can do with a Grep (regex) search in InDesign by looking for \x{00E1} and replacing it with \x{0101} But when I try to do the same in Find/Replace in Sigil with Regex option turned on, it does not find the á character on that page. I wonder what I am doing wrong in this instance? Last edited by oston; 05-29-2024 at 12:02 AM. |
|
05-29-2024, 12:07 AM | #4 |
Well trained by Cats
Posts: 30,567
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
@oston
Switch the MR editor to Advanced, the use the paperclip to attach a (screen capture) FILE |
05-29-2024, 12:15 AM | #5 |
Connoisseur
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
|
Many thanks @theducks.
The attached screen capture show my entries in Find/Replace trying to use the Hex code to change the á character 00E1 to ā 0101 |
Advert | |
|
05-29-2024, 01:30 AM | #6 |
Wizard
Posts: 1,435
Karma: 8560466
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
What options have you selected under the Replace box?
|
05-29-2024, 09:05 AM | #7 |
Grand Sorcerer
Posts: 28,108
Karma: 201052868
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
You're saying that \x{00E1} doesn't find the á character on line 22 of the code in your screenshot?
It finds it OK when I paste your code above into Sigil and search for \x{00E1}. The Sigil Unicode property setting shouldn't matter in this particular case. In fact none of the Regex settings should really affect that particular search. I tested on Windows and Linux. Note: you won't be able to use \x{FFFF} syntax in Sigil's replace field (you'll have to use the character you want), but I got the impression it was finding á with \x{00E1} that you were having problems with. Last edited by DiapDealer; 05-29-2024 at 11:05 AM. |
05-29-2024, 09:46 AM | #8 | |
Connoisseur
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
|
Quote:
But when I try to replace with ā using the \x{0101} the á is replaced with the string \x{0101} I have no idea why the search did not work yesterday but does today. |
|
05-29-2024, 10:28 AM | #9 |
Sigil Developer
Posts: 8,265
Karma: 5568412
Join Date: Nov 2009
Device: many
|
I still do not understand how a non-utf-8 byte sequence E1 got into the file in the first place. Either the original xhtml was cp1251 or latin-1 encoded and did not indicate that when being read in so that it could properly be converted to utf-8, or a copy from a cp-1251 or latin-1 source was pasted in without proper conversion.
Either way, the find replace step should not be needed unless earlier steps broke someplace. The actual font used has nothing really to do with reading in and properly encoding a text file. The problem typically comes from not properly specifying the original encoding of the file inside it near the top. Without that, Sigil's auto detection code can sometimes incorrectly guess the input encoding. Detecting the difference between latin-x/cp-125x and utf-8 is actually quite hard from small snippets of text. |
05-29-2024, 10:44 AM | #10 |
Connoisseur
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
|
The Indesign files where this happens are old files. The font-family used for the text are all 8-bit fonts where the diacriticals were coded at a different place in the family than the later Unicode fonts.
In the page that you saw in the capture, but top two lines of diacrits used TNR in the InDesign Document where ā is 0101, the now standard place for ā. The lower two lines were in GaramondNo8BPS a very old font-family where the codes of the diacrits were not standard codes. ā is place at code 00E1. I dont know if this answers your confusion, Kevin. It would be helpful to be able to used the Sigil Regex find/replace to use Find: \x{00E1} Replace: \x{0101} in the same way that I do in InDesign. But as I said at this beginning, this is NOT a big issue for me. It's more a matter of just wanting to know what I am doing wrong with the Regex search and replace using HEX char strings. |
05-29-2024, 10:48 AM | #11 |
Sigil Developer
Posts: 8,265
Karma: 5568412
Join Date: Nov 2009
Device: many
|
And as for using \x followed by 2 hex chars in the replacement, according to the code that should work, see:
https://github.com/Sigil-Ebook/Sigil...extBuilder.cpp So any byte sequence can be made by a string of \xHH where H is a hex character (0-9,a-f,A-F). If need be we can add support for longer hex specified chars to that code by supporting braces { and } so that would work as well. Is support of using \x{abcd} format in replacements desired or can people live with just \xab\xcd which should be supported now according to the ReplacementBuilder code. Update: \xab\xcd actually does *not* work, instead of combining the two consecutive byte values into an unsigned int16 they promote them both separately to two different u16 byte values. I will add support for replacement using \x{hhhhhhhh} to a sequence of uint16 values. Last edited by KevinH; 05-29-2024 at 01:24 PM. |
05-29-2024, 10:51 AM | #12 |
Well trained by Cats
Posts: 30,567
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Just use the 'Insert Special Character' tool (Omega icon). Set the insert point in Replace then click on the character in the tool
|
05-29-2024, 10:52 AM | #13 |
A Hairy Wizard
Posts: 3,248
Karma: 19222221
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Couldn’t you just use text f/r ?? (Don’t use regex)
|
05-29-2024, 10:55 AM | #14 | |
Sigil Developer
Posts: 8,265
Karma: 5568412
Join Date: Nov 2009
Device: many
|
So your InDesign files are improperly mixing unicode (utf-8) encoded text with most likely latin-1 encoded text in the same file to match an older font.
Mixing two different text encodings in the same file breaks the xhtml spec completely. To Sigil on import it would look like pure utf-8 encoded text file but with rogue encoded chars. At least that means there is no breakage in Sigil's import epub code, which is what I was worried about. Quote:
|
|
05-29-2024, 11:31 AM | #15 | |
Grand Sorcerer
Posts: 28,108
Karma: 201052868
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Right now, replacing á with ā using codepoints would need to search for \x{e1}, but replace with \xe3 (the code for ā is not 0101 in any listing I've found). That seems counterintuitive, no? EDIT: my mistake. The code for ā is indeed 0101. My eyes and tiny screens resulted in the confusion of the tilde for the macron. Last edited by DiapDealer; 05-29-2024 at 12:06 PM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Please help me to change the dictionary and change keyboard language! Manually usb | temp0rary | Onyx Boox | 1 | 06-13-2020 05:54 PM |
RegEx or RE Function to apply [Change Case] Capitialize? | phossler | Editor | 20 | 05-03-2016 08:53 PM |
Change Case with Regex Problem | nqk | Editor | 4 | 07-25-2014 11:38 PM |
RegEx to change but not all | phossler | Sigil | 2 | 01-11-2013 10:30 AM |
Is it possible to change Calibre-Server.exe to change to a service for Windows | roadrunnerm | Calibre | 1 | 10-19-2012 07:44 PM |