Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 05-28-2024, 09:55 PM   #1
oston
Connoisseur
oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.
 
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
Using Regex Find/Change to change by Unicode

I often have to handle diacrit characters in a non-unicode (8-bit) fonts, which need changing to their unicode equivalent.

To do this I can use a Grep search in Indesign's Find/Change to, for example, replace character \x{00E1} with character \x{0101} for ā

It's not a big issue for me, but wondered why this is not possible with the Regex Find/Replace in Sigil.

I looked in the documentation but could not find any information about this there.

Thanks
Jim
oston is offline   Reply With Quote
Old 05-28-2024, 10:20 PM   #2
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
I am confused. Are you talking about changing font lookup tables or changing non-utf-8 encoded files into utf-8.

If the latter, if the 8 bit non-utf-8 encoding is properly specified in the xhtml character set meta data, Sigil should recognize it and properly re-encode all xhtml files from the encoding to utf-8.

If you are talking about pasting in latin-1 or some other code page encoded text into Sigil and then trying to fix it in Sigil using Regular expression find and replace, you can do that as well since the pcre2 library used support using hex byte codes \xe1 to whatever unicode value you want.

Just look up any good reference on regular expressions or the online documentation on the pcre2 (library).

For example:

https://www.pcre.org/current/doc/html/pcre2unicode.html

where you can find \x and other escapes.
KevinH is offline   Reply With Quote
Advert
Old 05-28-2024, 10:52 PM   #3
oston
Connoisseur
oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.
 
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
Many thanks for that reference, Kevin.

I cannot see that I can post anscreen capture of a test page I just made, so I will try to show it here with a copy of teh code from that page. :
Quote:
<p class="Basic-Paragraph">Pāḷi Characters</p>

<p class="Basic-Paragraph">ā ḍ ī ḷ ṃ ṅ ñ ṇ ṭ ū</p>

<p class="Basic-Paragraph">Ā Ḍ Ī Ḷ Ṃ Ṅ Ñ Ṇ Ṭ Ū</p>

<p class="Basic-Paragraph">GarNo8BPS</p>

<p class="Basic-Paragraph">á ð ì ÿ í ò ó þ ú</p>

<p class="Basic-Paragraph">Á Ð Ì Ÿ Í Ò Ó Þ Ú</p>
When this test page was exported to epub, the ā character in the non-unicode font Gar8BPS came through as á which has unicode 00E1, I want to change it to the correct ā which has unicode 0101
I can do with a Grep (regex) search in InDesign by looking for \x{00E1} and replacing it with \x{0101}
But when I try to do the same in Find/Replace in Sigil with Regex option turned on, it does not find the á character on that page.

I wonder what I am doing wrong in this instance?

Last edited by oston; 05-28-2024 at 11:02 PM.
oston is offline   Reply With Quote
Old 05-28-2024, 11:07 PM   #4
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,440
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
@oston
Switch the MR editor to Advanced, the use the paperclip to attach a (screen capture) FILE
theducks is offline   Reply With Quote
Old 05-28-2024, 11:15 PM   #5
oston
Connoisseur
oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.
 
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
Many thanks @theducks.
The attached screen capture show my entries in Find/Replace trying to use the Hex code to change the á character 00E1 to ā 0101
Attached Thumbnails
Click image for larger version

Name:	unicode capture.png
Views:	142
Size:	111.8 KB
ID:	208564  
oston is offline   Reply With Quote
Advert
Old 05-29-2024, 12:30 AM   #6
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,351
Karma: 6794938
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
What options have you selected under the Replace box?
Karellen is online now   Reply With Quote
Old 05-29-2024, 08:05 AM   #7
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,036
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
You're saying that \x{00E1} doesn't find the á character on line 22 of the code in your screenshot?

It finds it OK when I paste your code above into Sigil and search for \x{00E1}. The Sigil Unicode property setting shouldn't matter in this particular case. In fact none of the Regex settings should really affect that particular search. I tested on Windows and Linux.

Note: you won't be able to use \x{FFFF} syntax in Sigil's replace field (you'll have to use the character you want), but I got the impression it was finding á with \x{00E1} that you were having problems with.

Last edited by DiapDealer; 05-29-2024 at 10:05 AM.
DiapDealer is online now   Reply With Quote
Old 05-29-2024, 08:46 AM   #8
oston
Connoisseur
oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.
 
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
Quote:
Originally Posted by DiapDealer View Post
You're saying that \x{00E1} doesn't find the á character on line 22 of the code in your screenshot?

It finds it OK when I paste your code above into Sigil and search for \x{00E1}. The Sigil Unicode property setting shouldn't matter in this particular case. In fact none of the Regex settings should really affect that particular search. I tested on Windows and Linux.
I have all 3 options checked. There is some factor on my computer that I do not understand. Yesterday the search in the screen shot did not find the á character, but this morning the search did find the character.

But when I try to replace with ā using the \x{0101} the á is replaced with the string \x{0101}

I have no idea why the search did not work yesterday but does today.
oston is offline   Reply With Quote
Old 05-29-2024, 09:28 AM   #9
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
I still do not understand how a non-utf-8 byte sequence E1 got into the file in the first place. Either the original xhtml was cp1251 or latin-1 encoded and did not indicate that when being read in so that it could properly be converted to utf-8, or a copy from a cp-1251 or latin-1 source was pasted in without proper conversion.

Either way, the find replace step should not be needed unless earlier steps broke someplace.

The actual font used has nothing really to do with reading in and properly encoding a text file. The problem typically comes from not properly specifying the original encoding of the file inside it near the top. Without that, Sigil's auto detection code can sometimes incorrectly guess the input encoding. Detecting the difference between latin-x/cp-125x and utf-8 is actually quite hard from small snippets of text.
KevinH is offline   Reply With Quote
Old 05-29-2024, 09:44 AM   #10
oston
Connoisseur
oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.oston ought to be getting tired of karma fortunes by now.
 
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
The Indesign files where this happens are old files. The font-family used for the text are all 8-bit fonts where the diacriticals were coded at a different place in the family than the later Unicode fonts.
In the page that you saw in the capture, but top two lines of diacrits used TNR in the InDesign Document where ā is 0101, the now standard place for ā.
The lower two lines were in GaramondNo8BPS a very old font-family where the codes of the diacrits were not standard codes. ā is place at code 00E1.

I dont know if this answers your confusion, Kevin.

It would be helpful to be able to used the Sigil Regex find/replace to use
Find: \x{00E1}
Replace: \x{0101}
in the same way that I do in InDesign.

But as I said at this beginning, this is NOT a big issue for me. It's more a matter of just wanting to know what I am doing wrong with the Regex search and replace using HEX char strings.
oston is offline   Reply With Quote
Old 05-29-2024, 09:48 AM   #11
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
And as for using \x followed by 2 hex chars in the replacement, according to the code that should work, see:

https://github.com/Sigil-Ebook/Sigil...extBuilder.cpp

So any byte sequence can be made by a string of \xHH where H is a hex character (0-9,a-f,A-F).

If need be we can add support for longer hex specified chars to that code by supporting braces { and } so that would work as well.

Is support of using \x{abcd} format in replacements desired or can people live with just \xab\xcd which should be supported now according to the ReplacementBuilder code.

Update: \xab\xcd actually does *not* work, instead of combining the two consecutive byte values into an unsigned int16 they promote them both separately to two different u16 byte values. I will add support for replacement using \x{hhhhhhhh} to a sequence of uint16 values.

Last edited by KevinH; 05-29-2024 at 12:24 PM.
KevinH is offline   Reply With Quote
Old 05-29-2024, 09:51 AM   #12
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,440
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Just use the 'Insert Special Character' tool (Omega icon). Set the insert point in Replace then click on the character in the tool
theducks is offline   Reply With Quote
Old 05-29-2024, 09:52 AM   #13
Turtle91
A Hairy Wizard
Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.Turtle91 ought to be getting tired of karma fortunes by now.
 
Turtle91's Avatar
 
Posts: 3,219
Karma: 19000635
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
Couldn’t you just use text f/r ?? (Don’t use regex)
Turtle91 is offline   Reply With Quote
Old 05-29-2024, 09:55 AM   #14
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
So your InDesign files are improperly mixing unicode (utf-8) encoded text with most likely latin-1 encoded text in the same file to match an older font.
Mixing two different text encodings in the same file breaks the xhtml spec completely. To Sigil on import it would look like pure utf-8 encoded text file but with rogue encoded chars.

At least that means there is no breakage in Sigil's import epub code, which is what I was worried about.

Quote:
Originally Posted by oston View Post
The Indesign files where this happens are old files. The font-family used for the text are all 8-bit fonts where the diacriticals were coded at a different place in the family than the later Unicode fonts.
In the page that you saw in the capture, but top two lines of diacrits used TNR in the InDesign Document where ā is 0101, the now standard place for ā.
The lower two lines were in GaramondNo8BPS a very old font-family where the codes of the diacrits were not standard codes. ā is place at code 00E1.

I dont know if this answers your confusion, Kevin.

It would be helpful to be able to used the Sigil Regex find/replace to use
Find: \x{00E1}
Replace: \x{0101}
in the same way that I do in InDesign.

But as I said at this beginning, this is NOT a big issue for me. It's more a matter of just wanting to know what I am doing wrong with the Regex search and replace using HEX char strings.
KevinH is offline   Reply With Quote
Old 05-29-2024, 10:31 AM   #15
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,036
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by KevinH View Post
Is support of using \x{abcd} format in replacements desired or can people live with just \xab\xcd which should be supported now according to the ReplacementBuilder code.
it was my understanding that PCRE2 does not support the \xFF or \xFFFF syntax for unicode codepoints. Only \x{FF} or \x{FFFF}. For instance: I can't find á using \xe1 (or \x00e1). The braces must be used. Should we not be consistent with what search and replace supports?

Right now, replacing á with ā using codepoints would need to search for \x{e1}, but replace with \xe3 (the code for ā is not 0101 in any listing I've found). That seems counterintuitive, no?

EDIT: my mistake. The code for ā is indeed 0101. My eyes and tiny screens resulted in the confusion of the tilde for the macron.

Last edited by DiapDealer; 05-29-2024 at 11:06 AM.
DiapDealer is online now   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Please help me to change the dictionary and change keyboard language! Manually usb temp0rary Onyx Boox 1 06-13-2020 04:54 PM
RegEx or RE Function to apply [Change Case] Capitialize? phossler Editor 20 05-03-2016 07:53 PM
Change Case with Regex Problem nqk Editor 4 07-25-2014 10:38 PM
RegEx to change but not all phossler Sigil 2 01-11-2013 09:30 AM
Is it possible to change Calibre-Server.exe to change to a service for Windows roadrunnerm Calibre 1 10-19-2012 06:44 PM


All times are GMT -4. The time now is 07:19 PM.


MobileRead.com is a privately owned, operated and funded community.