ALERT: Potential Issues with Sigil 2.2.X and rtl languages and Normalization Forms - Page 3

KevinH · 06-28-2024, 05:07 PM

The unicode standard is not carved in stone and is in its 15th revision. Would you believe that new emojiis are one of its driving issues.

And AI and text analytics are some of the areas pushing most strongly for search ability and a using precomposed forms.

democrite · 06-28-2024, 05:14 PM

I can understand the usefulness for most users. It just seems there are enough potential issues with some languages, perhaps potentially more for non-Western, partially given the opinion of some it seems of Unicode in general, that I strongly feel it should be a setting and not something strictly enforced.

KevinH · 06-28-2024, 05:42 PM

Understood. As I said, given demand, I will create an environment variable to disable it once we get things working completely.

democrite · 06-28-2024, 05:49 PM

Great. I wasn't sure if the variable was either possibly planned or certain. Thank you very much.

democrite · 06-29-2024, 02:38 PM

I’m not that familiar but I mention such since it seems to be relevant here and it seems that Sigil and some eReaders may not be supporting such.

Is canonical Unicode equivalence relevant here? Not necessarily storing the text in some normalized form, but it seems maybe the spec requires that apps normalize text just for the purpose of processing? So it seems maybe search should function the same way in apps as required by such?

KevinH · 06-29-2024, 05:00 PM

Easy to do when processing an external file but any position information of a match location may then not match the original file because of combined characters shrinking length of its string representation. This makes replacement a big issue.

This is even more of a problem when done inside an editor that does not force normalize the strings in advance. Any match information (position and if match or not) will be incorrect unless underlying file is properly normalized the same way as the search string. This makes search and replacement impossible.

democrite · 06-29-2024, 06:55 PM

It indeed seems challenging.

It seems that reading systems normalize text for the purpose of search sort or other and platforms seem to have APIs for handing such. I haven't used any readers other than Apple Books in a while. Yet it seems to do ok with canonical equivalence and diacritic insensitive search. If an editor should behave the same way I'm not sure but as an option maybe it'd be useful?

Not really familar but does it seem this might help?

https://doc.qt.io/qtforpython-6.6/Py...aryFinder.html

KevinH · 06-29-2024, 07:25 PM

That is just a python interface to an existing Qt class to help find word boundaries and etc based on unicode character classes.

As it stands, since Calibre's editor is successful using the NFC approach, we should be able to accomplish the same with enough redesign. The source and find fields will all use the same Unicode normalization forms so they will just work. The days of removing accents just to search for pseudo text are long gone.

Our use_nfc branch is undergoing testing now. When it is stable we will make a Beta release for people to test with, and report back any issue.

Once we get that working, I will eventually add a simple application level variable controlled by an environment variable that will determine whether to run the conversion to NFC form at all.

If we can *not* get things to work as we want, then we will move to creating a Sigil tool to run NFC normalization for the entire book, but that is just a fallback plan.

Over the last 5 years a typical Sigil release typically involves 40 thousand to 80 thousand Windows downloads, and 10 thousand to 20 thousand Mac OS downloads plus uncounted additional downloads on Linux, Mac Ports, Homebrew, and other places.

So over 100,000 regular users who bother to download multiple releases of Sigil. With that dedicated user base and a long enough beta testing period, we should be able to track down and fix any remaining issues.

democrite · 06-30-2024, 08:08 PM

Quote:

Originally Posted by KevinH

The days of removing accents just to search for pseudo text are long gone.

Maybe for some future date or consideration. As some may not want to normalize - not sure if ligatures or other features may also apply -, maybe it is someday worth also normalizing internally for search in such cases. It could be a mess but I'm not sure how some readers might handle such. Maybe some keep copies in RAM that are normalized and also stripped of diacritics to speed up search.

Being able to match canonical equivalence and possibly diacritical insensitive search might be useful for some though perhaps usage is more rare. For the latter, there could be typos in diacritics and/or OCR errors in diacritics or accents - some languages have terms spelled the same but with a difference in accent or diacritic -, such that that kind of search might be useful for some.

KevinH · 06-30-2024, 10:45 PM

And spellcheck and accent counting or checking can still be done on NFC normalized forms. The accents are not lost when combined. The resulting text is visibly identical decomposed or composed. In fact hunspell spelling dictionaries need to choose a single form and rely on normalization when used with languages that make heavy use of accents.

You seem to think NFC form is somehow worse than decomposed NFD form. That is just not the case. The end result is identical. It is mixing forms that cause issues for spellchecking, search and replace, etc. Choosing composed (NFC) over decomposed (NFD) should result in shorter strings, less memory, faster search and replace, etc. Plus it is what most keyboard input produces. Even git repos can be set to use precomposed forms using automatic conversion.

As I said now repeatedly, an environment var will be added to disable this in case the user disagrees, as long as all file paths, links, and urls use NFC form which is what the current epub3 spec requires minimally to ensure basic functionality.

kovidgoyal · 06-30-2024, 11:51 PM

@KevinH: If I were you I wouldn't bother with an env var, it will just create confusion and bugs. There is absolutely no reason to use NFD, ever. In all the years the calibre editor has existed there hasn't been a single bug report about NFD. Having Sigil producing EPUB with NFD text is just going to result in broken experiences in other software that doesnt normalize text before operating on it. I highly doubt all the dozens of EPUB viewing software out there does that, for instance, which means search will not work in them for Sigil produced EPUBs with the env var set.

There is *no* good reason to use NFD that I know of, ever, when serializing text. About the only advantage of NFD is that it makes certain operations simpler to implement, such as removing accents. But the difference is negligible. In either case you need to use lookup tables, just in one case the lookup table is larger and anyway this is a problem that was solved robustly in libraries a long time ago.

KevinH · 07-01-2024, 09:09 AM

@Kovid,
Understood and thanks for your help on this. Much appreciated!

06-28-2024, 05:07 PM	#31
KevinH Sigil Developer Posts: 8,154 Karma: 5450818 Join Date: Nov 2009 Device: many	The unicode standard is not carved in stone and is in its 15th revision. Would you believe that new emojiis are one of its driving issues. And AI and text analytics are some of the areas pushing most strongly for search ability and a using precomposed forms. Last edited by KevinH; 06-28-2024 at 05:09 PM.

06-28-2024, 05:49 PM	#34
democrite Evangelist Posts: 440 Karma: 77256 Join Date: Sep 2011 Device: none	ALERT: Potential Issues with Sigil 2.2.X and rtl languages and Normalization Forms Great. I wasn't sure if the variable was either possibly planned or certain. Thank you very much.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Potential Issues Upgrading? from 3.21 :O	rebeltaz	Calibre	22	03-20-2022 11:19 PM
New to Using Sigil and Having Issues	jester1972	Sigil	20	04-30-2017 09:24 AM
New Issues in Sigil 0.9.3	jafprrr	Sigil	11	03-10-2016 11:59 AM
issues with sigil 0.8.4	eregs	Sigil	2	02-27-2015 08:01 AM
Support for RTL Languages	Gonidae	Calibre	1	10-05-2012 05:13 AM

06-28-2024, 05:14 PM	#32
democrite Evangelist Posts: 440 Karma: 77256 Join Date: Sep 2011 Device: none	I can understand the usefulness for most users. It just seems there are enough potential issues with some languages, perhaps potentially more for non-Western, partially given the opinion of some it seems of Unicode in general, that I strongly feel it should be a setting and not something strictly enforced.

06-28-2024, 05:42 PM	#33
KevinH Sigil Developer Posts: 8,154 Karma: 5450818 Join Date: Nov 2009 Device: many	Understood. As I said, given demand, I will create an environment variable to disable it once we get things working completely.

06-29-2024, 02:38 PM	#35
democrite Evangelist Posts: 440 Karma: 77256 Join Date: Sep 2011 Device: none	I’m not that familiar but I mention such since it seems to be relevant here and it seems that Sigil and some eReaders may not be supporting such. Is canonical Unicode equivalence relevant here? Not necessarily storing the text in some normalized form, but it seems maybe the spec requires that apps normalize text just for the purpose of processing? So it seems maybe search should function the same way in apps as required by such?

06-29-2024, 05:00 PM	#36
KevinH Sigil Developer Posts: 8,154 Karma: 5450818 Join Date: Nov 2009 Device: many	Easy to do when processing an external file but any position information of a match location may then not match the original file because of combined characters shrinking length of its string representation. This makes replacement a big issue. This is even more of a problem when done inside an editor that does not force normalize the strings in advance. Any match information (position and if match or not) will be incorrect unless underlying file is properly normalized the same way as the search string. This makes search and replacement impossible.

06-29-2024, 06:55 PM	#37
democrite Evangelist Posts: 440 Karma: 77256 Join Date: Sep 2011 Device: none	It indeed seems challenging. It seems that reading systems normalize text for the purpose of search sort or other and platforms seem to have APIs for handing such. I haven't used any readers other than Apple Books in a while. Yet it seems to do ok with canonical equivalence and diacritic insensitive search. If an editor should behave the same way I'm not sure but as an option maybe it'd be useful? Not really familar but does it seem this might help? https://doc.qt.io/qtforpython-6.6/Py...aryFinder.html

06-29-2024, 07:25 PM	#38
KevinH Sigil Developer Posts: 8,154 Karma: 5450818 Join Date: Nov 2009 Device: many	That is just a python interface to an existing Qt class to help find word boundaries and etc based on unicode character classes. As it stands, since Calibre's editor is successful using the NFC approach, we should be able to accomplish the same with enough redesign. The source and find fields will all use the same Unicode normalization forms so they will just work. The days of removing accents just to search for pseudo text are long gone. Our use_nfc branch is undergoing testing now. When it is stable we will make a Beta release for people to test with, and report back any issue. Once we get that working, I will eventually add a simple application level variable controlled by an environment variable that will determine whether to run the conversion to NFC form at all. If we can not get things to work as we want, then we will move to creating a Sigil tool to run NFC normalization for the entire book, but that is just a fallback plan. Over the last 5 years a typical Sigil release typically involves 40 thousand to 80 thousand Windows downloads, and 10 thousand to 20 thousand Mac OS downloads plus uncounted additional downloads on Linux, Mac Ports, Homebrew, and other places. So over 100,000 regular users who bother to download multiple releases of Sigil. With that dedicated user base and a long enough beta testing period, we should be able to track down and fix any remaining issues.

06-30-2024, 10:45 PM	#40
KevinH Sigil Developer Posts: 8,154 Karma: 5450818 Join Date: Nov 2009 Device: many	And spellcheck and accent counting or checking can still be done on NFC normalized forms. The accents are not lost when combined. The resulting text is visibly identical decomposed or composed. In fact hunspell spelling dictionaries need to choose a single form and rely on normalization when used with languages that make heavy use of accents. You seem to think NFC form is somehow worse than decomposed NFD form. That is just not the case. The end result is identical. It is mixing forms that cause issues for spellchecking, search and replace, etc. Choosing composed (NFC) over decomposed (NFD) should result in shorter strings, less memory, faster search and replace, etc. Plus it is what most keyboard input produces. Even git repos can be set to use precomposed forms using automatic conversion. As I said now repeatedly, an environment var will be added to disable this in case the user disagrees, as long as all file paths, links, and urls use NFC form which is what the current epub3 spec requires minimally to ensure basic functionality.

06-30-2024, 11:51 PM	#41
kovidgoyal creator of calibre Posts: 44,480 Karma: 24495778 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@KevinH: If I were you I wouldn't bother with an env var, it will just create confusion and bugs. There is absolutely no reason to use NFD, ever. In all the years the calibre editor has existed there hasn't been a single bug report about NFD. Having Sigil producing EPUB with NFD text is just going to result in broken experiences in other software that doesnt normalize text before operating on it. I highly doubt all the dozens of EPUB viewing software out there does that, for instance, which means search will not work in them for Sigil produced EPUBs with the env var set. There is no good reason to use NFD that I know of, ever, when serializing text. About the only advantage of NFD is that it makes certain operations simpler to implement, such as removing accents. But the difference is negligible. In either case you need to use lookup tables, just in one case the lookup table is larger and anyway this is a problem that was solved robustly in libraries a long time ago.

07-01-2024, 09:09 AM	#42
KevinH Sigil Developer Posts: 8,154 Karma: 5450818 Join Date: Nov 2009 Device: many	@Kovid, Understood and thanks for your help on this. Much appreciated!