|
|
Thread Tools | Search this Thread |
06-28-2024, 06:07 PM | #31 |
Sigil Developer
Posts: 8,260
Karma: 5568412
Join Date: Nov 2009
Device: many
|
The unicode standard is not carved in stone and is in its 15th revision. Would you believe that new emojiis are one of its driving issues.
And AI and text analytics are some of the areas pushing most strongly for search ability and a using precomposed forms. Last edited by KevinH; 06-28-2024 at 06:09 PM. |
06-28-2024, 06:14 PM | #32 |
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
I can understand the usefulness for most users. It just seems there are enough potential issues with some languages, perhaps potentially more for non-Western, partially given the opinion of some it seems of Unicode in general, that I strongly feel it should be a setting and not something strictly enforced.
|
06-28-2024, 06:42 PM | #33 |
Sigil Developer
Posts: 8,260
Karma: 5568412
Join Date: Nov 2009
Device: many
|
Understood. As I said, given demand, I will create an environment variable to disable it once we get things working completely.
|
06-28-2024, 06:49 PM | #34 |
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
ALERT: Potential Issues with Sigil 2.2.X and rtl languages and Normalization Forms
Great. I wasn't sure if the variable was either possibly planned or certain. Thank you very much.
|
06-29-2024, 03:38 PM | #35 |
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
I’m not that familiar but I mention such since it seems to be relevant here and it seems that Sigil and some eReaders may not be supporting such.
Is canonical Unicode equivalence relevant here? Not necessarily storing the text in some normalized form, but it seems maybe the spec requires that apps normalize text just for the purpose of processing? So it seems maybe search should function the same way in apps as required by such? |
06-29-2024, 06:00 PM | #36 |
Sigil Developer
Posts: 8,260
Karma: 5568412
Join Date: Nov 2009
Device: many
|
Easy to do when processing an external file but any position information of a match location may then not match the original file because of combined characters shrinking length of its string representation. This makes replacement a big issue.
This is even more of a problem when done inside an editor that does not force normalize the strings in advance. Any match information (position and if match or not) will be incorrect unless underlying file is properly normalized the same way as the search string. This makes search and replacement impossible. |
06-29-2024, 07:55 PM | #37 |
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
It indeed seems challenging.
It seems that reading systems normalize text for the purpose of search sort or other and platforms seem to have APIs for handing such. I haven't used any readers other than Apple Books in a while. Yet it seems to do ok with canonical equivalence and diacritic insensitive search. If an editor should behave the same way I'm not sure but as an option maybe it'd be useful? Not really familar but does it seem this might help? https://doc.qt.io/qtforpython-6.6/Py...aryFinder.html |
06-29-2024, 08:25 PM | #38 |
Sigil Developer
Posts: 8,260
Karma: 5568412
Join Date: Nov 2009
Device: many
|
That is just a python interface to an existing Qt class to help find word boundaries and etc based on unicode character classes.
As it stands, since Calibre's editor is successful using the NFC approach, we should be able to accomplish the same with enough redesign. The source and find fields will all use the same Unicode normalization forms so they will just work. The days of removing accents just to search for pseudo text are long gone. Our use_nfc branch is undergoing testing now. When it is stable we will make a Beta release for people to test with, and report back any issue. Once we get that working, I will eventually add a simple application level variable controlled by an environment variable that will determine whether to run the conversion to NFC form at all. If we can *not* get things to work as we want, then we will move to creating a Sigil tool to run NFC normalization for the entire book, but that is just a fallback plan. Over the last 5 years a typical Sigil release typically involves 40 thousand to 80 thousand Windows downloads, and 10 thousand to 20 thousand Mac OS downloads plus uncounted additional downloads on Linux, Mac Ports, Homebrew, and other places. So over 100,000 regular users who bother to download multiple releases of Sigil. With that dedicated user base and a long enough beta testing period, we should be able to track down and fix any remaining issues. |
06-30-2024, 09:08 PM | #39 | |
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
Quote:
Being able to match canonical equivalence and possibly diacritical insensitive search might be useful for some though perhaps usage is more rare. For the latter, there could be typos in diacritics and/or OCR errors in diacritics or accents - some languages have terms spelled the same but with a difference in accent or diacritic -, such that that kind of search might be useful for some. |
|
06-30-2024, 11:45 PM | #40 |
Sigil Developer
Posts: 8,260
Karma: 5568412
Join Date: Nov 2009
Device: many
|
And spellcheck and accent counting or checking can still be done on NFC normalized forms. The accents are not lost when combined. The resulting text is visibly identical decomposed or composed. In fact hunspell spelling dictionaries need to choose a single form and rely on normalization when used with languages that make heavy use of accents.
You seem to think NFC form is somehow worse than decomposed NFD form. That is just not the case. The end result is identical. It is mixing forms that cause issues for spellchecking, search and replace, etc. Choosing composed (NFC) over decomposed (NFD) should result in shorter strings, less memory, faster search and replace, etc. Plus it is what most keyboard input produces. Even git repos can be set to use precomposed forms using automatic conversion. As I said now repeatedly, an environment var will be added to disable this in case the user disagrees, as long as all file paths, links, and urls use NFC form which is what the current epub3 spec requires minimally to ensure basic functionality. |
07-01-2024, 12:51 AM | #41 |
creator of calibre
Posts: 44,711
Karma: 24967300
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@KevinH: If I were you I wouldn't bother with an env var, it will just create confusion and bugs. There is absolutely no reason to use NFD, ever. In all the years the calibre editor has existed there hasn't been a single bug report about NFD. Having Sigil producing EPUB with NFD text is just going to result in broken experiences in other software that doesnt normalize text before operating on it. I highly doubt all the dozens of EPUB viewing software out there does that, for instance, which means search will not work in them for Sigil produced EPUBs with the env var set.
There is *no* good reason to use NFD that I know of, ever, when serializing text. About the only advantage of NFD is that it makes certain operations simpler to implement, such as removing accents. But the difference is negligible. In either case you need to use lookup tables, just in one case the lookup table is larger and anyway this is a problem that was solved robustly in libraries a long time ago. |
07-01-2024, 10:09 AM | #42 |
Sigil Developer
Posts: 8,260
Karma: 5568412
Join Date: Nov 2009
Device: many
|
@Kovid,
Understood and thanks for your help on this. Much appreciated! |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Potential Issues Upgrading? from 3.21 :O | rebeltaz | Calibre | 22 | 03-21-2022 12:19 AM |
New to Using Sigil and Having Issues | jester1972 | Sigil | 20 | 04-30-2017 10:24 AM |
New Issues in Sigil 0.9.3 | jafprrr | Sigil | 11 | 03-10-2016 12:59 PM |
issues with sigil 0.8.4 | eregs | Sigil | 2 | 02-27-2015 09:01 AM |
Support for RTL Languages | Gonidae | Calibre | 1 | 10-05-2012 06:13 AM |