Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 06-28-2024, 06:07 PM   #31
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
The unicode standard is not carved in stone and is in its 15th revision. Would you believe that new emojiis are one of its driving issues.

And AI and text analytics are some of the areas pushing most strongly for search ability and a using precomposed forms.

Last edited by KevinH; 06-28-2024 at 06:09 PM.
KevinH is offline   Reply With Quote
Old 06-28-2024, 06:14 PM   #32
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
I can understand the usefulness for most users. It just seems there are enough potential issues with some languages, perhaps potentially more for non-Western, partially given the opinion of some it seems of Unicode in general, that I strongly feel it should be a setting and not something strictly enforced.
democrite is offline   Reply With Quote
Old 06-28-2024, 06:42 PM   #33
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
Understood. As I said, given demand, I will create an environment variable to disable it once we get things working completely.
KevinH is offline   Reply With Quote
Old 06-28-2024, 06:49 PM   #34
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
ALERT: Potential Issues with Sigil 2.2.X and rtl languages and Normalization Forms

Great. I wasn't sure if the variable was either possibly planned or certain. Thank you very much.
democrite is offline   Reply With Quote
Old 06-29-2024, 03:38 PM   #35
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
I’m not that familiar but I mention such since it seems to be relevant here and it seems that Sigil and some eReaders may not be supporting such.

Is canonical Unicode equivalence relevant here? Not necessarily storing the text in some normalized form, but it seems maybe the spec requires that apps normalize text just for the purpose of processing? So it seems maybe search should function the same way in apps as required by such?
democrite is offline   Reply With Quote
Old 06-29-2024, 06:00 PM   #36
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
Easy to do when processing an external file but any position information of a match location may then not match the original file because of combined characters shrinking length of its string representation. This makes replacement a big issue.

This is even more of a problem when done inside an editor that does not force normalize the strings in advance. Any match information (position and if match or not) will be incorrect unless underlying file is properly normalized the same way as the search string. This makes search and replacement impossible.
KevinH is offline   Reply With Quote
Old 06-29-2024, 07:55 PM   #37
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
It indeed seems challenging.

It seems that reading systems normalize text for the purpose of search sort or other and platforms seem to have APIs for handing such. I haven't used any readers other than Apple Books in a while. Yet it seems to do ok with canonical equivalence and diacritic insensitive search. If an editor should behave the same way I'm not sure but as an option maybe it'd be useful?

Not really familar but does it seem this might help?

https://doc.qt.io/qtforpython-6.6/Py...aryFinder.html
democrite is offline   Reply With Quote
Old 06-29-2024, 08:25 PM   #38
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
That is just a python interface to an existing Qt class to help find word boundaries and etc based on unicode character classes.

As it stands, since Calibre's editor is successful using the NFC approach, we should be able to accomplish the same with enough redesign. The source and find fields will all use the same Unicode normalization forms so they will just work. The days of removing accents just to search for pseudo text are long gone.

Our use_nfc branch is undergoing testing now. When it is stable we will make a Beta release for people to test with, and report back any issue.

Once we get that working, I will eventually add a simple application level variable controlled by an environment variable that will determine whether to run the conversion to NFC form at all.

If we can *not* get things to work as we want, then we will move to creating a Sigil tool to run NFC normalization for the entire book, but that is just a fallback plan.

Over the last 5 years a typical Sigil release typically involves 40 thousand to 80 thousand Windows downloads, and 10 thousand to 20 thousand Mac OS downloads plus uncounted additional downloads on Linux, Mac Ports, Homebrew, and other places.

So over 100,000 regular users who bother to download multiple releases of Sigil. With that dedicated user base and a long enough beta testing period, we should be able to track down and fix any remaining issues.
KevinH is offline   Reply With Quote
Old 06-30-2024, 09:08 PM   #39
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
Quote:
Originally Posted by KevinH View Post
The days of removing accents just to search for pseudo text are long gone.
Maybe for some future date or consideration. As some may not want to normalize - not sure if ligatures or other features may also apply -, maybe it is someday worth also normalizing internally for search in such cases. It could be a mess but I'm not sure how some readers might handle such. Maybe some keep copies in RAM that are normalized and also stripped of diacritics to speed up search.

Being able to match canonical equivalence and possibly diacritical insensitive search might be useful for some though perhaps usage is more rare. For the latter, there could be typos in diacritics and/or OCR errors in diacritics or accents - some languages have terms spelled the same but with a difference in accent or diacritic -, such that that kind of search might be useful for some.
democrite is offline   Reply With Quote
Old 06-30-2024, 11:45 PM   #40
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
And spellcheck and accent counting or checking can still be done on NFC normalized forms. The accents are not lost when combined. The resulting text is visibly identical decomposed or composed. In fact hunspell spelling dictionaries need to choose a single form and rely on normalization when used with languages that make heavy use of accents.

You seem to think NFC form is somehow worse than decomposed NFD form. That is just not the case. The end result is identical. It is mixing forms that cause issues for spellchecking, search and replace, etc. Choosing composed (NFC) over decomposed (NFD) should result in shorter strings, less memory, faster search and replace, etc. Plus it is what most keyboard input produces. Even git repos can be set to use precomposed forms using automatic conversion.

As I said now repeatedly, an environment var will be added to disable this in case the user disagrees, as long as all file paths, links, and urls use NFC form which is what the current epub3 spec requires minimally to ensure basic functionality.
KevinH is offline   Reply With Quote
Old 07-01-2024, 12:51 AM   #41
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,566
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
@KevinH: If I were you I wouldn't bother with an env var, it will just create confusion and bugs. There is absolutely no reason to use NFD, ever. In all the years the calibre editor has existed there hasn't been a single bug report about NFD. Having Sigil producing EPUB with NFD text is just going to result in broken experiences in other software that doesnt normalize text before operating on it. I highly doubt all the dozens of EPUB viewing software out there does that, for instance, which means search will not work in them for Sigil produced EPUBs with the env var set.

There is *no* good reason to use NFD that I know of, ever, when serializing text. About the only advantage of NFD is that it makes certain operations simpler to implement, such as removing accents. But the difference is negligible. In either case you need to use lookup tables, just in one case the lookup table is larger and anyway this is a problem that was solved robustly in libraries a long time ago.
kovidgoyal is offline   Reply With Quote
Old 07-01-2024, 10:09 AM   #42
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
@Kovid,
Understood and thanks for your help on this. Much appreciated!
KevinH is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Potential Issues Upgrading? from 3.21 :O rebeltaz Calibre 22 03-21-2022 12:19 AM
New to Using Sigil and Having Issues jester1972 Sigil 20 04-30-2017 10:24 AM
New Issues in Sigil 0.9.3 jafprrr Sigil 11 03-10-2016 12:59 PM
issues with sigil 0.8.4 eregs Sigil 2 02-27-2015 09:01 AM
Support for RTL Languages Gonidae Calibre 1 10-05-2012 06:13 AM


All times are GMT -4. The time now is 10:21 AM.


MobileRead.com is a privately owned, operated and funded community.