Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 06-26-2024, 04:08 PM   #16
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,269
Karma: 5568868
Join Date: Nov 2009
Device: many
Okay, will do.

I have a force_nfc candidate version of Sigil ready locally but I have NOT tested it yet. If it seems to work, I will push it to my personal repo first and we can play around from there.

Thanks!
KevinH is offline   Reply With Quote
Old 06-26-2024, 05:44 PM   #17
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,269
Karma: 5568868
Join Date: Nov 2009
Device: many
@DiapDealer,

Quote:
Originally Posted by DiapDealer View Post
I don't believe I'd worry about making any more invasive changes to another Qt5 source release. For the few people it might impact, the recommendation would be to upgrade to a version based on the first Qt6-only source release.
Okay, I brought all of our late 2.2.0 and 2.2.1 changes over from master, pulled in the qt6only changes from my repo, and then added in a bunch of force nfc changes.

So this is a bleeding edge tree that will need some heavy testing.

I pushed it to my personal github repo in the qt6only branch.

git clone https://github.com/kevinhendricks/Sigil.git
cd Sigil
git checkout qt6only

If you agree, I would like to create a "qt5final" branch from current master to preserve it for any Linux distributions that are too old for Qt 6.2.X or 6.4.X. If we find that all the force nfc stuff is solid, we can even add to that branch and make a new qt5final update release. But I want to make sure the nfc force stuff is rock solid first.

Once we have that, I will merge in my "qt6only" branch into Sigil-Ebook master in the form of sets of patches so that we do not lose any history of what has been done and can revert them in pieces if necessary. Something along the lines of:

1. remove old qt5 support cleanup patch
2. remove need for Qt6 compat5 (no qstringref patch, no qtextcodec patch)
3. force nfc patches

Just let me know when you want me to start the transition.

Last edited by KevinH; 06-26-2024 at 08:49 PM.
KevinH is offline   Reply With Quote
Old 06-27-2024, 10:01 AM   #18
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
Trying a test from one of my recent EPUBs, I recall one instance of a Chinese character transliterated to pinyin that used a combining diacritic.

I generally do edits with BBEdit on macOS and use Sigil only for changes other than xhtml editing. I mainly use Apple Books and search by typing works fine.

My main concern is typing text and being able to search for it later in another editor. Some may use multiple editors. Sigil may not be around forever, or a user may someday even years from now decide to switch to a different editor.

I'm not sure what is the ideal solution. Maybe it is indeed better to leave it alone, or have a preference to normalize to either plus commands for such. Perhaps ideally I'd like to find any text by typing, with any editor, and pasted text would be normalized to whatever matches keyboard input on each platform, if needed. That might be some additional work but perhaps it would be easiest for the user?

It could be for such reasons that normalization of file contents was removed from the spec.
Attached Files
File Type: zip unicode normalization with BBEdit.txt.zip (785 Bytes, 411 views)

Last edited by democrite; 06-27-2024 at 10:11 AM.
democrite is offline   Reply With Quote
Old 06-27-2024, 10:20 AM   #19
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,269
Karma: 5568868
Join Date: Nov 2009
Device: many
You seem to misunderstand.

The epub3 spec did not remove it, they focused it on urls and file paths where it matters most for string comparison and matching for the epub links to function at all.

In order for search to work in general, all of the text being searched and all of the search strings need to form text characters from unicode using an identical sequence. If more than 1 sequence creates the same character, the mixing them in the same text causes problems as does searching with one sequence in find but the text uses the other sequence for the same word or words.

And in either form, *no* text is ever lost or unreadable. It will be 100% searchable only when the search string and all the text to be searched are in the same form.

As for Sigil, and Calibre going away, the code to convert a file between normalization forms in python is trivial (less than 5 lines) same in Qt and it is available in all good string libraries.

And finally, the world seems to be standardizing on NFC forms for just the reasons above.

So no worries.

Last edited by KevinH; 06-27-2024 at 10:25 AM.
KevinH is offline   Reply With Quote
Old 06-27-2024, 10:28 AM   #20
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
I thought that maybe the EPUB3 spec changed NFC normalization of file contents from a requirement to a recommendation, but I'm not positive.

Perhaps I wasn't clear in my previous message. Typing a combined diacritic in another editor, I was not able to find the same sequence after it was saved with Sigil 2.2.1. So I think maybe the concerns remain true. If any type of normalization is done that differs from keyed input, a user may not be able to search their typed text using a different editor, if they use multiple editors or someday wish to switch to another?
democrite is offline   Reply With Quote
Old 06-27-2024, 10:33 AM   #21
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,122
Karma: 201056482
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
We can't guarantee that all Sigil produced text will be searchable (let alone in the same normalized form) by all other editors right now. The changes we're proposing won't help or hurt that. We're just making sure that Sigil is consistent in the way it internally approaches this (and handles it according to epub spec). We can't worry whether other editors do the same. Not all editors are EPUB editors.

Last edited by DiapDealer; 06-27-2024 at 10:39 AM.
DiapDealer is offline   Reply With Quote
Old 06-27-2024, 10:40 AM   #22
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
What I'm saying is that whatever text I type, at least for now, I'm not able to search for it after it's saved. Maybe such is the case if I stick with Sigil, but not if I use another editor. I just want to be able to search for the same text I typed previously by typing in the same.
democrite is offline   Reply With Quote
Old 06-27-2024, 10:50 AM   #23
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,269
Karma: 5568868
Join Date: Nov 2009
Device: many
And that is what I already have working in the repo I posted about. It will convert pasting to the find field (or typing) to nfc form to match the nfc form of the text. The end user will see a working search and replace when typing or pasting. But can only happen if one normalization form is used throughout.

That is why this alert exists and explains that people who work with languages where issues like this are prevalent, should stick to Sigil 2.1.0 until the next release of Sigil.

Even with Sigil 2.2.x, once saved and re-read in, you can copy a word that illustrates this issue from Sigil and paste it into the find field, and it all works. Your keyboard input method is just not generating text in the order that matches the NFC form. In the version I posted about above, this does not happen anymore.

This will happen for any editor (except for calibre's) where multiple sequences to generate a single character exists and more than one sequence is used for that same character in the text (or in the find field). This problem is not Sigil specific.

Sigil 2.2.x converts text on being read in to NFC form to help make it universally searchable across various platforms and e-readers. As it turns out it just needed more work to more completely convert find info as well.

So I do not understand what you are asking here?

Last edited by KevinH; 06-27-2024 at 11:06 AM.
KevinH is offline   Reply With Quote
Old 06-27-2024, 11:14 AM   #24
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,269
Karma: 5568868
Join Date: Nov 2009
Device: many
As for BBEdit, check out Release 14.0 Notes for:

Quote:
Added "Precompose Unicode" to the Text menu. This command will convert decomposed Unicode pairs (such as a letter followed by a combining accent or diaresis) into a single Unicode character, where possible. Precompose Unicode is also available as a Text Factory operation as well as via the AppleScript interface.
KevinH is offline   Reply With Quote
Old 06-28-2024, 02:26 PM   #25
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,269
Karma: 5568868
Join Date: Nov 2009
Device: many
And interestingly, both Mac And Windows keyboards and input methods regularly create precomposed inputs (NFC). I finally found this documented on the Windows site:

Quote:
Windows, Microsoft applications, and the NET Framework generally generate characters in form C using normal input methods. For most purposes on Windows, form C is the preferred form. For example, characters in form C are produced by Windows keyboard input.
However, characters imported from the Web and other platforms can introduce other normalization forms into the data stream.
So really, the only issue here comes from cutting and pasting from documents that do not use precomposed text such as PDF documents and some web pages.
KevinH is offline   Reply With Quote
Old 06-28-2024, 04:31 PM   #26
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
Kevin,

Thank you very much for taking the time to be very helpful and also for really trying to consider all viewpoints and thoroughly think through the issue. Such is quite rare.

My issue is with characters that have multiple diacritics. I had attached a file in my previous message. One character in question is in a recent EPUB using the pinyin term nǚ.

After converting to NFC, that I can tell, there is no way to return to the character as entered. Some are concerned with data preservation, and want the original representation. That I can tell, after conversion to NFC, there is no way to search for that character as entered by typing it in again. Such would also be true, perhaps for other cases too, with possibly numerous or any other editor, current or future.

I am aware of the BBEdit command. There is also one to decompose unicode. There is also an expert preference to precompose Unicode when pasting. Yet that I can tell, none help for that specific issue.

Not everyone is going to eternally exclusively use Sigil for their editing. Sigil might not be around forever. Someone may use other editors. I use BBEdit.

Looking around, there seems to be a tendancy to not want to automatically convert text. From what I recall, JetBrains at one point automatically converted text to NFC, and some user complained through a bug report with others also sharing the same sentiment, so they backed out of it.

Trying to use various AI services such as Perplexity, I found various issues, many of which I'm not sure if they apply:

Quote:
normalizing ≯ [U+003E GREATER-THAN SIGN + U+0338 COMBINING LONG SOLIDUS OVERLAY] to ≯ [U+226F NOT GREATER-THAN] can corrupt XML code
Quote:
some people use compatibility characters in their content without realizing it, like ¼ [U+00BC VULGAR FRACTION ONE QUARTER] or № [U+2116 NUMERO SIGN]. Normalizing this content may affect the look or readability
From a W3C document, Normalization in HTML and CSS:

Quote:
there may sometimes be good reasons to mix normalized forms.
Quote:
You should also try to avoid automatically converting content from one normalization form to another, as it may obliterate some important code point distinctions, such as in the carefully crafted examples of világ above, or in filenames or URLs, or text included in the page from elsewhere, etc.
The document also shows a screen shot of Dreamweaver which had a setting to either select none, NFC, or NFD.

https://www.w3.org/International/que...-normalization

if this applies or not, I'm not sure, but among other issues found, perhaps some would create EPUBs containing such content:

Quote:
Some types of linguistic analysis or text mining projects may rely on or benefit from the distinctions that normalization would eliminate. The original representation might carry valuable information for these specialized use cases.
Quote:
Not normalizing allows the text to remain in its original form as entered by users. This could potentially be beneficial in some specialized linguistic or cultural contexts where the specific Unicode representation carries meaning.
Quote:
Some historical texts or languages use variant forms of characters that cannot be accurately handled by Unicode normalization. This can result in loss of nuance or meaning when normalizing text.
quoted from a node.js document:

Quote:
What happens when the Unicode standard advances to include a slightly different normalization algorithm (as has happened in the past)?
There was also this W3C document, Unicode in XML and other Markup Languages, which seems to have been withdrawn yet may still provide useful information about possible issues.

https://www.w3.org/TR/unicode-xml/

A document I found though I haven't read it thoroughly yet seems like it might be of use, Unicode® Standard Annex #15 UNICODE NORMALIZATION FORMS:

https://unicode.org/reports/tr15/

I think it is a good idea to thoroughly continue to investigate this issue, and not go by the recommendations of a few users. As Sigil has a much lower userbase than other editors, it may be more difficult to get good feedback concerning all the possible issues.

There seemed to also be mention of reading systems and doing diacritic insensitive search. Readers, at least several, it seems handle that fine yet it seems to take more work for such, and maybe on newer faster devices it is less of an issue.

It seems that reader and perhaps others systems normalize text before search. Maybe that is a better approach, to leave source text as is, and just normalize it and the search string for search operations.

At one point, it seemed you had thought maybe it is better to leave text alone. I strongly think that should remain how it is, with commands as you suggested to convert to NFC or NFD. Plus possibly preferences for such. Any changes that you have made to support exclusively NFC, I think such is best left as a preference, maybe default or maybe not; I'm not sure.

Last edited by democrite; 06-28-2024 at 04:59 PM.
democrite is offline   Reply With Quote
Old 06-28-2024, 05:19 PM   #27
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
And concerning the decompose unicode command in BBEdit which I'm guessing converts to NFD, that seems to convert all text including characters with single diacrtics or accents. Such then too isn't helpful as any conversion seems to make it impossible to return to the original data representation.
democrite is offline   Reply With Quote
Old 06-28-2024, 05:22 PM   #28
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,269
Karma: 5568868
Join Date: Nov 2009
Device: many
Some of your arguments have merit but an epub produced on one platform must be readable and searchable on many different e-readers.

Precomposed form works with multiple accents as well.

Researchers of historic texts that study dead languages and use of accents/diacritics use only primary hand written or typed sources as old as possible. This is not the realm of epubs as someone had to choose a form for digital storage and that was not the original author. So most of those points are moot. In addition the fonts chosen to be used in the epub have a greater an impact on any visual stylistic interpretation than invisible normalization forms that do not lose accents.

Yes mixed normalization forms can not be searched. And mixed normalization forms converted can not easily be converted back.

Even though the form is different the actual text is visually *identical*. The reader of any epub can not tell which normalization form is being used. Only when searching do issues become obvious.

As I said before, the latest version of Sigil now in its own repo branch now handles copying and pasting text into its find field and functions to prevent these search issues.

So you appear to be arguing against changes that actually help epubs be more universally searchable.

In addition decomposed text has become an important attack vector hiding website urls to enable redirecting them from real websites. NFC and precomposing is the right way to handle that along with unicode variants being made to be visually differentiable. So the push toward NFC will probably continue.


Work will continue on this.

If demand warrants, we can add an environment variable to allow the user of Sigil to control this, but then no support or bug reports for searches failing will be accepted if the environment variable approach is employed by the user.


In the meanwhile use Sigil-2.1.0 if you do not want to use NFC conversions.

Last edited by KevinH; 06-28-2024 at 05:54 PM.
KevinH is offline   Reply With Quote
Old 06-28-2024, 05:27 PM   #29
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
My search issues are because I primarly use BBEdit for xhtml editing and Sigil only for all else. Such I mentioned in my original message plus such is why I mentioned that Sigil might not around forever and some may use other editors either current or future.

I'm not necessarily arguing against making EPUBs more universally searchable. I am no expert but it just seems that the issue is more complicated than it seems.

Thus I think a preference to control this, plus commands to convert text, or maybe also automatically convert pasted text seem better.

As far if demand warrants, I am not sure but it just seems use cases of potential normalization problems seem mentioned around the web, and having enough users that use Sigil and will run into such, seems rare and might not happen for a while. But I suspect that it seems better to just leave text alone by default and have commands and preferences for normalization if someone wants. Is it useful to have some setting to have some character combinations excluded from automatic normalization?

For non-Western languages it seems there is some tendency of opinion that normalization issues are complicated though I haven't thoroughly read through such.

Your original thought I think still remains better:

Quote:
maybe just do what everyone else seems to be doing (except for Kovid) and just punt and allow NFD text mixed with NFC text.
Yes I'll continue using the older Sigil.

Last edited by democrite; 06-28-2024 at 05:38 PM.
democrite is offline   Reply With Quote
Old 06-28-2024, 05:39 PM   #30
democrite
Evangelist
democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.democrite will give the Devil his due.
 
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
Another issue I found somewhere which I'm not sure if remains true. It just seems to suggest there could be multiple issues that are unknown to many and may not come up for a while.

Quote:
U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.
democrite is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Potential Issues Upgrading? from 3.21 :O rebeltaz Calibre 22 03-21-2022 12:19 AM
New to Using Sigil and Having Issues jester1972 Sigil 20 04-30-2017 10:24 AM
New Issues in Sigil 0.9.3 jafprrr Sigil 11 03-10-2016 12:59 PM
issues with sigil 0.8.4 eregs Sigil 2 02-27-2015 09:01 AM
Support for RTL Languages Gonidae Calibre 1 10-05-2012 06:13 AM


All times are GMT -4. The time now is 05:44 AM.


MobileRead.com is a privately owned, operated and funded community.