|
|
Thread Tools | Search this Thread |
06-26-2024, 04:08 PM | #16 |
Sigil Developer
Posts: 8,258
Karma: 5568412
Join Date: Nov 2009
Device: many
|
Okay, will do.
I have a force_nfc candidate version of Sigil ready locally but I have NOT tested it yet. If it seems to work, I will push it to my personal repo first and we can play around from there. Thanks! |
06-26-2024, 05:44 PM | #17 | |
Sigil Developer
Posts: 8,258
Karma: 5568412
Join Date: Nov 2009
Device: many
|
@DiapDealer,
Quote:
So this is a bleeding edge tree that will need some heavy testing. I pushed it to my personal github repo in the qt6only branch. git clone https://github.com/kevinhendricks/Sigil.git cd Sigil git checkout qt6only If you agree, I would like to create a "qt5final" branch from current master to preserve it for any Linux distributions that are too old for Qt 6.2.X or 6.4.X. If we find that all the force nfc stuff is solid, we can even add to that branch and make a new qt5final update release. But I want to make sure the nfc force stuff is rock solid first. Once we have that, I will merge in my "qt6only" branch into Sigil-Ebook master in the form of sets of patches so that we do not lose any history of what has been done and can revert them in pieces if necessary. Something along the lines of: 1. remove old qt5 support cleanup patch 2. remove need for Qt6 compat5 (no qstringref patch, no qtextcodec patch) 3. force nfc patches Just let me know when you want me to start the transition. Last edited by KevinH; 06-26-2024 at 08:49 PM. |
|
06-27-2024, 10:01 AM | #18 |
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
Trying a test from one of my recent EPUBs, I recall one instance of a Chinese character transliterated to pinyin that used a combining diacritic.
I generally do edits with BBEdit on macOS and use Sigil only for changes other than xhtml editing. I mainly use Apple Books and search by typing works fine. My main concern is typing text and being able to search for it later in another editor. Some may use multiple editors. Sigil may not be around forever, or a user may someday even years from now decide to switch to a different editor. I'm not sure what is the ideal solution. Maybe it is indeed better to leave it alone, or have a preference to normalize to either plus commands for such. Perhaps ideally I'd like to find any text by typing, with any editor, and pasted text would be normalized to whatever matches keyboard input on each platform, if needed. That might be some additional work but perhaps it would be easiest for the user? It could be for such reasons that normalization of file contents was removed from the spec. Last edited by democrite; 06-27-2024 at 10:11 AM. |
06-27-2024, 10:20 AM | #19 |
Sigil Developer
Posts: 8,258
Karma: 5568412
Join Date: Nov 2009
Device: many
|
You seem to misunderstand.
The epub3 spec did not remove it, they focused it on urls and file paths where it matters most for string comparison and matching for the epub links to function at all. In order for search to work in general, all of the text being searched and all of the search strings need to form text characters from unicode using an identical sequence. If more than 1 sequence creates the same character, the mixing them in the same text causes problems as does searching with one sequence in find but the text uses the other sequence for the same word or words. And in either form, *no* text is ever lost or unreadable. It will be 100% searchable only when the search string and all the text to be searched are in the same form. As for Sigil, and Calibre going away, the code to convert a file between normalization forms in python is trivial (less than 5 lines) same in Qt and it is available in all good string libraries. And finally, the world seems to be standardizing on NFC forms for just the reasons above. So no worries. Last edited by KevinH; 06-27-2024 at 10:25 AM. |
06-27-2024, 10:28 AM | #20 |
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
I thought that maybe the EPUB3 spec changed NFC normalization of file contents from a requirement to a recommendation, but I'm not positive.
Perhaps I wasn't clear in my previous message. Typing a combined diacritic in another editor, I was not able to find the same sequence after it was saved with Sigil 2.2.1. So I think maybe the concerns remain true. If any type of normalization is done that differs from keyed input, a user may not be able to search their typed text using a different editor, if they use multiple editors or someday wish to switch to another? |
06-27-2024, 10:33 AM | #21 |
Grand Sorcerer
Posts: 28,082
Karma: 199770456
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
We can't guarantee that all Sigil produced text will be searchable (let alone in the same normalized form) by all other editors right now. The changes we're proposing won't help or hurt that. We're just making sure that Sigil is consistent in the way it internally approaches this (and handles it according to epub spec). We can't worry whether other editors do the same. Not all editors are EPUB editors.
Last edited by DiapDealer; 06-27-2024 at 10:39 AM. |
06-27-2024, 10:40 AM | #22 |
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
What I'm saying is that whatever text I type, at least for now, I'm not able to search for it after it's saved. Maybe such is the case if I stick with Sigil, but not if I use another editor. I just want to be able to search for the same text I typed previously by typing in the same.
|
06-27-2024, 10:50 AM | #23 |
Sigil Developer
Posts: 8,258
Karma: 5568412
Join Date: Nov 2009
Device: many
|
And that is what I already have working in the repo I posted about. It will convert pasting to the find field (or typing) to nfc form to match the nfc form of the text. The end user will see a working search and replace when typing or pasting. But can only happen if one normalization form is used throughout.
That is why this alert exists and explains that people who work with languages where issues like this are prevalent, should stick to Sigil 2.1.0 until the next release of Sigil. Even with Sigil 2.2.x, once saved and re-read in, you can copy a word that illustrates this issue from Sigil and paste it into the find field, and it all works. Your keyboard input method is just not generating text in the order that matches the NFC form. In the version I posted about above, this does not happen anymore. This will happen for any editor (except for calibre's) where multiple sequences to generate a single character exists and more than one sequence is used for that same character in the text (or in the find field). This problem is not Sigil specific. Sigil 2.2.x converts text on being read in to NFC form to help make it universally searchable across various platforms and e-readers. As it turns out it just needed more work to more completely convert find info as well. So I do not understand what you are asking here? Last edited by KevinH; 06-27-2024 at 11:06 AM. |
06-27-2024, 11:14 AM | #24 | |
Sigil Developer
Posts: 8,258
Karma: 5568412
Join Date: Nov 2009
Device: many
|
As for BBEdit, check out Release 14.0 Notes for:
Quote:
|
|
06-28-2024, 02:26 PM | #25 | |
Sigil Developer
Posts: 8,258
Karma: 5568412
Join Date: Nov 2009
Device: many
|
And interestingly, both Mac And Windows keyboards and input methods regularly create precomposed inputs (NFC). I finally found this documented on the Windows site:
Quote:
|
|
06-28-2024, 04:31 PM | #26 | ||||||||
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
Kevin,
Thank you very much for taking the time to be very helpful and also for really trying to consider all viewpoints and thoroughly think through the issue. Such is quite rare. My issue is with characters that have multiple diacritics. I had attached a file in my previous message. One character in question is in a recent EPUB using the pinyin term nǚ. After converting to NFC, that I can tell, there is no way to return to the character as entered. Some are concerned with data preservation, and want the original representation. That I can tell, after conversion to NFC, there is no way to search for that character as entered by typing it in again. Such would also be true, perhaps for other cases too, with possibly numerous or any other editor, current or future. I am aware of the BBEdit command. There is also one to decompose unicode. There is also an expert preference to precompose Unicode when pasting. Yet that I can tell, none help for that specific issue. Not everyone is going to eternally exclusively use Sigil for their editing. Sigil might not be around forever. Someone may use other editors. I use BBEdit. Looking around, there seems to be a tendancy to not want to automatically convert text. From what I recall, JetBrains at one point automatically converted text to NFC, and some user complained through a bug report with others also sharing the same sentiment, so they backed out of it. Trying to use various AI services such as Perplexity, I found various issues, many of which I'm not sure if they apply: Quote:
Quote:
Quote:
Quote:
https://www.w3.org/International/que...-normalization if this applies or not, I'm not sure, but among other issues found, perhaps some would create EPUBs containing such content: Quote:
Quote:
Quote:
Quote:
https://www.w3.org/TR/unicode-xml/ A document I found though I haven't read it thoroughly yet seems like it might be of use, Unicode® Standard Annex #15 UNICODE NORMALIZATION FORMS: https://unicode.org/reports/tr15/ I think it is a good idea to thoroughly continue to investigate this issue, and not go by the recommendations of a few users. As Sigil has a much lower userbase than other editors, it may be more difficult to get good feedback concerning all the possible issues. There seemed to also be mention of reading systems and doing diacritic insensitive search. Readers, at least several, it seems handle that fine yet it seems to take more work for such, and maybe on newer faster devices it is less of an issue. It seems that reader and perhaps others systems normalize text before search. Maybe that is a better approach, to leave source text as is, and just normalize it and the search string for search operations. At one point, it seemed you had thought maybe it is better to leave text alone. I strongly think that should remain how it is, with commands as you suggested to convert to NFC or NFD. Plus possibly preferences for such. Any changes that you have made to support exclusively NFC, I think such is best left as a preference, maybe default or maybe not; I'm not sure. Last edited by democrite; 06-28-2024 at 04:59 PM. |
||||||||
06-28-2024, 05:19 PM | #27 |
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
And concerning the decompose unicode command in BBEdit which I'm guessing converts to NFD, that seems to convert all text including characters with single diacrtics or accents. Such then too isn't helpful as any conversion seems to make it impossible to return to the original data representation.
|
06-28-2024, 05:22 PM | #28 |
Sigil Developer
Posts: 8,258
Karma: 5568412
Join Date: Nov 2009
Device: many
|
Some of your arguments have merit but an epub produced on one platform must be readable and searchable on many different e-readers.
Precomposed form works with multiple accents as well. Researchers of historic texts that study dead languages and use of accents/diacritics use only primary hand written or typed sources as old as possible. This is not the realm of epubs as someone had to choose a form for digital storage and that was not the original author. So most of those points are moot. In addition the fonts chosen to be used in the epub have a greater an impact on any visual stylistic interpretation than invisible normalization forms that do not lose accents. Yes mixed normalization forms can not be searched. And mixed normalization forms converted can not easily be converted back. Even though the form is different the actual text is visually *identical*. The reader of any epub can not tell which normalization form is being used. Only when searching do issues become obvious. As I said before, the latest version of Sigil now in its own repo branch now handles copying and pasting text into its find field and functions to prevent these search issues. So you appear to be arguing against changes that actually help epubs be more universally searchable. In addition decomposed text has become an important attack vector hiding website urls to enable redirecting them from real websites. NFC and precomposing is the right way to handle that along with unicode variants being made to be visually differentiable. So the push toward NFC will probably continue. Work will continue on this. If demand warrants, we can add an environment variable to allow the user of Sigil to control this, but then no support or bug reports for searches failing will be accepted if the environment variable approach is employed by the user. In the meanwhile use Sigil-2.1.0 if you do not want to use NFC conversions. Last edited by KevinH; 06-28-2024 at 05:54 PM. |
06-28-2024, 05:27 PM | #29 | |
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
My search issues are because I primarly use BBEdit for xhtml editing and Sigil only for all else. Such I mentioned in my original message plus such is why I mentioned that Sigil might not around forever and some may use other editors either current or future.
I'm not necessarily arguing against making EPUBs more universally searchable. I am no expert but it just seems that the issue is more complicated than it seems. Thus I think a preference to control this, plus commands to convert text, or maybe also automatically convert pasted text seem better. As far if demand warrants, I am not sure but it just seems use cases of potential normalization problems seem mentioned around the web, and having enough users that use Sigil and will run into such, seems rare and might not happen for a while. But I suspect that it seems better to just leave text alone by default and have commands and preferences for normalization if someone wants. Is it useful to have some setting to have some character combinations excluded from automatic normalization? For non-Western languages it seems there is some tendency of opinion that normalization issues are complicated though I haven't thoroughly read through such. Your original thought I think still remains better: Quote:
Last edited by democrite; 06-28-2024 at 05:38 PM. |
|
06-28-2024, 05:39 PM | #30 | |
Evangelist
Posts: 440
Karma: 77256
Join Date: Sep 2011
Device: none
|
Another issue I found somewhere which I'm not sure if remains true. It just seems to suggest there could be multiple issues that are unknown to many and may not come up for a while.
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Potential Issues Upgrading? from 3.21 :O | rebeltaz | Calibre | 22 | 03-21-2022 12:19 AM |
New to Using Sigil and Having Issues | jester1972 | Sigil | 20 | 04-30-2017 10:24 AM |
New Issues in Sigil 0.9.3 | jafprrr | Sigil | 11 | 03-10-2016 12:59 PM |
issues with sigil 0.8.4 | eregs | Sigil | 2 | 02-27-2015 09:01 AM |
Support for RTL Languages | Gonidae | Calibre | 1 | 10-05-2012 06:13 AM |