|
|
Thread Tools | Search this Thread |
06-25-2024, 01:33 PM | #1 |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
ALERT: Potential Issues with Sigil 2.2.X and rtl languages and Normalization Forms
Hi All,
I NEED YOUR HELP: The original epub2 spec said that all Content documents must use Unicode Normalization Form C (NFC). The epub3 spec now says that all file paths and urls must use Normalization Form C. Calibre, as far as I can tell enforces NFC for every xhtml file read in. So I need help from users who work with RTL languages and also LTR languages that use lots of accents. Starting in Sigil 2.2.0, for every file read in, the input was converted to use NFC. This in turn has caused problems with RTL languages like Hebrew and Arabic while fixing other issues for other heavy accent languages and search. From the unicode.org spec, both Hebrew and Arabic use special combining character classes and they should appear the same if input is NFC or NFD - but will *not* compare as identical due to byte order changes, and accent and character combinations. Either way mixing some text as NFC form and in other places the same text in NFD form will cause much pain and headaches for anyone trying to use Find and Replace or the end epub reader doing a search. So what I would like to know for *each* platform: 1. If your *type on your keyboard* and input Hebrew or Arabic or any language with lots of accents is that text stored in form NFC or form NFD, or some other mixed form inside the document you are editing (under Word, LibreOffice, Kate, emacs, or whatever text editor you use). 2. Try the same text copying to the clipboard from your web browser and pasting it into a text editor, what unicode normaliztion form does it use. When running this test, just save the text directly to a file and post it as a zip here (in this thread) along with info on language, platform, editor used, and source (typing in the keyboard vs copy and paste from clipboard). I can determine the form used by converting to utf-8 and dumping the hex code. For Arabic and Hebrew users of Sigil, I would revert to using Sigil 2.1.0 until we find out what is going on with unicode normalization with RTL languages. Especially those on macOS. If anyone is an expert on Unicode Normalization Forms and especially how it is handled for RTL languages, and whether normal keyboard input on each platform generates NFC vs NFD form, I would love for you to post what you know here. I do not even know how to input from a keyboard to produce Hebrew or Arabic so I am at a loss. I can only cut and paste from somewhere else but who knows if the test copied was generated in NFD or NFC form or unknown form. The biggest issue here is NOT readability or lost text (no losses occur) but instead find and replace and end user Search. To the end user the text will appear correct but if it differs from what they type on the keyboard or cut and paste into find, the search will come back empty. Any help or guidance here would be greatly appreciated. Thank you! Last edited by KevinH; 06-25-2024 at 01:37 PM. |
06-25-2024, 01:44 PM | #2 |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
ps. Perhaps the solution internal to Sigil is to convert everything to NFC on being read in or pasting and then take whatever is typed in the Find and replace windows and make sure to convert both to NFC form as well?
Last edited by KevinH; 06-25-2024 at 04:38 PM. |
Advert | |
|
06-25-2024, 06:23 PM | #3 | |
Grand Sorcerer
Posts: 5,640
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
I'm by no means a Unicode expert, but, AFAIK, most Windows apps do not perform Unicode normalizations when opening or saving files.
For example, when I entered the Häagen-Daz in Babelpad (freeware) and saved it as a UTF8 plain-text file with a BOM, it was shown as in a hex editor as: EF BB BF 48 C3 A4 61 67 65 6E 2D 44 61 7A As expected, the ä umlaut was encoded as C3 A4. When I entered the same string in LibreOffice and saved it a UTF-8 plain-text file with a BOM, I got the exact same result (+ 0D 0A at the end). I'm not aware of any Windows app that has normalization options for accented characters. Based on my experience, Arabic letters are usually saved as letters from the 0600–06FF range (Arabic), even though though many of the letters are actually rendered using glyphs from the 0600–06FF range (Arabic Presentation Forms-B). Take for example, the Arabic word من (min = from). It consists of small circle on the right, which is the initial form of the letter MEEM and a semi-circle with a dot above it, which is the final form of the letter NOON. When saved as a UTF-8 file, it was saved as D9 85 D9 86 MEEM U+0645 م d9 85 NOON U+0646 ن d9 86 Quote:
Please visit it, enter the sample word marhaban = welcome, and click on the first suggestion (مرحباً). Then copy it to a Mac editor and save it. On my machine it was saved as: D9 85 D8 B1 D8 AD D8 A8 D8 A7 D9 8B I.e. the codes for MEEM, REH, HAH, BA, ALEF, FATHATAN. I don't know what problems the Mac user reported, but, IIRC, very old versions of InDesign and other DTP apps came with RTL plugins that replaced characters from the 0600–06FF range with presentation forms from the 0600–06FF range. Visually, the words would look exactly the same. Take again the word من, if you encode it using characters from 0600–06FF range, it would look like this: ﻣﻦ You could only tell the difference if you saved the string and examined it with a a hex editor. INITIAL MEEM U+FEE3 ﻣ ef bb a3 FINAL NOON U+FEE6 ﻦ ef bb a6 (In Hebrew, only five letters have a different final form.) I.e., it's quite possible that the user who reported the RTL problem is using outdated software or software that can't handle RTL text. In Sigil 2.2.1, my accented characters test file, is still rendered correctly on my Windows 11 machine. What does the book browser look like on a Mac? Last edited by Doitsu; 06-25-2024 at 06:27 PM. |
|
06-25-2024, 07:03 PM | #4 |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
It looks identical to your screen cap.
Here is the Arabic in question. The first line is in NFD form: Code:
<p>إن تقييم الله لأيوب مسجّل في العهد القديم: "لِأَنَّهُ لَيْسَ مِثْلُهُ فِي ٱلْأَرْضِ. رَجُلٌ كَامِلٌ وَمُسْتَقِيمٌ يَتَّقِي ٱللهَ وَيَحِيدُ عَنِ ٱلشَّرِّ" (أيُّوب 1: 8)</p> Code:
<p>إن تقييم الله لأيوب مسجّل في العهد القديم: "لِأَنَّهُ لَيْسَ مِثْلُهُ فِي ٱلْأَرْضِ. رَجُلٌ كَامِلٌ وَمُسْتَقِيمٌ يَتَّقِي ٱللهَ وَيَحِيدُ عَنِ ٱلشَّرِّ" (أيُّوب 1: 8)</p> Code:
** ** 00000110: d98e d986 d991 d98e d987 d98f 20d9 84d9 (nfd) 00000110: d98e d986 d98e d991 d987 d98f 20d9 84d9 (nfc) ** ** ** ** 00000190: 8ed8 aad9 91d9 8ed9 82d9 90d9 8a20 d9b1 (nfd) 00000190: 8ed8 aad9 8ed9 91d9 82d9 90d9 8a20 d9b1 (nfc) ** ** ** ** 000001c0: d986 d990 20d9 b1d9 84d8 b4d9 91d9 8ed8 (nfd) 000001c0: d986 d990 20d9 b1d9 84d8 b4d9 8ed9 91d8 (nfc) ** ** ** ** ** ** 000001d0: b1d9 91d9 9022 2028 d8a3 d98a d991 d98f (nfd) 000001d0: b1d9 90d9 9122 2028 d8a3 d98a d98f d991 (nfc) ** ** ** ** By cutting and pasting out of Chrome, BBEdit, and Sigil 2.1.0 itself and pasting into Sigil and immediately saving that file (not writing it out to a zip) on macOS I can see that I can paste either form into Sigil and Qt does not normalize anything. It is a straight copy paste. That means I can not easily prevent mixed form (NFD and NFC) of the same text from being entered into the epub meaning that search for one of the chars that is changed will not find the other, depending on which version gets pasted into the Find Window. Under these conditions, Find and Replace becomes pretty worthless. I am just not sure how to handle things. I have no idea if the different forms will be kept in MR in this post, but if they are I would like you to copy the NFD line and paste it into Sigil and Word (or LibreOffice) and save it on Windows, so that I can see if it gets changed to NFC or is just left as is. If it is left as is, then I can never normalize the text to NFC form on load as it would not catch all changes. I would be better off Normalizing to NFC on Saving in Sigil and then maybe use NFC in the find and replace fields as well. Or maybe just do what everyone else seems to be doing (except for Kovid) and just punt and allow NFD text mixed with NFC text. Search and replace positions of the text will be off if I leave the underlying text as is but NFC normalize a copy of it just for find and replace. So I am just not sure what to do here. Last edited by KevinH; 06-25-2024 at 07:41 PM. |
06-25-2024, 07:08 PM | #5 |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
One way to handle this might be to provide a Sigil Tool command to convert all text to either NFD or NFC format, by choice of the user. That way if the User uses this command before running Find and Replace, it would allow Find and Replace to stand some chance of working perfectly without forcing one Normalization Form or the Other.
Alternatively I could instead force convert each file to NFC form and force Find and Replace contents to NFC form so that search and replace actually works and actual file position of cursor in the file makes sense. Nothing I know will work when the file has the identical string one in NFD form and one in NFC form which is what I was trying to see how often that might happen. It seems some keyboard input methods form chars/text as NFC while other form chars as NFD while others form chars in the order typed. And from my testing it seems copying and pasting can easily end up with mixed forms. Last edited by KevinH; 06-25-2024 at 07:11 PM. |
Advert | |
|
06-25-2024, 10:55 PM | #6 |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
I tried the Yamli website and chose the first one offered and it produced the exact same sequence of bytes on my macOS that you saw on Windows:
00000000: d985 d8b1 d8ad d8a8 d8a7 d98b But I then took that file and ran NFC normalization on it and called it test_nfc.txt And I took that file and ran NFD normalization on it, and called it test_nfd.txt. And the results were identical byte for byte so I am not sure that accents and combining order mattered here at all as the NFD form was identical to the NFC form. Is that what you expected? |
06-26-2024, 02:19 AM | #7 |
creator of calibre
Posts: 44,567
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Normalize to NFC both the needle and the haystack when searching. That is the only sane way to do this. And normalize HTML to NFC on save and on load, and when getting the text from QPlainTextEdit in order to perform any operation on it.
|
06-26-2024, 08:09 AM | #8 | ||
Grand Sorcerer
Posts: 5,640
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
U+0651 ّ d9 91 ّ ّ ARABIC SHADDA U+064E َ d9 8e َ َ ARABIC FATHA The fist difference in the sample paragraph is in this word: لِأَنَّهُ [li'annahu] In terms of rendering, the actual order of these two diacritics makes no difference whatsoever. NFD: لِأَنَّهُ NFC: لِأَنَّهُ (I also attached the strings as a plain-text file.) FYI: here are the specs: UNICODE ARABIC MARK ORDERING ALGORITHM. In short, it says, if an Arabic letter is combined with both ARABIC SHADDA and other diacritics, e.g. ARABIC FATHA/KASRA/DAMMA, ARABIC SHADDA should be saved last, because it has a higher canonical value. This explains the differences that you found. In real life it doesn't make any difference, because the strings are usually rendered exactly the same. Moreover, Arabic with multiple diacritics is primarily used in religious texts and some textbooks. In mass media, diacritics are rarely used and mostly only for disambiguation. I.e., this is mostly a cosmetic issue. Since diacritics are also somewhat difficult to enter, some apps that support Arabic text, allow users to search for strings with diacritics as if they didn't have any diacritics. I.e., if the user could search for لأنه or لانه and the app would also find لِأَنَّهُ. However, I don't know any EPUB app that has this option. Quote:
I totally agree with Kovid on this. Force converting NFD to NFC is the easier solution. |
||
06-26-2024, 09:31 AM | #9 | |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Quote:
I can force NFC form every time the QTextDocument get loaded as well. I can force NFC form when the epub is loaded and saved. I can intercept copying out of a TextDocument, so I can can force it to NFC form as well. That just leaves any keyboard input method that produces decomposed form. I am not sure how big an issue that is. and Pasting off the system clipboard from other apps into the Find and Replace Fields, and directly into the Text. I find that system copy paste on macOS can not be intercepted at times. Any suggestions on how to handle any of those cases? Thanks, KevinH |
|
06-26-2024, 09:41 AM | #10 | |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Hi Doitsu,
Thanks for all your help on this. I could not even tell which characters were changed without an object dump. You are correct, nothing is lost in either representation in Sigil 2.2.x. I just need to figure out how to 100% intercept text pasting into the xhtml file and into any replace fields, or force NFC when the find and replace buttons are pressed. That really leaves keyboard input methods that might generate decomposed form to worry about but as you said, it should not be a big deal. Thank you! KevinH Quote:
|
|
06-26-2024, 10:09 AM | #11 |
Grand Sorcerer
Posts: 28,045
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Will we be able to add something to the plugin framework to automatically force NFC on altered/added xhtml? Any plugin that has their own edit/search/diff checks might be on their own to ensure NFC on a search after edit, but it would be nice if plugin devs didn't have to worry about doing this themselves in general.
|
06-26-2024, 10:29 AM | #12 |
creator of calibre
Posts: 44,567
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Typing and pasting dont matter. As long as all code that gets text from the plain text edit normalizes the text before using it. Basically you need to be careful at all sites that operate on the contents of the plain text edit. Anything that gets the text from it either directly or via a text cursor and then uses/operates on the text should be normalising it first. Depending on your architecture that might be easy or hard to enforce.
|
06-26-2024, 12:13 PM | #13 | |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Well that is not going to be fun as our CodeView uses qtextcursors to get selected text directly in many many places. I will have to create an access routine to that accepts the cursor and grabs the text and fixes it up and use it everyplace.
Thanks! KevinH Quote:
|
|
06-26-2024, 12:34 PM | #14 | |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Quote:
Since all of these force NFC changes are going to be invasive, I am going to create a branch. Will we want to include these changes in a final Qt5 source release, or because they will be invasive and only impact biblical level RTL languages, and no data is lost, should we make these changes on our Qt6 only branch? Just let me know. |
|
06-26-2024, 12:52 PM | #15 |
Grand Sorcerer
Posts: 28,045
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I don't believe I'd worry about making any more invasive changes to another Qt5 source release. For the few people it might impact, the recommendation would be to upgrade to a version based on the first Qt6-only source release.
Last edited by DiapDealer; 06-26-2024 at 04:18 PM. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Potential Issues Upgrading? from 3.21 :O | rebeltaz | Calibre | 22 | 03-21-2022 12:19 AM |
New to Using Sigil and Having Issues | jester1972 | Sigil | 20 | 04-30-2017 10:24 AM |
New Issues in Sigil 0.9.3 | jafprrr | Sigil | 11 | 03-10-2016 12:59 PM |
issues with sigil 0.8.4 | eregs | Sigil | 2 | 02-27-2015 09:01 AM |
Support for RTL Languages | Gonidae | Calibre | 1 | 10-05-2012 06:13 AM |