05-04-2024, 05:36 PM | #1 |
Sigil Developer
Posts: 8,258
Karma: 5568412
Join Date: Nov 2009
Device: many
|
Help - need epub sample that uses non-ascii file names and paths
Sigil has received a bug report about how it could not handle for input an epub that used non-ascii file names made by InDesign.
I have tried the official non-ascii file name testcase from the epub3 samples page and Sigil had no problems with it. But that testcase used Japanese and I can not tell if Japanese has the same issues with combined characters and unicode normalization (nfc vs nfd, etc) as more European non-ascii characters often have. So can anyone either provide or point me to a good test case that uses non-ascii filenames in an epub that may use combined forms (when one accent is added to a base character in decomposed form), or alternatively when one character can be decomposed into two separate ones (typically involving one or more on accents and other diacritic markings). If anyone has one, I would love to have a copy of it even if just a single filename that is non-ascii in a sample epub. In addition, is this a known problem with InDesign either not properly unicode normalizing its non-ascii filenames to NFC as the spec calls for, or not properly url encoding them in the manifest? Or does the epub zip archive produced by InDesign not proerly set the flag that tells it to use utf-8? Also inside a zip archive that uses the utf-8 flag the order of decomposed vs composed characters would matter. Does anyone know what the rule of normalization is for the files inside a zip archive? On macOS filesystem paths are typically stored as decomposed (NFD), but the web and elsewhere seems to all be NFC (composed). So could a zip archive built on a Mac by InDesign be using file names/paths stored inside the zip using NFD normalization form, when if fact it should probably be NFC form? I just do not use InDesign, so I have no idea if this is an InDesign issue or a hidden Sigil bug. Thanks, KevinH Last edited by KevinH; 05-04-2024 at 05:50 PM. |
05-05-2024, 07:54 AM | #2 |
Grand Sorcerer
Posts: 5,651
Karma: 23456789
Join Date: Dec 2010
Device: Kindle PW2
|
I created a simple test file with Sigil that contains Latin, Hebrew and Arabic file names with combining diacritics. It also contains one file name without combining diacritics. Sigil had no problems with any of them.
However, EPUBCheck reported that it couldn't find the second résumé.xhtml file, which contained the letter e followed by a combining acute diacritic. Code:
Col: -1: ERROR(RSC-001): File "OEBPS/Text/re%CC%81sume%CC%81.xhtml" could not be found. Maybe this is the problem that the user is referring to. IMHO, using combining Unicode characters instead of precomposed characters isn't a good idea anyway, because many fonts don't contain glyphs for them. Last edited by Doitsu; 05-05-2024 at 07:58 AM. |
05-05-2024, 11:11 AM | #3 |
Sigil Developer
Posts: 8,258
Karma: 5568412
Join Date: Nov 2009
Device: many
|
Thank you Doitsu!
Good to know that even Epubcheck gets things wrong sometimes. |
05-05-2024, 01:58 PM | #4 |
Sigil Developer
Posts: 8,258
Karma: 5568412
Join Date: Nov 2009
Device: many
|
Ah, it seems Sigil on a mac can create an .zip (.epub) with Unicode Normalization Form D that will not be in sync with manifest urls that are Unicode Normalization Form C (and visa versa).
So at the boundary Sigil will have to convert zip files names and all files to be Unicode Normalized to form C. The worst is that macOS HFS+ file systems forced a modified Form D on all paths and filenames but its new APFS filesystem does not meaning that mixed normalization file names and paths can both exist on a mac with the newer APFS filesystem. Just what I needed. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
what is the transliteration map for non-ascii names and titles | zubzero | Library Management | 1 | 06-08-2020 08:46 AM |
Non-ASCII File Names | Hopkins | Editor | 5 | 01-18-2018 09:02 AM |
Convert an epub to a pdf from another pdf sample file | SvenSND | Conversion | 3 | 09-02-2016 05:29 PM |
Junk chars in splitted file names converting lit to epub | ozofmoz | Conversion | 2 | 07-15-2011 03:53 AM |
Ascii file | ProDigit | Lounge | 1 | 12-25-2008 11:08 PM |