|
|
Thread Tools | Search this Thread |
04-28-2024, 09:55 AM | #1 |
Junior Member
Posts: 9
Karma: 11018
Join Date: Feb 2024
Device: none
|
Non-breaking Space numerical entity NOT preserved after generating TOC
An issue has popped up where Sigil, seemingly at random, has converted about half of all cases of the Non-breaking Space numerical entity (already included in Preserve Entities!) to a very hard to see, red, broken underline symbol.
After 8 hours of investigating, I have found a way to easily reproduce the issue. Details and background (excuse my less-than-perfect English): I've copied text from my plain text editor into Sigil's Code View and added basic formatting like headings, paragraphs and lists. Nothing fancy, as I like to keep things simple and not introduce complicated styling that might not work in some EPUB reading apps. I'll add lots of images later, but 80% of the text is now imported and formatted. This is my first time using Sigil. I have read the entire Sigil User Guide twice, plus searched the user forum. My e-book (EPUB version 3) is about astronomy and contains lots of numbers, units of measurement and symbols. Non-breaking space is often needed in these kind of expressions. It's not optional, it's a must. (I frequently use non-breaking space on my astronomy website as well.) In Sigil's preferences, the Preserve Entities list already contains the numerical entity (& # 160; – obviously without the spaces), so according to the user guide, this would preserve this entity and prevent Mend from converting it to a Unicode character. I made a clip containing the numerical entity and added non-breaking spaces here and there. In total, my document contained 295 non-breaking spaces, coded as the above-mentioned numerical entity. After closing, opening, editing various chapters, and saving many times, I could clearly see in Code View that the numerical entity was present. Great! However, when I opened Sigil yesterday, I saw that many of the instances of non-breaking space were gone! Or rather, some had been converted to a red, broken underline symbol, which I later found out is the same as when using the Insert Special Character panel. The underline is extremely thin and hard to see. Using Find I discovered that 150 of the 295 cases of non-breaking space had been converted from the numerical entity to the almost invisible underline symbol. Luckily, all non-breaking spaces seem to work as they should when checking in Preview. But since this character is so important in my book, I would really have liked to be able to clearly see it in Code View, i.e. I would like Sigil to respect the Preserve Entities list, which, even as default, include this particular numerical identity. I didn't understand why this had happened. Why is the numerical entity not preserved, and why have half of them been converted (0% or 100% would make more sense)? Even in a single short chapter, some cases have been converted, some not. I couldn't make sense of it. Please note that I have not changed anything in Sigil's Basics setting, i.e. "Mend Not Well Formed HTML Source Code ON" has both boxes (Open and Save) ticked. I will run various tools and checks before publishing my e-book, but so far I have not used any of the "Reformat" or "Check" tools in the Tools meny. The culprit?: The only thing I have actively run, is Generate Table of Contents. I did that for the first time a few days ago, so I wondered if that might have had an impact. Using a demo file, I found out that Sigil preserves the non-breaking space numerical entity when editing, saving, closing and opening the file. However, after using Generate Table of Contents, about half of them are converted to the thin, underline symbol! Maybe this inconsistency is a bug? |
04-28-2024, 10:47 AM | #2 |
Sigil Developer
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Sigil's xhtml parser does not need or use entities. Instead, it converts those entities to their actual character value. That is fine for most entities except for whitespace ones so the CodeView editor highlights them with a red underscore so you know its there and working.
Adding numeric or named entities to the Sigil Preserve entities preferences will convert those characters back to their entities so you the editor user can see them. But since you already have it in Preserve Entities, it should have kept the as entities. Just run Mend All and your entities should re-appear. I will look into why Generate Table of Contents is sometimes missing the Preserve Entities replacement step. Thank you for your bug report. |
Advert | |
|
04-28-2024, 11:42 AM | #3 |
Sigil Developer
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Yes a longstanding but luckily basically harmless bug. When you generate a toc and manipulate things by selecting and omitting headers, Sigil add a marker class "not included" in toc header attributes to record the fact this header is not included or if you promote or demote a header, or change the title of the header, Sigil must parse the file to make those changes and then saves it.
Sigil neglected to run it through the CharToEntity conversion before saving that file. So some unicode chars were not properly converted back to their entity equivalent. Just running Mend All will put back your Preserved Entities, so nothing was ever lost just converted to its actual unicode character equivalent. That bug fix has now been pushed to Sigil master repo. That fix will appear in our next release. Thank you for your detective work on this bug, which allowed me to recreate it and greatly narrow down where it might be. Last edited by KevinH; 04-28-2024 at 12:16 PM. |
04-29-2024, 09:43 AM | #4 |
Junior Member
Posts: 9
Karma: 11018
Join Date: Feb 2024
Device: none
|
Thank you very much for implementing a fix for the next release.
Side note: The numerical entity (& # 160;) makes my code look rather messy, but it does mean I can easily identify where I have added non-breaking spaces. The red underscore, at least on my 5K iMac, is so thin that it's barely visible. If the red underscore had been thicker, I would probably have preferred that to the numerical entity. But I won't ask you to consider tweaking it. I'm very happy that the bug was identified. Also, thank you for the tip about Mend All as a workaround for bringing the numerical entity back in my epub! Case closed. |
04-29-2024, 10:20 AM | #5 |
Sigil Developer
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
|
FWIW ...
that red line is used to represent many different types of special whitespace characters that otherwise have no visible means to show up, to help distinguish them from normal whitespace. It is applied by the syntax highlighter. epub3 requires numeric entities, while epub2 allows either numeric or named entities. As nested named entities became famous for being used to create entity expansions attacks on web servers, numeric entities are now the only type allowed or used except for the small set basic xml recognized enties, gt, lt, etc. |
Advert | |
|
Tags |
non-breaking space, numerical entity, table of contents |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Removing all TOC info and generating a new TOC | WV-Mike | Sigil | 19 | 09-07-2023 12:55 PM |
non-breaking space | cramoisi | KOReader | 22 | 04-25-2017 04:47 AM |
Non-Breaking space | drago87 | Conversion | 0 | 01-20-2016 06:52 AM |
Why the non-breaking space? | Notjohn | Sigil | 2 | 06-08-2015 06:24 AM |
The entity name must immediately follow the '&' in the entity reference | digireads | Calibre | 3 | 06-08-2010 11:31 PM |