Add title="" to h* based on existing TOC -- suggestion for new feature (or plugin?) - Page 5

Hitch · 07-08-2020, 09:43 AM

Quote:

Originally Posted by Mister L

I had typed out a reply to this and then the stupid browser ate it.

Can't be bothered to try and remember all of it. In short: From what I've seen, previous attempts have been intended to fix or draw from the headings in the html files. DNSB will correct me if I'm wrong about his method. I agree, this is doomed to failure because too complex and unpredictable. No-one has said to me, "I (or someone) tried to automate copying the titles out of the toc.ncx or nav.xhtml and pasting them into the corresponding html files without touching the heading formatting at all, it's impossible for reason X."

KevinH understood precisely what I was trying to do and summed it up very succinctly (idea simplified since then to remove danger of breakage, now I just want to paste the titles into html comments instead of a title attribute or h1 tag) but apart from that the discussion has mostly been about how complicated / impossible it would be to make a plugin to fix the headings. Which is 1. 100% true, no argument from me, and also 2. 100% irrelevant, because I don't want a plugin to fix the bleeding headings, I want a plugin to copy the titles out of the toc.ncx or nav and paste them into an html comment in the corresponding files, which is an entirely different proposition and takes a detour around the whole problem of the unpredictable headings (that is a separate problem the way I see it and much better --and easier-- dealt with using regex).

You yourself mention the headings in the html file (h1, h2) which makes me think you haven't understood either what I've been getting at, which is understandable if you have only a passing / zero interest in the question and haven't been following the thread very closely, but in fact I am pretty sure I have found a way to do the specific thing I want to do, I just don't have the technical competencies to implement it. Maybe someday I will, in which case at least this thread will have forced me to think about precisely how to do this and what kind of problems need to be avoided.

Don't you find it remotely ODD that somehow, EVERYBODY on this thread is apparently misreading your posts, or too damned stoopid to understand your SIMPLE meanings?

You're a formatter, not a programmer, right? Perhaps in your experience as a formatter, a customer--who "lacked the technical competencies to do a specific thing that they wanted done," asked you to do something, assuming that it was simple to do, precisely BECAUSE they lacked those technical competencies?

/done here.

Hitch

KevinH · 07-08-2020, 11:19 AM

I really do not want to get into the middle of this as I have been tied up and have not been following this thread at all until now, but there does appear to be clear miscommunication innocently going on by all parties.

That said, please let me try to clarify things in the hope this does not degenerate further.

1. Assume you have an epub where the titles in the ncx or nav are correct and what you want.

2. Assume further that the actual headings tags in the xhtml files are either missing (they used p's) or not formatted in some sane way.

(And yes this can be a common problem with Gutenberg and some other books).

Now, the problem is if you want to regenerate the ncx or nav, using the Sigil tools because you want to later move, split, etc, you will not get back the ncx or nav entries that are there now. You will lose those "good" titles as they would be replaced during the regeneration of the ncx or nav with stuff taken from bad or missing heading tags.

So what is being asked for here, is a plugin that will parse an existing ncx or nav and extract the link back and the title text. (Actually just supporting parsing an ncx would be enough as an ncx can be autogenerated from a nav on epub3).

Then use that link to determine the destination file and in that file add a title attribute to the target heading tag that contains the title text extracted from the ncx/nav. If no target heading tag exists, then insert a new "no display" heading tag with the extracted title attribute.

If working with title attributes is too hard, then instead simply add a comment tag with the title text immediately before the destination element.

After this plugin was run then:

If the title attribute on heading tags had been set, then regenerating the ncx or nav would preserve the good titles for the most part.

If instead comments are added, then a follow-up regular expression search and replace can then be more easily done taking the title text from the just preceding comment.

That is what is being asked for here.

Hopefully, this will make everything clearer to everyone involved.

Hope this helps.

KevinH

KevinH · 07-08-2020, 11:35 AM

FWIW, as everybody noted and that is evidently clear, miscommunication is the bane of IT support! The problem is that English words (any language really) are imprecise with many possible interpretations and meanings and what seems perfectly clear to one party is gobbledy-gook to the other and visa-versa.

This is why the true language of math and its supporting notation were developed (speaking as a just retired programmer and analytics/stats prof here!) because English words (even seemingly well defined) were simply not precise enough.

The problem with teaching stats to students was that only one party understood the precise notation that most text books used and so my professor role really reduced to being that of a translator back to English with the hopes of still trying to be precise via lots of examples. That was not always fruitful and most people only learned things by rote and even more unfortunately good thinking (the forest) was typically lost in trees!

To make matters worse, the overlap of domains of IT and English are less well structured than math and almost impossible to pin down (leaving lawyers as the only winners!).

For IT/software, the advent of rapid protyping has helped, but obviously not fixed the issues.

DNSB · 07-08-2020, 12:04 PM

I simply went from the first post wherein it was stated:

Quote:

How easy / possible would it be to reverse engineer an existing TOC and add the existing titles as they appear, to a title="" in an h* tag at the appropriate point in the book? If the file is really badly made and the chapter titles are in some random tag like p or div it might be necessary to add a blank h* with a display:none to it.

A while back I attempted to create some code that would handle that task. What I found was that there were so many screwed up epubs that needed special case handling that I could not have my code reliably handle what on the surface appeared to be a simple task. Someone else may take a different approach that would be more successful. If so, I would happily use their code and learn from it.

DiapDealer · 07-08-2020, 12:06 PM

I guess I just don't follow why--given that the ncx (or nav) was already declared "good"--there would be any reason for regenerating the ncx from the text the proposed plugin took from that good ncx and plugged into the epub's html (utilizing attributes of h tags or contents of html comments that were later regexed into same).

Why insert non-rendering attributes into the html that can really only be used to regenerate the ncx if the ncx has already been declared sufficient?

Is the whole point to make an already functional, textually satisfactory NCX/NAV regeneratable from the html?

I could understand a desire for a plugin that truly reverses the html to ncx/nav process: namely making the various chapter/section headings in the html match the ncx/nav. But what I THINK I'm hearing, is a desire for a plugin that makes it possible to regenerate an NCX/NAV that doesn't need regenerated by inserting non-rendering html (or easily regexable non-rendering html comments) into the epub's xhtml.

If done correctly, an ncx/nav generated from the attributes inserted into the html by the proposed plugin (from the original ncx/nav) would look and function exactly like the original ncx/nav, no?

DNSB · 07-08-2020, 12:15 PM

Quote:

Originally Posted by DiapDealer

I guess I just don't follow why--given that the ncx (or nav) was already declared "good"--there would be any reason for regenerating the ncx from the text the proposed plugin took from that good ncx and plugged into the epub's html (utilizing attributes of h tags or contents of html comments that were later regexed into same).

Why insert non-rendering attributes into the html that can really only be used to regenerate the ncx if the ncx has already been declared sufficient?

Is the whole point to make an already functional, textually satisfactory NCX/NAV regeneratable from the html?

Given the OP mentioned splitting omnibuses back into the individual books as a use case, I suspect he is wanting to be able to recreate the navigation documents for the individual books without having to resort to manual editing.

So, yes, being able to regenerate the NCX/NAV from the html is the whole point.

KevinH · 07-08-2020, 12:17 PM

Yes, as any later use of Sigil tools (ie. split into chapters, or merging, or moves or ...) may force you to either hand edit the ncx or need to regenerate it. Regenerating it would be easiest but will lose content unless heading title attributes are first set.

KevinH · 07-08-2020, 12:25 PM

Come to think of it, my ePub3-itizer plugin already has the code to parse an ncx and extract destination links and source text. You could convert those links to book paths, and use the plugin interface to open the correct destination file ... so a rough prototype should be doable.

If no one else wants to take a shot at this, I will ... but ... I am tied up for the two weeks or so.

Quote:

Originally Posted by DNSB

I simply went from the first post wherein it was stated:

A while back I attempted to create some code that would handle that task. What I found was that there were so many screwed up epubs that needed special case handling that I could not have my code reliably handle what on the surface appeared to be a simple task. Someone else may take a different approach that would be more successful. If so, I would happily use their code and learn from it.

DiapDealer · 07-08-2020, 12:28 PM

Seems like an overly Rube Boldberg-ian process to me, but I'll happily add it to the plugin index when somebody develops and uploads it.

Tex2002ans · 07-08-2020, 04:03 PM

Quote:

Originally Posted by KevinH

That said, please let me try to clarify things in the hope this does not degenerate further.

[...]

Hopefully, this will make everything clearer to everyone involved.

Hope this helps.

Fantastic summary of what Mister L intends.

Quote:

Originally Posted by DiapDealer

Why insert non-rendering attributes into the html that can really only be used to regenerate the ncx if the ncx has already been declared sufficient?

Omnibuses (combining or splitting) are one use-case.

For example, the case I gave before of TOC:

Code:

“Article Title” by First Last

where the HTML might be:

Code:

<h2>Article Title</h2>
<p class="author">First Last</p>

Let's say I have Volumes 1-10 of journal articles in a single EPUB, and now I want to split each volume into 10 individual EPUBs.

I have a perfectly good TOC already generated... so (theoretical) plugin should be able to:

Code:

<h2 title="“Article Title” by First Last">Article Title</h2>
<p class="author">First Last</p>

This allows me to import Volume 1's HTML files into a separate EPUB, then regenerate using Sigil's Tools > Table of Contents > Generate Table of Contents.

DiapDealer · 07-08-2020, 04:34 PM

Sorry. I can usually at least comprehend someone else's use case for things like this, but I'm just not getting this one.

I think in terms of creating ebooks and fixing broken ebooks. That's about it. Turning someone else's ebook into something else (or multiple something elses) is simply not something I would bother doing. Why would one even want to split an omnibus ebook in the first place?

The good news is that I don't have to "get it."

BetterRed · 07-08-2020, 08:12 PM

Quote:

Originally Posted by KevinH

To make matters worse, the overlap of domains of IT and English are less well structured than math and almost impossible to pin down (leaving lawyers as the only winners!).

Ain't that the truth.

Financiers, merketeers and bureaucrats are the also destroyers of English. 'Bubble' being the latest overused fad word, yesterday I found myself typing '…a picket fence as a border bubble barrier…'

Quote:

Originally Posted by KevinH

For IT/software, the advent of rapid protyping has helped, but obviously not fixed the issues.

But I'm not sure I agree with that, unless you mean rapid prototyping helps proliferate the verbing of nouns etc.

BR

KevinH · 07-08-2020, 11:12 PM

Since I am not an author or epub developer, just an avid epub user, I rarely create new epubs. That said ... I often see almost this exact use case in older Gutenberg epubs. The missing headings (p tags used instead), a working ncx, and all chapters in one big file needing to be split at some point, horrible file naming, etc. I often find myself cleaning these up before adding them to my own library.

Quote:

Originally Posted by DiapDealer

Sorry. I can usually at least comprehend someone else's use case for things like this, but I'm just not getting this one.

I think in terms of creating ebooks and fixing broken ebooks. That's about it. Turning someone else's ebook into something else (or multiple something elses) is simply not something I would bother doing. Why would one even want to split an omnibus ebook in the first place?

The good news is that I don't have to "get it."

slowsmile · 07-09-2020, 05:00 AM

@Mister L...I've now discovered why no-one will be able to create the plugin you want from your spec. I continued working on the plugin, which is actually working now, more or less, according to your own spec(see below). But here is the point -- that plugin will only work for the epub that I used to test the plugin but it will never work for any other epub. Why? Well here's the problem with your spec:

Quote:

Some html files have NO toc marker at all (eg the cover, titre.html...): in these files, the toc reference should be copied from the ncx into a new h1 tag OR an html comment at the top of the html file, whichever is easiest to code.

How can you expect the plugin to find those p tags if those p tags have no associated unique string or id? It can't be done, that's wishful thinking. So instead and as a challenge because I was so bored(it's a COVID thing) I hard coded the location of the p tags using either the file position or file names of the cover.html, titre.html and carte.html files. And it worked for the test epub. But unless you somehow add a helpful string or id to those p tags or perhaps even use a hidden h1 tag(using the display:none class property) together with an appropriate heading string that can be easily equated and compared to its NCX equivalent then that plugin will never fly for any other epub other than your test epub.

I also found a lone non-breaking space lurking in between the empty h1 tags in the Citation file in your test epub which unhappily screwed up the results. How did that get there? I didn't bother to fix that, no point since no one's going to use the plugin anyway. By that time, as you'll appreciate, I'd had enough.

Mister L · 07-09-2020, 08:29 PM

Quote:

Originally Posted by KevinH

That said, please let me try to clarify things in the hope this does not degenerate further.

(...)

Hopefully, this will make everything clearer to everyone involved.

Hope this helps.

KevinH

That's exactly right.

Quote:

Originally Posted by KevinH

Come to think of it, my ePub3-itizer plugin already has the code to parse an ncx and extract destination links and source text. You could convert those links to book paths, and use the plugin interface to open the correct destination file ... so a rough prototype should be doable.

If no one else wants to take a shot at this, I will ... but ... I am tied up for the two weeks or so.

That would be amazing if you manage to find the time to get to it, I'd be really grateful.

Quote:

Originally Posted by Tex2002ans

Fantastic summary of what Mister L intends.

Omnibuses (combining or splitting) are one use-case.

For example, the case I gave before of TOC:

Code:

“Article Title” by First Last

where the HTML might be:

Code:

<h2>Article Title</h2>
<p class="author">First Last</p>

Let's say I have Volumes 1-10 of journal articles in a single EPUB, and now I want to split each volume into 10 individual EPUBs.

I have a perfectly good TOC already generated... so (theoretical) plugin should be able to:

Code:

<h2 title="“Article Title” by First Last">Article Title</h2>
<p class="author">First Last</p>

This allows me to import Volume 1's HTML files into a separate EPUB, then regenerate using Sigil's Tools > Table of Contents > Generate Table of Contents.

Yes exactly. I hadn't thought of the particular use case you describe until you mentioned it and I'm sure there are plenty more I haven't encountered or thought of. I mentioned previously in this thread some of the more common use cases for me:
- splitting an omnibus
- creating an omnibus from previously published individual books
- adding new material to a previously published book (first chapter of a different book, as a preview; new introduction; etc.)
- cleaning up a book (from Project Gutenberg) which was very badly formatted to begin with.

As for the reason someone would have for doing any of these, in my case most of the time it's because that is what the client (a publisher) has hired me to do.

07-08-2020, 11:19 AM	#62
KevinH Sigil Developer Posts: 9,409 Karma: 6733754 Join Date: Nov 2009 Device: many	I really do not want to get into the middle of this as I have been tied up and have not been following this thread at all until now, but there does appear to be clear miscommunication innocently going on by all parties. That said, please let me try to clarify things in the hope this does not degenerate further. 1. Assume you have an epub where the titles in the ncx or nav are correct and what you want. 2. Assume further that the actual headings tags in the xhtml files are either missing (they used p's) or not formatted in some sane way. (And yes this can be a common problem with Gutenberg and some other books). Now, the problem is if you want to regenerate the ncx or nav, using the Sigil tools because you want to later move, split, etc, you will not get back the ncx or nav entries that are there now. You will lose those "good" titles as they would be replaced during the regeneration of the ncx or nav with stuff taken from bad or missing heading tags. So what is being asked for here, is a plugin that will parse an existing ncx or nav and extract the link back and the title text. (Actually just supporting parsing an ncx would be enough as an ncx can be autogenerated from a nav on epub3). Then use that link to determine the destination file and in that file add a title attribute to the target heading tag that contains the title text extracted from the ncx/nav. If no target heading tag exists, then insert a new "no display" heading tag with the extracted title attribute. If working with title attributes is too hard, then instead simply add a comment tag with the title text immediately before the destination element. After this plugin was run then: If the title attribute on heading tags had been set, then regenerating the ncx or nav would preserve the good titles for the most part. If instead comments are added, then a follow-up regular expression search and replace can then be more easily done taking the title text from the just preceding comment. That is what is being asked for here. Hopefully, this will make everything clearer to everyone involved. Hope this helps. KevinH Last edited by KevinH; 07-08-2020 at 11:56 AM.

07-08-2020, 11:35 AM	#63
KevinH Sigil Developer Posts: 9,409 Karma: 6733754 Join Date: Nov 2009 Device: many	FWIW, as everybody noted and that is evidently clear, miscommunication is the bane of IT support! The problem is that English words (any language really) are imprecise with many possible interpretations and meanings and what seems perfectly clear to one party is gobbledy-gook to the other and visa-versa. This is why the true language of math and its supporting notation were developed (speaking as a just retired programmer and analytics/stats prof here!) because English words (even seemingly well defined) were simply not precise enough. The problem with teaching stats to students was that only one party understood the precise notation that most text books used and so my professor role really reduced to being that of a translator back to English with the hopes of still trying to be precise via lots of examples. That was not always fruitful and most people only learned things by rote and even more unfortunately good thinking (the forest) was typically lost in trees! To make matters worse, the overlap of domains of IT and English are less well structured than math and almost impossible to pin down (leaving lawyers as the only winners!). For IT/software, the advent of rapid protyping has helped, but obviously not fixed the issues. Last edited by KevinH; 07-08-2020 at 11:59 AM.

07-08-2020, 12:06 PM	#65
DiapDealer Grand Sorcerer Posts: 29,138 Karma: 211348980 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I guess I just don't follow why--given that the ncx (or nav) was already declared "good"--there would be any reason for regenerating the ncx from the text the proposed plugin took from that good ncx and plugged into the epub's html (utilizing attributes of h tags or contents of html comments that were later regexed into same). Why insert non-rendering attributes into the html that can really only be used to regenerate the ncx if the ncx has already been declared sufficient? Is the whole point to make an already functional, textually satisfactory NCX/NAV regeneratable from the html? I could understand a desire for a plugin that truly reverses the html to ncx/nav process: namely making the various chapter/section headings in the html match the ncx/nav. But what I THINK I'm hearing, is a desire for a plugin that makes it possible to regenerate an NCX/NAV that doesn't need regenerated by inserting non-rendering html (or easily regexable non-rendering html comments) into the epub's xhtml. If done correctly, an ncx/nav generated from the attributes inserted into the html by the proposed plugin (from the original ncx/nav) would look and function exactly like the original ncx/nav, no? Last edited by DiapDealer; 07-08-2020 at 12:22 PM.

07-08-2020, 12:17 PM	#67
KevinH Sigil Developer Posts: 9,409 Karma: 6733754 Join Date: Nov 2009 Device: many	Yes, as any later use of Sigil tools (ie. split into chapters, or merging, or moves or ...) may force you to either hand edit the ncx or need to regenerate it. Regenerating it would be easiest but will lose content unless heading title attributes are first set. Last edited by KevinH; 07-08-2020 at 12:19 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
GUI Plugin "TOC View Generator" (was: Define Content)	Mick2nd	Plugins	20	06-26-2024 04:19 AM
V3 "Feature" Full Screen Add Book Dialog	johnelle	Library Management	3	08-11-2017 03:43 PM
A warning for Linux users: slow "Add Books", "Unknown" title and Author	rolgiati	Library Management	8	07-24-2013 05:36 PM
"Add existing files" doesn't show all directories	Ripplinger	Sigil	5	02-23-2013 12:43 PM
Feature Request - TOC Exclude "> My Books"	chrisparker	Library Management	2	10-13-2012 12:44 PM

07-08-2020, 12:28 PM	#69
DiapDealer Grand Sorcerer Posts: 29,138 Karma: 211348980 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Seems like an overly Rube Boldberg-ian process to me, but I'll happily add it to the plugin index when somebody develops and uploads it.

07-08-2020, 04:34 PM	#71
DiapDealer Grand Sorcerer Posts: 29,138 Karma: 211348980 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Sorry. I can usually at least comprehend someone else's use case for things like this, but I'm just not getting this one. I think in terms of creating ebooks and fixing broken ebooks. That's about it. Turning someone else's ebook into something else (or multiple something elses) is simply not something I would bother doing. Why would one even want to split an omnibus ebook in the first place? The good news is that I don't have to "get it."

Advert

Advert