ePub creation tools : what's missing ? wishlist / dialogue - Page 4

Valloric · 02-09-2009, 06:22 PM

Quote:

Originally Posted by Komenor

Ok, but if the user can edit the source code, it's harder to certify the quality of the ePub output. We can use Tidy, of course, but it is an additional pre-requirement to the application...

You cannot guarantee that your application's output will be valid epub. Not in any realistic (and useful) editor.

Let me elaborate...

Any ebook editor needs to able to import (X)HTML. That's a given. If it's a good editor, then it will handle a lot more than just HTML, but let's stick to just that for now.

OK, so the application accepts an HTML file. Is the file valid HTML? You can't make that a precondition. I'm sorry, you just can't. Most HTML out there is nowhere near being valid, and the user could need to import HTML he didn't write himself.

So the app needs to accept invalid HTML, that is, HTML that display OK on a modern browser but that does not follow the required standards. And with that, you just blew any possibility of having a guarantee that the epub you export will always be standards compliant.

Why?

Well, you can't design a useful algorithm that accepts invalid HTML and outputs valid HTML. A useful algorithm would have these requirements, for any input:

1. Always output valid HTML.
2. The resultant HTML would always correctly represent the content of the original HTML and the intent of its author.

The first one is easy. If you remove the second one, for any input, just output whatever you like. But with the second requirement, you get a specification that cannot be fulfilled by any implementation, because it's incomputable.

Now, you could design an algorithm that fulfills both requirements for some input, but not for all. And no, not even Tidy can give you that, because it is theoretically impossible.

So you're stuck now. You can't guarantee your users that you will always output a valid epub file no matter what they import. You can do your best (and you should), but in the end... The second requirement is much more important than the first one. So you fix what you can and possibly tell the user about what you can't.

If they really care about producing a valid epub file, they will have to fix the errors your app can't fix themselves. And so you make it easy for them and give them access to the source. And if they introduce any errors whilst editing the source, it's their fault. They will probably have to fix it by editing the source, too.

Now if you wanted an editor that could only create epub files from scratch, then you could guarantee standard compliance if you disallow direct source code editing. But you don't want to make that kind of editor.

Your output can only be as good as the input (maybe slightly better, for trivial errors in the original file). The editor can't turn shit into gold, and can not give guarantees about compliance. Any that does is flat-out lying.

llasram · 02-09-2009, 11:01 PM

Quote:

Originally Posted by Valloric

1. Always output valid HTML.
2. The resultant HTML would always correctly represent the content of the original HTML and the intent of its author.

The first one is easy. If you remove the second one, for any input, just output whatever you like. But with the second requirement, you get a specification that cannot be fulfilled by any implementation, because it's incomputable.

Well, would depend on what you meant by "represent the content of the original HTML." It would be fairly easy to strip all semantic tag information from source HTML and translate into it into nothing but <div/>, <span/>, <a/>, and <img/> tags with appropriate CSS. That would make it trivial to output valid XHTML which retained exactly the same formatting characteristics as specified by the author.

Jellby · 02-10-2009, 04:25 AM

The program could accept invalid (X)HTML, and issue a warning if the final (X)HTML does not validate.

Valloric · 02-10-2009, 08:42 AM

Quote:

Originally Posted by llasram

Well, would depend on what you meant by "represent the content of the original HTML." It would be fairly easy to strip all semantic tag information from source HTML and translate into it into nothing but <div/>, <span/>, <a/>, and <img/> tags with appropriate CSS. That would make it trivial to output valid XHTML which retained exactly the same formatting characteristics as specified by the author.

Again, this would work for some input, but not for all. I also put "the intent of the author" in that prerequisite too. The author of the original file could write relatively complex HTML that does not validate and that you could not convert into standards compliant XHTML which faithfully represents the input file.

There's really no point discussing it, this is computer science 101: conversion of input from one language with non-deterministic rules (that is, non-validating HTML) to another with deterministic rules (standards compliant XHTML) whilst keeping all of the source information. An algorithm to perform this conversion for all input cannot be designed. It is theoretically impossible.

But that doesn't mean the application can't fix some errors and output valid XHTML. I'm just saying you can't guarantee compliance and not have to mangle the input in some situations. And even then it wouldn't work for some cases.

Quote:

Originally Posted by Jellby

The program could accept invalid (X)HTML, and issue a warning if the final (X)HTML does not validate.

My working idea too. Fix what you can, inform about what you can't, but don't mangle the input in any way or form. It is more important to guarantee to the user that you won't make some tiny change half-way through the novel he's importing than it is to guarantee standards compliance.

You can't piss off your users by trying to twist and turn their HTML into something it can't automatically become.

llasram · 02-10-2009, 10:15 AM

Quote:

Originally Posted by Valloric

Again, this would work for some input, but not for all. I also put "the intent of the author" in that prerequisite too. The author of the original file could write relatively complex HTML that does not validate and that you could not convert into standards compliant XHTML which faithfully represents the input file.

There's really no point discussing it, this is computer science 101: conversion of input from one language with non-deterministic rules (that is, non-validating HTML) to another with deterministic rules (standards compliant XHTML) whilst keeping all of the source information. An algorithm to perform this conversion for all input cannot be designed. It is theoretically impossible.

I really don't understand what you're getting at I'm afraid. I could write "fubby ducky loopy sunbird" and mean "Good morning, how are you?" and there would be no chance of conversion because the intent is all in my mind. With arbitrarily bad HTML the only possible interpretation of the author's intent is how some renderer renders that content. All contemporary HTML renderers use the same CSS box model for all rendering. Converting arbitrarily bad HTML into XHTML which displays the same is simply a matter applying the same rules the browser does in order to produce the box model instance it renders.

XHTML validity is a property of two components: XML validity and adherence to the XHTML schema, yah? Conversion of HTML w/o closing tags to valid XML with complete elements can be tricky, but the browser necessarily does essential the same thing in deciding what content ends up within what boxes. The Python lxml.html library calibre uses does an excellent job, matching for all practical purposes what most Web browsers produce. Producing schema-validating XHTML is where my proposal to strip all semantic tags comes in. CSS-based rendering doesn't care if you have a <div/> within a <p/> or a <sup/> within an <a/>. One just needs to extract the CSS applied to each element, then convert the element tags into ones which validate against the schema.

Valloric · 02-10-2009, 11:11 AM

Quote:

Originally Posted by llasram

With arbitrarily bad HTML the only possible interpretation of the author's intent is how some renderer renders that content. All contemporary HTML renderers use the same CSS box model for all rendering.

Quote:

Originally Posted by llasram

The Python lxml.html library calibre uses does an excellent job, matching for all practical purposes what most Web browsers produce.

There is no argument here.

I agree that you could very well design an algorithm that converts non-valid HTML into valid XHTML for most HTML people will write. It's what your "lxml.html" library does (although I've never used it) and it's what Tidy does as well.

But you can't do it for all possible arbitrarily bad HTML. You're assuming the user checked how his source displayed in a browser. If he did, then it's not a matter of parsing arbitrarily bad HTML. It's not a non-deterministic rule system anymore: the source follows the deterministic rendering rules of the browser he used to check his work. Converting from a deterministic language to another deterministic language is certainly possible. And while you could say that the vast majority of HTML authors would do just that (check the display in a browser) before importing, you can't categorically state it.

So let's sum this up... you can create an algorithm that can convert most practical non-conforming HTML into valid XHTML, but not all HTML one could write. If one were to say he could, one would be shoving a grave ignorance of computer science theory.

Komenor · 02-11-2009, 09:25 AM

Quote:

Originally Posted by Valloric

You cannot guarantee that your application's output will be valid epub. Not in any realistic (and useful) editor.

Let me elaborate...

Any ebook editor needs to able to import (X)HTML. That's a given. If it's a good editor, then it will handle a lot more than just HTML, but let's stick to just that for now.

I never said that my hypothetic editor will be able to import (X)HTML !

If it is only for modifying the fonts, the justification and other text formatting, the editor must only accept to import pure text file and only that.
Then give tools for text formatting (plus eventually "tables" and "pictures" support).

It is a choice : a "poor" editor with certified XHTML/ePub output or a good editor with no certification (or warnings on bad inputs).

Valloric · 02-11-2009, 12:46 PM

Quote:

Originally Posted by Komenor

It is a choice : a "poor" editor with certified XHTML/ePub output or a good editor with no certification (or warnings on bad inputs).

A "good" editor would embed some sort of validation of the final epub file. So if you don't get a warning when exporting, you're in the clear. And most of the time, the editor will be able to convert the user's non-conforming HTML into conforming XHTML.

Here's several use cases:

1. The user imports valid HTML. It is easily converted into XHTML. He then makes certain edits, and tries to export the book as an epub file. The epub file is created, the validator runs through it and finds no errors. All is well in ebook land.

2. The user imports invalid HTML. An algorithm tries to correct the input and create valid XHTML, and succeeds. The user then makes certain edits, and tries to export the book as an epub file. The epub file is created, the validator runs through it and finds no errors. All is well in ebook land.

3. The user imports invalid HTML. An algorithm tries to correct the input and create valid XHTML, and does not succeed: errors are thrown, the user is informed. The user opens the source view and tries to fix the problems. The user then makes certain other edits, and tries to export the book as an epub file. The epub file is created, the validator runs through it and finds no errors. All is well in ebook land.

4. The user imports invalid HTML. An algorithm tries to correct the input and create valid XHTML, and does not succeed: errors are thrown, the user is informed. The user opens the source view and tries to fix the problems. The user then makes certain other edits, and tries to export the book as an epub file. The epub file is created, the validator runs through it and finds errors. The user is informed, but the file remains--maybe the user doesn't care (if it's a file for personal use... who knows). If he does care, he makes more changes, and tries to export the file. The change/export process repeats until no errors are thrown.

So you see, the user can get an epub file that is certifiably valid.

mtravellerh · 02-12-2009, 04:59 PM

Now THAT makes sense. Can't wait for that piece of software, honestly!

Timoleon · 02-13-2009, 12:25 PM

Valloric's comments #53 should be used as a touchstone for any decent ePub editor. Great analysis and synopsis!

GeoffC · 02-13-2009, 12:39 PM

Seemingly complex, though?

Valloric · 02-13-2009, 04:36 PM

Quote:

Originally Posted by GeoffC

Seemingly complex, though?

It's as complex as it needs to be. If you remove something, you negatively impact the quality and usefulness of the editor.

From the programmer's perspective though, it is fairly complex. But the user doesn't care about that, does he? Of course he doesn't, nor should he.

GeoffC · 02-14-2009, 04:48 AM

It was, of course, the complexity of the programmers task that I was referring to.

richardigp · 02-23-2009, 06:01 AM

We just threw an Open Office to ePub Convertor into fray. It goes by the name of eScape. It does most of the advanced styling and formatting that is on the wish list above. Auto generation of OPF, NCX, etc. and free form modification of Stylesheets to create a book the way you want it to look. You can read about it and try it here. It's completely free for non-commercial use, but not Open Source.

It's a different approach. Rather than try and interprete endless inline and para styles, we define custom Structure-Styles and you have to put those on. There is a growing online tutorial here, so you can see if you can live with this different approach.

There are about 30 styles including drop & raised caps, small-caps, and lots of other blocks like epigraph, extract, notebox, code, boxed text, poem, notes, references, etc. All major book sections are predefined. If you want to comment, suggest please do so at our Publishing With XML blog.

Valloric · 02-23-2009, 02:31 PM

Quote:

Originally Posted by richardigp

It's a different approach. Rather than try and interprete endless inline and para styles, we define custom Structure-Styles and you have to put those on.

Questions:

1. How do you convert existing epub books to your format? Is it even possible to load existing epub books and edit them?

2. How do you guarantee display fidelity? Last time I checked, OpenOffice.org did not have an advanced XHTML renderer.

3. SVG? OO.org doesn't support it. Do you?

4. How do you handle the "longdesc" attribute? Do you support it?

5. Object tags?

6. DTBook?

7. XML islands?

8. Font embedding?

These are just from the top of my head. Haven't yet had the time to try out eScape, but I'm going to.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
tools for epub creation	Toxaris	ePub	15	03-05-2010 04:54 AM
creation tools for Mac?	nathanb	Workshop	1	09-11-2009 10:33 PM
on-the-fly epub creation	ilovejedd	ePub	19	04-16-2009 07:36 PM
Half of book missing after running converter tools, ideas?	ficbot	Other formats	0	04-11-2009 12:42 PM
epub creation tools	jbenny	ePub	20	03-13-2009 12:30 PM

02-10-2009, 04:25 AM	#48
Jellby frumious Bandersnatch Posts: 7,546 Karma: 19001583 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	The program could accept invalid (X)HTML, and issue a warning if the final (X)HTML does not validate.

02-12-2009, 04:59 PM	#54
mtravellerh book creator Posts: 9,657 Karma: 3856660 Join Date: Oct 2008 Location: Luxembourg Device: Kindle Scribe	Now THAT makes sense. Can't wait for that piece of software, honestly!

02-13-2009, 12:25 PM	#55
Timoleon Time Enough at Last Posts: 387 Karma: 1151316 Join Date: Feb 2008 Location: New England Device: iPad 3, iPhone 5, Kindle 3, Fire, Sony PRS-350	Valloric's comments #53 should be used as a touchstone for any decent ePub editor. Great analysis and synopsis!

02-13-2009, 12:39 PM	#56
GeoffC Chocolate Grasshopper ... Posts: 27,599 Karma: 20821184 Join Date: Mar 2008 Location: Scotland Device: Muse HD , Cybook Gen3 , Pocketbook 302 (Black) , Nexus 10: wife has PW	Seemingly complex, though?

02-14-2009, 04:48 AM	#58
GeoffC Chocolate Grasshopper ... Posts: 27,599 Karma: 20821184 Join Date: Mar 2008 Location: Scotland Device: Muse HD , Cybook Gen3 , Pocketbook 302 (Black) , Nexus 10: wife has PW	It was, of course, the complexity of the programmers task that I was referring to.

02-23-2009, 06:01 AM	#59
richardigp Enthusiast Posts: 27 Karma: 18 Join Date: Dec 2008 Location: Currently living in Pune India Device: Sony	We just threw an Open Office to ePub Convertor into fray. It goes by the name of *eScape*. It does most of the advanced styling and formatting that is on the wish list above. Auto generation of OPF, NCX, etc. and free form modification of Stylesheets to create a book the way you want it to look. You can read about it and try it here. It's completely free for non-commercial use, but not Open Source. It's a different approach. Rather than try and interprete endless inline and para styles, we define custom Structure-Styles and you have to put those on. There is a growing online tutorial here, so you can see if you can live with this different approach. There are about 30 styles including drop & raised caps, small-caps, and lots of other blocks like epigraph, extract, notebox, code, boxed text, poem, notes, references, etc. All major book sections are predefined. If you want to comment, suggest please do so at our Publishing With XML blog.

Advert

Advert