02-09-2009, 07:22 PM | #46 | |
Created Sigil, FlightCrew
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
|
Quote:
Let me elaborate... Any ebook editor needs to able to import (X)HTML. That's a given. If it's a good editor, then it will handle a lot more than just HTML, but let's stick to just that for now. OK, so the application accepts an HTML file. Is the file valid HTML? You can't make that a precondition. I'm sorry, you just can't. Most HTML out there is nowhere near being valid, and the user could need to import HTML he didn't write himself. So the app needs to accept invalid HTML, that is, HTML that display OK on a modern browser but that does not follow the required standards. And with that, you just blew any possibility of having a guarantee that the epub you export will always be standards compliant. Why? Well, you can't design a useful algorithm that accepts invalid HTML and outputs valid HTML. A useful algorithm would have these requirements, for any input: 1. Always output valid HTML. 2. The resultant HTML would always correctly represent the content of the original HTML and the intent of its author. The first one is easy. If you remove the second one, for any input, just output whatever you like. But with the second requirement, you get a specification that cannot be fulfilled by any implementation, because it's incomputable. Now, you could design an algorithm that fulfills both requirements for some input, but not for all. And no, not even Tidy can give you that, because it is theoretically impossible. So you're stuck now. You can't guarantee your users that you will always output a valid epub file no matter what they import. You can do your best (and you should), but in the end... The second requirement is much more important than the first one. So you fix what you can and possibly tell the user about what you can't. If they really care about producing a valid epub file, they will have to fix the errors your app can't fix themselves. And so you make it easy for them and give them access to the source. And if they introduce any errors whilst editing the source, it's their fault. They will probably have to fix it by editing the source, too. Now if you wanted an editor that could only create epub files from scratch, then you could guarantee standard compliance if you disallow direct source code editing. But you don't want to make that kind of editor. Your output can only be as good as the input (maybe slightly better, for trivial errors in the original file). The editor can't turn shit into gold, and can not give guarantees about compliance. Any that does is flat-out lying. Last edited by Valloric; 02-09-2009 at 07:26 PM. |
|
02-10-2009, 12:01 AM | #47 | |
Reticulator of Tharn
Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
|
Quote:
|
|
Advert | |
|
02-10-2009, 05:25 AM | #48 |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
The program could accept invalid (X)HTML, and issue a warning if the final (X)HTML does not validate.
|
02-10-2009, 09:42 AM | #49 | ||
Created Sigil, FlightCrew
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
|
Quote:
There's really no point discussing it, this is computer science 101: conversion of input from one language with non-deterministic rules (that is, non-validating HTML) to another with deterministic rules (standards compliant XHTML) whilst keeping all of the source information. An algorithm to perform this conversion for all input cannot be designed. It is theoretically impossible. But that doesn't mean the application can't fix some errors and output valid XHTML. I'm just saying you can't guarantee compliance and not have to mangle the input in some situations. And even then it wouldn't work for some cases. Quote:
You can't piss off your users by trying to twist and turn their HTML into something it can't automatically become. |
||
02-10-2009, 11:15 AM | #50 | |
Reticulator of Tharn
Posts: 618
Karma: 400000
Join Date: Jan 2007
Location: EST
Device: Sony PRS-505
|
Quote:
XHTML validity is a property of two components: XML validity and adherence to the XHTML schema, yah? Conversion of HTML w/o closing tags to valid XML with complete elements can be tricky, but the browser necessarily does essential the same thing in deciding what content ends up within what boxes. The Python lxml.html library calibre uses does an excellent job, matching for all practical purposes what most Web browsers produce. Producing schema-validating XHTML is where my proposal to strip all semantic tags comes in. CSS-based rendering doesn't care if you have a <div/> within a <p/> or a <sup/> within an <a/>. One just needs to extract the CSS applied to each element, then convert the element tags into ones which validate against the schema. |
|
Advert | |
|
02-10-2009, 12:11 PM | #51 | ||
Created Sigil, FlightCrew
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
|
Quote:
Quote:
I agree that you could very well design an algorithm that converts non-valid HTML into valid XHTML for most HTML people will write. It's what your "lxml.html" library does (although I've never used it) and it's what Tidy does as well. But you can't do it for all possible arbitrarily bad HTML. You're assuming the user checked how his source displayed in a browser. If he did, then it's not a matter of parsing arbitrarily bad HTML. It's not a non-deterministic rule system anymore: the source follows the deterministic rendering rules of the browser he used to check his work. Converting from a deterministic language to another deterministic language is certainly possible. And while you could say that the vast majority of HTML authors would do just that (check the display in a browser) before importing, you can't categorically state it. So let's sum this up... you can create an algorithm that can convert most practical non-conforming HTML into valid XHTML, but not all HTML one could write. If one were to say he could, one would be shoving a grave ignorance of computer science theory. |
||
02-11-2009, 10:25 AM | #52 | |
Enthusiast
Posts: 29
Karma: 100
Join Date: Dec 2008
Location: France
Device: Sony PRS-505
|
Quote:
If it is only for modifying the fonts, the justification and other text formatting, the editor must only accept to import pure text file and only that. Then give tools for text formatting (plus eventually "tables" and "pictures" support). It is a choice : a "poor" editor with certified XHTML/ePub output or a good editor with no certification (or warnings on bad inputs). |
|
02-11-2009, 01:46 PM | #53 | |
Created Sigil, FlightCrew
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
|
Quote:
Here's several use cases: 1. The user imports valid HTML. It is easily converted into XHTML. He then makes certain edits, and tries to export the book as an epub file. The epub file is created, the validator runs through it and finds no errors. All is well in ebook land. 2. The user imports invalid HTML. An algorithm tries to correct the input and create valid XHTML, and succeeds. The user then makes certain edits, and tries to export the book as an epub file. The epub file is created, the validator runs through it and finds no errors. All is well in ebook land. 3. The user imports invalid HTML. An algorithm tries to correct the input and create valid XHTML, and does not succeed: errors are thrown, the user is informed. The user opens the source view and tries to fix the problems. The user then makes certain other edits, and tries to export the book as an epub file. The epub file is created, the validator runs through it and finds no errors. All is well in ebook land. 4. The user imports invalid HTML. An algorithm tries to correct the input and create valid XHTML, and does not succeed: errors are thrown, the user is informed. The user opens the source view and tries to fix the problems. The user then makes certain other edits, and tries to export the book as an epub file. The epub file is created, the validator runs through it and finds errors. The user is informed, but the file remains--maybe the user doesn't care (if it's a file for personal use... who knows). If he does care, he makes more changes, and tries to export the file. The change/export process repeats until no errors are thrown. So you see, the user can get an epub file that is certifiably valid. Last edited by Valloric; 02-11-2009 at 02:13 PM. Reason: typo |
|
02-12-2009, 05:59 PM | #54 |
book creator
Posts: 9,656
Karma: 3856660
Join Date: Oct 2008
Location: Luxembourg
Device: Kindle Scribe
|
Now THAT makes sense. Can't wait for that piece of software, honestly!
|
02-13-2009, 01:25 PM | #55 |
Time Enough at Last
Posts: 387
Karma: 1151316
Join Date: Feb 2008
Location: New England
Device: iPad 3, iPhone 5, Kindle 3, Fire, Sony PRS-350
|
Valloric's comments #53 should be used as a touchstone for any decent ePub editor. Great analysis and synopsis!
|
02-13-2009, 01:39 PM | #56 |
Chocolate Grasshopper ...
Posts: 27,599
Karma: 20821184
Join Date: Mar 2008
Location: Scotland
Device: Muse HD , Cybook Gen3 , Pocketbook 302 (Black) , Nexus 10: wife has PW
|
Seemingly complex, though?
|
02-13-2009, 05:36 PM | #57 |
Created Sigil, FlightCrew
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
|
It's as complex as it needs to be. If you remove something, you negatively impact the quality and usefulness of the editor.
From the programmer's perspective though, it is fairly complex. But the user doesn't care about that, does he? Of course he doesn't, nor should he. |
02-14-2009, 05:48 AM | #58 |
Chocolate Grasshopper ...
Posts: 27,599
Karma: 20821184
Join Date: Mar 2008
Location: Scotland
Device: Muse HD , Cybook Gen3 , Pocketbook 302 (Black) , Nexus 10: wife has PW
|
It was, of course, the complexity of the programmers task that I was referring to.
|
02-23-2009, 07:01 AM | #59 |
Enthusiast
Posts: 27
Karma: 18
Join Date: Dec 2008
Location: Currently living in Pune India
Device: Sony
|
We just threw an Open Office to ePub Convertor into fray. It goes by the name of eScape. It does most of the advanced styling and formatting that is on the wish list above. Auto generation of OPF, NCX, etc. and free form modification of Stylesheets to create a book the way you want it to look. You can read about it and try it here. It's completely free for non-commercial use, but not Open Source.
It's a different approach. Rather than try and interprete endless inline and para styles, we define custom Structure-Styles and you have to put those on. There is a growing online tutorial here, so you can see if you can live with this different approach. There are about 30 styles including drop & raised caps, small-caps, and lots of other blocks like epigraph, extract, notebox, code, boxed text, poem, notes, references, etc. All major book sections are predefined. If you want to comment, suggest please do so at our Publishing With XML blog. |
02-23-2009, 03:31 PM | #60 | |
Created Sigil, FlightCrew
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
|
Quote:
1. How do you convert existing epub books to your format? Is it even possible to load existing epub books and edit them? 2. How do you guarantee display fidelity? Last time I checked, OpenOffice.org did not have an advanced XHTML renderer. 3. SVG? OO.org doesn't support it. Do you? 4. How do you handle the "longdesc" attribute? Do you support it? 5. Object tags? 6. DTBook? 7. XML islands? 8. Font embedding? These are just from the top of my head. Haven't yet had the time to try out eScape, but I'm going to. |
|
Tags |
epub application, epub creation, epub editor, wishlist |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
tools for epub creation | Toxaris | ePub | 15 | 03-05-2010 05:54 AM |
creation tools for Mac? | nathanb | Workshop | 1 | 09-11-2009 11:33 PM |
on-the-fly epub creation | ilovejedd | ePub | 19 | 04-16-2009 08:36 PM |
Half of book missing after running converter tools, ideas? | ficbot | Other formats | 0 | 04-11-2009 01:42 PM |
epub creation tools | jbenny | ePub | 20 | 03-13-2009 01:30 PM |