02-03-2014, 01:23 PM | #16 |
A curiosus lector!
Posts: 463
Karma: 2015140
Join Date: Jun 2012
Device: Sony PRS-T1, Kobo Touch
|
skreutzer, I do not know, but I think your project is very ambitious. I have no problem with that, but this could complicates things, at least initially.
By the way could you summarize point by point, in list form, the main aspects and goals of your project? (dickloraine has worded some good questions about it). You know, sometimes trees hide the forest! I think the idea suggested by roger64 is very good, and if that is the case why not start with the 'easy' stuff? Let me explain. Writer2LaTeX/Writer2xhtml has not been updated for some time, so why not start with this update if you can do it? The source code is available and can be modified within the GNU LGPL if I'm not mistaken. I am convinced that hardcore Writer2xhtml users will be happy to suggest you what should be improved. Another positive aspect of this suggestion is that it allows you to put aside the issue of free and non-free os/apps since LibreOffice or AOO are cross-platform. Finally, an another interesting thing to do would be to create, as you suggested yourself, a kind of wizard which will guide users to better understand the enormous interest of semantic encoding. With suitable templates and some macros for example, the end-user would be offered to create a project (within LO or AOO), step by step, and show him what to do to properly encode his or her documents. |
02-03-2014, 07:03 PM | #17 | |
Wizard
Posts: 2,304
Karma: 12587727
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
I have been a fan of this idea ever since you posted the first topic.. just have been a little busy with a few large projects.. so I haven't gotten the time to sit and write my usual in-depth tomes.
I stumbled upon this many months ago when looking up some EPUB3 information. This company called Infogrid Pacific works on one of the few EPUB3 reading programs, AZARDI. They also have a program out there called "Infogrid: Digital Publisher", which states to do exactly this: http://www.infogridpacific.com/DigitalPublisher.html You use their program to create their intermediary, and then it allows you to output that one file into a wide range out output formats: Their blog had some very useful information on EPUB3, and there are some good nuggets of information there on OCR/different formats (I haven't taken a look in about a year though.... I recall most the posts being self-promotional): http://www.infogridpacific.com/blog/ Maybe you might be able to gather some good ideas from their documentation/manuals/blogs/posts. Quote:
I actually spent a nice chunk of time in December looking as a way to go backwards from my consistent XHTML -> LaTeX -> PDF. Jellby pointed me towards:
(This is an ongoing project, at the pace I am going getting distracted with more and more book conversion, this EPUB -> PDF research will probably take me years! ) Now, I see a few large problems:
Let me just reiterate, I am an extremely small minority of the users. (I am one of the few here who is paid to convert (most here do it for personal usage or as a hobby)). Non-fiction is much harder/more complex than just handling your simple fictional work (which is probably the vast majority of writers getting books converted). I try to push consistency across all of my books, so that it will make it way easier to swap things around if needed. For example, we had a ton of discussion in this topic about footnotes: https://www.mobileread.com/forums/sho...d.php?t=225045 I treat them the same across all my books, so I can easily just regex them if needed (early on I used to have superscript footnotes, now I have them in the [##] format). |
|
Advert | |
|
02-03-2014, 09:40 PM | #18 |
Software Developer
Posts: 190
Karma: 89000
Join Date: Jan 2014
Location: Germany
Device: PocketBook Touch Lux 3
|
@dickloraine:
The company made a great effort in order to implement an automated processing workflow, which isn't different from what everybody in the field does. However, they saved themselves much more effort of hand formatting all their titles, so in digital publishing (especially in volume) there aren't that much alternatives than to change the old publishing process to an automated one. Their 250 styles represent 250 print layouts.
I would absolutely go into the discussion about free software, because there are already lots and lots of proprietary, non-free "solutions" around, which you neither are able/allowed to access nor you're technically able to use them because of the dependency on a proprietary environment and proprietary tools. As for self-publishers, it seems there is no solution available at all at the moment which can be used independently. You definitively should learn more about free software. Proprietary software might at any time prevent you from doing the job. The "open" approach risks to become proprietary at any time. You aren't forced to switch your operating system, to the contrary - there are so much portations of free software to proprietary operating systems available, that one gets almost all benefits from the free world even in an proprietary environment (since free software isn't discriminatory and can freely be used and modified, no matter what the operating system environment might be), while proprietary software is completely unusable on several levels in the free software world due to artificial restrictions. As it is very likely that all software used within the automated processing workflows are also available on proprietary systems and the workflow implementation itself can be easily ported to such, you should not expect that I myself have any interest in doing so and wasting time on things which are quite contrary to the initial goal of the project, which is developing automated processing workflows with and as free software. If such a workflow is dependend on proprietary software, there's absolutely no way to run it while still respecting the digital freedoms of the user, since the user is forced to use non-freedom-respecting software in order to run the free software. @Arios: I don't think it is too ambitions, since I've already implemented such an automated processing workflow in the past for one of my own projects (in case you haven't found the link in the Sigil thread): http://www.freie-bibel.de/official/p...lisierung.html (unfortunately, the text itself is in German, but you might still look at the images). I'm now only trying to generalize it while making it as easy as possible to use, which will take some time and some efforts, but I'm confident that over time some progress can be made. You are right, I should try harder to communicate the concept and idea behind what kind of solution I have in mind, since it almost looks repetitive in this discussion what the details of such an implementation would be. Usually, I would like best to just do the programming and demonstrating the results as real-world application, but at the other hand I started this thread to ask some questions, give some updates, etc., so some kind of project description could be quite useful for people who just want to look into if they could make use of an automated processing workflow or not. I'm actually planning to provide support for writer2xhtml standalone. Writer2latex doesn't seem that usable to me at the moment, since it requires ODT (which is XML) as input, so in order to make use of a backend, an application would be required to write ODT or to convert to it. I would consider LaTeX as "irreversible" output format since it isn't XML and can't directly be used in an XML based workflow (while LaTeX to XHTML or LaTeX to ODT doesn't seem to be high-priority). At the moment, I'm aiming for ODT to XHTML and XHTML to EPUB and LaTeX (or XSL-FO), so the latter part can even be used with XHTML input from any application and without requiring ODT, while EPUB and LaTeX output would be available for ODT and XHTML sources. Free software isn't a mere technical issue, but to look at the technical aspect: as you know, LibreOffice is written in Java, so there's a need to have a JavaVM implemented for a proprietary operating system, which is a competing product to the VMs of proprietary operating systems, so the operating system environments might make it artificially difficult for a Java VM implementation. If Oracle decides to discontinue its proprietary VM implementations for the proprietary operating systems, all freely licensed Java software will get effectively be unusable on proprietary operating systems, leaving only the free implementation of a Java VM (OpenJDK) intact - and I guess there wouldn't be much interest in porting it on the proprietary operating systems for obvious reasons (lots of work, no gain at all, contrary to the goals of free software). So the LibreOffice ports to proprietary systems are in constant danger of loosing their technical foundation, and if LibreOffice can't be supported any longer on proprietary operating systems, just guess what its users will do then: they'll switch to a proprietary word processor. Since Java bytecode is portable, Java programs are automatically cross-platform, but it still depends on licensing, if you are allowed to use, modify, distribute a program or not. To demonstrate the benefits of semantic encoding, especially if combined with automated processing, I would provide a set of default layouts with the automated processing workflow, including ODT document templates with predefined styles. Then I would show how text in the ODT documents could be easily rendered into PDF and EPUB without manual adjustments, as long as the template styles are supported by the backend and the template styles get applied semantically in the ODT document (additionally showing the opportunity that the visual appearance in the generated files can easily be changed without the need of manual adjustments to the ODT by just changing the style implementation in the backend). I observed that some of the self-publishers use predefined Microsoft word templates for print formatting, but they have to do e-book formatting separately (and keep both in sync), or they do both by manual direct formatting. I haven't got a better idea to show the benefits of semantic encoding than demonstrating that this manual efforts aren't necessary if semantic formatting is applied in the first place. @Tex2002ans: I know a pretty good EPUB3 reading software: it's called "webbrowser" ;-) No, seriously, just display the TOC in a side pane of a browser window, that's it. "Infogrid: Digital Publisher": yes, exactly such an automated processing workflow like they have, but for everyone as free software. Regarding your other observations: for the input, at some point in time, somebody has to apply formatting to a text in order to prepare it for output generation. At this point, whoever and whenever it is, one has to decide between direct formatting and semantic markup. If the decision is made in favor of semantic markup, it will be beneficial for the person who did the decision, and if not, this person will exclude himself from the benefits of automated processing. It can be the author, it can be an intermediar. For the output, I don't think a "one size fits all" approach would be the best way to handle it, I would rather develop several individual processors for different features and layouts. Initially, I would aim for basic book features, and maybe there might be more complex document representation as well (probably by integration of more sophisticated software that's already existing for such tasks?). However, as you mentioned the specific example of referencing, such things aren't difficult at all for automated processing, and as EPUB->LaTeX->PDF is basically XHTML->LaTeX->PDF, I already do something very similar quite often (however, not with the goal of representing the XHTML visual appearance of a browser in the PDF). For your question on a intermediate format: there might be lots of intermediate formats, some probably more complex, some more simple. If somebody wants to implement complex output, he also defines how the input should look like (which markup is supported by the backend), so input files can be transformed to this "public specified interface" of the backend in order to produce the output, be it by one intermediate step or several intermediate steps as part of a larger workflow. Just combine some scripts that do the transformations or add data/structure. However, I've started pretty simple with only basic features, and expand from there as needed. Consistency across all processed books will automatically be guaranteed, since the processing workflow won't change it's internals by itself ;-) If the initial layout definition is done right in the first place, I also think this is an advantage for the ordinary writer to get quality layout, eliminating the risk of accidentally inserting errors into the design. Last edited by skreutzer; 02-09-2014 at 10:03 AM. |
02-03-2014, 11:33 PM | #19 |
Grand Sorcerer
Posts: 12,773
Karma: 75003038
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
One minor suggestion. I realize English is not your first language but you might want to consider shorter paragraphs in your posts.
While I might be interested in what you are saying, I do find my eyes glazing over at the walls of texts I see in your posts. |
02-09-2014, 10:05 AM | #20 |
Software Developer
Posts: 190
Karma: 89000
Join Date: Jan 2014
Location: Germany
Device: PocketBook Touch Lux 3
|
Thanks for your hint! I'm used to read long texts, so I tend to write long texts. In any case, I'm glad to improve any of my posts, please just notify me (probably by PN, if you don't want to use a discussion thread for it).
|
Advert | |
|
02-09-2014, 11:01 PM | #21 | |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
Quote:
Last edited by roger64; 02-09-2014 at 11:02 PM. Reason: .../... |
|
02-10-2014, 12:13 AM | #22 | ||
Wizard
Posts: 2,304
Karma: 12587727
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
But it does make it look less daunting if you actually use the quote boxes and answer each thing in chunks! Quote:
|
||
02-11-2014, 08:12 AM | #23 |
Fanatic
Posts: 580
Karma: 810184
Join Date: Sep 2010
Location: Norway
Device: prs-t1, tablet, Nook Simple, assorted kindles, iPad
|
What about TEI? Popular, loads of tools, XSLT for EPUB Conversion?
|
02-13-2014, 04:39 PM | #24 |
A curiosus lector!
Posts: 463
Karma: 2015140
Join Date: Jun 2012
Device: Sony PRS-T1, Kobo Touch
|
skreutzer,
Sorry to be so late and thanks for your reply (post # 18). I don't know you skreutzer, but I'm sure you can do it. So "ambitious" in my previous post was not the good term. Sorry about that. I'd say now: "too much details". And, by the way, your project is very interesting. So my request is just to asking you to point out shortly the main elements of your project:
Do you see what I ask for? Cheers! |
02-20-2014, 03:41 PM | #25 |
Wizard
Posts: 2,304
Karma: 12587727
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
While looking through some articles, I stumbled upon these two, which I thought might be of interest in the case of automating book workflows. Perhaps you might be able to gather some gems from this discussion/articles:
http://programming.oreilly.com/2013/...uthorship.html http://www.balisage.net/Proceedings/...einfeld01.html Glad to see that other great minds think alike, with the "digital first" and then work backwards to print... instead of the dreadful waste that is happening currently by going the other way around. |
02-22-2014, 11:10 AM | #26 | ||
Software Developer
Posts: 190
Karma: 89000
Join Date: Jan 2014
Location: Germany
Device: PocketBook Touch Lux 3
|
Yes, TEI is fine for automated processing workflows, but neither is TEI an input format (I never received TEI from anybody, also word processors don't support it as output format), nor an output format (no TEI readers). But TEI could be used as intermediate format, so I'll certainly look into the TEI processing tools. It might save the time to implement PDF and EPUB generators, and if TEI tools can be integrated into an automated workflow which is easy to set up and to use, I guess it could be quite advantageous.
Quote:
Note: as mentioned, this freely licensed software can also be used (with no or little adjustments) in non-free, proprietary environments (which is part of the digital freedoms a user deserves), while it could be hard or even impossible to integrate non-free, proprietary tools into the free workflow or free environment, if it doesn't at least support some open formats, protocols, etc. Quote:
In the meantime I've started a little Java GUI programming, so there's now a metadata editor for the configuration file of my EPUB generator. I plan to add another GUI helper for the configuration file, so that the entire EPUB conversion can be managed not only by an automated processing workflow, but also by ordinary users. I would like to set up a basic processing workflow, both usable automatically and by GUI, so that real texts can be processed with it. I found out that OpenOffice/LibreOffice can be configured in a way that semantic markup can be applied quite conventiently. Further, I'm involved in automatically processing output generated by a tool which reads a Wiki software, so Wikis as online front-ends seem to be pretty convenient, probably semantic, text editors for special kinds of writing activity. I also found a freely licensed automated processing workflow called Booktype, which I will investigate as well as the TEI processing tools. Probably it is easier than expected to provide an freely licensed automated processing workflow, by just combining what's already there to some kind of a "single installer", and by making easy to use for everybody (both ordinary users as well as professional formatters/typesetters). And sorry, no new River Valley TV video link, my progress is too slow and the videos are too interesting for me personally, even when not directly related to the topic of automated digital publishing. |
||
03-13-2014, 02:26 PM | #27 |
Software Developer
Posts: 190
Karma: 89000
Join Date: Jan 2014
Location: Germany
Device: PocketBook Touch Lux 3
|
I haven't made any progress in the fields mentioned above in the last time because I was busy with implementing GUI helpers for a first primitive automated processing workflow, see http://vimeo.com/89003773 (only the pictures if you don't speak German). This should demonstrate how the automated workflow could be used manually. From there, I might extend it by PDF generation, ODT to XHTML integration, layout definition etc.
|
03-13-2014, 05:15 PM | #28 |
A curiosus lector!
Posts: 463
Karma: 2015140
Join Date: Jun 2012
Device: Sony PRS-T1, Kobo Touch
|
skreutzer,
Thanks for your reply # 26 (I'm so slow sometimes!). Now things are clearer. |
03-13-2014, 07:40 PM | #29 | |
Wizard
Posts: 2,304
Karma: 12587727
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
|
|
04-03-2014, 07:37 PM | #30 |
Software Developer
Posts: 190
Karma: 89000
Join Date: Jan 2014
Location: Germany
Device: PocketBook Touch Lux 3
|
Some pretty basic XHTML to LaTeX to PDF conversion added and demonstrated by a workflow based upon the principle of Single Source Publishing:
https://vimeo.com/90901780 Sorry, only in German language :-( |
Tags |
automated processing, epub, pdf, xhtml, xml |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sigil as front end for automated XML based processing workflows? | skreutzer | Sigil | 60 | 01-29-2014 01:10 PM |
Workflows to use Calibre with iOS Apps: Good Reader-PDFs, Marvin-epub, Kindle-mobi? | crashnburn | Calibre | 4 | 06-14-2013 05:49 PM |
Bug in Kobo processing of epub files causing hang in "Processing content" | BensonBear | Kobo Reader | 21 | 12-21-2012 06:47 AM |
Sideloading + Annotations and Highlights Workflows? | jddunn | Kindle Fire | 5 | 12-13-2012 04:59 AM |
Other Non-Fiction Stallman, Richard M.: Free Software, Free Society, PDF v1.0, 4 March 2009 | scottdw | Other Books | 1 | 12-15-2011 04:02 PM |