08-20-2008, 12:01 PM | #1 |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
lost when converting to mobi
Hello,
I'm using html2mobi to create mobipocket files. In French typography it's customary to have lots of spaces around punctuation (before colons, inside quotation marks, etc.). These spaces should ideally be thin non-breaking spaces, but since mobipocket apparently does not support the entity, I've decided to use normal non-breaking space (  instead. However, it seems html2mobi converts some into normal spaces, so that I get linebreaks in wrong places in the mobi file with an ebook reader. This is an example HTML file Code:
<HTML> <HEAD> </HEAD> <BODY> <DIV HEIGHT="2em"> Je demande pardon aux enfants d’avoir dédié ce livre à une grande personne. J’ai une excuse sérieuse : cette grande personne est le meilleur ami que j’ai au monde. J’ai une autre excuse : cette grande personne peut tout comprendre, même les livres pour enfants. J’ai une troisième excuse : cette grande personne habite la France où elle a faim et froid. Elle a besoin d’être consolée. Si toutes ces excuses ne suffisent pas, je veux bien dédier ce livre à l’enfant qu’a été autrefois cette grande personne. Toutes les grandes personnes ont d’abord été des enfants. (Mais peu d’entre elles s’en souviennent.) Je corrige donc ma dédicace : </DIV> <P HEIGHT="1em">On disait dans le livre : « Les serpents boas avalent leur proie tout entière, sans la mâcher. Ensuite ils ne peuvent plus bouger et ils dorment pendant les six mois de leur digestion ».</P> <P>J’ai alors beaucoup réfléchi sur les aventures de la jungle et, à mon tour, j’ai réussi, avec un crayon de couleur, à tracer mon premier dessin. Mon dessin numéro 1. Il était comme ça :</P> </BODY> </HTML> Is this a known problem? Is there a workaround? Is it possible to fix that? P.S. I'm using mobiperl 0.38 under linux (perl v5.8.6). |
08-20-2008, 07:00 PM | #2 |
Grand Sorcerer
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
The example does not look like correct HTML since <p> is missing before the start of the text.
If it is a real bug remind me next week and I will fix it. |
Advert | |
|
08-21-2008, 09:48 AM | #3 |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Well, I've reduced the example to this:
Code:
<HTML> <HEAD> </HEAD> <BODY> <P>J’ai une excuse sérieuse :</P> <P>J'ai une excuse sérieuse :</P> <P>J'ai une excuse sérieuse :</P> </BODY> </HTML> Code:
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252" /> </head> <body> <p>J’ai une excuse sérieuse : <p>J'ai une excuse sérieuse : <p>J'ai une excuse sérieuse : </body> </html> |
08-21-2008, 09:54 AM | #4 |
Resident Curmudgeon
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Why not put in the closing quote and see what happens?
Also have you tried Mobipocket Creator to see how it comes out since it does support HTML as one of the input formats? |
08-21-2008, 10:03 AM | #5 |
Resident Curmudgeon
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
I just converted your HTML code to PRC using Mobipocket Creator and then back to HTML using mobi2oeb.
Code:
<html><head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <style type="text/css"> blockquote { margin: 0em 0em 0em 1.25em; text-align: justify; } p { margin: 0em; text-align: justify; } </style> </head><body><p>J’ai une excuse sérieuse :</p> <p>J'ai une excuse sérieuse :</p> <p>J'ai une excuse sérieuse :</p> <br style="page-break-after:always" /></body></html> |
Advert | |
|
08-21-2008, 11:41 AM | #6 | |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Good point, it's an opening quote what I need, though. I tried with "‘J’ai une excuse ..." and it also converted the " " into a normal space.
Quote:
|
|
08-21-2008, 11:51 AM | #7 |
Resident Curmudgeon
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Actually, the lack of </p> is a bug in mobi2html. When I coverted the PRC, I used mobi2oeb which is part of Calibre.
|
08-21-2008, 12:22 PM | #8 |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
After further testing I could find a workaround by modifying the html2mobi script, adding:
$tree->no_space_compacting(1); just before the "$tree->parse_file" in "sub one_html_file". (A similar addition is needed in mobi2html.) Then I found this: Fixed HTML::TreeBuilder to not remove where it shouldn't, using patch supplied in RT 17481. That's dated in 2006. The only TreeBuilder.pm in my system seems to be from Perl 5.8.2, dated in September 2003, so it is quite possible this "bug" is already corrected in newer systems. Edit: I found a way to add the </p> as well, using "my $html = $tree->as_HTML(undef,undef,{});" instead of "my $html = $tree->as_HTML;" in mobi2html. Last edited by Jellby; 08-21-2008 at 12:32 PM. |
08-21-2008, 12:43 PM | #9 |
Resident Curmudgeon
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Well, I'm using Windows and the exe version and it still has the same bug.
|
08-21-2008, 03:00 PM | #10 |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
More data:
In HTML/Entities.pm, » is defined as "»", while ’ and ” are chr(8217) and chr(8221), — is also chr(8212) and indeed it causes the same problem. Maybe the existence of these non-latin1 (or whatever) characters in an HTML element causes the whole element to be encoded in unicode (or whatever), where the equivalent to (defined as "\240") is recognized as a space character. If the "space compacting" is done in latin1 encoding, the is not changed, but if it is done in unicode it is "compacted" in a space. |
08-22-2008, 05:13 AM | #11 |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Well, the 3.18 version of HTML::TreeBuilder.pm (Sep 2003) has:
$text =~ s/\s+/ /g (this includes all whitespace) where the 3.21 version (Nov 2006) has: $text =~ s/[\n\r\f\t ]+/ /g (this includes only newline, return, formfeed, tabulator and space) With this second instruction, non-breaking space (unicode 0x00A0) is not included in the regular expresion, and would not be converted into a normal space. I have not tried this yet (will do that this afternoon), but I guess this is the culprit. Probably the windows .exe version is compiled with an older version of the perl package. |
08-22-2008, 08:56 AM | #12 |
Resident Curmudgeon
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
You may want to PM Tompe to let him know of this issue.
|
08-22-2008, 07:04 PM | #13 |
Grand Sorcerer
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
I tested the example on my Linux machine and got:
Code:
<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252" /> </head> <body> <p>J’ai une excuse sérieuse : <p>J'ai une excuse sérieuse : <p>J'ai une excuse sérieuse : </body></html> It is not mobi2html that causes the </p> to dissappear. It is the saving of the file in html2mobi that for some reason do not keep the end tag. |
08-23-2008, 12:43 AM | #14 |
Resident Curmudgeon
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Actually, I used Mobipocket Creator to test with and mobi2oeb pulled back the code as you see above and mobi2html pulled back the code like this....
Code:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252" /></head><body><p>J’ai une excuse sérieuse :<p>J'ai une excuse sérieuse :<p>J'ai une excuse sérieuse :<br style="page-break-after:always" /></body></html> |
08-23-2008, 04:43 AM | #15 |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
As I said, it's all in the HTML::Tree perl package.
Oldish versions of the package (and it seems to be the case for the version included in the windows .exe distribution of mobiperl) in some cases convert into spaces when condensing whitespace. Newer versions seem to fix this problem, by only condensing real spaces tabulators and newlines. As for the </p> tag, it's an option in the as_HTML procedure (http://search.cpan.org/~sburke/HTML-...UMPING_METHODS). By default </p>, </li>, </dt> and </dd> are omitted, this can be avoided by calling as_HTML(undef,undef,{}) instead of just as_HTML(), and it happens both in html2mobi and mobi2html. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
txt to Epub - nbsp nbsp | cybmole | Calibre | 1 | 09-17-2010 10:05 AM |
Specify indent in css, not with   | James_Wilde | Calibre | 7 | 09-13-2010 10:48 PM |
converting from standard mobi to compressed mobi | noideaatall | Kindle Formats | 6 | 07-11-2010 04:10 PM |
conversion to Mobi - Colors lost | ichbindasauge | Calibre | 2 | 11-06-2009 12:20 PM |
Converting to mobi | rcuadro | Calibre | 3 | 03-13-2009 02:14 AM |