Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Kindle Formats

Notices

Reply
 
Thread Tools Search this Thread
Old 08-20-2008, 12:01 PM   #1
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
  lost when converting to mobi

Hello,

I'm using html2mobi to create mobipocket files. In French typography it's customary to have lots of spaces around punctuation (before colons, inside quotation marks, etc.). These spaces should ideally be thin non-breaking spaces, but since mobipocket apparently does not support the   entity, I've decided to use normal non-breaking space (&nbsp instead. However, it seems html2mobi converts some   into normal spaces, so that I get linebreaks in wrong places in the mobi file with an ebook reader.

This is an example HTML file

Code:
<HTML>
<HEAD>
</HEAD>
<BODY>

<DIV HEIGHT="2em">
Je demande pardon aux enfants d’avoir d&eacute;di&eacute; ce livre
&agrave; une grande personne. J’ai une excuse s&eacute;rieuse&nbsp;: cette
grande personne est le meilleur ami que j’ai au monde. J’ai une
autre excuse&nbsp;: cette grande personne peut tout comprendre, m&ecirc;me les
livres pour enfants. J’ai une troisi&egrave;me excuse&nbsp;: cette grande
personne habite la France o&ugrave; elle a faim et froid. Elle a besoin
d’&ecirc;tre consol&eacute;e. Si toutes ces excuses ne suffisent pas, je
veux bien d&eacute;dier ce livre &agrave; l’enfant qu’a
&eacute;t&eacute; autrefois cette grande personne. Toutes les grandes personnes
ont d’abord &eacute;t&eacute; des enfants. (Mais peu d’entre elles
s’en souviennent.) Je corrige donc ma d&eacute;dicace&nbsp;:
</DIV>

<P HEIGHT="1em">On disait dans le livre&nbsp;: &laquo;&nbsp;Les serpents boas
avalent leur proie tout enti&egrave;re, sans la m&acirc;cher. Ensuite ils ne
peuvent plus bouger et ils dorment pendant les six mois de leur
digestion&nbsp;&raquo;.</P>

<P>J’ai alors beaucoup r&eacute;fl&eacute;chi sur les aventures de la
jungle et, &agrave; mon tour, j’ai r&eacute;ussi, avec un crayon de
couleur, &agrave; tracer mon premier dessin. Mon dessin num&eacute;ro 1. Il
&eacute;tait comme &ccedil;a&nbsp;:</P>

</BODY>

</HTML>
If I convert the HTML to mobi and then back to HTML (with mobi2html), only the &nbsp; in the middle paragraph are conserved, the others are turned into normal spaces.

Is this a known problem? Is there a workaround? Is it possible to fix that?

P.S. I'm using mobiperl 0.38 under linux (perl v5.8.6).
Jellby is offline   Reply With Quote
Old 08-20-2008, 07:00 PM   #2
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
The example does not look like correct HTML since <p> is missing before the start of the text.

If it is a real bug remind me next week and I will fix it.
tompe is offline   Reply With Quote
Advert
Old 08-21-2008, 09:48 AM   #3
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Well, I've reduced the example to this:

Code:
<HTML>
<HEAD>
</HEAD>
<BODY>
<P>J&rsquo;ai une excuse s&eacute;rieuse&nbsp;:</P>
<P>J&apos;ai une excuse s&eacute;rieuse&nbsp;:</P>
<P>J'ai une excuse s&eacute;rieuse&nbsp;:</P>
</BODY>
</HTML>
which, after converting to mobi and back to HTML gives:

Code:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252" />
</head>
<body>
<p>J&rsquo;ai une excuse s&eacute;rieuse :
<p>J'ai une excuse s&eacute;rieuse&nbsp;:
<p>J'ai une excuse s&eacute;rieuse&nbsp;:
</body>
</html>
The problem seems to be the "&rsquo;", which causes the "&nbsp;" to be converted to a space. The same happens with "&rdquo;", but not with "&raquo;". My guess is something is checking for balanced quote marks. I'd say this is a bug, anyway.
Jellby is offline   Reply With Quote
Old 08-21-2008, 09:54 AM   #4
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Why not put in the closing quote and see what happens?

Also have you tried Mobipocket Creator to see how it comes out since it does support HTML as one of the input formats?
JSWolf is offline   Reply With Quote
Old 08-21-2008, 10:03 AM   #5
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
I just converted your HTML code to PRC using Mobipocket Creator and then back to HTML using mobi2oeb.

Code:
<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css">
blockquote { margin: 0em 0em 0em 1.25em; text-align: justify; }
p { margin: 0em; text-align: justify; }
</style>
</head><body><p>J&rsquo;ai une excuse s&eacute;rieuse&nbsp;:</p>
<p>J&apos;ai une excuse s&eacute;rieuse&nbsp;:</p>
<p>J'ai une excuse s&eacute;rieuse&nbsp;:</p> <br style="page-break-after:always" /></body></html>
That is the resulting code I get back from the PRC.
JSWolf is offline   Reply With Quote
Advert
Old 08-21-2008, 11:41 AM   #6
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by JSWolf View Post
Why not put in the closing quote and see what happens?
Good point, it's an opening quote what I need, though. I tried with "&lsquo;J&rsquo;ai une excuse ..." and it also converted the "&nbsp;" into a normal space.

Quote:
Also have you tried Mobipocket Creator to see how it comes out since it does support HTML as one of the input formats?

I just converted your HTML code to PRC using Mobipocket Creator and then back to HTML using mobi2oeb.
Good, thanks for trying. It shows MC does not have the same problem. It also shows that it also retains the closing </p> tags, which html2mobi/mobi2html doesn't.
Jellby is offline   Reply With Quote
Old 08-21-2008, 11:51 AM   #7
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Actually, the lack of </p> is a bug in mobi2html. When I coverted the PRC, I used mobi2oeb which is part of Calibre.
JSWolf is offline   Reply With Quote
Old 08-21-2008, 12:22 PM   #8
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
After further testing I could find a workaround by modifying the html2mobi script, adding:

$tree->no_space_compacting(1);

just before the "$tree->parse_file" in "sub one_html_file". (A similar addition is needed in mobi2html.) Then I found this:

Fixed HTML::TreeBuilder to not remove &nbsp; where it shouldn't, using patch supplied in RT 17481.

That's dated in 2006. The only TreeBuilder.pm in my system seems to be from Perl 5.8.2, dated in September 2003, so it is quite possible this "bug" is already corrected in newer systems.

Edit: I found a way to add the </p> as well, using "my $html = $tree->as_HTML(undef,undef,{});" instead of "my $html = $tree->as_HTML;" in mobi2html.

Last edited by Jellby; 08-21-2008 at 12:32 PM.
Jellby is offline   Reply With Quote
Old 08-21-2008, 12:43 PM   #9
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Well, I'm using Windows and the exe version and it still has the same bug.
JSWolf is offline   Reply With Quote
Old 08-21-2008, 03:00 PM   #10
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
More data:

In HTML/Entities.pm, &raquo; is defined as "»", while &rsquo; and &rdquo; are chr(8217) and chr(8221), &mdash; is also chr(8212) and indeed it causes the same problem.

Maybe the existence of these non-latin1 (or whatever) characters in an HTML element causes the whole element to be encoded in unicode (or whatever), where the equivalent to &nbsp; (defined as "\240") is recognized as a space character. If the "space compacting" is done in latin1 encoding, the &nbsp; is not changed, but if it is done in unicode it is "compacted" in a space.
Jellby is offline   Reply With Quote
Old 08-22-2008, 05:13 AM   #11
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Well, the 3.18 version of HTML::TreeBuilder.pm (Sep 2003) has:

$text =~ s/\s+/ /g
(this includes all whitespace)

where the 3.21 version (Nov 2006) has:

$text =~ s/[\n\r\f\t ]+/ /g
(this includes only newline, return, formfeed, tabulator and space)

With this second instruction, non-breaking space (unicode 0x00A0) is not included in the regular expresion, and would not be converted into a normal space. I have not tried this yet (will do that this afternoon), but I guess this is the culprit. Probably the windows .exe version is compiled with an older version of the perl package.
Jellby is offline   Reply With Quote
Old 08-22-2008, 08:56 AM   #12
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
You may want to PM Tompe to let him know of this issue.
JSWolf is offline   Reply With Quote
Old 08-22-2008, 07:04 PM   #13
tompe
Grand Sorcerer
tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.tompe ought to be getting tired of karma fortunes by now.
 
Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
I tested the example on my Linux machine and got:

Code:
<html>
<head>                                                                        
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252" />
</head>
<body>
<p>J&rsquo;ai une excuse s&eacute;rieuse&nbsp;:
<p>J'ai une excuse s&eacute;rieuse&nbsp;:
<p>J'ai une excuse s&eacute;rieuse&nbsp;:
</body></html>
Which I assume is correct or?

It is not mobi2html that causes the </p> to dissappear. It is the saving of the file in html2mobi that for some reason do not keep the end tag.
tompe is offline   Reply With Quote
Old 08-23-2008, 12:43 AM   #14
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 76,395
Karma: 136466962
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Actually, I used Mobipocket Creator to test with and mobi2oeb pulled back the code as you see above and mobi2html pulled back the code like this....

Code:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252" /></head><body><p>J&rsquo;ai une excuse s&eacute;rieuse&nbsp;:<p>J'ai une excuse s&eacute;rieuse&nbsp;:<p>J'ai une excuse s&eacute;rieuse&nbsp;:<br style="page-break-after:always" /></body></html>
I am using the Windows exe version of .38
JSWolf is offline   Reply With Quote
Old 08-23-2008, 04:43 AM   #15
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
As I said, it's all in the HTML::Tree perl package.

Oldish versions of the package (and it seems to be the case for the version included in the windows .exe distribution of mobiperl) in some cases convert &nbsp; into spaces when condensing whitespace. Newer versions seem to fix this problem, by only condensing real spaces tabulators and newlines.

As for the </p> tag, it's an option in the as_HTML procedure (http://search.cpan.org/~sburke/HTML-...UMPING_METHODS). By default </p>, </li>, </dt> and </dd> are omitted, this can be avoided by calling as_HTML(undef,undef,{}) instead of just as_HTML(), and it happens both in html2mobi and mobi2html.
Jellby is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
txt to Epub - nbsp nbsp cybmole Calibre 1 09-17-2010 10:05 AM
Specify indent in css, not with &nbsp James_Wilde Calibre 7 09-13-2010 10:48 PM
converting from standard mobi to compressed mobi noideaatall Kindle Formats 6 07-11-2010 04:10 PM
conversion to Mobi - Colors lost ichbindasauge Calibre 2 11-06-2009 12:20 PM
Converting to mobi rcuadro Calibre 3 03-13-2009 02:14 AM


All times are GMT -4. The time now is 11:35 PM.


MobileRead.com is a privately owned, operated and funded community.