html2mobi (a mobigen replacement written in Perl) - Page 2

jbenny · 11-26-2007, 05:49 AM

Something that may be of help to you is the source to a program called "pdbshred". It can extract the HTML and image files from a Mobipocket and Peanut ebook. You can find the program and source (in C) to the program by Googling for "pdbshred source". I would post a direct link, but because of some additional functionality in the program, some here may not like a direct link.

A similar program is called "makedoc", but it doesn't extract the images.

cstross · 11-26-2007, 07:45 AM

Looks extremely cute ...

Are you planning on packaging this and sticking it on CPAN when it's stable?

schmidt349 · 11-26-2007, 08:48 AM

Well done Tompe! Looks like you beat me to the punch

Would you mind if I bolted an XSL backend onto your code, effectively making it "xml2html2mobi?" It's a mouthful, but would be quite useful.

God I love Perl.

tompe · 11-26-2007, 09:25 AM

Quote:

Originally Posted by cstross

Looks extremely cute ...

Are you planning on packaging this and sticking it on CPAN when it's stable?

Might be a good idea. I have never done it before but why not. If I can find out how to put a script on CPAN.

I will release the html2mobi script here in a day or two so I can get some feedback. I have to fix one serious bug and write some documentation.

tompe · 11-26-2007, 09:32 AM

Quote:

Originally Posted by schmidt349

Would you mind if I bolted an XSL backend onto your code, effectively making it "xml2html2mobi?" It's a mouthful, but would be quite useful.

God I love Perl.

Yes, it is a pleasure to program Perl :-)

I should probable write some packages to make it easier to do a xml2html2mobi. I wanted to have just one file to make it easier to use but maybe I should just split it up and submit it to CPAN. That can be the next step after it works and is tested more.

I used XML::Parser::Lite::Tree to parse the opf file but I am not sure this was a good idea. Do you know of any better library for opf files or for XML? I really liked HTML::Element and HTML::TreeBuilder so something similar for XML would be nice. Or a specific opf file library.

tompe · 11-26-2007, 01:42 PM

I have a problem. My converter generates a mobi file that is not entirely correct. It works perfect in FBreader. On my Gen3 it works but the number of pages is 650 when it should be arount 25. There are a lot of empty pages in the end. My Palm T5 refuses to load the file and says corrupt database 0x0209 (2).

What I wondered is if this is a problem with the Palmdoc things or if it is a problem with the html that i packed in the Palmdoc format?

I can have forgotten to set some parameter in the Palm::PDB package but I tested to load a working mobi file and than replacing the text and it did not work.

Ideas?

tompe · 11-26-2007, 04:01 PM

I realised that I had not written any Mobipocket header in record 0 at all and I was fooled by it working so well with FBReader. Were there any specification of the data that should be in record 0 anywhere? I have googled for it but can not find it.

igorsk · 11-26-2007, 05:07 PM

No spec. A few fields are documented in pdbshred but they're probably not what you need. I'm working on a more or less complete doc but here's what you should be able to get away with:

Quote:

0 DWord dwSignature //'MOBI'
4 DWord dwSize //including first two fields (put 0x18 here)
8 DWord dwType //pub type: 2=book,3=palmdoc,4=audio,news=257,feed=258,magazin e=259 etc
C DWord dwCodepage //1252=western, 65001 = UTF8. Better not use anything else
10 DWord dwUniqueId //? filled from rand() calls
14 DWord dwFileFormatVer //seems to correspond to Mobipocket reader ver. put 3 here

This is in addition to the palmdoc header, naturally.

tompe · 11-26-2007, 07:09 PM

Quote:

Originally Posted by igorsk

No spec. A few fields are documented in pdbshred but they're probably not what you need. I'm working on a more or less complete doc but here's what you should be able to get away with:

This is in addition to the palmdoc header, naturally.

Thanks. I have now managed to write record 0 so now I can add the MOBI header also.

When I unpacked a mobi file I saw three records after the last image and they have size 36, 52 and 4. What are these? One contained the string FLIS and one the string FCIS. Maybe the end of the document is not detected becasue I have not written these records.

tompe · 11-26-2007, 07:14 PM

How long must the MOBI header be?

At position 0xF4 I see the string EXTH and after that follows some strings that indicates that the author and titlte are stored there. Does this belong to the header?

igorsk · 11-26-2007, 08:24 PM

I beleive FCIS and FLIS have something to do with dictionary indices. Do you set the unpacked size and number of records in Palmdoc header correctly?

tompe · 11-26-2007, 08:36 PM

Quote:

Originally Posted by igorsk

I beleive FCIS and FLIS have something to do with dictionary indices. Do you set the unpacked size and number of records in Palmdoc header correctly?

The last record that was 4 byte contains E9 8E 0D 0A. I wonder if this is important...

The number of records are correct because I tried to include the image records in that number but then FBReader started to display garbage after the end of the text. I will double check the unpacked size. I have not set this pointer to first image either.

Now I have got the strange phenomen that the images in FBReader is correct but on my Gen3 they seem to be shifted. The "library" image seems to work. I just put it in the last record and it was displayed correctly on the Gen3. The change I did was that I set the record "id" to an increasing number for the text content instead of using 0.

Well, it moves forward. Hopefully I will fix the problem with the size and the image order soon so I have a first alpha version of the scripts.

igorsk · 11-26-2007, 08:53 PM

The "number of records" in palmdoc header (Word at 0x8) needs to be set to the number of records containing only text (no pictures). E.g. if you have compressed text in records 1,2 and 3, then set it to 3. The uncompressed size (dword at 4) has to be the full uncompressed size of all text.
By the way, I was wrong. Mobi format 3 needs MOBI header to be 0x74 bytes long, not 0x18. The fields are mostly irrelevant except for the number of the first record with images I mentioned above (at 0x5C).
There are also DATP records that contain mapping from uncompresed offset to record numbers but I didn't figure out their format yet and not sure if they're mandatory...

tompe · 11-26-2007, 09:23 PM

Got it nearly to work on my Gen3 when I extende the MOBI header. The only problem is now that the title says "libc-2.3.6" and the header information is wrong...

Strangely enough the library image works without me including it. Maybe it takes the first record with an image and uses this.

# 4 DWord dwSize //including first two fields (put 0x18 here)

If I put 0x18 here it does not work. If I put 0xE4 here as in my example document then it works but the title did not work. So what does this number mean?

tompe · 11-26-2007, 09:57 PM

Quote:

Originally Posted by tompe

# 4 DWord dwSize //including first two fields (put 0x18 here)

If I put 0x18 here it does not work. If I put 0xE4 here as in my example document then it works but the title did not work. So what does this number mean?

I just realized what this field is. It is a pointer to the block that starts with EXTH. The first number after that sees to be the size of this block. But I have not managed to see how it is coded.

Maybe I should try to set this pointer to 0 and see if that means that this block does not exist.

11-26-2007, 09:23 PM	#29
tompe Grand Sorcerer Posts: 7,452 Karma: 7185064 Join Date: Oct 2007 Location: Linköpng, Sweden Device: Kindle Voyage, Nexus 5, Kindle PW	Got it nearly to work on my Gen3 when I extende the MOBI header. The only problem is now that the title says "libc-2.3.6" and the header information is wrong... Strangely enough the library image works without me including it. Maybe it takes the first record with an image and uses this. # 4 DWord dwSize //including first two fields (put 0x18 here) If I put 0x18 here it does not work. If I put 0xE4 here as in my example document then it works but the title did not work. So what does this number mean? Last edited by tompe; 11-26-2007 at 09:25 PM.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
html2mobi - html formatting	brunovg	Kindle Formats	2	12-13-2009 05:56 AM
Old Version Mobigen needed	wilko10	Kindle Formats	11	11-25-2008 08:10 PM
Does someone still have Mobigen 6.01 build 37?	IceHand	Kindle Formats	7	03-03-2008 05:04 PM
lit2mobi written in Perl working	tompe	Bookeen	7	01-19-2008 01:06 PM
MobiPocket TOC using mobigen	wallcraft	Reading and Management	4	12-07-2007 09:45 AM

11-26-2007, 05:49 AM	#16
jbenny Addict Posts: 323 Karma: 358 Join Date: May 2007 Device: Tablet PC and Nokia N800	Something that may be of help to you is the source to a program called "pdbshred". It can extract the HTML and image files from a Mobipocket and Peanut ebook. You can find the program and source (in C) to the program by Googling for "pdbshred source". I would post a direct link, but because of some additional functionality in the program, some here may not like a direct link. A similar program is called "makedoc", but it doesn't extract the images.

11-26-2007, 07:45 AM	#17
cstross Cynic Posts: 86 Karma: 514 Join Date: Jul 2007 Location: Edinburgh, Scotland Device: Lots, started with a Psion 3 circa 1998	Looks extremely cute ... Are you planning on packaging this and sticking it on CPAN when it's stable?

11-26-2007, 08:48 AM	#18
schmidt349 Member Posts: 20 Karma: 65 Join Date: Nov 2007 Device: Amazon Kindle	Well done Tompe! Looks like you beat me to the punch Would you mind if I bolted an XSL backend onto your code, effectively making it "xml2html2mobi?" It's a mouthful, but would be quite useful. God I love Perl.

11-26-2007, 01:42 PM	#21
tompe Grand Sorcerer Posts: 7,452 Karma: 7185064 Join Date: Oct 2007 Location: Linköpng, Sweden Device: Kindle Voyage, Nexus 5, Kindle PW	I have a problem. My converter generates a mobi file that is not entirely correct. It works perfect in FBreader. On my Gen3 it works but the number of pages is 650 when it should be arount 25. There are a lot of empty pages in the end. My Palm T5 refuses to load the file and says corrupt database 0x0209 (2). What I wondered is if this is a problem with the Palmdoc things or if it is a problem with the html that i packed in the Palmdoc format? I can have forgotten to set some parameter in the Palm::PDB package but I tested to load a working mobi file and than replacing the text and it did not work. Ideas?

11-26-2007, 04:01 PM	#22
tompe Grand Sorcerer Posts: 7,452 Karma: 7185064 Join Date: Oct 2007 Location: Linköpng, Sweden Device: Kindle Voyage, Nexus 5, Kindle PW	I realised that I had not written any Mobipocket header in record 0 at all and I was fooled by it working so well with FBReader. Were there any specification of the data that should be in record 0 anywhere? I have googled for it but can not find it.

11-26-2007, 07:14 PM	#25
tompe Grand Sorcerer Posts: 7,452 Karma: 7185064 Join Date: Oct 2007 Location: Linköpng, Sweden Device: Kindle Voyage, Nexus 5, Kindle PW	How long must the MOBI header be? At position 0xF4 I see the string EXTH and after that follows some strings that indicates that the author and titlte are stored there. Does this belong to the header?

11-26-2007, 08:24 PM	#26
igorsk Wizard Posts: 3,442 Karma: 300001 Join Date: Sep 2006 Location: Belgium Device: PRS-500/505/700, Kindle, Cybook Gen3, Words Gear	I beleive FCIS and FLIS have something to do with dictionary indices. Do you set the unpacked size and number of records in Palmdoc header correctly?

11-26-2007, 08:53 PM	#28
igorsk Wizard Posts: 3,442 Karma: 300001 Join Date: Sep 2006 Location: Belgium Device: PRS-500/505/700, Kindle, Cybook Gen3, Words Gear	The "number of records" in palmdoc header (Word at 0x8) needs to be set to the number of records containing only text (no pictures). E.g. if you have compressed text in records 1,2 and 3, then set it to 3. The uncompressed size (dword at 4) has to be the full uncompressed size of all text. By the way, I was wrong. Mobi format 3 needs MOBI header to be 0x74 bytes long, not 0x18. The fields are mostly irrelevant except for the number of the first record with images I mentioned above (at 0x5C). There are also DATP records that contain mapping from uncompresed offset to record numbers but I didn't figure out their format yet and not sure if they're mandatory...

Advert

Advert