KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 13

siebert · 09-14-2011, 07:54 AM

Quote:

Originally Posted by pdurrant

An interesting idea. I haven't really explored the dev hub.

It seems to support only subversion (*shudder*)...

As I've said before, I pushed my git repository to github and any fellow developer should feel free to create forks for their own development which can be merged if a feature is ready: https://github.com/siebert/mobiunpack

Ciao,
Steffen

KevinH · 09-14-2011, 09:27 AM

Hi fandrieu,

Great work!

I will take a shot at combining your latest version with a version that uses Siebert's readTag routine to parse the TAGX which can be found in the indx0 section to find the field bitmaps for each tag and parse them. That way we can forget about all of the if type == 0x1f lines and just use the correct bitmaps to decipher which fields are present and then read them.

Thanks!

KevinH

KevinH · 09-14-2011, 09:38 AM

Hi DaleDe,

Quote:

Originally Posted by DaleDe

This is great interaction and development. I wonder if the dev hub available here would be better for the purpose.

Interesting idea. Paul and I hosted a code.google.com site for mobiunpack.py but we received almost no contributions or input over the years. Siebert was the first new developer to come on and he found the source on this site (not our code.google.com dev site) and after his extensive changes he added his own git site.

Based on similar experiences from other small (couple of files only) dev projects, it appears to me that using development specific hosting with its own hurdle of concurrent versioning tool (git vs svn vs mercurial vs cvs vs rcs, etc.) and the lack of visits by users who might have an "itch to scratch" simply lowers contributions.

I think the same thing happens with users of both Sigil and Calibre. They are constantly pointed to other official sites but most of the impetus for change is done or initiated via MR.

So unless we are disrupting things with our posts, I would prefer to keep things here just to maximize our exposure to new users (and hopefully potential developers) who might want to contribute a new feature or quick fix.

My 2 cents ...

KevinH

fandrieu · 09-14-2011, 10:07 AM

KevinH, sorry to flood the thread with zips, but here a new version

I tried the NCX code against on all the mobis I could lay my hands on...

The only "real" error I got was with really fat ebooks (technical books with more than a thousand entries), the INDX1 is splitted across more than one section !

I first added a few checks to prevent exceptions, but more importantly found out the the actual number of "data" INDX sections is stored in the INDX0.

So I modified the code to take this into account and parse multiple INDXx.
In the zip file you'll find a file for this test case, a dummy book with 4000 entries on 5 levels (that's a 600kb ncx...)

While I was at it, as suggested by siebert, I used his tagx code to parse the rest of INDX0, but still doesn't do anything with the data.

Please use this version instead if the previous if you plan on integrating the changes.

Thanks, fand.

PS: i also included the (simplistic) script I used to test the code on all my books, if someone interested...

DiapDealer · 09-14-2011, 12:39 PM

Hi, the above (mobiunpack_testncx2.zip) test script isn't recognizing the ncx in most of my mobi's. The multi-level stuff seems to be off by one. Any of my mobi's that have a strictly flat ncx (one level), the script mistakenly reports as having "No ncx." And with a mobi that has a two-level ncx, the script builds a one-level (flat ncx file)... ignoring the parent level if an entry has a parent.

I may be wrong, but I seem to remember something about calibre flattening the ncx regardless. I'm not sure the Kindle properly handles a multi-level ncx file. Something about only the parent levels (and not the children) showing on the progress bar as "jump points" (which is the only thing useful function the ncx provides on a Kindle). I could be completely mistaken about all that, though... I'll have to do some testing.

fandrieu · 09-14-2011, 01:28 PM

Quote:

Originally Posted by DiapDealer

Hi, the above (mobiunpack_testncx2.zip) test script isn't recognizing the ncx in most of my mobi's. The multi-level stuff seems to be off by one. Any of my mobi's that have a strictly flat ncx (one level), the script mistakenly reports as having "No ncx." And with a mobi that has a two-level ncx, the script builds a one-level (flat ncx file)... ignoring the parent level if an entry has a parent.

I wouldn't be surprised if it's off by one, quite the contrary I don't expect the code to be correct at this stage

But for now I couldn't find a book to reproduce the problem, that's pretty weird, i'll look into it further...

Quote:

Originally Posted by DiapDealer

I may be wrong, but I seem to remember something about calibre flattening the ncx regardless. I'm not sure the Kindle properly handles a multi-level ncx file. Something about only the parent levels (and not the children) showing on the progress bar as "jump points" (which is the only thing useful function the ncx provides on a Kindle). I could be completely mistaken about all that, though... I'll have to do some testing.

As far as I know I completely agree and all that makes multi-level NCX pretty useless for now.
But anyway kindlegen does produce this kind of file and my goal with this code was to extract as much from the mobi as possible, so that you can re-compile the files from mobiunpack into an as-identical-as-possible new mobi...

KevinH · 09-14-2011, 03:22 PM

Hi All,

Okay, I took fandrieu's latest, and modified it to pass the tagx info to the readINDX1 routine and fixed an off by one in the code that sorts the NCX.

I think this should now be close.

PS: Actually I still think sortINDX has an off-by-one issue and my change may not be the correct one! My change fixed my problem but will probably fail for some other case. Recursion is so fun!

Either way it needs to be worked on and fixed. We should also re-factor things into classes and maybe even separate it into files that encapsulate the various functions in some smarter way.

DiapDealer · 09-14-2011, 05:08 PM

I'm getting good results with these latest scripts. I'm still trying to find something in one of my books that breaks it, but I'm not having much luck.

Quote:

Originally Posted by KevinH

Either way it needs to be worked on and fixed. We should also re-factor things into classes and maybe even separate it into files that encapsulate the various functions in some smarter way.

I'm all for class-ifying, but if given a vote, I would rather that mobiunpack remain one self-contained script.

fandrieu · 09-14-2011, 05:28 PM

Quote:

Originally Posted by DiapDealer

I'm getting good results with these latest scripts. I'm still trying to find something in one of my books that breaks it, but I'm not having much luck.

I just found a book with the same kind of problem:
calibre fetched a scheduled feed just while i was testing some files, so i tried the resulting "periodical" mobi and that was it

It seems the problem is with the INDX parsing, i got the output:

Code:

parsed INDX header:
len 192 nul1 0 type 1 gen 0 start 1256 count 54 code 4294967295 lng 4294967295 total 0 ordt 0 ligt 0
contextual data @ xB
DF	0	-1	1	6
contextual data @ x98
2	2	E2	-1	-1
contextual data @ x127
46	2	E2	-1	-1

which shows that from the second entry everything is mangled.
There's actually an extra VWI in the first "DF" entry so the rest is shifted.

I guess the right way to fix should be to use the TAGX data to reliably know what to expect in the entries.
In this particular case our current "type-based" rules might work if we took into account the differences between book & periodical style indexes...but i'm yet to fiddle with that...

EDIT:
I missed KevinH last post...
Thanks for the tagx code i'll look into it
And yes there were some errors in the sortINDX code

i actually (silently out of shame

) reuploaded the zip earlier with >= replaced by > in the first test and other fixes

EDIT2:
tagx: pretty impressive, many thanks for quickly implementing this tagx bit i had skipped altogether

sortINDX: you got the second ">0" error but missed the one i mentioned above

refactor: i was toying with the oop approach before but wouldn't do it to keep in sync with other versions, but i have a mobiunpack_ootest.py somehere...

pdurrant · 09-14-2011, 06:04 PM

Bear in mind that calibre-generated Mobipocket files might not be valid in all instances, since the code was written with reverse-engineered info, not with documentation of the format.

KevinH · 09-14-2011, 07:02 PM

Hi All,

Okay I merged the fixes that fandrieu made to his version (fixes to sortINDX, other changes) and added in a few other typo fixes and now I think we have a version we can use as the basis for public testing and as a basis for refactoring into classes while trying to keep to just one file.

Very nice work fandrieu!

mobiunpack_fand_updated2.zip is attached.

KevinH

DiapDealer · 09-14-2011, 07:58 PM

The above script is slightly broken for MOBI's that have no NCX (when DEBUG_NCX is set to False). In that circumstance, the outncx variable is referenced before it's assigned in the unpackBook function. The <spine> element is also incorrect in the opf for a MOBI with no ncx file.

I made two small changes to the unpackBook function that make it work for MOBI's with no NCX. A quick diff will reveal the simple changes.

I'm having quite a bit of success with unpacking various books and rebuilding them with Kindlegen.

KevinH · 09-14-2011, 08:06 PM

Hi DiapDealer,

Nice catch! I never actually tested it on a book without an NCX.
If your version seems to work for everyone, then we have one to release before we attempt the refactoring/adding of classes.

Thanks,

KevinH

[QUOTE=DiapDealer;1742537]The above script is slightly broken for MOBI's that have no NCX (when DEBUG_NCX is set to False). In that circumstance, the outncx variable is referenced before it's assigned in the unpackBook function. The <spine> element is also incorrect in the opf for a MOBI with no ncx file.

I made two small changes to the unpackBook function that make it work for MOBI's with no NCX. A quick diff will reveal the simple changes.

I'm having quite a bit of success with unpacking various books and rebuilding them with Kindlegen.

[/QUOTE]

fandrieu · 09-14-2011, 09:54 PM

Hehe, i didn't take the time to check your latest fixes (pretty late here), but you seem to have spotted the misplaced outncx=False line

I just wanted to add another bit that troubled me:
I merged the (hopefully fixed) sortINDX & buildNCX functions, removing an "evolutionary" clutch with the added bonus of correct indenting (but didn't take much time to test it though...)

siebert · 09-15-2011, 10:42 AM

Hi,

I've looked into the latest source provided by fandrieu and the handling seems to make some shortcuts. I assume that the ncx index also contains a IDXT section, why don't you don't use it to find the start and end position of each entry, so you can verify that you've decoded all bytes?

The tag handling code will work only if all bitmasks are single bits. Is this always the case? I would then at least add an assertion which will fail for non-single bitmasks.

Ciao,
Steffen

09-14-2011, 12:39 PM	#185
DiapDealer Grand Sorcerer Posts: 27,699 Karma: 196509000 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Hi, the above (mobiunpack_testncx2.zip) test script isn't recognizing the ncx in most of my mobi's. The multi-level stuff seems to be off by one. Any of my mobi's that have a strictly flat ncx (one level), the script mistakenly reports as having "No ncx." And with a mobi that has a two-level ncx, the script builds a one-level (flat ncx file)... ignoring the parent level if an entry has a parent. I may be wrong, but I seem to remember something about calibre flattening the ncx regardless. I'm not sure the Kindle properly handles a multi-level ncx file. Something about only the parent levels (and not the children) showing on the progress bar as "jump points" (which is the only thing useful function the ncx provides on a Kindle). I could be completely mistaken about all that, though... I'll have to do some testing. Last edited by DiapDealer; 09-14-2011 at 12:49 PM.

09-14-2011, 03:22 PM	#187
KevinH Sigil Developer Posts: 7,878 Karma: 5449552 Join Date: Nov 2009 Device: many	Hi All, Okay, I took fandrieu's latest, and modified it to pass the tagx info to the readINDX1 routine and fixed an off by one in the code that sorts the NCX. I think this should now be close. PS: Actually I still think sortINDX has an off-by-one issue and my change may not be the correct one! My change fixed my problem but will probably fail for some other case. Recursion is so fun! Either way it needs to be worked on and fixed. We should also re-factor things into classes and maybe even separate it into files that encapsulate the various functions in some smarter way. Last edited by KevinH; 09-15-2011 at 06:56 PM. Reason: add a PS

09-14-2011, 07:02 PM	#191
KevinH Sigil Developer Posts: 7,878 Karma: 5449552 Join Date: Nov 2009 Device: many	Hi All, Okay I merged the fixes that fandrieu made to his version (fixes to sortINDX, other changes) and added in a few other typo fixes and now I think we have a version we can use as the basis for public testing and as a basis for refactoring into classes while trying to keep to just one file. Very nice work fandrieu! mobiunpack_fand_updated2.zip is attached. KevinH Last edited by KevinH; 09-15-2011 at 08:32 PM.

09-14-2011, 07:58 PM	#192
DiapDealer Grand Sorcerer Posts: 27,699 Karma: 196509000 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	The above script is slightly broken for MOBI's that have no NCX (when DEBUG_NCX is set to False). In that circumstance, the outncx variable is referenced before it's assigned in the unpackBook function. The <spine> element is also incorrect in the opf for a MOBI with no ncx file. I made two small changes to the unpackBook function that make it work for MOBI's with no NCX. A quick diff will reveal the simple changes. I'm having quite a bit of success with unpacking various books and rebuilding them with Kindlegen. Last edited by DiapDealer; 09-16-2011 at 01:20 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

09-14-2011, 09:27 AM	#182
KevinH Sigil Developer Posts: 7,878 Karma: 5449552 Join Date: Nov 2009 Device: many	Hi fandrieu, Great work! I will take a shot at combining your latest version with a version that uses Siebert's readTag routine to parse the TAGX which can be found in the indx0 section to find the field bitmaps for each tag and parse them. That way we can forget about all of the if type == 0x1f lines and just use the correct bitmaps to decipher which fields are present and then read them. Thanks! KevinH

09-14-2011, 06:04 PM	#190
pdurrant The Grand Mouse 高貴的老鼠 Posts: 71,889 Karma: 307105450 Join Date: Jul 2007 Location: Norfolk, England Device: Kindle Voyage	Bear in mind that calibre-generated Mobipocket files might not be valid in all instances, since the code was written with reverse-engineered info, not with documentation of the format.

09-14-2011, 08:06 PM	#193
KevinH Sigil Developer Posts: 7,878 Karma: 5449552 Join Date: Nov 2009 Device: many	Hi DiapDealer, Nice catch! I never actually tested it on a book without an NCX. If your version seems to work for everyone, then we have one to release before we attempt the refactoring/adding of classes. Thanks, KevinH [QUOTE=DiapDealer;1742537]The above script is slightly broken for MOBI's that have no NCX (when DEBUG_NCX is set to False). In that circumstance, the outncx variable is referenced before it's assigned in the unpackBook function. The <spine> element is also incorrect in the opf for a MOBI with no ncx file. I made two small changes to the unpackBook function that make it work for MOBI's with no NCX. A quick diff will reveal the simple changes. I'm having quite a bit of success with unpacking various books and rebuilding them with Kindlegen. [/QUOTE]

09-15-2011, 10:42 AM	#195
siebert Developer Posts: 155 Karma: 280 Join Date: Nov 2010 Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)	Hi, I've looked into the latest source provided by fandrieu and the handling seems to make some shortcuts. I assume that the ncx index also contains a IDXT section, why don't you don't use it to find the start and end position of each entry, so you can verify that you've decoded all bytes? The tag handling code will work only if all bitmasks are single bits. Is this always the case? I would then at least add an assertion which will fail for non-single bitmasks. Ciao, Steffen

Advert

Advert