KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 4

pdurrant · 09-21-2010, 02:12 PM

Quote:

Originally Posted by becky330

I am new to this and am trying to understand how to use the mobiunpack python script. I used calibre to convert a book I created from .epub to .mobi. I have installed Python on my Mac and when I run the mobiunpack.py then add my .mobi file in the terminal window, it says "permission denied". Why is it doing this since it is my file? How do I actually use the script???

Exactly why you're getting the error message depends on the exact command, and what permissions you have for the locations you specify.

I'd suggest forgetting the terminal, and use the Applescript application I've just uploaded here: https://www.mobileread.com/forums/sho...836#post774836

Just drag&drop the mobipocket file onto the script, and it'll decode into a folder in the same location as the file. The first time you run it it will ask you to find the copy of the MobiUnpack.py script you have on your hard disk.

If you have any further problems, just ask.

GeoffC · 09-22-2010, 10:39 AM

Becky

Welcome to mobileread ....

st_albert · 10-12-2010, 07:43 PM

I've finally gotten around to "discovering" mobiunpack, and now I have a few questions.

1) on both Linux and Windows, the output .html file seems to have Apple/Mac style end-of-line characters. Can this be fixed easily? I'm not a python programmer by any means, but i did try changing things like "f = open(outsrc, 'wb')" to "f = open(outsrc, 'w')" without effect.

2) I'm guessing the .html file produced is not supposed to be valid HTML. e.g. it lacks a <!DOCTYPE..> header, and the <guide> section in the <head> shouldn't be there. The presence of the <mbp: pagebreak /> tags are trivial.

Anyhow, it's a great tool for seeing what is going on inside the mobipocket file! Thanks for your efforts, all of you, whoever you are!

KevinH · 10-12-2010, 08:08 PM

Hi,

I think the line endings depend on which type of machine was used to generate the original Mobi file. The ones you tested must have been made on a Mac platform. Luckily HTML itself is immune to line ending differences. But encodings (specificly utf-8) may need the high bit set so I would keep the 'wb'.

If you are on Linux or Mac OSX, simply use tr to remove or change them:

To replace carriage returns '\r' with new lines '\n':

cat FILE.html | tr '\r' '\n' > temp.html
mv temp.html FILE.html

To simply remove the carriage returns without replacing them

cat FILE.html | tr -d '\r' > temp.html
mv temp.html FILE.html

BTW: There is another tool: mobiml2html.py that will take the Mobi specific html file created by mobiunpack.py and make it xhtml if you want to archive things or convert them to epub.

It is available as python source code with a GUI front-end from the same site as a zip archive

http://code.google.com/p/ebook-conve...s.zip&can=2&q=

or you can checkout the source tree itself
http://code.google.com/p/ebook-conve...ource/checkout

It is also available in the "tools" package mentioned on the ApprenticeAlf site.

Hope this helps,

KevinH

st_albert · 10-13-2010, 11:22 AM

KevinH, Thanks for all the info. No, the files were not created on a mac. They were built on Linux and tested on Linux and Windows.

Actually it turns out that they seem to have no EOL characters at all. the "tr" command didn't change anything in the file. I had guessed Mac format because that's what notepad++ guessed.

In the end I used perl to add linebreaks between all tags (e.g. "s/></>\n</g"). That turns out to be overkill, but at least the file is readable and editable.

The clean-up tools you linked to work very well indeed.

adamselene · 10-15-2010, 02:07 AM

A few things:

* MobiPocket is an old format, derived from HTML2 with some extensions. In HTML2 times, there was no !DOCTYPE, and in any case there is no need in MobiPocket to differentiate between document languages (because there is only one), so you shouldn't expect it to be there. In fact, quite a bit of what mobigen/kindlegen does is to convert HTML4 and XHTML to HTML2 by rewriting tags and flattening CSS into old-style tags.

* <guide> is one of the extensions. Basically they took an entire chunk of the .opf file and stuck it in the <head> tag so that devices could generate menus to navigate to parts of the document. There are historical reasons for doing it this way, originating with MobiPocket's predecessor formats, which were basically just one big HTML document wrapped in a Palm database file. There are many other ways this could have been done, but creating multiple files/streams within the Palm database would get awkward for several reasons, not least of all because links are all flattened to absolute file positions.

* mobigen/kindlegen specifically removes line breaks to make the file smaller, so you shouldn't expect to see any.

Honestly, MobiPocket is such a crappy format that I would strongly advise avoiding it at all costs, with the sole exception of using it as an output format to display on a Kindle. For all other purposes, you should use ePub. I only wrote the original mobiunpack.py because I tried to decompress the dictionary with other tools, it took more than 30 minutes, and I wanted to demonstrate that it could be done much better (even in Python).

st_albert · 10-15-2010, 11:17 AM

Quote:

Originally Posted by adamselene

A few things:

...

Honestly, MobiPocket is such a crappy format that I would strongly advise avoiding it at all costs, with the sole exception of using it as an output format to display on a Kindle. For all other purposes, you should use ePub. I only wrote the original mobiunpack.py because I tried to decompress the dictionary with other tools, it took more than 30 minutes, and I wanted to demonstrate that it could be done much better (even in Python).

Yes, I have to agree with you there, regarding mobi vs. epub format. Unfortunately, I'm pretty sure (don't have access to actual sales figures) that Kindle is our largest e-book sales outlet. So I'm always interested in learning how to deal with it better.

Thanks for the background information. I find it fascinating. Mobiunpack is a great tool for looking at what's inside the mobi package, and thanks to it I can actually SEE what you're talking about. I've been dabbling in ebook format conversions since Aportis Doc and Peanut Reader on Palm Pilots, but it has only been recently that I've taken a more "professional" interest. So much to learn!

sklamb · 11-14-2010, 04:43 PM

Having finally bought my Kindle just as the price of modern digital books went up, I naturally turned to the wonderful world of out-of-copyright material for the bulk of my reading pleasure. Of course the quality of digitizing does vary a lot, and I'm just grateful for all the work that people have done already to make it possible to read books I'd otherwise not be able to get. However, I have a surprising number of (non-DRM) ebooks which need only a small number of errors corrected, and I'm OCD enough to want to do that if I can. I know calibre would solve some of these problems, but for editing an ebook originally generated in PRC this script seems much more suitable. Unfortunately I don't have Python installed on my Windows XP computer and I don't really want to get involved with all the complications that would involve just to do some PRC proofreading....

Is there any possibility that some kind person might convert this script into a Windows executable, as has been done for the mobiperl scripts?

I know it's an imposition and I feel guilty about not doing it for myself, but I'm getting older and doing something like installing Python doesn't seem as much fun as it used to.

sklamb · 11-14-2010, 04:51 PM

Sorry...adding this post because I can't figure out how else to subscribe to this thread...had the wrong option set when I posted the first time... darn :newbie !

ATDrake · 11-14-2010, 04:56 PM

1) Installing Python on Windows is as easy as double-clicking the installer from ActiveState Python Community Edition. Actually using it is admittedly a bit trickier, but perhaps someone will make a widgetized version.

2) You can subscribe to any thread without posting in it by clicking the Thread Tools button in the bar above the top post and choosing Subscribe.

3) Welcome to MobileRead!

sklamb · 11-14-2010, 05:10 PM

Duh...thank you for that, ATDrake. (Especially as I apparently didn't succeed the other way....)

I may just have to grit my teeth and take on Python as well as the prc format (and XML and all the other things I only vaguely sorta know about). Somehow I hadn't expected getting a Kindle to turn me back into any sort of computer geek after decades of just being a user!

DiapDealer · 11-14-2010, 07:13 PM

Quote:

Is there any possibility that some kind person might convert this script into a Windows executable, as has been done for the mobiperl scripts?

The Windows program for mobiperl still requires Perl to be installed (it used to anyway). So even though someone might write a different front-end for MobiUnpack (there's already a Tk front-end)... chance are, it will still require Python.
(Even though a Python to C port of MobiUnpack probably wouldn't be that difficult... there'd then be two separate versions to maintain)

sklamb · 11-14-2010, 08:21 PM

Very humbly...what's Tk? I thought what was available was the original script and an applet for the Mac....

DiapDealer · 11-14-2010, 08:47 PM

Quote:

Very humbly...what's Tk?

It's just some standard GUI type stuff that comes standard with almost all versions of Python. The Tools archive (from Alf's blog) has a GUI front-end for MobiUnpack that will work for pretty much any O.S. (that has python installed, of course). It allows you to choose the files and output directories with standard file dialogs and familiar buttons and such. You have to install python, but none of the scripts really require you to get down and dirty with command-line stuff if you don't want to... while still allowing those who actually prefer to get down and dirty, to do so.

sklamb · 11-14-2010, 09:28 PM

Quote:

Originally Posted by DiapDealer

It's just some standard GUI type stuff that comes standard with almost all versions of Python. The Tools archive (from Alf's blog) has a GUI front-end for MobiUnpack that will work for pretty much any O.S. (that has python installed, of course).

Oh, dear...Alf's blog? I'm going to need an address, I'm afraid....

...No, never mind, I worked that bit out for myself. Thank you so much for all your help!

10-12-2010, 08:08 PM	#49
KevinH Sigil Developer Posts: 8,109 Karma: 5450184 Join Date: Nov 2009 Device: many	Hi, I think the line endings depend on which type of machine was used to generate the original Mobi file. The ones you tested must have been made on a Mac platform. Luckily HTML itself is immune to line ending differences. But encodings (specificly utf-8) may need the high bit set so I would keep the 'wb'. If you are on Linux or Mac OSX, simply use tr to remove or change them: To replace carriage returns '\r' with new lines '\n': cat FILE.html \| tr '\r' '\n' > temp.html mv temp.html FILE.html To simply remove the carriage returns without replacing them cat FILE.html \| tr -d '\r' > temp.html mv temp.html FILE.html BTW: There is another tool: mobiml2html.py that will take the Mobi specific html file created by mobiunpack.py and make it xhtml if you want to archive things or convert them to epub. It is available as python source code with a GUI front-end from the same site as a zip archive http://code.google.com/p/ebook-conve...s.zip&can=2&q= or you can checkout the source tree itself http://code.google.com/p/ebook-conve...ource/checkout It is also available in the "tools" package mentioned on the ApprenticeAlf site. Hope this helps, KevinH Last edited by KevinH; 10-12-2010 at 08:19 PM. Reason: fixed a typo, added an download archive

10-15-2010, 02:07 AM	#51
adamselene Enthusiast Posts: 42 Karma: 11050 Join Date: Nov 2009 Device: Kindle Paperwhite, Kindle Touch, Kindle 2	A few things: * MobiPocket is an old format, derived from HTML2 with some extensions. In HTML2 times, there was no !DOCTYPE, and in any case there is no need in MobiPocket to differentiate between document languages (because there is only one), so you shouldn't expect it to be there. In fact, quite a bit of what mobigen/kindlegen does is to convert HTML4 and XHTML to HTML2 by rewriting tags and flattening CSS into old-style tags. * <guide> is one of the extensions. Basically they took an entire chunk of the .opf file and stuck it in the <head> tag so that devices could generate menus to navigate to parts of the document. There are historical reasons for doing it this way, originating with MobiPocket's predecessor formats, which were basically just one big HTML document wrapped in a Palm database file. There are many other ways this could have been done, but creating multiple files/streams within the Palm database would get awkward for several reasons, not least of all because links are all flattened to absolute file positions. * mobigen/kindlegen specifically removes line breaks to make the file smaller, so you shouldn't expect to see any. Honestly, MobiPocket is such a crappy format that I would strongly advise avoiding it at all costs, with the sole exception of using it as an output format to display on a Kindle. For all other purposes, you should use ePub. I only wrote the original mobiunpack.py because I tried to decompress the dictionary with other tools, it took more than 30 minutes, and I wanted to demonstrate that it could be done much better (even in Python). Last edited by adamselene; 10-15-2010 at 02:12 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

09-22-2010, 10:39 AM	#47
GeoffC Chocolate Grasshopper ... Posts: 27,599 Karma: 20821184 Join Date: Mar 2008 Location: Scotland Device: Muse HD , Cybook Gen3 , Pocketbook 302 (Black) , Nexus 10: wife has PW	Becky Welcome to mobileread ....

10-12-2010, 07:43 PM	#48
st_albert Guru Posts: 697 Karma: 150000 Join Date: Feb 2010 Device: none	I've finally gotten around to "discovering" mobiunpack, and now I have a few questions. 1) on both Linux and Windows, the output .html file seems to have Apple/Mac style end-of-line characters. Can this be fixed easily? I'm not a python programmer by any means, but i did try changing things like "f = open(outsrc, 'wb')" to "f = open(outsrc, 'w')" without effect. 2) I'm guessing the .html file produced is not supposed to be valid HTML. e.g. it lacks a <!DOCTYPE..> header, and the <guide> section in the <head> shouldn't be there. The presence of the <mbp: pagebreak /> tags are trivial. Anyhow, it's a great tool for seeing what is going on inside the mobipocket file! Thanks for your efforts, all of you, whoever you are!

10-13-2010, 11:22 AM	#50
st_albert Guru Posts: 697 Karma: 150000 Join Date: Feb 2010 Device: none	KevinH, Thanks for all the info. No, the files were not created on a mac. They were built on Linux and tested on Linux and Windows. Actually it turns out that they seem to have no EOL characters at all. the "tr" command didn't change anything in the file. I had guessed Mac format because that's what notepad++ guessed. In the end I used perl to add linebreaks between all tags (e.g. "s/></>\n</g"). That turns out to be overkill, but at least the file is readable and editable. The clean-up tools you linked to work very well indeed.

11-14-2010, 04:43 PM	#53
sklamb Junior Member Posts: 5 Karma: 10 Join Date: Nov 2010 Device: KindleDX2, PocketBook InkPad Color	Having finally bought my Kindle just as the price of modern digital books went up, I naturally turned to the wonderful world of out-of-copyright material for the bulk of my reading pleasure. Of course the quality of digitizing does vary a lot, and I'm just grateful for all the work that people have done already to make it possible to read books I'd otherwise not be able to get. However, I have a surprising number of (non-DRM) ebooks which need only a small number of errors corrected, and I'm OCD enough to want to do that if I can. I know calibre would solve some of these problems, but for editing an ebook originally generated in PRC this script seems much more suitable. Unfortunately I don't have Python installed on my Windows XP computer and I don't really want to get involved with all the complications that would involve just to do some PRC proofreading.... Is there any possibility that some kind person might convert this script into a Windows executable, as has been done for the mobiperl scripts? I know it's an imposition and I feel guilty about not doing it for myself, but I'm getting older and doing something like installing Python doesn't seem as much fun as it used to.

11-14-2010, 04:51 PM	#54
sklamb Junior Member Posts: 5 Karma: 10 Join Date: Nov 2010 Device: KindleDX2, PocketBook InkPad Color	Sorry...adding this post because I can't figure out how else to subscribe to this thread...had the wrong option set when I posted the first time... darn :newbie !

11-14-2010, 04:56 PM	#55
ATDrake Wizzard Posts: 11,517 Karma: 33048258 Join Date: Mar 2010 Location: Roundworld Device: Kindle 2 International, Sony PRS-T1, BlackBerry PlayBook, Acer Iconia	1) Installing Python on Windows is as easy as double-clicking the installer from ActiveState Python Community Edition. Actually using it is admittedly a bit trickier, but perhaps someone will make a widgetized version. 2) You can subscribe to any thread without posting in it by clicking the Thread Tools button in the bar above the top post and choosing Subscribe. 3) Welcome to MobileRead!

11-14-2010, 05:10 PM	#56
sklamb Junior Member Posts: 5 Karma: 10 Join Date: Nov 2010 Device: KindleDX2, PocketBook InkPad Color	Duh...thank you for that, ATDrake. (Especially as I apparently didn't succeed the other way....) I may just have to grit my teeth and take on Python as well as the prc format (and XML and all the other things I only vaguely sorta know about). Somehow I hadn't expected getting a Kindle to turn me back into any sort of computer geek after decades of just being a user!

11-14-2010, 08:21 PM	#58
sklamb Junior Member Posts: 5 Karma: 10 Join Date: Nov 2010 Device: KindleDX2, PocketBook InkPad Color	Very humbly...what's Tk? I thought what was available was the original script and an applet for the Mac....

Advert

Advert