KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 35

pdurrant · 03-13-2013, 05:32 AM

I shall have to leave detailed discussion of DATP sections to KevinH and DiapDealer.

Nice to see you back at MobileRead. Thanks again for the original code that's been developed into KindleUnpack.

Hitch · 03-13-2013, 06:51 PM

Quote:

Originally Posted by adamselene

So I actually decided to try the splitting feature, to see how much space it would save on a Kindle loaded with converted ePub books. The answer turned out to be about 11.5% on my corpus of 1277 books. But that's not what this post is about…

I compared a KF8 stripped by KindleUnpack with one generated from the same original file by Amazon's Personal Document Service, and found that, aside from some minor changes in the metadata (including addition of the atv:kin:1 tag that they harvest and upload to track documents), it does something different with DATP sections near the end of the document. There are two in the original file and the KindleUnpack KF8, but one is removed in the Amazon KF8, and it's put in a slightly different location in the file. I wonder whether this matters. Do you have any idea what this section is for? It looks like a table of offsets.

Hi:

Most of this is beyond me, but if I may, you can't actually "generate" a K8 file from Amazon's PDS. If you email a file to your own Kindle addy, or use the PDS in any other way, what you get back is not a K8; it's the old mobi (prc) format. So, you can't compare apples-to-apples (in any sense) for an actual K8 created by KindleGen/KP versus the "mobi" (prc) file that you'll get from PDS. You can test this yourself by sending a K8 with, say, an embedded font to the PDS--what you get back will be equivalent to a book made with MBPC. Then sideload the same K8 file with an embedded font to a Fire device directly, either by USB or via an actual (not faux) wifi connection. The "send to Kindle by wifi" prompt you can see on your computer does not use Wifi; it emails the document/book via the PDS. So when I say "wifi," I mean an app like "Wifi File Explorer," which is genuine wifi. You'll see the difference; the USB or wifi-d book will have the embedded font; the PDS book will not.

Hope that helps. The DATP stuff is too deep for yours truly, but I thought before you tried to sort this, you should use files that are equivalent.

Hitch

adamselene · 03-13-2013, 07:06 PM

Quote:

Originally Posted by Hitch

Most of this is beyond me, but if I may, you can't actually "generate" a K8 file from Amazon's PDS. If you email a file to your own Kindle addy, or use the PDS in any other way, what you get back is not a K8; it's the old mobi (prc) format.

That turns out not to be true. I emailed some combo KF7/KF8 files made by KindleGen, and PDS turned them into standalone KF8s for transfer to the Kindle Paperwhite.

I note that the file size listed on Amazon site indicates that it stored a combo file in the cloud. I haven't tested downloading to an older device without KF8, but presumably it gets stripped to KF7.

adamselene · 03-13-2013, 07:16 PM

Quote:

Originally Posted by adamselene

I haven't tested downloading to an older device without KF8, but presumably it gets stripped to KF7.

I just tested that, and it works as I expected. In this case, the only difference from the KindleUnpacked version is in the metadata; all the data sections are identical to the Amazon stripped version.

Hitch · 03-13-2013, 08:08 PM

Quote:

Originally Posted by adamselene

That turns out not to be true. I emailed some combo KF7/KF8 files made by KindleGen, and PDS turned them into standalone KF8s for transfer to the Kindle Paperwhite.

I note that the file size listed on Amazon site indicates that it stored a combo file in the cloud. I haven't tested downloading to an older device without KF8, but presumably it gets stripped to KF7.

Well, if that's true, that's new. As of merely 6 weeks ago, the K8 files that were emailed via PDS were still being converted as if they were K7 (or K6, or whatever). The files mailed to my Fire were not converting properly, and I discussed this with Amazon, and my Tech. Account Manager, in...the end of January, I think it was. I'll check it again.

ETA: Yup--I just sent a K8-formatted book to my Fire, and now it's working. That's very cool, thank you for this discussion--I wouldn't have found out for ages, given that I'd stopped using the PDS for this very reason. (That, and it's faster to just wifi it, but, still...this way I can tell my clients to email the files to their devices. It will save me untold brain-damage. COOL!)

Hitch

adamselene · 03-13-2013, 09:00 PM

Well, it would hardly be the first time Amazon quietly changed something without bothering to tell anyone.

Transferring over USB is why I wanted to strip the files myself. Using PDS has the advantage that more content can be kept in the cloud and fetched from the device, and it syncs reading location. (The latter two things don't seem to work on my Kindle 2, but content can be pushed from the web site.)

Hitch · 03-14-2013, 06:35 AM

Quote:

Originally Posted by adamselene

Well, it would hardly be the first time Amazon quietly changed something without bothering to tell anyone.

Transferring over USB is why I wanted to strip the files myself. Using PDS has the advantage that more content can be kept in the cloud and fetched from the device, and it syncs reading location. (The latter two things don't seem to work on my Kindle 2, but content can be pushed from the web site.)

No: it certainly wouldn't (be the first time Amazon changed something...);

I could harangue for days over the horsepucky with the SRL change in/around December, which doesn't show up until after the Publishing Workflow (in other words, after the book is put on sale)...and then only in books for which there's no discernible or describable or document-able criterion. I've had not less than 20 back and forth emails with the Mgr of Digital Operations about this one, because it's just WHACK.

Anyway, though: thanks again. I really wouldn't have found out for ages, simply because it's not a method we ever used a lot, and on the rare occasions we did, post the advent of K8, the doc conversion was still old-school.

Hitch

nickredding · 03-16-2013, 10:34 PM

I'm occasionally getting a codec error unpacking calibre-generated news downloads:

Code:

...
Write ncx
Find link anchors
Insert data into html
Insert hrefs into html
Remove empty anchors from html
Insert image references into html
Write opf
Error: 'ascii' codec can't decode byte 0xe2 in position 84: ordinal not in range(128)


Error: Unpacking Failed

I can't determine where in the unpacking code this is happening (I'm assuming it's in WriteOPF since there is no OPF file after this crash).

It would be nice if KindleUnpack would report (including where the offending byte data is) and then ignore this type of error and carry on instead of terminating. I suppose it's possible there is an error somewhere in calibre, but the resulting files work fine on kindles, ipads, etc., so whatever it is it's harmless, and anyway there is no way to figure out where the issue might be in calibre without some useful information from KindleUnpack.

Normally I would try to isolate the issue in KindleUnpack myself, but the code has changed and grown so much since I last worked on it that would be a major project for me to get back into it. Hopefully, someone who is up to speed on the current code can deal with this.

KevinH · 03-16-2013, 11:28 PM

Hi Nick,
That error typically can be generated deep inside the python library code when unicode data is passed between threads but somehow the default python encoding is used and on some platforms this is ascii which causes an error. I thought all of those were fixed in the very latest version of KindleUnpack. Perhaps not. Or perhaps some full unicode data is used in a filename or book title or link target, that should have been properly converted to utf-8 before being written to the opf. Either way please post a zip archive of the problem news feed ebook and I will try to track down what is happening and get it fixed.

KevinH

nickredding · 03-17-2013, 12:12 AM

Kevin - attached is a file that generates this fault.

adamselene · 03-17-2013, 12:41 AM

Quote:

Originally Posted by adamselene

Transferring over USB is why I wanted to strip the files myself. Using PDS has the advantage that more content can be kept in the cloud and fetched from the device, and it syncs reading location. (The latter two things don't seem to work on my Kindle 2, but content can be pushed from the web site.)

This is a bit tangential to the topic of the thread, but just following up about my experiment using PDS…

The interface Amazon provides for maintaining Personal Documents is horrid and basically unusable for maintaining a library. In addition to laborious paging interfaces on the web site, there's no ability to categorize, no way to update a document if you've changed it, and not even any attempt to suppress duplicates.

In addition to that, leaving WiFi on to sync location between the Touch and Paperwhite clearly caused the battery on both to drain significantly faster, which greatly nonplussed me.

So, screw that—I'll stick with maintaining my Kindles with rsync.

KevinH · 03-17-2013, 11:55 AM

Hi Nick,

It seems the Description metadata item in your np.mobi testcase is properly utf-8 encoded (and it does correctly encode and use non-ascii characters) - notice the smart quotes and accented chars in this snippet.

I looked at the Description in a hex editor and all of the smart quotes appear to be utf-8 encoded and not cp1251.

---
Key: "Description"
Value: "Daily news from the National Post

Articles in this issue:
Is the war on cancer an ‘utter failure’?: A sobering look at how billions in research money is spent

Jean Chrétien: A capable caretaker, but no statesman
---

The error you reported seems to happen because utf-8 bytes in the Description metadata element are not properly being handled in either the unescape or xmlescape python library routines.

In other words, the bug fix we made to escape html in the metadata text fields properly (you can't have html inside the opf xml metadata, dc:description) is now messing up when utf-8 text is used in someplace inside those libraries.

To prove this I made the following change to mobi_opf.py to disable the html escaping.

Code:

--- mobi_opf.py~	2013-01-12 23:40:42.000000000 -0500
+++ mobi_opf.py	2013-03-17 11:38:06.000000000 -0400
@@ -47,7 +47,8 @@
                 for value in metadata[key]:
                     # Strip all tag attributes for the closing tag.
                     closingTag = tag.split(" ")[0]
-                    data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag))
+                    # data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag))
+                    data.append('<%s>%s</%s>\n' % (tag, value, closingTag))
                 del metadata[key]
 
         def handleMetaPairs(data, metadata, key, name):

And now it unpacks just fine.

I am not sure how these library routines work but somewhere inside they are assuming the string is ascii or converting it through ascii and this causes the error when the bytestring is in fact utf-8.

So I will have to dig around in those libraries to see how to fix their issues with handling properly encoded bytestrings. The fix may take a while but in the meanwhile you can simply disable the unescaping via the patch above.

Quote:

Originally Posted by nickredding

Kevin - attached is a file that generates this fault.

Thanks for the testcase.

Take care,

KevinH

KevinH · 03-17-2013, 12:19 PM

Hi Nick,

Okay, I think xmlescape and HTMLparser both work better with full unicode strings. At that point, all metadata has already been encoded as utf-8, so I have modified mobi_opf.py to convert all required pieces from utf-8 to full unicode, pass through the xmlescape and escape methods, and then convert back to the needed utf-8 for the opf file.

So please give this mobi_opf.py version a try and let me know if it fixes your issues.

Thanks,

Kevin

nickredding · 03-17-2013, 07:34 PM

Quote:

Originally Posted by KevinH

So please give this mobi_opf.py version a try and let me know if it fixes your issues.

It works -- thanks.

DiapDealer · 03-24-2013, 09:24 AM

Quote:

Originally Posted by KevinH

Okay, I think xmlescape and HTMLparser both work better with full unicode strings. At that point, all metadata has already been encoded as utf-8, so I have modified mobi_opf.py to convert all required pieces from utf-8 to full unicode, pass through the xmlescape and escape methods, and then convert back to the needed utf-8 for the opf file.

Just a heads up: there are three more places in the mobi_opf script where data gets the unescape->escape treatment in addition to the handleTag and handleMetaPairs methods.

Would it make sense to do something similar (full unicode) in those additional three locations?

03-16-2013, 10:34 PM	#518
nickredding onlinenewsreader.net Posts: 324 Karma: 10143 Join Date: Dec 2009 Location: Phoenix, AZ & Victoria, BC Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire	I'm occasionally getting a codec error unpacking calibre-generated news downloads: Code: ... Write ncx Find link anchors Insert data into html Insert hrefs into html Remove empty anchors from html Insert image references into html Write opf Error: 'ascii' codec can't decode byte 0xe2 in position 84: ordinal not in range(128) Error: Unpacking Failed I can't determine where in the unpacking code this is happening (I'm assuming it's in WriteOPF since there is no OPF file after this crash). It would be nice if KindleUnpack would report (including where the offending byte data is) and then ignore this type of error and carry on instead of terminating. I suppose it's possible there is an error somewhere in calibre, but the resulting files work fine on kindles, ipads, etc., so whatever it is it's harmless, and anyway there is no way to figure out where the issue might be in calibre without some useful information from KindleUnpack. Normally I would try to isolate the issue in KindleUnpack myself, but the code has changed and grown so much since I last worked on it that would be a major project for me to get back into it. Hopefully, someone who is up to speed on the current code can deal with this.

03-16-2013, 11:28 PM	#519
KevinH Sigil Developer Posts: 8,109 Karma: 5450184 Join Date: Nov 2009 Device: many	Hi Nick, That error typically can be generated deep inside the python library code when unicode data is passed between threads but somehow the default python encoding is used and on some platforms this is ascii which causes an error. I thought all of those were fixed in the very latest version of KindleUnpack. Perhaps not. Or perhaps some full unicode data is used in a filename or book title or link target, that should have been properly converted to utf-8 before being written to the opf. Either way please post a zip archive of the problem news feed ebook and I will try to track down what is happening and get it fixed. KevinH Last edited by KevinH; 03-16-2013 at 11:29 PM. Reason: fix typos

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

03-13-2013, 05:32 AM	#511
pdurrant The Grand Mouse 高貴的老鼠 Posts: 72,251 Karma: 309000000 Join Date: Jul 2007 Location: Norfolk, England Device: Kindle Voyage	I shall have to leave detailed discussion of DATP sections to KevinH and DiapDealer. Nice to see you back at MobileRead. Thanks again for the original code that's been developed into KindleUnpack.

03-13-2013, 09:00 PM	#516
adamselene Enthusiast Posts: 42 Karma: 11050 Join Date: Nov 2009 Device: Kindle Paperwhite, Kindle Touch, Kindle 2	Well, it would hardly be the first time Amazon quietly changed something without bothering to tell anyone. Transferring over USB is why I wanted to strip the files myself. Using PDS has the advantage that more content can be kept in the cloud and fetched from the device, and it syncs reading location. (The latter two things don't seem to work on my Kindle 2, but content can be pushed from the web site.)

Advert

Advert