07-19-2011, 08:26 PM | #76 |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Hi Steffen,
Okay, here is a slightly revised version of what you did. I must admit my image name replacement is slower than yours but still much faster than the old version. If need be we can condition this code on if " processing dictionary" or not and add back in your fixed image file extension version simply for pure speed. I called it version v0.28 to differentiate it. If it works okay for you, we can then integrate it into your git repository Last edited by KevinH; 07-19-2011 at 08:27 PM. Reason: fix typos |
07-20-2011, 02:44 AM | #77 |
The Grand Mouse 高貴的老鼠
Posts: 72,518
Karma: 309063598
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
If anyone has a good suggestion for how to fix the problem of loss of multiple metadata entries, I'd love to hear it. (i.e. if there's more than one author listed, we only save and write out one of them.)
|
Advert | |
|
07-20-2011, 06:19 AM | #78 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
Then it should be as easy as storing a list of strings instead of a single string in metadata[name] and for the output just iterate over the list. Ciao, Steffen |
|
07-20-2011, 06:30 AM | #79 | |
The Grand Mouse 高貴的老鼠
Posts: 72,518
Karma: 309063598
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
Quote:
Anyone better than me at Python like to give it a go? |
|
07-20-2011, 06:33 AM | #80 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
As far as I know these non-image sections appear only at the end after all images, never between images, so it should not confuse the image name index if we just skip over them. In that case we should also handle image sections where we can't determine the type as an error and print some message to stdout. I think it might be even faster to just search all <img> tags in the source and then merge the source with the replaced <img> tags like we do the merge with in the "apply dictionary metadata and anchors" section. Of course we have to skip over the original <img> tags in this merge, but as the match objects contain the position information where the found string was located, it should be very easy. Since we have only the list of <img> tags in memory instead of all the splitted source file data, I expect that solution to be faster than your implementation. Ciao, Steffen |
|
Advert | |
|
07-20-2011, 06:47 AM | #81 |
The Grand Mouse 高貴的老鼠
Posts: 72,518
Karma: 309063598
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
Some Mobipocket files that have been edited with the Perl tools may have images after the non-image bits at the end.
|
07-20-2011, 11:11 AM | #82 |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Hi Steffen,
> I'm not so happy that you disabled skipping over sections which has been recognized as non-image sections. I am not sure what your concern here is. A file is only created if it is a known image type. The remaining code in the loop simply invokes imghdr which just looks at select bytes strings near the front of the data string (very much like what you are doing, so it should be very fast) and then appends an placeholder to a list. Nothing here will impact processing time much if at all versus your version. > As far as I know these non-image sections appear only at the end after all images, never between images, so it should not confuse the image name index if we just skip over them. As Paul indicated, this may not be the case so this version is safer. > In that case we should also handle image sections where we can't determine the type as an error and print some message to stdout. Feel free to add that if you like. My main concern was properly adding the image filename extensions so that later post processing to xhtml works properly (ie. for those not using kindlegen or mobipocket create) > I think it might be even faster to just search all <img> tags in the source and then merge the source with the replaced <img> tags like we do the merge with in the "apply dictionary metadata and anchors" section. That is similar to what is happening here. Regular expressions are used to split which breaks up the string into segments where all of the odd pieces 1,3,5,7 are the img tags and the even pieces are everything else before or after. Then when we do replacements all we are doing is dropping an element from the list and replacing it and we only process the img tags themselves. So no need to create and delete 26mb-100mb copies all of the time. And then you simply put it back together using join. > Of course we have to skip over the original <img> tags in this merge, but as the match objects contain the position information where the found string was located, it should be very easy. > Since we have only the list of <img> tags in memory instead of all the splitted source file data, I expect that solution to be faster than your implementation. Makes sense. Please feel free to make any changes you like. I only have one old dictionary to test with and so can't really fine tune it much. If your way is faster and keeps the proper image file name extensions, I am all for it. Once we have that stable, I am going to test timewise comparing FastConcat with hugeFile set to FastConcat without to see how much of a penalty it is to do everything in memory but with lists of string segments and not one huge string constantly being added to. Take care, Kevin |
07-20-2011, 01:50 PM | #83 |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Hi,
For fun ... I ran mobiunpack_v0.28.py on my one dictionary (file size is 27,585,020 bytes) and timed it (clock time from date in shell script both before and after mobiunpack) and then hard coded hugeFile to False and re-ran. With hugeFile set as True: (uses file IO to temporary files) Run Start Stop Elapsed Time 1 12:25:21 12:26:39 1 minute 18 seconds 2 12:26:45 12:28:02 1 minute 17 seconds With hugeFile set as False (uses lists of strings and "".join(strlist) Run Start Stop Elapsed Time 1 12:29:18 12:30:32 1 minute 14 seconds 2 12:30:38 12:31:53 1 minute 15 seconds It was as I expected. There is no "memory issue" when using lists of strings. In most OS's File IO has overhead and typically writes data to large memory buffers (buffered io) and does not actually flush them to disk unless pushed or until closed. So any slight savings in memory use is offset by the disk overhead. So it appears there is no real advantage for using temporary file IO over using lists of strings and a final join. Please try the same thing with your dictionaries and see if you get the same results. If so, we can probably remove the file io approach and remove FactConcat and just go with the string list approach. Thanks, Kevin |
07-20-2011, 04:45 PM | #84 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
Ciao, Steffen |
|
07-20-2011, 04:50 PM | #85 | ||
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
Quote:
So I've changed the code to skip non-image sections again but still work for such broken files. Ciao, Steffen |
||
07-20-2011, 04:54 PM | #86 |
The Grand Mouse 高貴的老鼠
Posts: 72,518
Karma: 309063598
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
|
07-20-2011, 05:05 PM | #87 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
But I've noticed that several tags are currently not handled by mobiunpack (e.g. 202-209, 300). I'm would like to get some input about how mobiunpack should handle them. I doubt that mobigen/kindlegen supports all these tags (if any), but there are already tags that will be exported to the opf file despite they are ignored by mobigen/kindlegen (e.g. the ASIN). Are there other tools which actually support these tags and use the values or are they just for information? In the latter case I would like to mark them as such (for example by putting them into a comment section) to make clear that their value won't affect the generated mobi. Another solution would be to define a new list of ignored tags, so it's clear that we are aware of those tags but deliberately don't include them in the opf file. Ciao, Steffen |
|
07-20-2011, 05:16 PM | #88 | |
The Grand Mouse 高貴的老鼠
Posts: 72,518
Karma: 309063598
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
Quote:
We could have a list of tags for export as comments, where we have some idea of what the tags mean, and then also do a simple dump into comments of any completely unknown tags. The plan (if it can be called that) behind the opf generation was to add as much info from the EXTH as possible that was valid in an OPF file, whether of not KindleGen would use it. I'm looking forward to seeing what you come up with. I do have some test files with multiple authors. |
|
07-22-2011, 01:12 PM | #89 |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Hi,
Instead of making all metadata elements lists which is a bit messy code wise (especially for something that is not a common event) it may be easier and cleaner to check if a value with that key already exists and if so appending a string delimiter (can be any unique identifier string we want - '"&#$%" or whatever) and then add the new data to the end. That if there is only 1 author or many authors, all data is stored in a simple string in the metadata dictionary. Clean and easy to do using .get(key, '"") on the key to return either the current value for that key or the null string, if not null you append the string delimiter, then you just append the new value for the key. It also works with encoding to utf-8 quite easily. When we go to write it out, simply split on the string delimiter and write out each one. If there is no delimiter present in the string , you will only write out 1. As for keeping all values for metadata, I am for that but we need to be careful in that some mobs will have binary data in some metadata values (left over from keys previously used for DRM, etc) and we can run into byte values that do not exist in utf-8. So we may want to hex or base64 encode these values if you want to maintain them in some way. My two cents, Kevin |
07-22-2011, 01:56 PM | #90 | ||
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
One might implement a solution to use strings for single values and a list of strings only if multiple values exist and use type() to distinguish both cases, but I've refactored my all-list solution already to be usable. I'm almost done (the temporary file code was also removed), do you want me to just publish it when its finished, or do you want to take a look before (let me know your email address then)? Quote:
By having a list for them the code can now warn about any unknown tag it might occur. Ciao, Steffen |
||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can i rotate text and insert images in Mobi and EPUB? | JanGLi | Kindle Formats | 5 | 02-02-2013 05:16 PM |
PDF to Mobi with text and images | pocketsprocket | Kindle Formats | 7 | 05-21-2012 08:06 AM |
Mobi files - images | DWC | Introduce Yourself | 5 | 07-06-2011 02:43 AM |
pdf to mobi... creating images rather than text | Dumhed | Calibre | 5 | 11-06-2010 01:08 PM |
Transfer of images on text files | anirudh215 | 2 | 06-22-2009 10:28 AM |