11-30-2019, 07:47 AM | #1 |
Guru
Posts: 911
Karma: 149881
Join Date: Jul 2013
Location: Netherlands
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
|
Pocketbook dictionary format revisited
I am maintaining the script PocketBookDic to convert dictionaries to xdxf-, Pocketbook dic- and Stardict ifo-format. It's at the github repository PocketBookDic.
Conversion to dic-format needs a windows program converter.exe and language configuration files. They can be found at the github repository LanguageFilesPocketbookConverter. Nowadays, I am looking into more heuristic approaches to convert free format dictionaries, such as mobi-files and Kindle dictionaries (azw-, azw3-files). Typically it has an intermediate html-stage, which has to be interpreted and converted to the central xdxf-format, before it can be converted to other formats such as Stardict optimized for Koreader, Pocketbook dic-format or mdict-format.* Over 20 dictionaries in both xdxf- (human readable), Stardict- and Pocketbook's dic- (binary) format can be found here. (The xdxf-files can be converted with convert.exe to the dic-files. If you want to tweak your dictionary, this is the place to do it.) 16th November 2021: Getkey just did some testing and the scripting for pocketbook format is updated to handle unicode characters better. All pocketbook dictionaries are recreated. For those that can't be charmed by the tinkerings needed for conversion, post a request and link to your dictionary files and I'll try and convert them. 11th November 2022: Due to an excessive amount of traffic (more than 50GB this month) pCloud is restricting access. As it is a moving total, access should be restored within a few days. (The last 2 days in the graph show restricted access. Apparently, nobody is going to pay for pCloud. ) November 13th, 2022: pCloud access is indeed restored. . January 8th, 2024: Link to mirror of the pCloud account on 8th of January '24 Hopefully this will spread the traffic. March 7th, 2024: This one of the links to the 6.5GB stardict-tarball floating around the internet for over a decade. This is the starting place for checking whether a dictionary is available in Stardict format. ___________________________ *Only implemented to check whether there would be a speed improvement over Stardict on Onyx Boox systems. It didn't improve. Last edited by Markismus; 03-07-2024 at 04:36 AM. Reason: Updated the info |
12-01-2019, 06:05 PM | #2 |
Guru
Posts: 911
Karma: 149881
Join Date: Jul 2013
Location: Netherlands
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
|
I spend yesterday trying to guess to restrictions of the pocketbooks dictionary converter.exe* to get the whole of the Oxford Dictionary 2nd Edition into dic-format. Oxford dictionary has entries up to 115k characters, so it not odd converter.exe crashes, just irritating. Duden (de-de) en Oxford Learners Dictionary 8th Ed. (en-en) work with a little tweaking of the xdxf-files.**
Wish I had a clue of that format so I could skip the program converter.exe: The Perl script already runs up to 250 lines! Does anyone have or know a link to the source code of converter.exe? Does anyone know the format of pocketbook's dic-format, so I can generate it straight from xdxf- or cvs-format? The restrictions known of converter.exe are
Possible resolutions are:
If someone has tinkered with this before and has pointers for me, I would be much obliged. ____________________________________ * I used DictionaryConverter-neu 171109. Search this forum or look here for more info. ** For the conversion of dictionaries to xdxf-format I used linguae. Search this forum or look here for more info. *** This is different from @Rkomar's post that states that he converted a dictionary with 33283 lines. It seems to be the limit on one dictionary entry. EDIT: I just removed all the lines>4096 bytes. The result was: Loading collates... Loading morphems... Loading keyboard... Loading dictionary file... 140407 words loaded Sorting dictionary... Searching for equal words... Packing dictionary... maximum block count reached So it doesn't crash anymore, however, it still can't pack it. It is slightly larger than Rkomar's claim of 33283 lines: 1,185,340 lines. That's why I wanted it! Maybe if I make the dictionary instead of in the 2 parts that it is now for Stardict in 6 parts for Pocketbook.....crappy Last edited by Markismus; 11-16-2021 at 03:32 PM. |
Advert | |
|
12-03-2019, 04:26 PM | #3 |
Guru
Posts: 911
Karma: 149881
Join Date: Jul 2013
Location: Netherlands
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
|
I have a working Perl script and it's on github. It converts mobi- (KindleUnpacked html), cvs-, Stardict- and dxdf-format to Pocketbook dic-format and Stardict formats.
I've succesfully converted Liddell-Scott-Jones, Oxford's Learners dictionary, Duden (de-de), an latin-english dictionary, Nouveau Littre 2011, the Oxford English Dictionary 2nd Ed.and Wordnet. The results in both xdxf- (human readable) and dic- (binary) format are here. (The xdxf-files can be converted with convert.exe to the dic-files. If you want to tweak your dictionary, this is the place to do it.) You will also need
The zip-file attached contains the newest converter.exe patched by ezdiy from post #6. Last edited by Markismus; 11-14-2020 at 08:10 AM. |
12-03-2019, 05:46 PM | #4 |
Zealot
Posts: 121
Karma: 156515
Join Date: Oct 2019
Device: KT, KPW4, PB740-2
|
I've patched the binary to remove block count limit (I'm using it for small 200k word dict though, not sure if it really works with larger dicts) and seems to work for me (TM). I've also tried to remove the 4kbyte entry limit, though not sure if successfully (I don't have dicts with defs this long to test).
https://drive.google.com/file/d/1uRx...Q-_cm-QO1r-PX/ |
12-03-2019, 06:16 PM | #5 |
Guru
Posts: 911
Karma: 149881
Join Date: Jul 2013
Location: Netherlands
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
|
How did you patch that? Do you have the source code?
EDIT: No luck. Still crashes on the Oxford dictionary part 1. Last edited by Markismus; 12-03-2019 at 06:26 PM. |
Advert | |
|
12-03-2019, 07:37 PM | #6 | ||
Zealot
Posts: 121
Karma: 156515
Join Date: Oct 2019
Device: KT, KPW4, PB740-2
|
Quote:
Turns out the "100kb limit" is actually 64k (after removal of tags). This is a hard limit of DIC format. I've patched the binary to not crash, truncate and report the offending line over limit. But there's not much more that can be done - you'll have to abbreviate the entry or split it via perl. Out of the whole dict there's only one such entry though. Further, the chunks between each < are still limited to 4k i think, though that can be easily fixed with some re-formatting from perl with no information loss. Quote:
|
||
12-04-2019, 02:53 AM | #7 | |||
Guru
Posts: 911
Karma: 149881
Join Date: Jul 2013
Location: Netherlands
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
|
@ezdiy Great! Thank you!
Quote:
Quote:
What is the limiting entity, precisely? I saw with Greek letters, that it isn't bytes: Some accepted entries stayed below 3500 chars, while being 7500 Bytes. But the chars are not exactly 4k either, somewhat less. Is there a way to encode for resources? Audio tags for pronunciation? I know Stardict-tools can convert Lingvo audio resources to Stardict format, however, I have no idea how to implement them in xdxf-format, yet. Would be great to use the audio feature of the pocketbook! Image resources would be nice, too. Maybe with bbencode? I encoded fonts that way into xml when further processing needed it. Quote:
Last edited by Markismus; 12-04-2019 at 03:52 AM. |
|||
12-04-2019, 05:37 AM | #8 |
Guru
Posts: 771
Karma: 625816
Join Date: Sep 2013
Device: EnergySistemEreaderPro, Nook STG, Pocketbook 622, Bookeen Cybooks ...
|
Nice, someone is working on the pocketbook dictionary format.
Do you guys know this program? http://linguae.stalikez.info/ |
12-04-2019, 05:41 AM | #9 |
Guru
Posts: 911
Karma: 149881
Join Date: Jul 2013
Location: Netherlands
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
|
@nhedgehog Yes, I used it to get the first xdxf-formatted files. It crashes rather neatly and was not unproblematic to install. You wouldn't have to use it anymore with the script. (See the second footnote of the first post in this thread.)
|
12-04-2019, 05:43 AM | #10 | |
Guru
Posts: 771
Karma: 625816
Join Date: Sep 2013
Device: EnergySistemEreaderPro, Nook STG, Pocketbook 622, Bookeen Cybooks ...
|
This may be interesting too (from a Russian Forum)
Quote:
|
|
12-04-2019, 05:51 AM | #11 |
Guru
Posts: 911
Karma: 149881
Join Date: Jul 2013
Location: Netherlands
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
|
@nhedgehog Nice. And it works on the final binary pocketbook dictionary.
If you have a convertable dictionary, the script allows you to alter or keep the name, too: Code:
$ clear; perl pocketbookdic.pl Read dict/stardict-Oxford_English_Dictionary_2nd_Ed._P1-2.4.2/Oxford English Dictionary 2nd Ed. P1.xdxf, returning array. Exiting FiletoArray ]lang_from is "". Would you like to change it? (press enter to keep default [eng] lang_to is "". Would you like to change it? (press enter to keep default [eng] format is "visual". Would you like to change it? (press enter to keep default [visual] <xdxf lang_from="eng" lang_to="eng" format="visual"> Full_name is "Oxford English Dictionary 2nd Ed. P1". Would you like to change it? (press enter to keep default [Oxford English Dictionary 2nd Ed. P1] Last edited by Markismus; 12-04-2019 at 06:02 AM. |
12-04-2019, 06:30 AM | #12 |
Guru
Posts: 911
Karma: 149881
Join Date: Jul 2013
Location: Netherlands
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
|
@ezdiy It works! I converted part 1 of the Oxford English Dictionary 2nd Ed.
I've tested it and it works on my Inkpad 3 Pro. However, I don't know how the double entries work out. So that is something that remains to be tested with a specially devised dictionary. Last edited by Markismus; 11-14-2020 at 08:11 AM. |
12-04-2019, 01:39 PM | #13 |
Connoisseur
Posts: 55
Karma: 8430
Join Date: Mar 2016
Device: PW3, Clara HD, PB740
|
Ooooh nice work guys~
Suggestion: maybe create an output format for https://github.com/ilius/pyglossary (or penelope, but it's no longer maintained AFAIK) and get rid of that horrible platform-specific and buggy exe? The .dic structure seems fairly basic, with a fixed header, a list of sections by "Alpha", morphems.txt, keyboard.txt, and the sections. ZLIB with max compression. SIRSteiner has figured out 3 formatting options (b, i, br), there may be more. https://www.mobileread.com/forums/sh...0&postcount=14 What do you mean by double entries? Last edited by Marco77; 12-04-2019 at 02:59 PM. Reason: more stuff |
12-04-2019, 02:54 PM | #14 |
Guru
Posts: 911
Karma: 149881
Join Date: Jul 2013
Location: Netherlands
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
|
The script generates pocketbook dic- and xdxf-files for all input and Stardict xml-files as intermediary when converting any Stardict triplet of files. Since Penelope can handle both Stardict- and xdxf-files, you’re ready to go.
Ezdiy already patched the most horrible aspects of [i]convert.exe[\i] and the script basically smooths all wrinkles left. So there is no buggy binary involved anymore. Just buggy windows users. As a solution to too large articles, I split them in multiple articles with the same heading + a symbol. Currently the script uses only the symbol nothing (“”). Ideally, we should figure out how to use morphems.txt to make the app judge them as identical. Anyway, I don’t know whether this works or needs tweaking. Could you elaborate on how you analyzed the dic-files? I would really like to generate them directly from Perl. Last edited by Markismus; 12-04-2019 at 03:04 PM. |
12-04-2019, 03:11 PM | #15 | |||||
Zealot
Posts: 121
Karma: 156515
Join Date: Oct 2019
Device: KT, KPW4, PB740-2
|
Quote:
Code:
v24 = "full_name"; Str = "?xml"; v23 = "xdxf"; v29 = "i"; v25 = "description"; v26 = "ar"; // this one for each definition entry v27 = "k"; v28 = "b" // maybe this is for <br> too, due shared prefix? Unknown tags, it seems to strip, keeping only the text within - I *think*, not really sure. Quote:
All things considered, here's how you determine entry limit: 1. take one <ar> entry 2. unescape all entities, strip all tags, for recognized ones, (i,k,b) count additional byte. Wrap over-long lines with newlines at word boundary. 3. the resulting text is what would get encoded (64k bytes limit per whole <ar> body, 4k bytes per line). For this to make any sense at all, you should encode the input similiarly, ie: keep only i,k,b tags per <ar>. Convert all entities to utf8, except for lt,gt,amp,quot (that can be done internally by convert). The limits should be slightly below 64k and 4k (something like 4k-16 and 64k-16), as it depends on some slack space in there internally and the limits are enforced like that, too. Quote:
Quote:
Quote:
Last edited by ezdiy; 12-04-2019 at 03:33 PM. |
|||||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Pocketbook dictionary | logan | PocketBook | 322 | 03-05-2024 10:48 AM |
Dictionary coversion from .mobi to pocketbook format? | doctorat | PocketBook | 16 | 07-01-2020 06:34 PM |
Webster's 1913 Dictionary in Pocketbook Format | luqmaninbmore | PocketBook | 8 | 05-27-2020 11:41 AM |
SW>EN Dictionary for Pocketbook | tttrine | PocketBook | 3 | 06-09-2015 07:01 AM |