Pocketbook dictionary format revisted

Markismus · 11-30-2019, 06:47 AM

I am maintaining the script PocketBookDic to convert dictionaries to xdxf-, Pocketbook dic- and Stardict ifo-format. It's at the github repository PocketBookDic.

Conversion to dic-format needs a windows program converter.exe and language configuration files. They can be found at the github repository LanguageFilesPocketbookConverter.

Nowadays, I am looking into more heuristic approaches to convert free format dictionaries, such as mobi-files and Kindle dictionaries (azw-, azw3-files). Typically it has an intermediate html-stage, which has to be interpreted and converted to the central xdxf-format, before it can be converted to other formats such as Stardict optimized for Koreader, Pocketbook dic-format or mdict-format.*

Over 20 dictionaries in both xdxf- (human readable), Stardict- and Pocketbook's dic- (binary) format can be found here. (The xdxf-files can be converted with convert.exe to the dic-files. If you want to tweak your dictionary, this is the place to do it.)

16th November 2021: Getkey just did some testing and the scripting for pocketbook format is updated to handle unicode characters better. All pocketbook dictionaries are recreated.

For those that can't be charmed by the tinkerings needed for conversion, post a request and link to your dictionary files and I'll try and convert them.

11th November 2022:

Due to an excessive amount of traffic (more than 50GB this month) pCloud is restricting access. As it is a moving total, access should be restored within a few days. (The last 2 days in the graph show restricted access. Apparently, nobody is going to pay for pCloud.

)

November 13th, 2022:
pCloud access is indeed restored.

.

January 8th, 2024:
Link to mirror of the pCloud account on 8th of January '24

Hopefully this will spread the traffic.

March 7th, 2024:
This one of the links to the 6.5GB stardict-tarball floating around the internet for over a decade. This is the starting place for checking whether a dictionary is available in Stardict format.
___________________________
*Only implemented to check whether there would be a speed improvement over Stardict on Onyx Boox systems. It didn't improve.

Markismus · 12-01-2019, 05:05 PM

I spend yesterday trying to guess to restrictions of the pocketbooks dictionary converter.exe* to get the whole of the Oxford Dictionary 2nd Edition into dic-format. Oxford dictionary has entries up to 115k characters, so it not odd converter.exe crashes, just irritating. Duden (de-de) en Oxford Learners Dictionary 8th Ed. (en-en) work with a little tweaking of the xdxf-files.**

Wish I had a clue of that format so I could skip the program converter.exe: The Perl script already runs up to 250 lines!
Does anyone have or know a link to the source code of converter.exe? Does anyone know the format of pocketbook's dic-format, so I can generate it straight from xdxf- or cvs-format?

The restrictions known of converter.exe are

A line should not be >4096 bytes. It cuts the line after this length and messages that the XML is missing closing tags.
If '&' or '>' are found in the XML content outside of tags, etc., it quits and messages about malformed XML.
If an dictionary entry definition, a block enclosed by <def> and </def> tags exceeds 100kB it crashes without messaging. (103916 bytes works, but 104992 bytes already crashes. )***

Possible resolutions are:

Split the dictionary entry at the tags or use something like prettify, auto-ident.
'&' and '<' should be replaced with '&amp' and '&lt'.
I can resolve this by splitting an entry in multiple entries with identical lemma's.

If someone has tinkered with this before and has pointers for me, I would be much obliged.

____________________________________
* I used DictionaryConverter-neu 171109. Search this forum or look here for more info.
** For the conversion of dictionaries to xdxf-format I used linguae. Search this forum or look here for more info.
*** This is different from @Rkomar's post that states that he converted a dictionary with 33283 lines. It seems to be the limit on one dictionary entry.

EDIT:
I just removed all the lines>4096 bytes. The result was:
Loading collates...
Loading morphems...
Loading keyboard...
Loading dictionary file...
140407 words loaded
Sorting dictionary...
Searching for equal words...
Packing dictionary...

maximum block count reached

So it doesn't crash anymore, however, it still can't pack it.

It is slightly larger than Rkomar's claim of 33283 lines: 1,185,340 lines. That's why I wanted it! Maybe if I make the dictionary instead of in the 2 parts that it is now for Stardict in 6 parts for Pocketbook.....crappy

Markismus · 12-03-2019, 03:26 PM

I have a working Perl script and it's on github. It converts mobi- (KindleUnpacked html), cvs-, Stardict- and dxdf-format to Pocketbook dic-format and Stardict formats.

I've succesfully converted Liddell-Scott-Jones, Oxford's Learners dictionary, Duden (de-de), an latin-english dictionary, Nouveau Littre 2011, the Oxford English Dictionary 2nd Ed.and Wordnet.

The results in both xdxf- (human readable) and dic- (binary) format are here. (The xdxf-files can be converted with convert.exe to the dic-files. If you want to tweak your dictionary, this is the place to do it.)

You will also need

pocketbook converter binary and its language configuration files. I've zipped them in the uploaded PockebookDic.zip.
Install Perl
Instal Stardict-tools (If you want to convert from Stardict ifo-,dict- and idx-files.)

See github for further info.

The zip-file attached contains the newest converter.exe patched by ezdiy from post #6.

ezdiy · 12-03-2019, 04:46 PM

I've patched the binary to remove block count limit (I'm using it for small 200k word dict though, not sure if it really works with larger dicts) and seems to work for me (TM). I've also tried to remove the 4kbyte entry limit, though not sure if successfully (I don't have dicts with defs this long to test).

https://drive.google.com/file/d/1uRx...Q-_cm-QO1r-PX/

Markismus · 12-03-2019, 05:16 PM

How did you patch that? Do you have the source code?

EDIT:
No luck. Still crashes on the Oxford dictionary part 1.

ezdiy · 12-03-2019, 06:37 PM

Quote:

Originally Posted by Markismus

EDIT:
No luck. Still crashes on the Oxford dictionary part 1.

This one works: https://drive.google.com/file/d/1D_h...T-6f8BYgv/view

Turns out the "100kb limit" is actually 64k (after removal of tags). This is a hard limit of DIC format. I've patched the binary to not crash, truncate and report the offending line over limit. But there's not much more that can be done - you'll have to abbreviate the entry or split it via perl. Out of the whole dict there's only one such entry though. Further, the chunks between each < are still limited to 4k i think, though that can be easily fixed with some re-formatting from perl with no information loss.

Quote:

How did you patch that? Do you have the source code?

Kinda.

Markismus · 12-04-2019, 01:53 AM

@ezdiy Great! Thank you!

Quote:

Turns out the "100kb limit" is actually 64k (after removal of tags).

What tags are retained in the conversion? Are color-tags removed? Blockquote, ex, abr?

Quote:

you'll have to abbreviate the entry or split it via perl. Out of the whole dict there's only one such entry though.

The maximum article- and line-lengths are already implemented, so I’ll tune them in the script. That's why the reconstructed xdxf-file still only had one left, that was too long. The original is teeming with them.

What is the limiting entity, precisely? I saw with Greek letters, that it isn't bytes: Some accepted entries stayed below 3500 chars, while being 7500 Bytes. But the chars are not exactly 4k either, somewhat less.

Is there a way to encode for resources? Audio tags for pronunciation? I know Stardict-tools can convert Lingvo audio resources to Stardict format, however, I have no idea how to implement them in xdxf-format, yet. Would be great to use the audio feature of the pocketbook!

Image resources would be nice, too. Maybe with bbencode? I encoded fonts that way into xml when further processing needed it.

Quote:

Kinda

That looks a bit like the de-assembler I used as a kid. (I had to hack CGA games to work on my dad's monochrome Hercules graphics card.) What could I look into for that, nowadays?

nhedgehog · 12-04-2019, 04:37 AM

Nice, someone is working on the pocketbook dictionary format.
Do you guys know this program?
http://linguae.stalikez.info/

Markismus · 12-04-2019, 04:41 AM

@nhedgehog Yes, I used it to get the first xdxf-formatted files. It crashes rather neatly and was not unproblematic to install. You wouldn't have to use it anymore with the script. (See the second footnote of the first post in this thread.)

nhedgehog · 12-04-2019, 04:43 AM

This may be interesting too (from a Russian Forum)

Quote:

The name of any * .dic dictionary that displays a Pocketbook can be corrected in the following way:
1. Create a text file, in it we write the desired name of the dictionary.
2. Using the wu8.exe program from Alex_None, we convert this file to UTF-8.
3. Open the converted file with the necessary name for viewing as Hex.
4. Open the dictionary * .dic hex editor.
5. Starting at offset 0x40, we replace the unreadable name by the required one byte.
There is a limit on the length of the name - a maximum of 31 characters (already other data come from the offset 0x80). The name must be terminated with two zero bytes (maximum at offset 0x7e and 0x7f).
Point 2 can be made with a usual notepad, in this case, when viewing it in Hex mode, ignore the first 3 bytes of the EF BB BF.

There is an app (dicrename.exe) in one of the converter folders.

Markismus · 12-04-2019, 04:51 AM

@nhedgehog Nice. And it works on the final binary pocketbook dictionary.
If you have a convertable dictionary, the script allows you to alter or keep the name, too:

Code:

$ clear; perl pocketbookdic.pl 

Read dict/stardict-Oxford_English_Dictionary_2nd_Ed._P1-2.4.2/Oxford English Dictionary 2nd Ed. P1.xdxf, returning array. Exiting FiletoArray
 ]lang_from is "". Would you like to change it? (press enter to keep default [eng] 
 lang_to is "". Would you like to change it? (press enter to keep default [eng] 
 format is "visual". Would you like to change it? (press enter to keep default [visual] 
<xdxf lang_from="eng" lang_to="eng" format="visual">
Full_name is "Oxford English Dictionary 2nd Ed. P1".
Would you like to change it? (press enter to keep default [Oxford English Dictionary 2nd Ed. P1]

Markismus · 12-04-2019, 05:30 AM

@ezdiy It works! I converted part 1 of the Oxford English Dictionary 2nd Ed.

I've tested it and it works on my Inkpad 3 Pro. However, I don't know how the double entries work out. So that is something that remains to be tested with a specially devised dictionary.

Marco77 · 12-04-2019, 12:39 PM

Ooooh nice work guys~

Suggestion: maybe create an output format for https://github.com/ilius/pyglossary (or penelope, but it's no longer maintained AFAIK) and get rid of that horrible platform-specific and buggy exe?

The .dic structure seems fairly basic, with a fixed header, a list of sections by "Alpha", morphems.txt, keyboard.txt, and the sections. ZLIB with max compression.
SIRSteiner has figured out 3 formatting options (b, i, br), there may be more. https://www.mobileread.com/forums/sh...0&postcount=14

What do you mean by double entries?

Markismus · 12-04-2019, 01:54 PM

The script generates pocketbook dic- and xdxf-files for all input and Stardict xml-files as intermediary when converting any Stardict triplet of files. Since Penelope can handle both Stardict- and xdxf-files, you’re ready to go.

Ezdiy already patched the most horrible aspects of [i]convert.exe[\i] and the script basically smooths all wrinkles left. So there is no buggy binary involved anymore. Just buggy windows users.

As a solution to too large articles, I split them in multiple articles with the same heading + a symbol. Currently the script uses only the symbol nothing (“”). Ideally, we should figure out how to use morphems.txt to make the app judge them as identical. Anyway, I don’t know whether this works or needs tweaking.

Could you elaborate on how you analyzed the dic-files? I would really like to generate them directly from Perl.

ezdiy · 12-04-2019, 02:11 PM

Quote:

Originally Posted by Markismus

@ezdiy Great! Thank you!
What tags are retained in the conversion? Are color-tags removed? Blockquote, ex, abr?

The tags it interprets and encodes as special values are:

Code:

  v24 = "full_name";
  Str = "?xml";
  v23 = "xdxf";
  v29 = "i";
  v25 = "description";
  v26 = "ar"; // this one for each definition entry
  v27 = "k";
  v28 = "b" // maybe this is for <br> too, due shared prefix?

(see next_tag in disassembly).
Unknown tags, it seems to strip, keeping only the text within - I *think*, not really sure.

Quote:

What is the limiting entity, precisely? I saw with Greek letters, that it isn't bytes: Some accepted entries stayed below 3500 chars, while being 7500 Bytes. But the chars are not exactly 4k either, somewhat less.

&escapes; are unescaped (ie count limit after unescaping first). But it recognizes only lt, gt, quot and amp. All other entities will be put in the output as-is. This may needlessly waste space in the 64k total when there's actually valid utf8 encoding or worse, unknown entities may even not be properly displayed (as opposed to their utf8). The input/output is most certainly utf8 only, as it internally performs utf8-aware language-specific collations. However you must always count underlying bytes, NOT characters. That is, bytes::length() is what matters (per-line and per-entry limits). character length() can be anything and is irrelevant.

All things considered, here's how you determine entry limit:
1. take one <ar> entry
2. unescape all entities, strip all tags, for recognized ones, (i,k,b) count additional byte. Wrap over-long lines with newlines at word boundary.
3. the resulting text is what would get encoded (64k bytes limit per whole <ar> body, 4k bytes per line).

For this to make any sense at all, you should encode the input similiarly, ie: keep only i,k,b tags per <ar>. Convert all entities to utf8, except for lt,gt,amp,quot (that can be done internally by convert).

The limits should be slightly below 64k and 4k (something like 4k-16 and 64k-16), as it depends on some slack space in there internally and the limits are enforced like that, too.

Quote:

That looks a bit like the de-assembler I used as a kid. (I had to hack CGA games to work on my dad's monochrome Hercules graphics card.) What could I look into for that, nowadays?

It's IDA/hexrays.

Quote:

Is there a way to encode for resources? Audio tags for pronunciation? I know Stardict-tools can convert Lingvo audio resources to Stardict format, however, I have no idea how to implement them in xdxf-format, yet. Would be great to use the audio feature of the pocketbook!

Image resources would be nice, too. Maybe with bbencode? I encoded fonts that way into xml when further processing needed it.

No, the format is a dead end for this reason.

Quote:

Originally Posted by Marco77

Ooooh nice work guys~
Suggestion: maybe create an output format for https://github.com/ilius/pyglossary (or penelope, but it's no longer maintained AFAIK) and get rid of that horrible platform-specific and buggy exe?

Patching the exe is much simpler when it's about quick & dirty solutions. Ultimately proper solution is to just use coolreader/koreader, and ditch this dictionary obscurity altogether.

11-30-2019, 06:47 AM	#1
Markismus Guru Posts: 948 Karma: 149907 Join Date: Jul 2013 Location: Rotterdam Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	Pocketbook dictionary format revisited I am maintaining the script PocketBookDic to convert dictionaries to xdxf-, Pocketbook dic- and Stardict ifo-format. It's at the github repository PocketBookDic. Conversion to dic-format needs a windows program converter.exe and language configuration files. They can be found at the github repository LanguageFilesPocketbookConverter. Nowadays, I am looking into more heuristic approaches to convert free format dictionaries, such as mobi-files and Kindle dictionaries (azw-, azw3-files). Typically it has an intermediate html-stage, which has to be interpreted and converted to the central xdxf-format, before it can be converted to other formats such as Stardict optimized for Koreader, Pocketbook dic-format or mdict-format.* Over 20 dictionaries in both xdxf- (human readable), Stardict- and Pocketbook's dic- (binary) format can be found here. (The xdxf-files can be converted with convert.exe to the dic-files. If you want to tweak your dictionary, this is the place to do it.) 16th November 2021: Getkey just did some testing and the scripting for pocketbook format is updated to handle unicode characters better. All pocketbook dictionaries are recreated. For those that can't be charmed by the tinkerings needed for conversion, post a request and link to your dictionary files and I'll try and convert them. 11th November 2022: Due to an excessive amount of traffic (more than 50GB this month) pCloud is restricting access. As it is a moving total, access should be restored within a few days. (The last 2 days in the graph show restricted access. Apparently, nobody is going to pay for pCloud. ) November 13th, 2022: pCloud access is indeed restored. . January 8th, 2024: Link to mirror of the pCloud account on 8th of January '24 Hopefully this will spread the traffic. March 7th, 2024: This one of the links to the 6.5GB stardict-tarball floating around the internet for over a decade. This is the starting place for checking whether a dictionary is available in Stardict format. ___________________________ Only implemented to check whether there would be a speed improvement over Stardict on Onyx Boox systems. It didn't improve. Attached Thumbnails Last edited by Markismus; 03-07-2024 at 03:36 AM. Reason: Updated the info*

12-01-2019, 05:05 PM	#2
Markismus Guru Posts: 948 Karma: 149907 Join Date: Jul 2013 Location: Rotterdam Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	I spend yesterday trying to guess to restrictions of the pocketbooks dictionary converter.exe* to get the whole of the Oxford Dictionary 2nd Edition into dic-format. Oxford dictionary has entries up to 115k characters, so it not odd converter.exe crashes, just irritating. Duden (de-de) en Oxford Learners Dictionary 8th Ed. (en-en) work with a little tweaking of the xdxf-files.** Wish I had a clue of that format so I could skip the program converter.exe: The Perl script already runs up to 250 lines! Does anyone have or know a link to the source code of converter.exe? Does anyone know the format of pocketbook's dic-format, so I can generate it straight from xdxf- or cvs-format? The restrictions known of converter.exe are A line should not be >4096 bytes. It cuts the line after this length and messages that the XML is missing closing tags. If '&' or '>' are found in the XML content outside of tags, etc., it quits and messages about malformed XML. If an dictionary entry definition, a block enclosed by <def> and </def> tags exceeds 100kB it crashes without messaging. (103916 bytes works, but 104992 bytes already crashes. )*** Possible resolutions are: Split the dictionary entry at the tags or use something like prettify, auto-ident. '&' and '<' should be replaced with '&amp' and '&lt'. I can resolve this by splitting an entry in multiple entries with identical lemma's. If someone has tinkered with this before and has pointers for me, I would be much obliged. ____________________________________ * I used DictionaryConverter-neu 171109. Search this forum or look here for more info. For the conversion of dictionaries to xdxf-format I used linguae. Search this forum or look here for more info. * This is different from @Rkomar's post that states that he converted a dictionary with 33283 lines. It seems to be the limit on one dictionary entry. EDIT: I just removed all the lines>4096 bytes. The result was: Loading collates... Loading morphems... Loading keyboard... Loading dictionary file... 140407 words loaded Sorting dictionary... Searching for equal words... Packing dictionary... maximum block count reached So it doesn't crash anymore, however, it still can't pack it. It is slightly larger than Rkomar's claim of 33283 lines: 1,185,340 lines. That's why I wanted it! Maybe if I make the dictionary instead of in the 2 parts that it is now for Stardict in 6 parts for Pocketbook.....crappy Last edited by Markismus; 11-16-2021 at 02:32 PM.

12-03-2019, 05:16 PM	#5
Markismus Guru Posts: 948 Karma: 149907 Join Date: Jul 2013 Location: Rotterdam Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	How did you patch that? Do you have the source code? EDIT: No luck. Still crashes on the Oxford dictionary part 1. Last edited by Markismus; 12-03-2019 at 05:26 PM.

12-04-2019, 05:30 AM	#12
Markismus Guru Posts: 948 Karma: 149907 Join Date: Jul 2013 Location: Rotterdam Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	@ezdiy It works! I converted part 1 of the Oxford English Dictionary 2nd Ed. I've tested it and it works on my Inkpad 3 Pro. However, I don't know how the double entries work out. So that is something that remains to be tested with a specially devised dictionary. Last edited by Markismus; 11-14-2020 at 07:11 AM.

12-04-2019, 12:39 PM	#13
Marco77 Connoisseur Posts: 55 Karma: 8430 Join Date: Mar 2016 Device: PW3, Clara HD, PB740	Ooooh nice work guys~ Suggestion: maybe create an output format for https://github.com/ilius/pyglossary (or penelope, but it's no longer maintained AFAIK) and get rid of that horrible platform-specific and buggy exe? The .dic structure seems fairly basic, with a fixed header, a list of sections by "Alpha", morphems.txt, keyboard.txt, and the sections. ZLIB with max compression. SIRSteiner has figured out 3 formatting options (b, i, br), there may be more. https://www.mobileread.com/forums/sh...0&postcount=14 What do you mean by double entries? Last edited by Marco77; 12-04-2019 at 01:59 PM. Reason: more stuff

12-03-2019, 04:46 PM	#4
ezdiy Zealot Posts: 121 Karma: 156515 Join Date: Oct 2019 Device: KT, KPW4, PB740-2	I've patched the binary to remove block count limit (I'm using it for small 200k word dict though, not sure if it really works with larger dicts) and seems to work for me (TM). I've also tried to remove the 4kbyte entry limit, though not sure if successfully (I don't have dicts with defs this long to test). https://drive.google.com/file/d/1uRx...Q-_cm-QO1r-PX/

12-04-2019, 04:37 AM	#8
nhedgehog Guru Posts: 800 Karma: 628976 Join Date: Sep 2013 Device: EnergySistemEreaderPro, Nook STG, Pocketbook 622, Bookeen Cybooks ...	Nice, someone is working on the pocketbook dictionary format. Do you guys know this program? http://linguae.stalikez.info/

12-04-2019, 04:41 AM	#9
Markismus Guru Posts: 948 Karma: 149907 Join Date: Jul 2013 Location: Rotterdam Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	@nhedgehog Yes, I used it to get the first xdxf-formatted files. It crashes rather neatly and was not unproblematic to install. You wouldn't have to use it anymore with the script. (See the second footnote of the first post in this thread.)

12-04-2019, 01:54 PM	#14
Markismus Guru Posts: 948 Karma: 149907 Join Date: Jul 2013 Location: Rotterdam Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	The script generates pocketbook dic- and xdxf-files for all input and Stardict xml-files as intermediary when converting any Stardict triplet of files. Since Penelope can handle both Stardict- and xdxf-files, you’re ready to go. Ezdiy already patched the most horrible aspects of [i]convert.exe[\i] and the script basically smooths all wrinkles left. So there is no buggy binary involved anymore. Just buggy windows users. As a solution to too large articles, I split them in multiple articles with the same heading + a symbol. Currently the script uses only the symbol nothing (“”). Ideally, we should figure out how to use morphems.txt to make the app judge them as identical. Anyway, I don’t know whether this works or needs tweaking. Could you elaborate on how you analyzed the dic-files? I would really like to generate them directly from Perl. Last edited by Markismus; 12-04-2019 at 02:04 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Pocketbook dictionary	logan	PocketBook	322	03-05-2024 09:48 AM
Dictionary coversion from .mobi to pocketbook format?	doctorat	PocketBook	16	07-01-2020 05:34 PM
Webster's 1913 Dictionary in Pocketbook Format	luqmaninbmore	PocketBook	8	05-27-2020 10:41 AM
SW>EN Dictionary for Pocketbook	tttrine	PocketBook	3	06-09-2015 06:01 AM

Advert

Advert