07-27-2019, 02:52 PM | #1 |
Grand Sorcerer
Posts: 5,527
Karma: 100606001
Join Date: Apr 2011
Device: pb360
|
azw3r highlight and note extraction info
I've figured out enough of the azw3r format to extract personal highlights, notes, and maybe bookmarks. (All strictly by inspection.) I've also written a C program to extract highlights and notes (in a text format possibly most suitable as an intermediate stage) and a perl script that uses the extracted highlights and notes to mark up the rawml for the book. azw3r.pl is a perl alternative to the C program which takes the same arguments and produces the same output. Both of these can now extract highlighted text from the book's rawml file. Both might also be used with yjr files from KFX books, but without the capability to extract highlighted text.
Since jhowell's KRDS parser krds.py https://www.mobileread.com/forums/sh...d.php?t=322172 is general and complete, I've put the details of my partial reverse engineering in spoiler tags. Spoiler:
The C code and perl scripts are in github at https://github.com/jps-e/azw3r and a ttached here along with a sed script to make the rawml viewable in a web browser. ETA: The C and perl have been updated ETA: New release attached as azw3r-0.1.7.zip to this post. See post #29 for details of added features. Last edited by j.p.s; 09-07-2019 at 07:25 PM. Reason: New release 0.1.7 |
07-28-2019, 02:47 PM | #2 |
BLAM!
Posts: 13,497
Karma: 26047188
Join Date: Jun 2010
Location: Paris, France
Device: Kindle 2i, 3g, 4, 5w, PW, PW2, PW5; Kobo H2O, Forma, Elipsa, Sage, C2E
|
Not being a Java guy at all, I've always wondered if those (and a few other things) weren't some weird Java binary storage/serialized format...
|
Advert | |
|
07-28-2019, 03:06 PM | #3 | |
Wizard
Posts: 1,086
Karma: 6719822
Join Date: Jul 2012
Device: Palm Pilot M105
|
Quote:
I would doubt that it's Java serialization since that is rather fragile; a slight change to a class could break compatibility. But for other binary encodings, who knows. |
|
07-28-2019, 03:42 PM | #4 |
BLAM!
Posts: 13,497
Karma: 26047188
Join Date: Jun 2010
Location: Paris, France
Device: Kindle 2i, 3g, 4, 5w, PW, PW2, PW5; Kobo H2O, Forma, Elipsa, Sage, C2E
|
Because most of the Kindle backend is in Java .
|
07-28-2019, 05:38 PM | #5 |
Grand Sorcerer
Posts: 5,527
Karma: 100606001
Join Date: Apr 2011
Device: pb360
|
|
Advert | |
|
08-09-2019, 03:31 PM | #6 |
curly᷂͓̫̙᷊̥̮̾ͯͤͭͬͦͨ ʎʌɹnɔ
Posts: 3,008
Karma: 50506927
Join Date: Dec 2010
Location: ♁ ᴺ₄₅°₃₀' ᵂ₇₃°₃₇' ±₆₀"
Device: K3₃.₄.₃ PW3&4₅.₁₃.₃
|
|
08-10-2019, 03:17 AM | #7 |
hopeless n00b
Posts: 5,110
Karma: 19597086
Join Date: Jan 2009
Location: in the middle of nowhere
Device: PW4, PW3, Libra H2O, iPad 10.5, iPad 11, iPad 12.9
|
Awesome work! Question though, how do you use this (syntax)? I'm assuming Linux only? Will this work on a Linux LiveUSB?
Thanks! |
08-10-2019, 09:41 AM | #8 |
curly᷂͓̫̙᷊̥̮̾ͯͤͭͬͦͨ ʎʌɹnɔ
Posts: 3,008
Karma: 50506927
Join Date: Dec 2010
Location: ♁ ᴺ₄₅°₃₀' ᵂ₇₃°₃₇' ±₆₀"
Device: K3₃.₄.₃ PW3&4₅.₁₃.₃
|
En passant, I've also been using Kindle Mate to store notes, highlights and vocabulary builder words (but not bookmarks).
|
08-10-2019, 10:37 AM | #9 | |
hopeless n00b
Posts: 5,110
Karma: 19597086
Join Date: Jan 2009
Location: in the middle of nowhere
Device: PW4, PW3, Libra H2O, iPad 10.5, iPad 11, iPad 12.9
|
Quote:
|
|
08-10-2019, 01:26 PM | #10 | |
Grand Sorcerer
Posts: 5,527
Karma: 100606001
Join Date: Apr 2011
Device: pb360
|
Quote:
And it looks like on older firmware, mbp for MOBI. I played around with them a bit and their formats for highlights and notes are all different and deserve their own threads. After they are all sorted out, maybe someone can start a thread for an application that automatically handles all of them. In all cases, there is "junk" between the notes header and the text of the note. I might edit this post later to show excerpts inside spoiler tags. |
|
08-10-2019, 02:54 PM | #11 | |
Grand Sorcerer
Posts: 5,527
Karma: 100606001
Join Date: Apr 2011
Device: pb360
|
Quote:
Microsoft has been misleadingly claiming POSIX compliance for decades, but it is my understanding that Microsoft Windows Subsystem for linux (or whatever it is called) is the real deal. If you have a C compiler and it doesn't work, I can make a small change that just reads the entire azw3r file into a buffer since it is very unlikely that one would ever be too large to do that. Maybe a pythonista will come along and crank out a python equivalent or improvement. To compile: Code:
cc -o azw3r azw3r.c Code:
azw3r -i name.azw3r > name.notes azw3r -h -i name.azw3r > name.highlights azw3r -h -n -i name.azw3r > name.notes azw3r -i name.azw3r | sort -n > name.notes Code:
97434 97443 Note: 'Not correct definition for this book.' 114792 114796 Note: 'Should be in x-ray terms category.' 135617 135632 Note: 'Same as Tut' 533488 533494 Note: 'Not a person.' 553723 553726 Note: 'Not a podcast.' 712228 712235 Note: 'Not a video game.' |
|
08-11-2019, 07:11 PM | #12 |
Grand Sorcerer
Posts: 5,527
Karma: 100606001
Join Date: Apr 2011
Device: pb360
|
I've attached a perl script, azw3r.pl as the gzip'd azw3r.pl.gz to the first post. It provides the same functionality as the azw3r.c program. It should run on any platform that has perl installed. Same syntax, e.g.
Code:
azw3r.pl -i name.azw3r > name.notes or perl azw3r.pl -i name.azw3r > name.notes Last edited by j.p.s; 08-11-2019 at 07:14 PM. |
08-11-2019, 09:18 PM | #13 |
Grand Sorcerer
Posts: 5,527
Karma: 100606001
Join Date: Apr 2011
Device: pb360
|
Update:
The azw3r C program and perl script, as is, work for KFX yjr files for highlights or notes separately, that is using only the -n or -h option and not both at the same time. It is OK for the yjr file have both highlights and notes in it. (The C program works with both at the same time on azw3r files.) The perl script, unlike the C program, does work fine for listing both highlights and notes at the same time for both yjr and azw3r files. The perl script notes_insert.pl is not able to process the listings for KFX yjr files. The perl script azw3r.pl probably does not work for notes longer than 255 characters on any file type. This should be easy to fix. |
08-13-2019, 04:28 PM | #14 |
Junior Member
Posts: 9
Karma: 10
Join Date: Jul 2019
Device: Kindle PW 2
|
help please
HI JPS, very interesting work.
Could you please be so kind to try and help me a little bit? I have this problem here, and I'd like to understand more if your solution is able to help me. https://www.mobileread.com/forums/sh...44#post3878444 Thanks! |
08-13-2019, 07:10 PM | #15 | |
Grand Sorcerer
Posts: 5,527
Karma: 100606001
Join Date: Apr 2011
Device: pb360
|
Quote:
You might also look at jhowell's kindle reader data store KRDS https://www.mobileread.com/forums/sh...d.php?t=322172 Last edited by j.p.s; 08-17-2019 at 02:55 PM. Reason: correct typo word -> work |
|
Tags |
azw3r, highlights, highlights and notes, notes |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Fully Automated ebook file parsing, ISBN extraction, Titel Extraction and metadata | isbnread | Reading and Management | 0 | 02-20-2017 11:20 AM |
Paperwhite 2 add note without highlight? | just_jeepin | Amazon Kindle | 3 | 10-07-2013 03:07 PM |
PRS-650 Two years late — A crossplatform ePub highlight extraction tool for PRS-350, 650... | Syniurge | Sony Reader | 1 | 09-30-2013 01:45 PM |
eink device with note and highlight sync with Mendeley | aldomenguzzi | Which one should I buy? | 0 | 12-04-2012 05:44 AM |