View Single Post
Old 03-07-2012, 05:18 AM   #381
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Quote:
Originally Posted by Yoths View Post
The number after colon seems to be the character offset, but even that is counted strange (cf. [in the middle of the first line #2] or [at the end of the first line] - this may be because of the — sign, but even after replacing them by a "-" the offsets don't match)
From http://code.google.com/p/epub-revision/wiki/ImplementationProposalSimpleAnnotations it says:
Quote:
N after semicolon (optional) is byte offset in utf8-encoded text node
I think if you count the mdash as two characters that works out?


I've also figured out that for my "current_position" table, the current location in the book is that of the last character on the page, not the first character. For example, the current location at the first page of a chapter in one book is OEBPS/Text/asimov-youthebook-3.html#point(/1/4/12/1:248). The /1/4 seems to be constant everywhere I've seen (1 for html, 4 for ????), and indeed the last character on the page is the 248th character in the 12th element (remembering to count each <p> as an element [B]and[\B] the whitespace between each paragraph as an element.

Also, the end of a chapter (or more rightly I suspect a html file) is simply OEBPS/Text/asimov-youthebook-3.html#point(:1) so it would seem a simple :1 represents that. How that is actually worked out (no reference to the DOM tree structure!) I have no idea.

For your first example, point(/1/4/2/6/1/1), we have the same /1/4 prefix. I'm guessing the /2 is caused by the <div> and the whitespace before that (which my book didn't have), then the 6 is the 6th element (<h3>, <hr>, <p> and whitespace before each), the next 1 is the the <em> (with no whitespace before the <em> so it's 1 and not 2) and the final 1 is the text node after the <em>. I haven't double checked all that, but it mostly makes sense at my first attempt .

In your example for near the start of the second line we have point(/1/4/2/8/1:3), so the 8 means it's two nodes after the first line (remembering that the white space between paragraphs counts as a node), there's only a single /1 after that as there is a single <p> without the <em> that the first paragraph has, and obviously "DON" starts 3 characters after the start of the text node.

Hopefully we're on the way to understanding this a bit better now!

Cheers,
Simon.
snarkophilus is offline   Reply With Quote