MobileRead Forums - View Single Post - Sony PRS-T1 in Calibre

snarkophilus · 03-07-2012, 05:18 AM

Quote:

Originally Posted by Yoths

The number after colon seems to be the character offset, but even that is counted strange (cf. [in the middle of the first line #2] or [at the end of the first line] - this may be because of the — sign, but even after replacing them by a "-" the offsets don't match)

From http://code.google.com/p/epub-revision/wiki/ImplementationProposalSimpleAnnotations it says:

Quote:

N after semicolon (optional) is byte offset in utf8-encoded text node

I think if you count the mdash as two characters that works out?

I've also figured out that for my "current_position" table, the current location in the book is that of the last character on the page, not the first character. For example, the current location at the first page of a chapter in one book is OEBPS/Text/asimov-youthebook-3.html#point(/1/4/12/1:248). The /1/4 seems to be constant everywhere I've seen (1 for html, 4 for ????), and indeed the last character on the page is the 248th character in the 12th element (remembering to count each <p> as an element [B]and[\B] the whitespace between each paragraph as an element.

Also, the end of a chapter (or more rightly I suspect a html file) is simply OEBPS/Text/asimov-youthebook-3.html#point(:1) so it would seem a simple :1 represents that. How that is actually worked out (no reference to the DOM tree structure!) I have no idea.

For your first example, point(/1/4/2/6/1/1), we have the same /1/4 prefix. I'm guessing the /2 is caused by the <div> and the whitespace before that (which my book didn't have), then the 6 is the 6th element (<h3>, <hr>, <p> and whitespace before each), the next 1 is the the <em> (with no whitespace before the <em> so it's 1 and not 2) and the final 1 is the text node after the <em>. I haven't double checked all that, but it mostly makes sense at my first attempt

.

In your example for near the start of the second line we have point(/1/4/2/8/1:3), so the 8 means it's two nodes after the first line (remembering that the white space between paragraphs counts as a node), there's only a single /1 after that as there is a single <p> without the <em> that the first paragraph has, and obviously "DON" starts 3 characters after the start of the text node.

Hopefully we're on the way to understanding this a bit better now!

Cheers,
Simon.