Counting is hard: word breaks and XPath

isarl · 12-02-2022, 11:26 AM

Hello, thank you for taking the time to read my post.

I am interested in counting words in ebooks. “Simple!” you might respond. “Use the Count Pages plugin!” If this is your suggestion, then thank you – I already do! However, I am specifically interested in counting the words in only part of an ebook.

I got this idea when I started using KoboUtilities and examined the format in which the Kobo stores my “last reading position”, which looks like this:

Code:

Text/Chapter02.xhtml#kobo.27.1

In other words, it stores the location as the name of the document in the book's spine, and an XHTML tag ID (these tags and their ID values are inserted by the Kobo driver when converting to .kepub). This means that I can use the OEB container data types provided by Calibre to do something like:

Code:

from calibre.ebooks.oeb.polish.container import get_container
book = get_container("/path/to/book.kepub")
chapter = book.parsed("Text/Chapter02.xhtml")
excerpt = ''.join(chapter.xpath("//*[@id='kobo.27.1']/following::*//text()"))
from calibre.spell.break_iterator import count_words
count_words(excerpt)

The above code counts all words in the document (“chapter”/section/split) of that name after the indicated XHTML span. However, because str.join is unaware of display rules for HTML block elements, successive paragraph tags are joined without any space, and usually the first word of the next paragraph is counted along with the last word of the preceding paragraph (although some punctuation results in the correct count). Example:

Code:

>>> count_words("Hello.How are you?")
3
>>> count_words("‘Hello.’How are you?")
4

I am hesitant to roll my own word counting function as I strongly suspect that the ICU code core to count_words is much better than anything I can come up with. Is there perhaps a better XPath query I can use for this, or some other mechanism to excerpt the content I wish to count? Should I still use XPath, only be more intelligent about how I count words? (Perhaps I can drop the "//text()" suffix and be smart about iterating over the returned nodeset, e.g. counting words for each paragraph tag separately? But I'm not sure how I would do this without exhaustively enumerating every possible block-type tag name I might have to consider, and this also completely ignores that an individual book might have style rules which change one or more block elements to display inline.)

My ultimate goal with this code is to take two reading positions like Kobo stores, and count the words between them. There is extra logic involved in determining, “Are the starting and ending positions in the same document? Do the names documents exist in this book? Do the named tags exist in their documents?” which I have omitted here for the sake of brevity.

Thank you again for taking the time to read my post! Even if you can't help, I appreciate your time, and I hope you have a lovely day.

~isarl

kovidgoyal · 12-02-2022, 01:17 PM

See the code in polish/spell.py for many different ways of counting words some of which you should be able to adapt

isarl · 12-02-2022, 02:16 PM

Thank you Kovid! I appreciate the pointer and look forward to exploring the classes and methods available.

12-02-2022, 11:26 AM	#1
isarl Addict Posts: 287 Karma: 2534928 Join Date: Nov 2022 Location: Canada Device: Kobo Aura 2	Counting is hard: word breaks and XPath Hello, thank you for taking the time to read my post. I am interested in counting words in ebooks. “Simple!” you might respond. “Use the Count Pages plugin!” If this is your suggestion, then thank you – I already do! However, I am specifically interested in counting the words in only part of an ebook. I got this idea when I started using KoboUtilities and examined the format in which the Kobo stores my “last reading position”, which looks like this: Code: Text/Chapter02.xhtml#kobo.27.1 In other words, it stores the location as the name of the document in the book's spine, and an XHTML tag ID (these tags and their ID values are inserted by the Kobo driver when converting to .kepub). This means that I can use the OEB container data types provided by Calibre to do something like: Code: from calibre.ebooks.oeb.polish.container import get_container book = get_container("/path/to/book.kepub") chapter = book.parsed("Text/Chapter02.xhtml") excerpt = ''.join(chapter.xpath("//[@id='kobo.27.1']/following:://text()")) from calibre.spell.break_iterator import count_words count_words(excerpt) The above code counts all words in the document (“chapter”/section/split) of that name after the indicated XHTML span. However, because str.join is unaware of display rules for HTML block elements, successive paragraph tags are joined without any space, and usually the first word of the next paragraph is counted along with the last word of the preceding paragraph (although some punctuation results in the correct count). Example: Code: >>> count_words("Hello.How are you?") 3 >>> count_words("‘Hello.’How are you?") 4 I am hesitant to roll my own word counting function as I strongly suspect that the ICU code core to count_words is much better than anything I can come up with. Is there perhaps a better XPath query I can use for this, or some other mechanism to excerpt the content I wish to count? Should I still use XPath, only be more intelligent about how I count words? (Perhaps I can drop the "//text()" suffix and be smart about iterating over the returned nodeset, e.g. counting words for each paragraph tag separately? But I'm not sure how I would do this without exhaustively enumerating every possible block-type tag name I might have to consider, and this also completely ignores that an individual book might have style rules which change one or more block elements to display inline.) My ultimate goal with this code is to take two reading positions like Kobo stores, and count the words between them. There is extra logic involved in determining, “Are the starting and ending positions in the same document? Do the names documents exist in this book? Do the named tags exist in their documents?” which I have omitted here for the sake of brevity. Thank you again for taking the time to read my post! Even if you can't help, I appreciate your time, and I hope you have a lovely day. ~isarl Last edited by isarl; 12-02-2022 at 11:28 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
What is the Xpath for "Split html at the word 'chapter"	lealla	Editor	5	06-26-2015 04:32 AM
Word breaks	FlexUser	Calibre	10	03-24-2014 03:42 AM
xpath to insert chapter breaks - but chapter name cut off ?	Rob557	Conversion	2	03-06-2014 07:59 AM
How to insert hard page breaks	Blessings2all	ePub	4	02-28-2013 12:22 PM
Unwrapping hard line breaks across all input formats	ldolse	Calibre	17	05-11-2009 12:31 AM

12-02-2022, 01:17 PM	#2
kovidgoyal creator of calibre Posts: 44,566 Karma: 24495948 Join Date: Oct 2006 Location: Mumbai, India Device: Various	See the code in polish/spell.py for many different ways of counting words some of which you should be able to adapt

12-02-2022, 02:16 PM	#3
isarl Addict Posts: 287 Karma: 2534928 Join Date: Nov 2022 Location: Canada Device: Kobo Aura 2	Thank you Kovid! I appreciate the pointer and look forward to exploring the classes and methods available.

Advert