12-02-2022, 11:26 AM | #1 |
Addict
Posts: 287
Karma: 2534928
Join Date: Nov 2022
Location: Canada
Device: Kobo Aura 2
|
Counting is hard: word breaks and XPath
Hello, thank you for taking the time to read my post.
I am interested in counting words in ebooks. “Simple!” you might respond. “Use the Count Pages plugin!” If this is your suggestion, then thank you – I already do! However, I am specifically interested in counting the words in only part of an ebook. I got this idea when I started using KoboUtilities and examined the format in which the Kobo stores my “last reading position”, which looks like this: Code:
Text/Chapter02.xhtml#kobo.27.1 Code:
from calibre.ebooks.oeb.polish.container import get_container book = get_container("/path/to/book.kepub") chapter = book.parsed("Text/Chapter02.xhtml") excerpt = ''.join(chapter.xpath("//*[@id='kobo.27.1']/following::*//text()")) from calibre.spell.break_iterator import count_words count_words(excerpt) Code:
>>> count_words("Hello.How are you?") 3 >>> count_words("‘Hello.’How are you?") 4 My ultimate goal with this code is to take two reading positions like Kobo stores, and count the words between them. There is extra logic involved in determining, “Are the starting and ending positions in the same document? Do the names documents exist in this book? Do the named tags exist in their documents?” which I have omitted here for the sake of brevity. Thank you again for taking the time to read my post! Even if you can't help, I appreciate your time, and I hope you have a lovely day. ~isarl Last edited by isarl; 12-02-2022 at 11:28 AM. |
12-02-2022, 01:17 PM | #2 |
creator of calibre
Posts: 44,566
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
See the code in polish/spell.py for many different ways of counting words some of which you should be able to adapt
|
12-02-2022, 02:16 PM | #3 |
Addict
Posts: 287
Karma: 2534928
Join Date: Nov 2022
Location: Canada
Device: Kobo Aura 2
|
Thank you Kovid! I appreciate the pointer and look forward to exploring the classes and methods available.
|
Tags |
plugin development, word break, word count |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
What is the Xpath for "Split html at the word 'chapter" | lealla | Editor | 5 | 06-26-2015 04:32 AM |
Word breaks | FlexUser | Calibre | 10 | 03-24-2014 03:42 AM |
xpath to insert chapter breaks - but chapter name cut off ? | Rob557 | Conversion | 2 | 03-06-2014 07:59 AM |
How to insert hard page breaks | Blessings2all | ePub | 4 | 02-28-2013 12:22 PM |
Unwrapping hard line breaks across all input formats | ldolse | Calibre | 17 | 05-11-2009 12:31 AM |