02-15-2023, 04:54 AM | #1 |
Connoisseur
Posts: 88
Karma: 503050
Join Date: Mar 2021
Device: Kindle Voyage
|
Getting Text content of book
What is the proper way of extracting the text from a book from a plugin?
I know I can do Code:
os.system('ebook-convert' , ... |
02-15-2023, 06:16 AM | #2 |
creator of calibre
Posts: 44,310
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That way works fine.
|
Advert | |
|
02-15-2023, 06:27 AM | #3 |
Grand Sorcerer
Posts: 6,212
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
|
I don't know whether it's better or faster but the calibre plugin 'Count Pages' contains some code for extracting book text into a big string. It uses it when calculating a wordcount for the book.
|
02-15-2023, 09:23 AM | #4 |
Addict
Posts: 287
Karma: 2534928
Join Date: Nov 2022
Location: Canada
Device: Kobo Aura 2
|
Instead of using Calibre's objects I find it simplest to use the Python library ebooklib. Calibre's container types work with exact MIME types whereas ebooklib simply lets me ask for all ITEM_DOCUMENTs in an ebook. Here is some sample code I have written which demonstrates using it to read ebook contents:
Code:
import ebooklib import lxml book = ebooklib.epub.read_epub("path/to/book.epub") docs = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT) # beware non-UTF8 content! E.g. you might need to .decode("latin1"), or some other encoding, instead. doctree = lxml.etree.fromstring(docs[0].get_body_content().decode()) Good luck with your project. Last edited by isarl; 02-15-2023 at 09:25 AM. |
02-15-2023, 10:09 AM | #5 | |
Connoisseur
Posts: 88
Karma: 503050
Join Date: Mar 2021
Device: Kindle Voyage
|
Quote:
Unfortunately it is not better and indeed not good enough. I have some files which look like they have been generated as epub files by Microsoft Word, and the count_pages algorithm produces text which is about four times larger than ebook-convert. (A quick glance shows thousands of font-family entries which have not been removed by count_pages). |
|
Advert | |
|
02-15-2023, 12:52 PM | #6 |
Connoisseur
Posts: 88
Karma: 503050
Join Date: Mar 2021
Device: Kindle Voyage
|
|
02-15-2023, 02:18 PM | #7 |
Addict
Posts: 287
Karma: 2534928
Join Date: Nov 2022
Location: Canada
Device: Kobo Aura 2
|
|
02-15-2023, 03:43 PM | #8 | |
null operator (he/him)
Posts: 20,912
Karma: 27620686
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
The ePUBTools Word addin can create EPUBs from within Word, and there are a number of tools, including calibre, that will convert MS Word's native format DOCX files to EPUB. Added: InDesign is a more likely candidate as the source of poorly formed EPUBs. BR Last edited by BetterRed; 02-15-2023 at 03:57 PM. |
|
02-15-2023, 08:41 PM | #9 |
creator of calibre
Posts: 44,310
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you care about speed use the extract_text() function from calibre.db.fts.text
|
02-16-2023, 09:00 AM | #10 | |
Connoisseur
Posts: 88
Karma: 503050
Join Date: Mar 2021
Device: Kindle Voyage
|
Quote:
Also only takes about 1/20 of the time to call ebook-convert. Thanks again. |
|
02-16-2023, 05:22 PM | #11 |
the rook, bossing Never.
Posts: 12,248
Karma: 89531599
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Calibre converting docx to epub works better than the plugins I've tried on Word & LO Writer. It seems to work better than Indesign and other commercial tools judging by the commercial ebooks from big publishers.
Indesign should only be used for fancy colour coffee table books and glossy magasines. Calibre is jjust about perfect from properly formatted docx (made by Word or extra Save As in LO Writer) for novels. Also for ordinary novels direct PDF export from a differently formated copy of the Wp file beats Indesign too. Also you now can only rent Indesign. |
Tags |
plugins |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Content Server now has Full-text search! | Comfy.n | Calibre | 1 | 12-16-2022 02:56 AM |
Aura Grey text in book, Black text in menu | aluisscp | Kobo Reader | 4 | 09-03-2014 07:10 PM |
HTML to ePub stripping out Content text | nimblebooks | Conversion | 6 | 02-01-2012 01:50 AM |
text file and table of content | skao | Calibre | 1 | 04-09-2010 12:15 PM |