View Single Post
Old 11-21-2022, 06:55 AM   #8
ownedbycats
Custom User Title
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
ownedbycats's Avatar
 
Posts: 8,740
Karma: 62032183
Join Date: Oct 2018
Location: Canada
Device: Kobo Libra H2O, formerly Aura HD
Quote:
Originally Posted by chaley View Post
The plugin is keeping TXT files in memory. According to this thread on Quora, 1 GB is approximately 900,000 pages, or 179 million "standard" words (5 single-byte characters plus a space). If the text is 100% UTF-8 extended characters then using the same assumptions as in the thread there are 11 bytes per word. A GB would be 98 million words or 488,000 pages.

I don't know how many people will be comparing 900,000 page books, or even 488,000 page books.
Just curious: how would this work for PDFs?

While I probably wouldn't be running it specifically on those books, PDFs downloaded from the Internet Archive use some sort of layering compression that means that pdftotext can extract gigabytes of image layers into the temp folder alongside the text layer. (This happens when indexing for FTS or running the word count plugin, which should only require the text layer.) Would it try to keep all that in memory?

Last edited by ownedbycats; 11-21-2022 at 06:59 AM.
ownedbycats is offline   Reply With Quote