MobileRead Forums - View Single Post

ownedbycats · 11-21-2022, 06:55 AM

Quote:

Originally Posted by chaley

The plugin is keeping TXT files in memory. According to this thread on Quora, 1 GB is approximately 900,000 pages, or 179 million "standard" words (5 single-byte characters plus a space). If the text is 100% UTF-8 extended characters then using the same assumptions as in the thread there are 11 bytes per word. A GB would be 98 million words or 488,000 pages.

I don't know how many people will be comparing 900,000 page books, or even 488,000 page books.

Just curious: how would this work for PDFs?

While I probably wouldn't be running it specifically on those books, PDFs downloaded from the Internet Archive use some sort of layering compression that means that pdftotext can extract gigabytes of image layers into the temp folder alongside the text layer. (This happens when indexing for FTS or running the word count plugin, which should only require the text layer.) Would it try to keep all that in memory?