[GUI Plugin] English Noun Frequency

DaltonST · 08-05-2015, 10:56 AM

[GUI Plugin] English Noun Frequency

Summary: Determines 'English Noun Frequencies' for words in a particular book's text, and will optionally:

Add frequences for the chosen number of frequent nouns to the book's Comments;
Create new Tags using the chosen number of frequent nouns for Tags;
Update a Custom Column with the chosen number of frequent nouns for a Custom Column;
Update nothing but log the frequent nouns using the number chosen for Comments;
Translate the English Comments to another language, showing both;
Accumulate the Top 100 English Nouns with frequency counts across all of your books and all of you libraries.

Questions & Answers:

Spoiler:

Requires Minimum Calibre Version: 6.0.0

Version History:

Spoiler:

DaltonST · 08-05-2015, 10:56 AM

For future use only.

DaltonST · 08-05-2015, 10:57 AM

For future use.

DaltonST · 10-01-2015, 11:27 AM

Release 1.0.4 has been posted, and provides some enhanced ToolTips.

DaltonST

insertrealname · 10-06-2015, 03:55 PM

Plugin no longer works on Calibre 2.40 (Windows 7).

Here is the Calibre Debug Log:

Spoiler:

Code:

calibre Debug log
calibre 2.40  isfrozen: True is64bit: False
Windows-7-6.1.7601-SP1 Windows ('32bit', 'WindowsPE')
32bit process running on 64bit windows
('Windows', '7', '6.1.7601')
Python 2.7.9
Windows: ('7', '6.1.7601', 'SP1', 'Multiprocessor Free')
Successfully initialized third party plugins: DeDRM && EpubMerge && Overdrive Link && SmartEject && English Noun Frequency && Count Pages
Starting up...
macmenuhack file_path:C:\Users\XXXXX\AppData\Roaming\calibre\plugins\fanficfare_macmenuhack.txt
Started up in 8.43 seconds with 411 books
windows_user_name XXXXX
Clearing or initializing globals
ENF Control
Loading user custom word rules for use
Building custom column list
Current book id is:  442  ______________________________________________________________________________________
Determing ENF for a single book
Building book path
Loading book file:  C:/Users/XXXXX/Documents/Calibre Library/Marci McDonald/The Armageddon Factor (442)/The Armageddon Factor - Marci McDonald.epub
Loading epub file:  C:/Users/XXXXX/Documents/Calibre Library/Marci McDonald/The Armageddon Factor (442)/The Armageddon Factor - Marci McDonald.epub
Extracting epub text:  C:/Users/XXXXX/Documents/Calibre Library/Marci McDonald/The Armageddon Factor (442)/The Armageddon Factor - Marci McDonald.epub
Filtering text
Length of the input, file_data:  0
Length of text_data prior to re.sub's :  0
Length of text_data after re.sub's :  0
Length of text_data after fancy apostrophes & quotation marks replacement to a simple single quote:  0
Length of text_data after contraction filtering :  0
Length of text_data after apostrophes and quotes are replaced :  0
Length of text_data after html was stripped :  0
Length of text_data after symbols are stripped:  0
Length of text_data after failed contractions are stripped:  0
Length of text_data after contiguous spaces have been reduced to three (3) before and after each remaining word:  0
Condensing text
Applying Change Pair Rules:  #1 of 4
Pass #1 of 2:  Changing Bad Words to Spaces
re.sub  -  miscellany plus |  
Pass #2 of 2:  Changing Bad Words to Spaces - prep ASCII list of current words
Pass #1 of 3:  Changing Bad Words to Spaces - words < 5 letters and not in any good words set
Pass #2 of 3:  Changing Bad Words to Spaces - identifying words per custom bad words set
Pass #3 of 3:  Changing Bad Words to Spaces - identifying words per standard bad words set
Pass #1 of 2:  Identifying English Names to Change to Spaces
Pass #1 of 1:  Identifying Specific Suffixes of Adjectives & Adverbs
Pass #1 of 1:  Identifying All '......ed' verb forms plus all  '......ing' forms that are NOT deverbal nouns (and are already in the standard good list) to Change to Spaces
Changing Previously Identified Words to Spaces
# first custom dict pass at changing plurals to their singulars
# first standard dict pass at changing plurals to their singulars
finished with the first pass for plurals
Finished condense_text
Analyzing text
Applying Change Pair Rules:  #2 of 4
Applying Plural Pairs pass #2 of 2
finished with the second (and more comprehensive) pass for plurals
The length of final_text_list is:  0
Counting the frequency of the entire current list of filtered words
The length of common_list is:  0
Finished counting
Trimming the initial frequency list, and Accumulating the frequencies for the final list of 'good' words
Finished Trimming and Accumulating Frequency Counts
Finalizing the List of Most Frequent Words
Finished the Finalizing of the List of Most Frequent Words
full_book_path for current book:  C:/Users/XXXXX/Documents/Calibre Library/Marci McDonald/The Armageddon Factor (442)/The Armageddon Factor - Marci McDonald.epub
Finalizing accumulated most frequent nouns
Clearing or initializing globals
Job: 1 English Noun Frequency finished
Starting job: English Noun Frequency 
	Starting 'English Noun Frequency' 
	Library DB: C:/Users/XXXXX/Documents/Calibre Library/metadata.db 
	Tue Oct 06 14:38:13 2015 
	Python: Windows   CPython   2.7.9 
	SQLite Version: 3.8.4    [APSW] 
	PRAGMA main.busy_timeout = 2000 
	  
	Beginning 'English Noun Frequency' Processing 
	═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ 
	  
	Chosen Options:  
	 
	------------------------------------------- 
	 
	Update Comments?  True 
	Maximum Words to Add to Comments:   20 
	Comments Location:  Append 
	Remove Previous ENF Comments Prior to Update?  True 
	  
	------------------------------------------- 
	  
	Update Custom Column?  True 
	Maximum Words in Custom Column:  5 
	Custom Column Specified:  #enf 
	Sort Custom Column Words Alphabetically (not by Frequency)?  False 
	  
	------------------------------------------- 
	  
	Update Nothing.  Just Log the List of Words?  False 
	Update Nothing.  Just Remove Previous ENF Comments?  False 
	  
	------------------------------------------- 
	  
	Accumulate the Most Frequent Nouns in this .csv File:   C:/Users/XXXXX/Documents/Calibre Library/accumulated_most_frequent_nouns.csv 
	Accumulate the Most Frequent Nouns for all books for all jobs?  True 
	Pause the Accumulation of Most Frequent Nouns?  False 
	  
	------------------------------------------- 
	  
	Delete Global First Names?  True 
	Delete the Top 100 Most Common Nouns?  True 
	  
	------------------------------------------- 
	  
	Add New Tags?  False 
	Maximum New Tags:  5 
	Only Add New Tags, or Replace All Existing Tags?  Add 
	  
	------------------------------------------- 
	  
	Is Translation of English Nouns Active?  False 
	English will be Translated to this Language:   None 
	Custom Translation Mapping File to Use:   Select Custom Translation File 
	  
	------------------------------------------- 
	  
	Number of English word pairs in the standard 'singular:plural pair' list:  4,557 
	  
	Number of English words in the standard 'always discard' list:  18,927 
	  
	Number of global first names in the standard  'first names to discard' list:  3,536 
	  
	Number of English words in the standard 'always keep' list:  44,824 
	  
	Number of English words in the standard 'obscenities' list:  49 
	  
	Number of English word pairs in the standard 'change pairs' list:  19 
	  
	Number of English words in the standard 'acronyms to capitalize' list:  54 
	  
	  
	Number of 'User custom good words' loaded from the Calibre Plugin Directory:   0 
	  
	Number of 'User custom bad words' loaded from the Calibre Plugin Directory:    0 
	  
	The 'user custom word change pairs' that were loaded, if any, have been lost. 
	  
	Number of 'User custom word change pairs' loaded from the Calibre Plugin Directory:                                0 
	  
	Number of 'User custom word change pairs' that force a word to all upper case after counting is complete:          0 
	  
	Number of 'User custom word change pairs' that force a word to title case after counting is complete:              0 
	  
	Number of 'User custom word change pairs' that will be Defaulted:                                                  0 
	  
	  
	Default:  Any 'Most Frequent Noun' that does not have a specific rule to force it to all upper case will be titlecased. 
	  
	  
	  
	Number of 'User custom singular:plural pairs' loaded from the Calibre Plugin Directory:    0 
	  
	  
	  
	Lists have been synchronized by 'Priority':  Custom User Good Words > Custom User Bad Words > Standard Good Words > Standard First (Bad) Names > Standard Bad Words. 
	  
	  
	------------------------------------------- 
	  
	  
	Number of selected books for which to determine 'English Noun Frequency':     1 
	  
	  
	Priority sequence in which book formats will be searched until one is found to use:     (1st) TXT    (2nd) EPUB    (3rd) PDF 
	  
	  
	  
	═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ 
	Book: C:/Users/XXXXX/Documents/Calibre Library/Marci McDonald/The Armageddon Factor (442)/The Armageddon Factor - Marci McDonald.epub 
	  
	  
	Number of verb forms, adjectives and adverbs (not nouns or deverbal nouns) that were deleted based upon their English suffixes: 0 
	  
	  
					----------------------------------------------- 
	  
	  
					----------------------------------------------- 
	  
	No Nouns Were Found in this Book with the Format Shown in the Path. 
	  
	  
	Elapsed time to process this book was: 0 seconds 
	  
	  
	═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ 
	The accumulated most frequent nouns inception-to-date frequencies were written to your personal .csv file. 
	  
	The number of words with their corresponding frequencies saved to your personal .csv file:        1,934 
	  
	  
	_______________________________________________________________________________________________ 
	  
	Percentage of the  0 total words from this entire Job remaining after discarding all undesired English words: 0.00%, or: 0 net words 
	_______________________________________________________________________________________________ 
	  
	Format: EPUB             Books: 1 
	Format: PDF              Books: 0 
	Format: TXT              Books: 0 
	Format: UNSUPPORTED      Books: 0 
	  
	_______________________________________________________________________________________________ 
	  
	'English Noun Frequency' has completed. 
	  
	  
	  
	  
	Job complete.

DaltonST · 10-06-2015, 06:47 PM

I just ran it against all of the books in one of my test libraries for Calibre 2.40 Windows 64bit, and it worked perfectly.

It looks to me that you are jumping to a conclusion without having posted any empirical data to prove that it is not the .epub's fault.

For example, you did not attach a screen-snip indicating that the "Count Pages" plug-in actually found real pages of "text". That is quite common in .PDF files that were created from scans, since "images" are not "text". "Count Pages" might find zero pages of "text" in a .PDF that is 5mb in size.

Your log showed no errors, and looks normal other than the fact that it extracted no text from the .epub. I suspect that your .epub has problems.

I suggest that you 'fix' your .epub by:

(1) reconverting it from an epub to an epub;

(2) running it against the excellent "Modify Epub" plug-in, clicking almost all of the checkboxes;

(3) running it against the "Count Pages" plug-in, confirming that is has "real" pages text, and is not just an .epub version of a scanned .PDF; and,

(4) converting the reconverted and "fixed" .epub to a .txt format, and then running ENF again for that book. ENF will use any .txt it finds before using a .epub format, and will use a .epub format before using a .PDF format. The log indicates that priority, and also will indicate which format it used.

After (4) above, open the .txt format in Notepad, and read it. Are there a large number of real English words? If there are, please PM me the .txt file so I can test with it.

Thanks.

DaltonST

DaltonST · 10-31-2015, 12:11 PM

Release 1.0.5 has been posted. Minor performance tweaks.

Absolutely nothing was changed that would 'break' any ebook that processed properly in Release 1.0.4.

Please note that .PDF files that were created from scans have "images", not "text". For that reason, ENF would find zero "text" in a .PDF that is physically huge in size. The "Count Pages" plugin would find nothing as well.

If you have problems with a particular .EPUB file, please review this post for a suggested course of action: https://www.mobileread.com/forums/sho...d.php?t=263684

DaltonST

Sidetrack · 04-17-2016, 03:46 PM

An interesting plugin. I'd be interested in seeing a similar output of a frequency of "Proper" names and non-english word usage. The characters, places, and invented words do a lot to categorize and compare books. Seems like it would allow you to glossarize books and then compare glossaries to other works.

Pulling proper names out of the copyright pages and "ends" of the book might give you some interesting info on publishers, translators, editors, etc...

jecilop · 06-07-2016, 06:04 PM

@DaltonST,

I have been looking at this plug-in and trying to apply it but am struggling. After re-reading your description here as well as the Q&A, I'm left with the following:

1) Is it possible to setup the plug-in to read capitalized words and not have to extract then turn into lowercase before reading? I have to wonder if this is what contributes to it taking a LONG time on ONE book. As someone with tens of thousands of books in Calibre, this plug-in then becomes a waste of time in the running (not in the data it could provide). Am I wrong about this contributing to the time it takes to run one book? (for Example, I could run Quality Check/Search Epubs and go through a lot of books in little time comparatively). I am curious, but I also concede that I do not know the code needed to make this work.

2) When I first installed the plug-in (seeing it only in Calibre's list of available plug-ins), I understood it to mean that I could tell it to include the frequency of words I defined. For example, say I want to know the frequency of the word "hall" in a book. This would be basic text and thus include combinations that include it such as "hallmark" as well as including any capitalized version such as "Hall" or "Hallway".
Now, rereading the description and trying to play with the plug-in (at which point I noticed the time it took to run it on default settings for one book), I believe this is not possible.

Is it possible that you could modify your app or create another based on similar principles that does a word count for user-specified words and creates tags based on this?

The purpose would be just as you noted - info about a book that can be very helpful to a user. For example, in my case I'm not fond of books full of vulgarity. Sometimes, you just don't know what you are going to be reading. I'd like to take what I already do via Calibre and improve my "word existence" search to including a count of the frequency of the word I specify as well as creating tags in a customized column based on the returned info (rather than the comments - example of tag: hall-50) This will help me to better categorize books as to the content and feel of the book.

If this isn't something you can do, do you know of a similar app?

DaltonST · 06-07-2016, 06:15 PM

@jecilop:

The OP says exactly why it was written, and exactly what it does.

What you want is not why it was written, and is not what it does.

Sounds like you should uninstall it.

DaltonST

jecilop · 06-07-2016, 06:46 PM

Ok, thanks for that input. That was the next step, but I thought I civily asked you about it.

Please consider not everyone who asks about your app is a tool of some sort. I'm not criticizing it. I was just wondering if I missed something in my understanding or if you could expand on it if not.

glennhefley · 06-03-2017, 01:53 AM

Hi, I found your plugin interesting, and oddly useful. I've been playing around with some ideas of how best to utilize a set of data (words) which several studies have been done on now, confirming results. When I read the description of this plug in I thought you might find it interesting as well.

I just posted, this morning actually a blog post with the full result data file, the published paper describing the intent and method of gathering. with a bit of purple writing around it... hey, I was tired. There is something there, but I'm just not sure what.

Google, I also discovered this morning, is stepping up their involvement in the word game. with the:
Sideways Dictionary: https://sidewaysdictionary.com/#/term/phishing
And several other projects on the Jigsaw site
https://jigsaw.google.com/projects/
My blog is at https://psyopwriter.blog/2017/06/02/...and-dominance/

I would be interested in hearing your thoughts if you have the time. Not sure when I'll stop by again.

Thanks for the plugin however, it has given me several ideas.

dxcore35 · 03-20-2018, 08:34 AM

After clicking on the button for choosing of default location for collection of csv:

Quote:

File "calibre_plugins.english_noun_frequency.enf_dialog ", line 1045, in choose_accumulated_most_common_words_csv_file
File "calibre_plugins.english_noun_frequency.enf_dialog ", line 1072, in build_csv_file_default_path
NameError: global name 'isosx' is not defined

Just change:

Code:

IsOsX()

to

Code:

utils.IsOsX()

DaltonST · 03-20-2018, 10:41 AM

No, actually, "isosx" is a constant. It was not imported prior to use. The correct fix is: from calibre.constants import isosx

I do not have OSX, so this OSX-specific code has never been tested before now.

I will upload a new version in the near future.

DaltonST

dxcore35 · 03-21-2018, 06:38 AM

Thank you for update, the bug is eliminated!

10-01-2015, 11:27 AM	#4
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	Release 1.0.4 Release 1.0.4 has been posted, and provides some enhanced ToolTips. DaltonST

10-06-2015, 06:47 PM	#6
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	ENF Works Perfectly in Calibre 2.40 on Windows 64bit I just ran it against all of the books in one of my test libraries for Calibre 2.40 Windows 64bit, and it worked perfectly. It looks to me that you are jumping to a conclusion without having posted any empirical data to prove that it is not the .epub's fault. For example, you did not attach a screen-snip indicating that the "Count Pages" plug-in actually found real pages of "text". That is quite common in .PDF files that were created from scans, since "images" are not "text". "Count Pages" might find zero pages of "text" in a .PDF that is 5mb in size. Your log showed no errors, and looks normal other than the fact that it extracted no text from the .epub. I suspect that your .epub has problems. I suggest that you 'fix' your .epub by: (1) reconverting it from an epub to an epub; (2) running it against the excellent "Modify Epub" plug-in, clicking almost all of the checkboxes; (3) running it against the "Count Pages" plug-in, confirming that is has "real" pages text, and is not just an .epub version of a scanned .PDF; and, (4) converting the reconverted and "fixed" .epub to a .txt format, and then running ENF again for that book. ENF will use any .txt it finds before using a .epub format, and will use a .epub format before using a .PDF format. The log indicates that priority, and also will indicate which format it used. After (4) above, open the .txt format in Notepad, and read it. Are there a large number of real English words? If there are, please PM me the .txt file so I can test with it. Thanks. DaltonST

10-31-2015, 12:11 PM	#7
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	Release 1.0.5 Release 1.0.5 has been posted. Minor performance tweaks. Absolutely nothing was changed that would 'break' any ebook that processed properly in Release 1.0.4. Please note that .PDF files that were created from scans have "images", not "text". For that reason, ENF would find zero "text" in a .PDF that is physically huge in size. The "Count Pages" plugin would find nothing as well. If you have problems with a particular .EPUB file, please review this post for a suggested course of action: https://www.mobileread.com/forums/sho...d.php?t=263684 DaltonST

06-07-2016, 06:04 PM	#9
jecilop Addict Posts: 260 Karma: 139980 Join Date: Mar 2014 Device: Android	Custom word in frequency search @DaltonST, I have been looking at this plug-in and trying to apply it but am struggling. After re-reading your description here as well as the Q&A, I'm left with the following: 1) Is it possible to setup the plug-in to read capitalized words and not have to extract then turn into lowercase before reading? I have to wonder if this is what contributes to it taking a LONG time on ONE book. As someone with tens of thousands of books in Calibre, this plug-in then becomes a waste of time in the running (not in the data it could provide). Am I wrong about this contributing to the time it takes to run one book? (for Example, I could run Quality Check/Search Epubs and go through a lot of books in little time comparatively). I am curious, but I also concede that I do not know the code needed to make this work. 2) When I first installed the plug-in (seeing it only in Calibre's list of available plug-ins), I understood it to mean that I could tell it to include the frequency of words I defined. For example, say I want to know the frequency of the word "hall" in a book. This would be basic text and thus include combinations that include it such as "hallmark" as well as including any capitalized version such as "Hall" or "Hallway". Now, rereading the description and trying to play with the plug-in (at which point I noticed the time it took to run it on default settings for one book), I believe this is not possible. Is it possible that you could modify your app or create another based on similar principles that does a word count for user-specified words and creates tags based on this? The purpose would be just as you noted - info about a book that can be very helpful to a user. For example, in my case I'm not fond of books full of vulgarity. Sometimes, you just don't know what you are going to be reading. I'd like to take what I already do via Calibre and improve my "word existence" search to including a count of the frequency of the word I specify as well as creating tags in a customized column based on the returned info (rather than the comments - example of tag: hall-50) This will help me to better categorize books as to the content and feel of the book. If this isn't something you can do, do you know of a similar app?

03-21-2018, 06:38 AM	#15
dxcore35 Member Posts: 22 Karma: 10 Join Date: Mar 2018 Device: Kindle Voyage	Working fix Thank you for update, the bug is eliminated!

08-05-2015, 10:56 AM	#2
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	For future use only.

08-05-2015, 10:57 AM	#3
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	For future use.

04-17-2016, 03:46 PM	#8
Sidetrack Enthusiast Posts: 39 Karma: 10 Join Date: Jan 2009 Location: South Pacific Device: Kindle DX	An interesting plugin. I'd be interested in seeing a similar output of a frequency of "Proper" names and non-english word usage. The characters, places, and invented words do a lot to categorize and compare books. Seems like it would allow you to glossarize books and then compare glossaries to other works. Pulling proper names out of the copyright pages and "ends" of the book might give you some interesting info on publishers, translators, editors, etc...

06-07-2016, 06:15 PM	#10
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	@jecilop: The OP says exactly why it was written, and exactly what it does. What you want is not why it was written, and is not what it does. Sounds like you should uninstall it. DaltonST

06-07-2016, 06:46 PM	#11
jecilop Addict Posts: 260 Karma: 139980 Join Date: Mar 2014 Device: Android	Ok, thanks for that input. That was the next step, but I thought I civily asked you about it. Please consider not everyone who asks about your app is a tool of some sort. I'm not criticizing it. I was just wondering if I missed something in my understanding or if you could expand on it if not.

06-03-2017, 01:53 AM	#12
glennhefley Junior Member Posts: 1 Karma: 10 Join Date: Jun 2017 Location: WA, USA Device: kindlefire	Hi, I found your plugin interesting, and oddly useful. I've been playing around with some ideas of how best to utilize a set of data (words) which several studies have been done on now, confirming results. When I read the description of this plug in I thought you might find it interesting as well. I just posted, this morning actually a blog post with the full result data file, the published paper describing the intent and method of gathering. with a bit of purple writing around it... hey, I was tired. There is something there, but I'm just not sure what. Google, I also discovered this morning, is stepping up their involvement in the word game. with the: Sideways Dictionary: https://sidewaysdictionary.com/#/term/phishing And several other projects on the Jigsaw site https://jigsaw.google.com/projects/ My blog is at https://psyopwriter.blog/2017/06/02/...and-dominance/ I would be interested in hearing your thoughts if you have the time. Not sure when I'll stop by again. Thanks for the plugin however, it has given me several ideas.

03-20-2018, 10:41 AM	#14
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	No, actually, "isosx" is a constant. It was not imported prior to use. The correct fix is: from calibre.constants import isosx I do not have OSX, so this OSX-specific code has never been tested before now. I will upload a new version in the near future. DaltonST

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[GUI Plugin] KindleUnpack - The Plugin	DiapDealer	Plugins	495	10-19-2024 07:06 AM
[GUI Plugin] Wordpress	frescogamba	Plugins	11	04-06-2015 10:09 PM
German -> English Dictionary and noun/verb forms	laylos	Amazon Kindle	5	07-24-2014 12:40 AM
[GUI Plugin] KiNotes	-axel-	Plugins	0	07-14-2013 07:39 PM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 01:27 PM

Advert

Advert