Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 04-26-2011, 06:08 PM   #1
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,682
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
[GUI Plugin] Find Duplicates

This plugin will help you to identify duplicate authors, titles, formats, series, publishers, tags and identifiers in your calibre libraries.
  • Duplicate authors are where you have multiple variants of an author due to spacing, punctuation, spelling differences, initials or word order. e.g. Kevin Anderson / Kevin J. Anderson / Anderson, Kevin
  • Duplicate titles are where you have multiple book entries with either the same or varying titles. e.g. Martian Way / The Martian Way / The Martian Way (2010)
  • Duplicate formats are where the contents of a particular format like ePub are binary identical to another in your library
The plugin offers a variety of matching algorithms for finding possible groups of duplicate candidates.

If duplicates are found, you are presented the results with the ability to resolve the variations (e.g. by deleting or merging). You can also exclude from future duplicate comparisons.

Main Features:
  • Searches either your entire library or respecting any search restriction set at the time you Find Duplicates.
  • Choose your desired combination of title and author matching from any of "identical", "similar", "soundex", "fuzzy" or "ignore" algorithms.
  • Choose alternative algorithms such as matching identifiers or binary comparison.
  • View the results either one group at a time, or showing all duplicate candidates at once using highlighting to show the groups.
  • When doing author duplicate searches (ignore title), optionally highlight the authors under consideration in the tag browser for ease of renaming
  • Sort the result groups either by title/author (default) or by the size of the group
  • Fine tune the soundex algorithm options to make them "fuzzier" or more explicit matching.
  • Optionally include the languages field when comparing titles, so intentionally using the same book title in different languages does not show as duplicates.
  • Optionally have binary duplicate formats automatically removed from your library when doing a binary comparison.
  • Mark the current group as exempt or all groups as exempt from appearing as duplicates again
  • Review your duplicate exemptions with the opportunity to reverse the exemption allowing duplicate consideration again
  • Exempt either individual books (title searches) or authors (author searches)
  • Clicking the clear search button, setting a different restriction or choosing an explicit Clear duplicate results menu option will exit duplicate search mode.
  • Switching libraries or restarting Calibre will also clear any duplicate search results. Your exemptions will be remember and are stored per library.
  • Customize the keyboard shortcuts for a number of the menu options.
  • Find metadata variations for authors, publishers, series and tags to eradicate unwanted duplicates with an alternative simplified UI to rename them.
  • Find duplicates across multiple libraries, producing a report.
  • When placed on the toolbar, clicking the toolbar button without duplicate groups displayed will display the Find Duplicates options dialog. When results are displayed, clicking on the button will move to the next result. Ctrl+click or shift+click to navigate to the previous result.
  • Use delete key to remove entry from library list in cross library search options.

Special Notes:
Paypal Donations:
  • If you find this or any of my other plugins useful please feel free to show your appreciation. I have spent many hundreds of unpaid hours in their development and support so any encouragement for me to continue is appreciated!
Attached Files
File Type: zip Find Duplicates.zip (507.3 KB, 13517 views)

Last edited by kiwidude; 03-17-2024 at 12:28 AM. Reason: New version
kiwidude is offline   Reply With Quote
Old 04-26-2011, 06:53 PM   #2
lbik
Reader
lbik doesn't litterlbik doesn't litter
 
Posts: 46
Karma: 162
Join Date: Nov 2010
Location: Hannover
Device: Kindle KB and Kindle Fire HD 8.9
Thank you. Works good.
lbik is offline   Reply With Quote
Advert
Old 04-27-2011, 12:37 AM   #3
snafa
Junior Member
snafa began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Feb 2011
Device: kobo
I installed this in version 0.7.57 using the newest version of plugin updater and get this error when I try to open the drop down menu. I get a different version of this error when I just click the Find Duplicates icon on the toolbar.

Quote:
Traceback (most recent call last):
File "calibre_plugins.find_duplicates.action", line 113, in about_to_show_menu
File "calibre_plugins.find_duplicates.action", line 131, in update_actions_enabled
AttributeError: 'FindDuplicatesAction' object has no attribute 'duplicate_finder'
I had version .3.0 I think and it worked ok.
snafa is offline   Reply With Quote
Old 04-27-2011, 01:18 AM   #4
collin8579
Member
collin8579 began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Mar 2011
Device: Kindle
So out of curiosity, why couldn't this be a content based search instead of title/author
calibre can read the contents and display them
I know it would take longer
but if you have a book with 95% of the same words, its probably a dupe regardless
collin8579 is offline   Reply With Quote
Old 04-27-2011, 01:51 AM   #5
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,867
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
Quote:
Originally Posted by collin8579 View Post
I know it would take longer
I think you may have answered your own question. I'm not a programmer but after following the discussion in the thread that came up with this plugin I think saying it would take longer might just be a bit of an understatement.

Then again, what the heck do I know. The reply should be educational.
DoctorOhh is offline   Reply With Quote
Advert
Old 04-27-2011, 01:54 AM   #6
darthyoda6
Connoisseur
darthyoda6 can talk all four legs off a donkey... then persuade it to go for a walk.darthyoda6 can talk all four legs off a donkey... then persuade it to go for a walk.darthyoda6 can talk all four legs off a donkey... then persuade it to go for a walk.darthyoda6 can talk all four legs off a donkey... then persuade it to go for a walk.darthyoda6 can talk all four legs off a donkey... then persuade it to go for a walk.darthyoda6 can talk all four legs off a donkey... then persuade it to go for a walk.darthyoda6 can talk all four legs off a donkey... then persuade it to go for a walk.darthyoda6 can talk all four legs off a donkey... then persuade it to go for a walk.darthyoda6 can talk all four legs off a donkey... then persuade it to go for a walk.darthyoda6 can talk all four legs off a donkey... then persuade it to go for a walk.darthyoda6 can talk all four legs off a donkey... then persuade it to go for a walk.
 
Posts: 94
Karma: 124056
Join Date: Nov 2010
Location: Canada
Device: Kobo Clara HD, Kindle Paperwhite 10th Gen, Kindle 7th Gen
Quote:
Originally Posted by collin8579
but if you have a book with 95% of the same words, its probably a dupe regardless
Not always true, some websites are lazy and the description can be the same or almost similar in series books (ie book 2 & 3 in a series). It's not often, but I have seen it.
darthyoda6 is offline   Reply With Quote
Old 04-27-2011, 04:26 AM   #7
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,682
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by collin8579 View Post
So out of curiosity, why couldn't this be a content based search instead of title/author
calibre can read the contents and display them
I know it would take longer
but if you have a book with 95% of the same words, its probably a dupe regardless
It wouldn't be slow. Slow is far too generous. Glacial would be a better choice of words.

For a start, every format of every book has to be converted to a single format. If you have ever seen the posts on this forum about how it took one particular conversion x hours to run - well multiply that out for users with large libraries and you can see it would have a running time of days if not weeks.

What about all those books that calibre can't convert, like image based PDF files, CBZ files etc? Or people who have empty book entries for wish list items or representing their paperback editions which have no electronic versions to compare? Don't those deserve duplicate consideration too?

Then to round it all off, every time you add even just a single book format to your library, you would have to incur the whole penalty all over again, as it must compare that books content with every other book. Well unless you kept that whole temp directory structure of hundreds of thousands if not millions of files around, but even then you must still incur a very expensive cost of reading all the file contents and applying a fuzzy heuristic to compare the text.

By comparison, with this plugin I can test 40000 books in under a second and once my exemptions are in place any future comparisons will take negligible time to perform and maintain.

That is not to say a content based search would not have some advantages of course. One problem this plugin cannot help you with is books that had the wrong filename or metadata when imported. So you think you have book 5 in a series but in actual fact it Is just a copy of book 3 or whatever. However a visual inspection will reveal that, which you should do before you merge identical formats anyways. That was one of the reasons I requested starson to enhance automerge so that identical formats do not have to be discarded, giving you a chance to compare them first.

So, there are some of the reasons why I didn't take that approach. It just isn't workable in my opinion, or certainly not for many users.
kiwidude is offline   Reply With Quote
Old 04-27-2011, 05:08 AM   #8
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,682
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by snafa View Post
I installed this in version 0.7.57 using the newest version of plugin updater and get this error when I try to open the drop down menu. I get a different version of this error when I just click the Find Duplicates icon on the toolbar.


I had version .3.0 I think and it worked ok.
snafa - did you restart Calibre after updating the plugin? Having done so, are you still getting the error? Also, from running an old beta previously there might be some issue. Try deleting the "Find Duplicates.json" file from your plugins configuration directory.
kiwidude is offline   Reply With Quote
Old 04-27-2011, 12:29 PM   #9
snafa
Junior Member
snafa began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Feb 2011
Device: kobo
Deleted the Find Duplicates.json and that fixed it. Thank you

Last edited by snafa; 04-27-2011 at 12:43 PM.
snafa is offline   Reply With Quote
Old 04-27-2011, 03:10 PM   #10
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
A little (big?) problem.
Calibre on old pc. Using 1.2GB of mem! on a special search
Error removed from clipboard on killing calibre

Used the plugin on db known by kiwidude
I exempt the large list of duplicates (also known by kd)

used:
Title soundex 8
author similar
Show all groups
Sort groups by number of duplicates
Calibre mem size on start: 130mb
So mem expanded about 10 times.
After closing error, calibre was still open, mem did not decrease.

EDIT:
while ctrl + \ was lost. I added \ as next-shortcut

Last edited by drMerry; 04-27-2011 at 03:11 PM. Reason: added custom made option
drMerry is offline   Reply With Quote
Old 04-27-2011, 03:30 PM   #11
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Feature request:
It would be nice to add an option to exempt books based on author, title-part or tag
and authors based on tags or part of name.

Then it would be possible to exempt:
books of calibre (news)
books with special tag (other version, second edition)
books with special part in name [other version] [.. edition] <- tricky, what would you do in case of 4th edition, 4th edition and 5th edition. Ignore all or show the 2 4th edition versions?

Authors with special label (my fav author, English Author 1950, Dutch Author 1968)
Authors with special parts (Jr.)
drMerry is offline   Reply With Quote
Old 04-27-2011, 03:37 PM   #12
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,682
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
drMerry - re your "feature request" - you can do this already by applying a search restriction before you do your duplicate search. So come up with a search that covers all the stuff you want to exempt, for reuse puposes save that as a saved search, set it as the search restriction and you should be good to go.

Re your other problem. Memory usage during "normal" comparisons isn't an issue. I suspect what you have done however is created an enormous exemption group. How many members did it have in it? That is something we may need to think of a more optimal storage strategy for, because you end up with some kind of logarithmic or exponential storage problem if your groups starting having hundreds (or more) members in that you try to exempt.
kiwidude is offline   Reply With Quote
Old 04-27-2011, 03:48 PM   #13
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
@1 This is a good solution I think because it is already part of Calibre. At the other hand, I already have a lot of this searches, but that is a personal thing, solution works for me.

@2
I've a large group indeed.
I exempt the books that gave a problem previous (did not yet rename them)
I exempt books I previous marked as not duplicate (put [other version] in title)
So at the moment there are 269 books exempt (no need if I use solution for 1)
The script is (even fast (I have 2 pc's, even on my old pc it is a fast process, with more exempts it is slower) So I think a complete test would be no big problem.
To solve the problem maybe you could use the following workflow (do not know how it is implemented at this moment):

A:
Spoiler:
Test all books
Filter groups that have only exempts in it
remove filtered groups from process (and mem)
output duplicate list


B:
Spoiler:
Test only books not in exempt
Drawback: you would not find new books that match exempt books
drMerry is offline   Reply With Quote
Old 04-27-2011, 03:58 PM   #14
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,682
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
@drMerry - I think the simplest solution to your issue right now is to do Show all book exemptions, remove all those ones in that group, and instead use a search restriction before you search for duplicates.

The problem I believe is due to the way exemptions are stored, as every book is being stored as being exempt with every other book. This isn't a scalable approach if (as you have) your group contains a massive number of books.

Right now I will see what others think on the dev thread about how we solve it - either we prevent you marking the group as exempt in the first place by putting in a threshold, or we change the way exemptions are stored. However you have a workaround in the meantime I believe.

In what I would term "normal" usage your exemption groups should not be very big - the 99% scenario I perceive as being 2-3 books/authors in a group. However allowing very fuzzy searches and in your case storing a large number of near duplicate titles as people will have who store magazines etc this situation will arise.
kiwidude is offline   Reply With Quote
Old 04-27-2011, 04:03 PM   #15
drMerry
Addict
drMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmosdrMerry has become one with the cosmos
 
drMerry's Avatar
 
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
Correction
Option 1 is not the same.
I can add a filter but if I add a filter like
not Title:"2nd edition"

It would not show duplicates for
2nd edition, 2nd-edition and 2 nd edition

If the option was provided in exempt, it is provided on the plugin and on run-time. So 2nd-edition would match 2nd edition and show it because it is a new book.
It would also show new books with 2nd edition because your exempts are set based on books. New books would not yet have the exempt flag set (the flag is set on books at the moment I add a tag, not at every time the plugin runs)
drMerry is offline   Reply With Quote
Reply

Tags
cross library duplicates, in library duplicates


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] Quality Check kiwidude Plugins 1205 11-05-2024 06:26 AM
[GUI Plugin] Generate Cover kiwidude Plugins 833 09-13-2024 12:42 PM
[GUI Plugin] View Manager kiwidude Plugins 415 05-11-2024 04:28 AM
[GUI Plugin] Open With kiwidude Plugins 403 04-01-2024 09:39 AM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 01:27 PM


All times are GMT -4. The time now is 08:26 PM.


MobileRead.com is a privately owned, operated and funded community.