Duplicate detection plugin - Page 7

chaley · 04-16-2011, 10:19 AM

Ahhh, I get it now.

I think that dealing with authors as idolse suggests is in fact a different function, but one that could be integrated rather nicely by using restrictions.

What we want to see is something like the tag browser, listing the authors that are similar. Assume I search for fuzzy authors, and get groups of similar authors. Lets assume I get 10 variants of E Smith. If I can set a restriction to a group, then in the tag browser I will see only those ten variants, along with any co-authors. By clicking on the authors in the TB, I can look at the books and change any authors I wish. Flat-out mistakes can be fixed directly on the tag browser.

All this would take would be to have a mode/option where the restriction is set to a group instead of all duplicates. Of course, I can do this myself by using 'Next Group' to set the search bar, then using the new "*Current search" restriction option to copy that to the restriction.

Given how easy this is, I am not convinced that it needs to be made into dup-check mode. However, I do note that because I changed the line in the restriction box to the search, it is a bit hard to know which line to select to copy the search. I think I will add the '*' to the beginning of the search so that the line in the combobox is more obvious.

Perhaps it would work if 'show one group at a time' set the restriction but left the search empty? In fact, that might be better in general...

kiwidude · 04-16-2011, 10:41 AM

@charles - I'm going to need to think about what you said but fundamentally it sounds like what has been niggling in the back of my mind - wanting an author-centric view of the duplicates for this mode.

I think from as per my last few posts that it also requires different logic in determining the groups contents, rationalising/managing them as you navigate and any applying of exemptions.

Using the tag browser to give the author centric filtered view is a great idea, particularly as you say you can quickly rename authors or ctrl+click to see certain combinations of authors you want to focus on.

The "one group at a time" using restriction instead of search is something I hadn't considered. I guess my only slightest of hesitation is that our reliance on restrictions will preclude the user from (easily) doing any kind of more generic searching while they are contemplating that group. Say for instance you wanted to for some reason see all the books for that author. You would have to exit duplicate search mode first and then start it again.

However that "limitation" would also apply to the show all duplicate groups mode which has a restriction. And perhaps it really just isn't necessary, you have the books you are considering duplicates in front of you so what more do you want

I'll apply the restriction for one group at a time - at least that way the behaviour is consistent with show all groups.

kiwidude · 04-16-2011, 02:34 PM

Quote:

Originally Posted by chaley

My opinion:

If all selections are in the group, show a dialog saying that the entire group will be added, not just the selected books.

If some selections are in a different group, show a dialog saying that the entire group will be added and that the selections outside the group will be ignored.

If only one book is selected, and if that book is in the group, then show a dialog saying that the entire group will be added.

Use three different ignore_me checkbox names.

And if no selected rows are in the current group, we have a fourth dialog?

I'm sitting here looking into this now and just not liking it much

My natural instinct (admittedly as I am testing rather than "using") is to ignore where the green highlighting is and grab a bunch of rows on screen that grab my attention.

I'm thinking all the warning dialogs in the world are not going to get my brain around the fact that I selected rows 3,4,5 but rows 1 & 2 as the "current group" are what exemptions will get added for. The dialog will tell me, but after the first time the brain will just go "yeah yeah" and ignore it in future. I think the disconnect between row selections and actual affected rows is too great.

So I propose one of two things: Either...
(1) When you choose the menu option to Mark group as exempt, it moves the selection to the current group row(s) that are affected. Then the dialog appears. So you get a visual reminder (until you stop the dialog nagging of course). Or...
(2) I go back to the idea of row selection based exemptions.

My reasoning for not allowing the user to do free-form selection was to do with if the user did not select all the rows within a group, you can get loads of confusion about partitioning without re-running searches etc. And quite possibly the user isn't actually understanding what they are doing, and you would either end up with weird cross group exemptions that make no sense or nothing "happening" because behind the scenes we rationalise them out.

So if...
Group 1 has books (10,11,12)
Group 2 has books (13,14)

What does it mean if the user selects books 12 & 13? Do we think they mean that they are trying to say that 12 & 13 are not duplicates of each other? As clearly it makes no sense to create a duplicate exemption for them since they were not put in the same group.

Alternatively, did they mean that 12 is not a duplicate of anything in its group and neither is 13 in its group? So they want exemptions created of (10,12), (11,12) and (13,14) resulting in only (10,11) being left?

I think we have to give the user the benefit of the doubt in that they mean "not duplicates of each other". So the selection may contain a mixture of valid pairs plus invalid "single row from a group" selections.

That leaves the final issue of this second option being updating the UI. Unless we re-run the search, then for various combinations nothing will actually change on screen. So perhaps we treat it like removing an exemption from the "show all" screen that it does exactly that of re-running the search after each time you mark a selection as exempt. So any partitioning gets applied etc.

Thoughts?

drMerry · 04-16-2011, 04:36 PM

Quote:

Originally Posted by kiwidude

So I propose one of two things: Either...
(1) When you choose the menu option to Mark group as exempt, it moves the selection to the current group row(s) that are affected. Then the dialog appears. So you get a visual reminder (until you stop the dialog nagging of course). Or...
(2) I go back to the idea of row selection based exemptions.
....
Thoughts?

I like the new plugin very much.

I think option 1 is a nice one. I myself like to be able to select them all quick.
I've got more then 6000 books at the moment.
Due to the fact I did not look very close when adding, and Calibre does not move, but copy the items, I have around 1000 duplicates in this set.
It takes a lot of time if I get them one at a time.

An other problem I found (maybe it is mentioned earlier, but had not much time to read it all and if not mentioned, I think it is important to know) on version 0.3 is the fact that the plugin does not use the selection part (drop down on left main screen).

I have one author I know I have a lot of duplicates. I made a saved search on him and I made the sub selection.
When I start the duplicate scan, It starts scanning all my files, not the 250 I had from this author.

Starson17 · 04-16-2011, 07:02 PM

FYI, I get an error after searching for dupes if all dupes have either been removed or marked exempt.

Spoiler:

kiwidude · 04-16-2011, 07:25 PM

@Starson17 - yeah I found that one too

chaley · 04-17-2011, 05:49 AM

Quote:

Originally Posted by kiwidude

So I propose one of two things: Either...
(1) When you choose the menu option to Mark group as exempt, it moves the selection to the current group row(s) that are affected. Then the dialog appears. So you get a visual reminder (until you stop the dialog nagging of course). Or...
(2) I go back to the idea of row selection based exemptions.

I agree with #1. Selecting all the books in the group will definitely show me what I need to see. Note that you might need to clear/change a search to ensure that all books in the group are visible.

Of course, changing the selections will toss the ones I have, which can cause angst if I was getting ready to edit some metadata. However, I can exhibit learning behavior and not do that.

kiwidude · 04-17-2011, 05:55 AM

Quote:

Originally Posted by chaley

I agree with #1. Selecting all the books in the group will definitely show me what I need to see. Note that you might need to clear/change a search to ensure that all books in the group are visible.

Of course, changing the selections will toss the ones I have, which can cause angst if I was getting ready to edit some metadata. However, I can exhibit learning behavior and not do that.

Good choice

Thanks Charles.

chaley · 04-17-2011, 07:37 AM

@kiwidude: Found what seems to be a problem with the find_dups plugin initialization. The highlight config flag is being reset to False during startup, regardless of what the stored preference value is.

Sequence:
- Start calibre. Highlighting is disabled.
- Enable highlighting
- Exit calibre, then restart
- The highlighting flag is set to the saved value of True
- Find_dups is initialized and clears it back to false.

The stack trace for when the flag is erroneously set to False is:

Code:

  File "site.py", line 103, in main
  File "site.py", line 85, in run_entry_point
  File "calibre_dev\src\calibre\debug.py", line 187, in main
  File "calibre_dev\src\calibre\gui2\main.py", line 382, in main
  File "calibre_dev\src\calibre\gui2\main.py", line 286, in run_gui
  File "calibre_dev\src\calibre\gui2\main.py", line 253, in initialize
  File "calibre_dev\src\calibre\gui2\main.py", line 234, in initialize_db
  File "calibre_dev\src\calibre\gui2\main.py", line 203, in initialize_db_stage2
  File "calibre_dev\src\calibre\gui2\main.py", line 159, in start_gui
  File "calibre_dev\src\calibre\gui2\ui.py", line 329, in initialize
  File "calibre_plugins.find_duplicates.action", line 50, in initialization_complete
  File "calibre_plugins.find_duplicates.duplicates", line 261, in __init__
  File "calibre_plugins.find_duplicates.duplicates", line 277, in clear_duplicates_mode
  File "calibre_plugins.find_duplicates.duplicates", line 283, in restore_previous_gui_state
  File "calibre_dev\src\calibre\gui2\search_box.py", line 382, in set_highlight_only_button_icon

kiwidude · 04-17-2011, 07:44 AM

Thx Charles I will re-test that scenario for the next version. All that code has had to change with hooking into the search cleared event. I don't know if I was just too tired when I wrote it but you have no idea how something that sounds so simple could cause me so many issues (as so many subtle ways that the event will get fired). I "think" I got there in the end but I will try to give it a good thrashing. It was a really good suggestion, nice to have a simple toolbar button to exit out of the search mode. It's just the permutations that in the end I took a bit of a brute force approach to ensuring I was not connected when I didn't want to be. It's to do with the independent nature of the plugin from other actions the user could be doing in the gui that makes it a bit more complex than I would have hoped.

@drMerry - the search is "supposed" to restrict the duplicate search respecting a value in that restriction combo on the left. Again I will retest this with the next version as again that code has all had to be massaged for this release.

kiwidude · 04-18-2011, 03:34 PM

Got a bit distracted writing the new Goodreads metadata plugin for a few days but that is done and this got some attention again.

Changes in this beta:

Run the find duplicates search again after removing a duplicate
Reapply a sort after doing a fresh find duplicates
Select all rows in the current group when use mark current group as exempt
Respond to the clear search button on the toolbar to exit duplicate search modes
When viewing search results one at a time, apply a search restriction rather than a search
Fix bugs related to highlighting and search restrictions not being remember/reapplied correctly
Fix error when no duplicates were found
Implement a dialog for Manage exemptions for book with checkboxes allowing you to remove

I've set the minimum version requires of Calibre as 0.7.56 since 0.7.55 is known to have other issues.

So... other than any new grenades I have planted in the code I think from a "Find duplicate book" perspective this is functionally complete?

The big todo item now becomes the handling of the find duplicate author algorithms. I have given it zero thought since my last posts but am expecting it to have some significant differences that I need to think through before I ramble on about here again. Thinking through how I did things with my own tool I believe it will require a separate exemption list, as this is between author name pairings not book pairs. That has a fair few implications but am sure we will figure something out.

As always, feedback appreciated.

Starson17 · 04-18-2011, 04:54 PM

Quote:

Originally Posted by kiwidude

I believe it will require a separate exemption list, as this is between author name pairings not book pairs.

I agree - an author name exemption list would be the way to go. In theory, you could structure it as book pairs (All books by AuthorA vs. all books by AuthorB) but if you add a new book, you'd have to consider the authors again, and it would still need to be a separate list from the book exception list.

Thinking out loud here - Suppose I tell you that AuthorA and AuthorB are not the same, even though the algorithm sees them as similar. Can I then say anything about whether BookA by AuthorA and BookB by AuthorB are the same? I suppose not. Father and son write a book, but I've got format 1 under Father's name and Format 2 under the son's name.

kiwidude · 04-19-2011, 04:45 AM

Quote:

Originally Posted by Starson17

Thinking out loud here - Suppose I tell you that AuthorA and AuthorB are not the same, even though the algorithm sees them as similar. Can I then say anything about whether BookA by AuthorA and BookB by AuthorB are the same? I suppose not. Father and son write a book, but I've got format 1 under Father's name and Format 2 under the son's name.

It is a good question as to whether there is crossover from the author exemption list to the book find algorithms. My first instinct was to say the answer is that there should be. Your example if I understand it correctly is as the result of a metadata data entry error, as the book has been given the wrong author. It just so happens that you coincidentally may see it appearing in duplicate searches because father and son share a similar name.

So if I have Steve Smith and S. Smith as authors, and I decide that these are not duplicate authors from a duplicate author search. As I am displaying all books by those two authors at once before I make that exemption, that is my opportunity to make sure that any wrong author values on individual books between the two are rectified (this is where the Search the Internet plugin with fantastic fiction are gold to me).

Then if it happened to be the case that both authors had written a book with a title that is similar enough to appear in a duplicate search, you might argue that it should automatically be excluded, as you have already said the author sets are distinct.

However if we did this I see the potential issue of you adding another format for this book in future to your library where once again the author has the wrong value on it. Now you will never see it appear as a duplicate, unless you removed the author exclusions. That is a bit nasty and subtle.

Note that unless you run the 'xxx title, ignore author' book algorithms you are unlikely to have an overlap for the above scenario as it needs a more fuzzy author match which will only be offered for author based searches, not book ones. Similar author just does punctuation and comma name flipping. And there must be a relatively small % of books in the world which are written with an exact enough title match by different authors that have such subtly different author names. So I think it is safer to not apply the author exclusion list to book searches and let the user make book based exemptions instead. At least that way if they import books in future with the wrong name on they have a chance of picking that up from a duplicate book based search. Not if it is a new title of course but there is only so much we an do!

chaley · 04-19-2011, 06:07 AM

First, the new version works very well.

Comments:

- I like the manage duplicates dialog.

- if I run a test that finds one group, then mark that group as exempt, I get the message "No further duplicate groups exist for 'None'". If I subsequently run the test, I get "No duplicate groups were found using 'similar title, similar author'". Perhaps the 'None' was supposed to be 'similar title, similar author'?

- Using the restriction in 'One group at a time' mode does exactly what I expect and want. The tag browser is very useful for (tada) browsing, because it shows only the values for the books in question. I can quickly scan other metadata such as series and tags simply by looking that the items in the browser, rather than scrolling the library view and sorting.

- I was unable to make anything break by pushing the clear button or by clearing the restriction. However, using the tag browser to do searches has the side effect of leaving duplicate_check mode when cycling through searches, because one of the states clears the search. I don't know if this is a problem, and if it is, I don't know how to fix it.

- The problem where the use_marks configuration flag was being reset has been fixed.

Quote:

Originally Posted by kiwidude

It is a good question as to whether there is crossover from the author exemption list to the book find algorithms. My first instinct was to say the answer is that there should be.

I am still not convinced that we need author exemptions, much less to use them in book searches. The reason is that we are dealing with books, not authors. Searching for fuzzy author, ignore title, I get a list of books. I may know that S Smith and Steve Smith are different authors, but I don't know that all the books have the correct author. If I mark these authors as exempt, then how do I check for mistakes (repeats kiwidude's argument)? One author at a time? I argue that seeing the books by both authors together helps me see errors more easily than seeing the books one author at a time. In addition, an author search might find duplicate books, such as S. Smith "2 Vampires" and Steve Smith "Two Vampires" (and other variants). An author exemption would block showing these.

As for mixing author exemptions with book exemptions, kiwidude's 'complexity of interaction' argument is spot-on. I imagine trying to write documentation describing how things work, and end up pulling my hair.

Finally, and probably a red herring, there are situations where S Smith and Steve Smith are in fact the same author, but listed differently on purpose. This happens all the time in academic papers, where the author name varies slightly from paper to paper. Do I need another kind of exemption to handle these?

I do recognize that other people might want to work differently. There is nothing that forces me to use author exemptions. My argument against them is based mostly on complexity, especially as this code will be integrated into trunk, where it might be touched (maintained) by more than one person as calibre evolves.

kiwidude · 04-19-2011, 06:56 AM

Quote:

Originally Posted by chaley

- if I run a test that finds one group, then mark that group as exempt, I get the message "No further duplicate groups exist for 'None'". If I subsequently run the test, I get "No duplicate groups were found using 'similar title, similar author'". Perhaps the 'None' was supposed to be 'similar title, similar author'?

Oops, I'll look into that. I confess to not testing "resolving the last duplicate" because I got lazy and tired of continually recreating duplicate scenarios to test

Quote:

- I was unable to make anything break by pushing the clear button or by clearing the restriction. However, using the tag browser to do searches has the side effect of leaving duplicate_check mode when cycling through searches, because one of the states clears the search. I don't know if this is a problem, and if it is, I don't know how to fix it.

Darn it, I knew I would miss a permutation of that clear event and it would come back to bite me. Yes it is a problem.

It is all caused by hooking into the wrong signal. What I am really interested in is the user clicking the clear button action on the toolbar, not in the search being cleared. I have added all sorts of filth to the code to try to disconnect/connect around doing actions which result in the search being cleared, but that doesn't work when as you say actions like tag browser clicking result in another scenario I can't differentiate between.

I would like to rip all my filth out and instead directly hook into the triggered signal of the clear search button action. You have any objections/thoughts on that? I should have pulled the pin on my current hacks and proposed this days ago, but I was playing whack-a-mole with the event triggering instead of a fresh perspective.

Quote:

I am still not convinced that we need author exemptions, much less to use them in book searches...

The problem with if we only run with the algorithms in the plugin currently is that it does not help the user find books by the same author with a simple variation in initials/first name.

So to make this plugin more complete/useful imho we *need* an ignore title based search.

But the problem with trying to treat such searches as "book searches" is that our normal exemption model and grouping model does not fit. As I think we are all agreed on you will want to see all the books by those authors who have been found to be similar, to then be able to review what are genuine data entry/import errors versus author names that for whatever reason you decide are valid to be treated as not duplicates of each other.

It also sounds like we are in agreement that trying to apply such author based exemptions to book searches is a bad idea. So that takes one aspect of the complexity out.

Quote:

Finally, and probably a red herring, there are situations where S Smith and Steve Smith are in fact the same author, but listed differently on purpose. This happens all the time in academic papers, where the author name varies slightly from paper to paper. Do I need another kind of exemption to handle these?

Not an issue in my opinion. If you flag those two authors as exemptions you are saying to the plugin that you do not want those authors to be displayed again as duplicates of each other. That your reason is that they are different people or different variations you want to preserve is not relevant imho. The intention is that when you next run the author based search you are not faced with spending brain cycles on making that same choice again.

Quote:

I do recognize that other people might want to work differently. There is nothing that forces me to use author exemptions. My argument against them is based mostly on complexity, especially as this code will be integrated into trunk, where it might be touched (maintained) by more than one person as calibre evolves.

Totally agree that maintenance is an issue to be potentially concerned with. Until I work through all the details I won't know how much of an impact this has. Obviously there is a lot of commonality, but there are significant differences as well.

I've only started last night thinking through all the implications and how it would fit. For instance when you are reviewing groups of authors, you are not going to want the "show all duplicates/highlight mode" option - instead it will be one group at a time and then the tag browser to filter within that group as you like or rename authors etc. So the Find duplicates dialog either needs a different dialog/menu option, or rearranging so that the options of how to view the results is either disabled or made a suboption of book based searches.

But I need to finish reviewing what is involved before I know for sure the impact. There is already a house of cards that has started to have been built by the permutations of individual versus group review and in particular adding duplicate exemptions. I have no interest in making a rod for my own back or anyone else's by making this more complex than it is currently. However I am convinced we do need ignore title searches, and if I have to rewrite the way I have done the code so far to support them then better to do that now and get it sorted while it is fresh in my mind than down the track imho.

04-16-2011, 10:19 AM	#91
chaley Grand Sorcerer Posts: 11,866 Karma: 7036359 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	Author matches Ahhh, I get it now. I think that dealing with authors as idolse suggests is in fact a different function, but one that could be integrated rather nicely by using restrictions. What we want to see is something like the tag browser, listing the authors that are similar. Assume I search for fuzzy authors, and get groups of similar authors. Lets assume I get 10 variants of E Smith. If I can set a restriction to a group, then in the tag browser I will see only those ten variants, along with any co-authors. By clicking on the authors in the TB, I can look at the books and change any authors I wish. Flat-out mistakes can be fixed directly on the tag browser. All this would take would be to have a mode/option where the restriction is set to a group instead of all duplicates. Of course, I can do this myself by using 'Next Group' to set the search bar, then using the new "Current search" restriction option to copy that to the restriction. Given how easy this is, I am not convinced that it needs to be made into dup-check mode. However, I do note that because I changed the line in the restriction box to the search, it is a bit hard to know which line to select to copy the search. I think I will add the '' to the beginning of the search so that the line in the combobox is more obvious. Perhaps it would work if 'show one group at a time' set the restriction but left the search empty? In fact, that might be better in general...

04-18-2011, 03:34 PM	#101
kiwidude Calibre Plugins Developer Posts: 4,664 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	v0.4 Beta Got a bit distracted writing the new Goodreads metadata plugin for a few days but that is done and this got some attention again. Changes in this beta: Run the find duplicates search again after removing a duplicate Reapply a sort after doing a fresh find duplicates Select all rows in the current group when use mark current group as exempt Respond to the clear search button on the toolbar to exit duplicate search modes When viewing search results one at a time, apply a search restriction rather than a search Fix bugs related to highlighting and search restrictions not being remember/reapplied correctly Fix error when no duplicates were found Implement a dialog for Manage exemptions for book with checkboxes allowing you to remove I've set the minimum version requires of Calibre as 0.7.56 since 0.7.55 is known to have other issues. So... other than any new grenades I have planted in the code I think from a "Find duplicate book" perspective this is functionally complete? The big todo item now becomes the handling of the find duplicate author algorithms. I have given it zero thought since my last posts but am expecting it to have some significant differences that I need to think through before I ramble on about here again. Thinking through how I did things with my own tool I believe it will require a separate exemption list, as this is between author name pairings not book pairs. That has a fair few implications but am sure we will figure something out. As always, feedback appreciated. Last edited by kiwidude; 04-19-2011 at 12:45 PM. Reason: Later version in thread

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Duplicate Detection	Philosopher	Library Management	114	09-08-2022 07:03 PM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM
Duplicate Detection	albill	Calibre	2	10-26-2010 02:21 PM
New Plugin Type Idea: Library Plugin	cgranade	Plugins	3	09-15-2010 12:11 PM
Help with Chapter detection	ubergeeksov	Calibre	0	09-02-2010 04:56 AM

04-16-2011, 10:41 AM	#92
kiwidude Calibre Plugins Developer Posts: 4,664 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@charles - I'm going to need to think about what you said but fundamentally it sounds like what has been niggling in the back of my mind - wanting an author-centric view of the duplicates for this mode. I think from as per my last few posts that it also requires different logic in determining the groups contents, rationalising/managing them as you navigate and any applying of exemptions. Using the tag browser to give the author centric filtered view is a great idea, particularly as you say you can quickly rename authors or ctrl+click to see certain combinations of authors you want to focus on. The "one group at a time" using restriction instead of search is something I hadn't considered. I guess my only slightest of hesitation is that our reliance on restrictions will preclude the user from (easily) doing any kind of more generic searching while they are contemplating that group. Say for instance you wanted to for some reason see all the books for that author. You would have to exit duplicate search mode first and then start it again. However that "limitation" would also apply to the show all duplicate groups mode which has a restriction. And perhaps it really just isn't necessary, you have the books you are considering duplicates in front of you so what more do you want I'll apply the restriction for one group at a time - at least that way the behaviour is consistent with show all groups.

04-16-2011, 07:25 PM	#96
kiwidude Calibre Plugins Developer Posts: 4,664 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Starson17 - yeah I found that one too

04-17-2011, 07:44 AM	#100
kiwidude Calibre Plugins Developer Posts: 4,664 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Thx Charles I will re-test that scenario for the next version. All that code has had to change with hooking into the search cleared event. I don't know if I was just too tired when I wrote it but you have no idea how something that sounds so simple could cause me so many issues (as so many subtle ways that the event will get fired). I "think" I got there in the end but I will try to give it a good thrashing. It was a really good suggestion, nice to have a simple toolbar button to exit out of the search mode. It's just the permutations that in the end I took a bit of a brute force approach to ensuring I was not connected when I didn't want to be. It's to do with the independent nature of the plugin from other actions the user could be doing in the gui that makes it a bit more complex than I would have hoped. @drMerry - the search is "supposed" to restrict the duplicate search respecting a value in that restriction combo on the left. Again I will retest this with the next version as again that code has all had to be massaged for this release.

Advert

Advert