04-10-2011, 02:12 PM | #1 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Duplicate detection plugin
Call me a masochist, but I can't resist the challenge of getting something up and running for this. I've tried to say no, but I can't. It is a disease.
We had a bunch of really good discussions in this on the Duplicate Detection thread - Starson17 & Kovid implemented the automerge changes which occupied the first four pages, then the posts more relevant to the plugin begain in earnest with some good stuff being thrown around. I considered continuing this post on that thread, but it is clear it is now all about plugin design decisions and has no place in the library mgmt forum. So having reviewed the thread, I have had a few thoughts.
You guys have any further thoughts on this before I take the plunge? |
04-11-2011, 04:08 AM | #2 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
A couple more thoughts on the false positive stuff:
|
Advert | |
|
04-11-2011, 08:22 AM | #3 |
Grand Sorcerer
Posts: 11,916
Karma: 7176769
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
not duplicates
We must first decide what it means to say two books are not duplicates. My position is that books marked as not duplicates are not duplicates, independent of what algorithm is used. Note that the 'not duplicate' relationship is not transitive: if books 1 & 2 are not duplicates and books 2 & 3 are not duplicates, we can say nothing about whether books 1 and 3 are duplicates.
I think that the natural implementation is to have pairs of books that are known to be 'not duplicates'. When sets of duplicates are built, they must be partitioned so that pairs of 'not duplicates' are never in the same set. Using the above example, if the algorithm produces a set [1, 2, 3], then partitioning will produce a set [1, 3]. If we have a duplicate set [1, 2, 3, 4] with not-duplicate [1, 2] and [3, 4], then the result will be four sets [1, 3], [1, 4], [2, 3], [2, 4]. The algorithm I posted before does this partitioning. I am not convinced that a custom column is the best place to keep the 'not duplicate' pairs. First, the data would be duplicated in both books, creating a maintenance problem (change one means the other must change). Second, the data must be 'multiple' because a book can participate in multiple pairs. I think I am saying that the plugin must accept a selection of books that are not dups and write that information to its own storage. As for whether or not to use a persistent column for the results: I tend toward persistence. I see three advantages of using a custom column. 1) I can process a group, then delete the group marker to indicate I am done. Running the algorithm again could find these books again, unless I am meticulous about marking not-dups. 2) I can do the work over time, knowing where I stopped. 3) I can select all the groups that a given book is in, giving a books view. For this last one, the plugin could help by writing the transitive group information into a custom column. This would solve the sorting problem because the values written could be sortable: the base book could be #1, with the rest of the books being some other number. The problem you raise regarding books 2 and 3 'not duplicates' is also real, but I think it is a consequence of using book mode. Book-based browsing says something about a relationship between the base book and any other book, but nothing about the relationship between books that are not the base. Have fun! |
04-11-2011, 09:42 AM | #4 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hey Charles, thanks for the replies.
I don't think we differ at all on thoughts on how the data should be partitioned, and that a set based approach to viewing the results has a number of advantages over book based in terms of simplicity of the UI etc. Where to store the duplicate pair exemption is an interesting one. Sounds like we are agreed that the algorithm is not relevant to the pairing. The maintenance aspect is a point I touched on in a slightly different fashion, I don't believe you would need to store it on both books if you process in book id order but there is certainly an issue when one of the books in the pair becomes merged or deleted. Now we have library uuid it is more feasible to store library specific ids in a config file so I'll add that to the mix. I'm not sure I understand all your persistence reasons, as surely similar justification for not using a custom column to store exemptions applies to storing groups? Take for instance the example of (1,2) and (1,3) being found. We display (1,2). Now if the user goes "next group" without doing anything, we will want to display (1,3). If the user merges book 2 into book 1 before going next group, we will still want to display (1,3). However if they merge book 1 into book 2, then the (1,3) 2nd group is invalidated. If that information has been persisted into a custom column, the plugin potentially will be doing a lot of repeated querying to find a next group, validate all the ids, clearing the persisted column if group is not valid, move to next group etc. I would have thought that might be a little easier to do when not persisted in the database? For point (3) of viewing groups a book is in, you can still do that. You would always have to rely on some kind of in-memory mapping the plugin kept of groups and their members. So I don't see that reason as a differentiator - in fact it could well be easier just using in-memory results because you don't have to re-read all the custom group ids and re-build the mappings as you would if relying on the persisted values? I think the point (2) of doing the resolving across Calibre sessions is absolutely valid, though I think it comes down to the workflow again and how long it takes to run the duplicate search. For me I either resolve the pair in some way or else it will come up again the next time I run the search. Whether I do that now or the next time I open Calibre (and trigger another search) to me doesn't matter, that pair still needs to be resolved in some way. But of course thats just my opinion . I think at this point I will still go for the "marked" memory based approach and see how well it works. Converting it to use persisted custom columns at a later point should only involve adding to the logic I would have thought, not having to bin everything. If I also store the false positives in the config file then that means I can avoid custom columns completely. I would like to avoid them if I can, as it just avoids the issues of column visibility, deletions, attempted sorting, naming conflicts, users fiddling with values etc. It is a good point about using a custom column for transitive based sorting. Though unless you forced that column visible on the view one stray click by the user and the sorting is lost, they would like have to do "next group, previous group" to get it restored. However at least for the first cut since transitive groups are not being displayed I think I can again avoid this issue for now anyways. |
04-11-2011, 10:17 AM | #5 | |||||
Grand Sorcerer
Posts: 11,916
Karma: 7176769
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
If both of these assumptions are true, then your examples don't create a problem. In the first case (2 -> 1), book 1 will still have all its groups, so next_group will show (1,3). If 1 is merged into 2, then 1 no longer exists and next_group will show only book 3. Quote:
I also assume that telling the plugin that two books are not duplicates does not rerun the partitioning algorithm. The new non-dup info would be used next time I run it. Quote:
Quote:
Quote:
|
|||||
Advert | |
|
04-11-2011, 10:35 AM | #6 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hey Charles, no we are on the same page when it comes to what is stored in the CC..
What I thought would not be useful as a user though is to get shown a duplicate group that had only one entry in it when I go "next result". So just displaying book 3 by itself wouldn't give me a high level of confidence the plugin was "working right", even though we know the reasons why there is just now one in the group . It is that extra processing behind the scenes to invalidate that group etc that was my concern of the CC versus in-memory approach. For instance it is going to be impossible to give the user an updated number of how many more "real" duplicate groups there are, unless you go through and invalidate them every time they hit next result, to remove all deleted books and reset the group counts. Something that could be done in memory less expensively/easily I would argue than via a CC. Because there are no hooks for this plugin to know that a user has deleted a book or merged a book, it must check each id in the duplicates space every time you hit next to give that count. You are of course correct that you wouldn't need to rebuild the mappings to display the book view so thanks for correcting me on that. Brain fart. And about making a column visible. Yeah that's where you get into the problem of a column being there that you only want while you are resolving duplicates. And the myriad of issues about ways to hide it etc. If you get sidetracked while resolving and want to do other queries/restart Calibre etc and have this column in your view that "annoys" when not being used. If I can avoid it then I would prefer to |
04-11-2011, 10:43 AM | #7 | |
Grand Sorcerer
Posts: 11,916
Karma: 7176769
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
The above aside, as you are doing the work, your vote is worth more than mine. In addition, given the quality of all your other work, I am sure that whatever you do will do the job admirably. Have at it! |
|
04-11-2011, 10:56 AM | #8 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Haha, dunno about the quality part mate, I've had a couple of crappy bugs with all the plugin rewrites recently, trying to context switch too often and Eclipse/Pydev has decided to not be of any use to me any more at indicating missing imports etc. May as well be writing in Notepad again. Might be time to try Aptana which John says has less issues.
I enjoy the debate to verify I'm not wandering too far down the wrong path. You guys inevitably suggest things I haven't considered. For instance just now you mentioned the tag browser. Using a custom column would of course let you "navigate" your groups that way, and as you say give you an ongoing count of the groups which is something I hadn't figured out how to do other than via the status bar. I think that the group count is invalid unless you go through and "prune" (clear) group ids for books that no longer have more than one member though, and there has to be some action by the user to trigger that unless it is hooked into delete events (which as I discovered with a previous plugin and discussed here before don't get fired by database2.py anyways with that notify stuff). But a starting point is better than nothing, and I reckon 90% of it should be reusable should we steer down another path once people start feeding back on usability etc. |
04-11-2011, 08:18 PM | #9 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
v0.1.1 beta
Ok, here is the progress so far. The big ticket missing at this point is handling false positives. However the rest of it is pretty much there. So far I've put in five different algorithms:
It returns the matches, and you can navigate through them either by clicking on the toolbar button (ctrl+click on the button for previous) or using keyboard shortcuts (ctrl+\ for next result, ctrl+shift+\ for previous). Groups are dynamically removed every time you move to the next result. False positives are my next issue. The first part which perhaps Chaley has thoughts on is deciding how to store them. For instance the simple case is that a user marks (1,2) as not dups, then (1,3) as not dups. So I could store those as tuples in a list for the library like [(1,2),(1,3)] However in theory we could allow the user to select more than 2 rows. For instance they select (1,2,3). Now what does that actually "mean". Presumably the net effect is indicating [(1,2),(1,3),(2,3)]. Do we break it down and store it that way, or do we store it as [(1,2,3)]? Now what happens if the user then selects (1,2,3,4)? As obviously that is a superset. Or what if they select (1,2,5)? I haven't got my head around whether we should flatten this out yet etc. Presumably we also need to cater for the user screwing up and allowing them to "undo" a no duplicate set. And how could we show the user visually what duplicate exemptions they have in place? That might need a special screen being built for it I think (and let them delete exemptions from that screen). Feedback as always appreciated, be it on the code, usability or whatever. So far it is fairly simple I think. Last edited by kiwidude; 04-12-2011 at 11:25 AM. Reason: Removed attachment, see later posts |
04-11-2011, 09:11 PM | #10 |
Junior Member
Posts: 6
Karma: 10
Join Date: Mar 2010
Device: PRS 505
|
I used this tonight on my collection and it worked very well. If the program found more than two matching books, it would hang after merging. Other than that it worked great - what a timesaver.
|
04-12-2011, 02:55 AM | #11 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@tomd365 - thanks for trying this and the feedback. The "hanging" sounds a concern though. I have made two changes - one was a fix to the logic doing the cleaning up when you move to the next book, and the second is an improvement to the "xxx author, similar title" algorithms to strip whitespace from the fuzzied result, so that titles of "Manichu" and "Manichu -" would end up in the same group.
If you can give me some exact steps to replicate any hanging you experienced it would be much appreciated. It wasn't very clear from your post. Did you mean more than two matching groups, or more than two books in a group? If the latter - how many groups, just one or multiple? What merges did you do - all of them required to get down to one record or just a couple? When did it actually "hang" - when you did next record, while you were merging or when you ran the duplicate check again? Last edited by kiwidude; 04-12-2011 at 03:15 AM. Reason: Fix some shocking grammar. |
04-12-2011, 07:18 AM | #12 | |
Grand Sorcerer
Posts: 11,916
Karma: 7176769
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Code:
for bk in not_dups: nd_map[bk] |= not_dups Editing the not_dups list is mostly a presentation issue. The easiest way to do it (but perhaps not the easiest way to use it) would be to select a book and see the list of not_dups for that book (the value of nd_map[bk]). The user could check (or del or something) any book from that set, say with id bk2. The code would do: Code:
for bk2 in checked_books: nd_map[bk] -= set([bk2]) nd_map[bk2] -= set([bk]) { 1:(1,2,3,4), 2:(1,2,3,4), 3:(1,2,3,4), 4:(1,2,3,4) } Then adding (1,2,5) would result in: { 1:(1,2,3,4,5), 2:(1,2,3,4,5), 3:(1,2,3,4), 4:(1,2,3,4), 5:(1,2,5) } |
|
04-12-2011, 10:32 AM | #13 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hey Charles, thanks for the comments.
Yeah I was rather tired last night when I posted but wanted to ask for input before I crashed. I had come to the same conclusion of a dictionary by book of exclusions but wanted to make sure I wasn't missing some obvious alternative. Part of my hesitation was thinking about ways users could mark not duplicates. For instance if I let it be a free-for-all, then in it could be a lot of data. Say for instance I allowed a user to mark any books they liked together as not duplicates. So I have a library of 1,000 books as a starting point, and I am confident I have no duplicates. If I selected the whole lot and said mark as not duplicates, then I would ensure that any duplicates that came up in future were not books I had considered. I could skip through the book groups with a very "loose" duplicate matching algorithm, then rather than resolving each group I could just resolve the ones I were dups, then select my whole library and mark it as not containing duplicates again. From a user perspective this is kind of simple. From a data perspective it is a bit of a disaster, storing a cross-map table of every id with every other id So to prevent that scenario, I would instead only allow a user to mark as not duplicates books in the found sets - if the selection you make is not found together in the same set with the search algorithm you can't add them as a duplicate. That would keep the data volumes down, at the expense of more clicks required by the user. Perhaps a compromise would be an additional menu option of something like "mark all groups as not duplicates". So a user could still skip through their duplicate results found, and then if they are happy they never want to see those again they can with one click add the various pairings within those groups as marked pairs. That would be dramatically less data while minimising the user effort. Does that sound useful? As for the presentation, yeah need to think about that one. It has the same issues (I think) as a book based view. You have a root book, and you have the related books marked as not duplicates. From a user perspective you perhaps initially want to see the set of all books marked as not duplicates, then when you click on one be able to see what books it is duplicates with. As you say the whole nasty issue of transitivity comes up when you visually display that information to consider the pairings. Maybe the easiest way is to only allow the user to remove all exceptions for a book. So we have a menu option of "Show marked not duplicates" which displays every book that has some kind of a not duplicate relationship. Then a user can choose a book and choose "Remove from not duplicates" which breaks all associations other books have with it. So if I have (1,2) and (1,3) marked as not duplicates, I choose "Show" which will display 1, 2, 3. If I choose "Remove" on book 1, then both it's exclusion relationship with 2 and with 3 are deleted. If 2 & 3 have no exclusions with other books then they also get removed from the exclusion dictionary, so when it refreshes the view it would display no results. So if a user wanted to remove just the (1,2) relationship, they would need to do the remove on book 2. Then when it refreshes it would still be showing book 1 and book 3. Without having some visual indication of the relationship pair graph between the books I think it is the simplest approach. Any thoughts/objections? |
04-12-2011, 10:45 AM | #14 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Spoiler:
This is my small test library and has a lot of funky data from hundreds of odd tests. |
|
04-12-2011, 11:08 AM | #15 |
Calibre Plugins Developer
Posts: 4,673
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Hi Starson17 - ahhh, it must be a very small library (less than 20 books perhaps?). I borrowed some code from my quality check plugin and it would suffer from the same issue. I will stick a new version up shortly.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Duplicate Detection | Philosopher | Library Management | 114 | 09-08-2022 07:03 PM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |
Duplicate Detection | albill | Calibre | 2 | 10-26-2010 02:21 PM |
New Plugin Type Idea: Library Plugin | cgranade | Plugins | 3 | 09-15-2010 12:11 PM |
Help with Chapter detection | ubergeeksov | Calibre | 0 | 09-02-2010 04:56 AM |