12-09-2010, 01:16 PM | #1 |
Connoisseur
Posts: 88
Karma: 200
Join Date: Nov 2010
Location: Dortmund, Germany
Device: Kindle Paperwhite (10. Generation)
|
Find duplicate books...
Hi!
I got the feeling that the duplicate check that calibre offers by default upon adding is not good enough, so I thought I'd try to understand how the search source code works and make a plugin that makes some searches to find duplicates and display them in the books list. Sadly, I haven't neither been able to figure out how to properly search from a plugin not how to tell it to display only the books my search returned. Could someone point me in the right direction? Essentially I just want it to display all the books with the following ids "select group_concat(id) from books group by UPPER(title) having count(*) > 1;" Well, that query could probably be tweaked, but I thought before I thought about it more, I'd figure out the basics :P |
12-09-2010, 02:22 PM | #2 |
Connoisseur
Posts: 88
Karma: 200
Join Date: Nov 2010
Location: Dortmund, Germany
Device: Kindle Paperwhite (10. Generation)
|
Okay, combining the hello world gui plugin with
db = self.gui.library_view.model().db dupes = db.conn.get('select group_concat(id) from books group by UPPER(title) having count(*) > 1;') allowed me to build a message box displaying the id's of duplicate books. next step, messing with the few and trying to figure out if my direct sql access is kinda bad :P |
Advert | |
|
12-09-2010, 02:28 PM | #3 |
creator of calibre
Posts: 44,573
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
direct sql access is fine, if it is readonly. If you want to make changes you hsould use the api methods in LibraryDatabase2 as they do various things to maintain consistency.
|
12-09-2010, 07:47 PM | #4 |
Connoisseur
Posts: 88
Karma: 200
Join Date: Nov 2010
Location: Dortmund, Germany
Device: Kindle Paperwhite (10. Generation)
|
Yay, my first version of the plugin is working. It searches for equal titles (with and without equal case) and then displays the found books.
I started working on something like "similar" title, but that was hard to put into a nice query... http://bugs.calibre-ebook.com/ticket/4571 |
12-09-2010, 10:06 PM | #5 |
creator of calibre
Posts: 44,573
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You'll get a lot more fexibility if you use the python cache rather than SQL. Look at find_identical_books in database2.py
|
Advert | |
|
12-10-2010, 06:00 AM | #6 |
Wizard
Posts: 3,472
Karma: 48036360
Join Date: Aug 2009
Location: where the sun lives, or so they say
Device: Pocketbook Era, Pocketbook Inkpad 4, Kobo Libra 2, Kindle Scribe
|
wrong thread, sorry
Last edited by aceflor; 12-10-2010 at 06:15 AM. Reason: wrong thread, sorry |
12-10-2010, 10:42 AM | #7 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
find_identical_books in database2.py is used in the autosort/automerge code to find books that have a identical author(s) and nearly identical titles (see fuzzy_title). If the autosort/automerge option is on, incoming books are compared to existing book records with find_identical_books, and the incoming format is added to the existing record.
I considered doing fuzzy matching on authors, and more aggressive fuzzy matching on titles, but for automatic merging there were too many errors. If you're going to write a duplicate finder, you can be aggressive, provided you only display the results and merging is done manually. I don't know if you want to only compare titles, or if you also want to consider authors. I was thinking about multiple types of dupe finding, selectable during the search: 1) Match on title (fuzzy) only, 2) match on title (fuzzy) and exact author (this is what find_identical_books produces), and 3) match on title (fuzzy - but aggressive, ignoring plurals) and author (fuzzy - ignore initials, Jr., and second authors, etc) You will find a couple of threads on duplicate finding, if you search here, that provide SQL searches to be run with the calibre-debug tool or with an SQL database browser like SQLiteSpy. I've found that most of my "duplicates" on title only are not really duplicates, just similar or identical titles. |
12-10-2010, 11:28 AM | #8 |
Calibre Plugins Developer
Posts: 4,688
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
I ended up writing my own external tool to query the database and find duplicates based on quite a range of criteria. It is fairly "fuzzy" in that it strips off stuff like lead "A " and "The ", rips out characters like colons, apostrophes etc and pumps out various sets of results. It also does "starts with" type checks, as there could be the same book but a longer version of the title as many books often have. Similar starts with type checks done on Authors (taking into account the first initial). And as all my Authors are supposed to be stored LN, FN I also look for names stored as FN LN (no comma) or authors where the names were imported the wrong way around so stored as FN, LN.
It "does a job" and helps me eliminate many duplicates I would otherwise have. However one of the issues as Starson17 says is that for certain types of checks there are genuine exceptions to a rule, and you can get a problem with wasting your time re-verifying that exception every time you run the duplicate check, particularly when you have lots of books. The "fuzzier" the search, the better chance of finding duplicates but the more false positives you have to keep looking through. My current solution to this is to: - run the duplicate check and process the results until happy with them. - run the check again. The output this time should just be the stuff I am happy to be treated as exceptions. I save that output as a text file. - Then the next time I run the check, I open up the previous output in Notepad++, paste the new output into another tab then use it's built-in diff functionality to just highlight the "new" duplicates it has detected. - once that is done, go back to the second step, thereby overwriting my new baseline. All that baseline/persistence/identify only new stuff could be built into a tool but the above was just a quick and dirty "get it done" approach I use. Will be interested to see what evolves from other ideas people have. Just food for thought. |
12-10-2010, 11:45 AM | #9 | |||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
Quote:
|
|||
12-10-2010, 11:54 AM | #10 |
Calibre Plugins Developer
Posts: 4,688
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Yes you are right, it does evolve a bit, and probably like me you tune it to the way you store books. For instance I do stuff like strip off "(Omnibus)" since I know that is how I store those types of titles amongst various other things.
One point I did not mention which you brought up again with the find_identical_books comment. One of the common things I find is that the built-in logic that Calibre has at time of import does you no good if the filename was not "close enough" at the time of import. For instance a common thing I miss is to not spot a missing space between the series hyphen and the title. So my title gets imported with a name like "Series X-Title name" which the Calibre logic cannot pick up. Now that is easy to spot in Calibre when you review your newly added books, you fix the title/series up correctly and think job done. However of course that can now result in a duplicate. My point being that regardless of how much cleverness goes into the "merge" logic, there will always be situations where as the result of an edit you now have a duplicate that only some sort of post-check can pickup, replicating similar and other more extensive checks. |
12-10-2010, 12:03 PM | #11 | ||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
|
||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How can you get rid of duplicate books? | pmatch1104 | Calibre | 4 | 12-03-2010 12:08 AM |
where do l find books.... | caddie | Deals and Resources (No Self-Promotion or Affiliate Links) | 11 | 03-13-2010 09:29 AM |
PRS-600 Duplicate books | radcliffe287 | Sony Reader | 4 | 12-18-2009 07:54 AM |
Duplicate books on reader | bassett520 | Calibre | 2 | 11-29-2009 09:51 PM |
Duplicate books - multiple formats | mranlett | Calibre | 5 | 09-26-2009 08:02 AM |