Find duplicate books...

silentguy · 12-09-2010, 12:16 PM

Hi!
I got the feeling that the duplicate check that calibre offers by default upon adding is not good enough, so I thought I'd try to understand how the search source code works and make a plugin that makes some searches to find duplicates and display them in the books list.
Sadly, I haven't neither been able to figure out how to properly search from a plugin not how to tell it to display only the books my search returned. Could someone point me in the right direction?
Essentially I just want it to display all the books with the following ids "select group_concat(id) from books group by UPPER(title) having count(*) > 1;"
Well, that query could probably be tweaked, but I thought before I thought about it more, I'd figure out the basics :P

silentguy · 12-09-2010, 01:22 PM

Okay, combining the hello world gui plugin with
db = self.gui.library_view.model().db
dupes = db.conn.get('select group_concat(id) from books group by UPPER(title) having count(*) > 1;')
allowed me to build a message box displaying the id's of duplicate books. next step, messing with the few and trying to figure out if my direct sql access is kinda bad :P

kovidgoyal · 12-09-2010, 01:28 PM

direct sql access is fine, if it is readonly. If you want to make changes you hsould use the api methods in LibraryDatabase2 as they do various things to maintain consistency.

silentguy · 12-09-2010, 06:47 PM

Yay, my first version of the plugin is working. It searches for equal titles (with and without equal case) and then displays the found books.
I started working on something like "similar" title, but that was hard to put into a nice query...

http://bugs.calibre-ebook.com/ticket/4571

kovidgoyal · 12-09-2010, 09:06 PM

You'll get a lot more fexibility if you use the python cache rather than SQL. Look at find_identical_books in database2.py

aceflor · 12-10-2010, 05:00 AM

wrong thread, sorry

Starson17 · 12-10-2010, 09:42 AM

find_identical_books in database2.py is used in the autosort/automerge code to find books that have a identical author(s) and nearly identical titles (see fuzzy_title). If the autosort/automerge option is on, incoming books are compared to existing book records with find_identical_books, and the incoming format is added to the existing record.

I considered doing fuzzy matching on authors, and more aggressive fuzzy matching on titles, but for automatic merging there were too many errors. If you're going to write a duplicate finder, you can be aggressive, provided you only display the results and merging is done manually.

I don't know if you want to only compare titles, or if you also want to consider authors. I was thinking about multiple types of dupe finding, selectable during the search: 1) Match on title (fuzzy) only, 2) match on title (fuzzy) and exact author (this is what find_identical_books produces), and 3) match on title (fuzzy - but aggressive, ignoring plurals) and author (fuzzy - ignore initials, Jr., and second authors, etc)

You will find a couple of threads on duplicate finding, if you search here, that provide SQL searches to be run with the calibre-debug tool or with an SQL database browser like SQLiteSpy. I've found that most of my "duplicates" on title only are not really duplicates, just similar or identical titles.

kiwidude · 12-10-2010, 10:28 AM

I ended up writing my own external tool to query the database and find duplicates based on quite a range of criteria. It is fairly "fuzzy" in that it strips off stuff like lead "A " and "The ", rips out characters like colons, apostrophes etc and pumps out various sets of results. It also does "starts with" type checks, as there could be the same book but a longer version of the title as many books often have. Similar starts with type checks done on Authors (taking into account the first initial). And as all my Authors are supposed to be stored LN, FN I also look for names stored as FN LN (no comma) or authors where the names were imported the wrong way around so stored as FN, LN.

It "does a job" and helps me eliminate many duplicates I would otherwise have. However one of the issues as Starson17 says is that for certain types of checks there are genuine exceptions to a rule, and you can get a problem with wasting your time re-verifying that exception every time you run the duplicate check, particularly when you have lots of books. The "fuzzier" the search, the better chance of finding duplicates but the more false positives you have to keep looking through.

My current solution to this is to:
- run the duplicate check and process the results until happy with them.
- run the check again. The output this time should just be the stuff I am happy to be treated as exceptions. I save that output as a text file.
- Then the next time I run the check, I open up the previous output in Notepad++, paste the new output into another tab then use it's built-in diff functionality to just highlight the "new" duplicates it has detected.
- once that is done, go back to the second step, thereby overwriting my new baseline.

All that baseline/persistence/identify only new stuff could be built into a tool but the above was just a quick and dirty "get it done" approach I use. Will be interested to see what evolves from other ideas people have. Just food for thought.

Starson17 · 12-10-2010, 10:45 AM

Quote:

Originally Posted by kiwidude

I ended up writing my own external tool to query the database and find duplicates based on quite a range of criteria. It is fairly "fuzzy" in that it strips off stuff like lead "A " and "The ", rips out characters like colons, apostrophes etc and pumps out various sets of results.

This is built into find_identical_books

Quote:

It also does "starts with" type checks, as there could be the same book but a longer version of the title as many books often have.

This is a good point. I've seen a few of this type of dupe that weren't caught with my tools.

Quote:

you can get a problem with wasting your time re-verifying that exception every time you run the duplicate check, particularly when you have lots of books.

Like you, I built my own dupe checker, and like you, I found myself rechecking the same exceptions a lot. One of the reasons I posted was to highlight the same issue you have highlighted - what you want or need the dupe checker to do seems to change as you use it. I found myself changing the search a lot to look for dupes in different ways and spending too much time looking at the exceptions. For a while I had a custom boolean column that meant "If all dupes found for this title have this column checked, we are not dupes of each other"

kiwidude · 12-10-2010, 10:54 AM

Yes you are right, it does evolve a bit, and probably like me you tune it to the way you store books. For instance I do stuff like strip off "(Omnibus)" since I know that is how I store those types of titles amongst various other things.

One point I did not mention which you brought up again with the find_identical_books comment. One of the common things I find is that the built-in logic that Calibre has at time of import does you no good if the filename was not "close enough" at the time of import. For instance a common thing I miss is to not spot a missing space between the series hyphen and the title. So my title gets imported with a name like "Series X-Title name" which the Calibre logic cannot pick up. Now that is easy to spot in Calibre when you review your newly added books, you fix the title/series up correctly and think job done. However of course that can now result in a duplicate.

My point being that regardless of how much cleverness goes into the "merge" logic, there will always be situations where as the result of an edit you now have a duplicate that only some sort of post-check can pickup, replicating similar and other more extensive checks.

Starson17 · 12-10-2010, 11:03 AM

Quote:

Originally Posted by kiwidude

One point I did not mention which you brought up again with the find_identical_books comment. One of the common things I find is that the built-in logic that Calibre has at time of import does you no good if the filename was not "close enough" at the time of import. For instance a common thing I miss is to not spot a missing space between the series hyphen and the title. So my title gets imported with a name like "Series X-Title name" which the Calibre logic cannot pick up. Now that is easy to spot in Calibre when you review your newly added books, you fix the title/series up correctly and think job done. However of course that can now result in a duplicate.

Yes. This specific issue is an interaction between the regex used to identify the title and series, and the autosort/automerge code that compares the title passed to it by the regex with the title of existing book records. The missing space caused the regex to think the title was "Series X-Title name" and that didn't match the book title of "Title name."

Quote:

My point being that regardless of how much cleverness goes into the "merge" logic, there will always be situations where as the result of an edit you now have a duplicate that only some sort of post-check can pickup, replicating similar and other more extensive checks.

Agreed. I've started to put together a dupe checker a few times, but my motivation is low now that my library is in good shape.

12-09-2010, 12:16 PM	#1
silentguy Connoisseur Posts: 88 Karma: 200 Join Date: Nov 2010 Location: Dortmund, Germany Device: Kindle Paperwhite (10. Generation)	Find duplicate books... Hi! I got the feeling that the duplicate check that calibre offers by default upon adding is not good enough, so I thought I'd try to understand how the search source code works and make a plugin that makes some searches to find duplicates and display them in the books list. Sadly, I haven't neither been able to figure out how to properly search from a plugin not how to tell it to display only the books my search returned. Could someone point me in the right direction? Essentially I just want it to display all the books with the following ids "select group_concat(id) from books group by UPPER(title) having count(*) > 1;" Well, that query could probably be tweaked, but I thought before I thought about it more, I'd figure out the basics :P

12-10-2010, 05:00 AM	#6
aceflor Wizard Posts: 3,472 Karma: 48036360 Join Date: Aug 2009 Location: where the sun lives, or so they say Device: Pocketbook Era, Pocketbook Inkpad 4, Kobo Libra 2, Kindle Scribe	wrong thread, sorry Last edited by aceflor; 12-10-2010 at 05:15 AM. Reason: wrong thread, sorry

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How can you get rid of duplicate books?	pmatch1104	Calibre	4	12-02-2010 11:08 PM
where do l find books....	caddie	Deals and Resources (No Self-Promotion or Affiliate Links)	11	03-13-2010 08:29 AM
PRS-600 Duplicate books	radcliffe287	Sony Reader	4	12-18-2009 06:54 AM
Duplicate books on reader	bassett520	Calibre	2	11-29-2009 08:51 PM
Duplicate books - multiple formats	mranlett	Calibre	5	09-26-2009 07:02 AM

12-09-2010, 01:22 PM	#2
silentguy Connoisseur Posts: 88 Karma: 200 Join Date: Nov 2010 Location: Dortmund, Germany Device: Kindle Paperwhite (10. Generation)	Okay, combining the hello world gui plugin with db = self.gui.library_view.model().db dupes = db.conn.get('select group_concat(id) from books group by UPPER(title) having count(*) > 1;') allowed me to build a message box displaying the id's of duplicate books. next step, messing with the few and trying to figure out if my direct sql access is kinda bad :P

12-09-2010, 01:28 PM	#3
kovidgoyal creator of calibre Posts: 45,305 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	direct sql access is fine, if it is readonly. If you want to make changes you hsould use the api methods in LibraryDatabase2 as they do various things to maintain consistency.

12-09-2010, 06:47 PM	#4
silentguy Connoisseur Posts: 88 Karma: 200 Join Date: Nov 2010 Location: Dortmund, Germany Device: Kindle Paperwhite (10. Generation)	Yay, my first version of the plugin is working. It searches for equal titles (with and without equal case) and then displays the found books. I started working on something like "similar" title, but that was hard to put into a nice query... http://bugs.calibre-ebook.com/ticket/4571

12-09-2010, 09:06 PM	#5
kovidgoyal creator of calibre Posts: 45,305 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You'll get a lot more fexibility if you use the python cache rather than SQL. Look at find_identical_books in database2.py

12-10-2010, 09:42 AM	#7
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	find_identical_books in database2.py is used in the autosort/automerge code to find books that have a identical author(s) and nearly identical titles (see fuzzy_title). If the autosort/automerge option is on, incoming books are compared to existing book records with find_identical_books, and the incoming format is added to the existing record. I considered doing fuzzy matching on authors, and more aggressive fuzzy matching on titles, but for automatic merging there were too many errors. If you're going to write a duplicate finder, you can be aggressive, provided you only display the results and merging is done manually. I don't know if you want to only compare titles, or if you also want to consider authors. I was thinking about multiple types of dupe finding, selectable during the search: 1) Match on title (fuzzy) only, 2) match on title (fuzzy) and exact author (this is what find_identical_books produces), and 3) match on title (fuzzy - but aggressive, ignoring plurals) and author (fuzzy - ignore initials, Jr., and second authors, etc) You will find a couple of threads on duplicate finding, if you search here, that provide SQL searches to be run with the calibre-debug tool or with an SQL database browser like SQLiteSpy. I've found that most of my "duplicates" on title only are not really duplicates, just similar or identical titles.

12-10-2010, 10:28 AM	#8
kiwidude Calibre Plugins Developer Posts: 4,728 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	I ended up writing my own external tool to query the database and find duplicates based on quite a range of criteria. It is fairly "fuzzy" in that it strips off stuff like lead "A " and "The ", rips out characters like colons, apostrophes etc and pumps out various sets of results. It also does "starts with" type checks, as there could be the same book but a longer version of the title as many books often have. Similar starts with type checks done on Authors (taking into account the first initial). And as all my Authors are supposed to be stored LN, FN I also look for names stored as FN LN (no comma) or authors where the names were imported the wrong way around so stored as FN, LN. It "does a job" and helps me eliminate many duplicates I would otherwise have. However one of the issues as Starson17 says is that for certain types of checks there are genuine exceptions to a rule, and you can get a problem with wasting your time re-verifying that exception every time you run the duplicate check, particularly when you have lots of books. The "fuzzier" the search, the better chance of finding duplicates but the more false positives you have to keep looking through. My current solution to this is to: - run the duplicate check and process the results until happy with them. - run the check again. The output this time should just be the stuff I am happy to be treated as exceptions. I save that output as a text file. - Then the next time I run the check, I open up the previous output in Notepad++, paste the new output into another tab then use it's built-in diff functionality to just highlight the "new" duplicates it has detected. - once that is done, go back to the second step, thereby overwriting my new baseline. All that baseline/persistence/identify only new stuff could be built into a tool but the above was just a quick and dirty "get it done" approach I use. Will be interested to see what evolves from other ideas people have. Just food for thought.

12-10-2010, 10:54 AM	#10
kiwidude Calibre Plugins Developer Posts: 4,728 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Yes you are right, it does evolve a bit, and probably like me you tune it to the way you store books. For instance I do stuff like strip off "(Omnibus)" since I know that is how I store those types of titles amongst various other things. One point I did not mention which you brought up again with the find_identical_books comment. One of the common things I find is that the built-in logic that Calibre has at time of import does you no good if the filename was not "close enough" at the time of import. For instance a common thing I miss is to not spot a missing space between the series hyphen and the title. So my title gets imported with a name like "Series X-Title name" which the Calibre logic cannot pick up. Now that is easy to spot in Calibre when you review your newly added books, you fix the title/series up correctly and think job done. However of course that can now result in a duplicate. My point being that regardless of how much cleverness goes into the "merge" logic, there will always be situations where as the result of an edit you now have a duplicate that only some sort of post-check can pickup, replicating similar and other more extensive checks.

Advert

Advert