Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Library Management

Notices

Reply
 
Thread Tools Search this Thread
Old 07-02-2016, 11:22 AM   #1
rjherald
Junior Member
rjherald began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jul 2016
Device: iPad Air 2
easiest way to remove duplicates

i am sure this has been asked a bunch of times but is there an easy way to remove duplicates other than the one at a time method? any help is appreciated thanks

rjherald
rjherald is offline   Reply With Quote
Old 07-02-2016, 12:47 PM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,893
Karma: 55267620
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Start with the Find Duplicates Plugin
Be sure to configure to your way of working, deciding
If you have all found results showing, you can use the standard 'selection clicks' then 'delete' those
theducks is online now   Reply With Quote
Advert
Old 07-02-2016, 02:48 PM   #3
Acanthus
Junior Member
Acanthus began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jul 2016
Device: Generic
As always, there are several options (within Calibre or external) and a few things to consider:
  1. even if you have two byte-identical ebooks, their metadata may be different, so you may want to merge both records.
    The Find Duplicates Plugin can handle these things reasonably well: Using binary comparision, it can be set to delete the duplicate ebook format. To merge the records, you will need to walk the results and decide book-by-book if you want a merge or not. (There may be pairs of books where for example the PDFs are identical, yet the other formats are not. In those cases, merging them would lead to the loss of the alternate other formats present).
  2. if only one of the formats is a duplicate, yet the other ones are not, you most likely will have to decide manually which ones to keep, depending on the reason for the differing files:
    • different people may have run a conversion from the original file to their preferred format with different options, converter versions, or postprocessing stages, resulting in conversions of different qualities.
    • a format may have an additional jacked or modified metadata slapped on them, causing the resulting files not to match binary comparision.
    • bookmarks and other metadata may have been added by a careless reader (application).
    No automatic tool I am aware of is able to put these things into consideration, so if you want to sanitize your library, you're in for some dedicated time of weeding.
  3. a format may be broken due to faulty filetransfer or bit-rot (This goes undetected by tools looking at filesize only). Automatic merging of book records may leave you with the broken one instead of the good one, so you will want to identify those files first and deal with them. (Some formats may have a chance to recovery, in most cases the best recovery will be to look for a sane backup or copy)

My steps to weed through libraries suffering from duplicates and other mayhem:
  1. Identify the bad apples, either from within Calibre (using a plugin), or from your OS, using a script. The Quality Check plugin at least has the side-effect of identifying defective epub (zip) files when looking for missing container.xml files within the archive, or similar tasks - just copy the resulting errorlist to the clipboard, paste it into a file, and fetch the pathnames to the defective files (or their Calibre IDs) from that file for further processing.
  2. Unify spelling and formatting of Author names and Initials, so all books by the same Author will show up together - both in the file system and in Calibre.
  3. Quit Calibre, and make a Backup of the library to an external medium for safekeeping. Label and date.
  4. Now to identify the duplicate formats:
    1. True duplicate import: All formats and metadata are identical, the only difference is in the ID and import date/timestamp. - Delete automatically (Note: When deleting from the commandline, use the calibredb command to remove those books, don't just delete the files)
      This can be done by scripting, method depending on your OS and level of expertise in whatever scripting language you are choosing. (I am not aware of a plugin handling these cases automatically.)
      For Linux/BSD/MacOS Systems, rmlint does a good job of identifying duplicate files (hash/checksum based). Its resulting shell script to remove the duplicate files would need to be modified to use calibredb, though.
      Other tools would include fdupes, fslint among others. There used to be a Wikipedia page: 2013 snapshot, 2015 snapshot
      Note that even though these tools will do most of the file-level comparing, you still will need to write a wrapper handling the Calibre logistics, i.e. identify if two books contain conflicting formats that would result in data loss if merged, and if the metadata needs additional handling.
    2. calibre_report_duplicates by mekk so far seems to be the only candidate checking all formats within a book. Unfortunately it relies on file size as ultimate criterium, which does not handle bit-rot and other fatalities. It is giving good summaries on books with additional formats, though.
    3. All those books having only one or two formats in common need to be examined and resolved manually. Often, when I don't want to sacrifice the time, I end up tagging and moving those copies that seem inferior to an attic repository to help with text recovery in case the preferred copy turns out to be a lemon upon closer inspection (OCR damage and other goodies).

Last edited by Acanthus; 07-02-2016 at 02:55 PM. Reason: cruft
Acanthus is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Remove duplicates + images absconditus Recipes 0 09-24-2014 07:20 PM
Duplicates handling with Find Duplicates plugin erfjr Calibre 0 03-05-2013 02:52 PM
Easiest on my eyes? MrDoug General Discussions 36 10-19-2012 10:09 AM
Found duplicates - but how to remove? Motomaggot Calibre 7 02-07-2012 09:22 PM
RFE: Remove remove tags in bulk edit magphil Calibre 0 08-11-2009 10:37 AM


All times are GMT -4. The time now is 03:55 PM.


MobileRead.com is a privately owned, operated and funded community.