cb_find_duplicates

PrintCamiel Bouchier0

Goal

Over time I collected numerous media files, documents etc. copying hence and back from several sources. The result though is many duplicates that are unnecessary occupying disk space. Of course there are other duplicate file finders around. I used some time Auslogics duplicate file finder. The problem though with most of those alternatives is that once the duplicates are found, you basically have to decide manually which to keep and which to delete.

This is where cb_find_duplicates comes into the picture: via Lua scripts one can script exactly which files to keep and which not. This can be based on modification time, creation time, filename and similar. Also, deleting is not the only option. You can instead choose to hardlink the duplicates to each other, keeping the files without occupying disk space.

Approach

To find duplicates, cb_find_duplicates uses a fairly simple algorithm:

  • In a first run files are collected that have the same length.
  • For the files with the same length a partial (5M) md5sum is calculated. Duplicate files must have the same partial MD5.
  • For the files that are having the same partial md5sum, a full md5sum is calculated. The files with the same length and full matching md5sum are considered duplicate.

This approach, along with some multithreading, results in an acceptable speed for finding the duplicates. Usually it is some 30% faster than the duplicate file finder I used before.

Duplicate selection

Duplicate files then can be selected manually, via a Lua scripting interface or a combination of both.

A selection script receives for each group of duplicates the list of files with following info:

  • modification time
  • creation time
  • size
  • selection status in the GUI

Based on that a Lua script can make whatever decision on which files to select or unselect for further processing. Of course there are example scripts in the distribution.

Selected duplicates can be deleted or converted to a hardlink.

Installation

Head to the latest release and download. Currently you will find only a Windows version with a standard NSIS installer.

cb_find_duplicates has been written with portability in mind. It should be fairly straightforward making a linux distribution. Contact me in case you would be interested.

License

(C) 2015-2021 by Camiel Bouchier (camiel@bouchier.be)

This file is part of cb_find_duplicates. All rights reserved. You are granted a non-exclusive and non-transferable license to use this software for personal or internal business purposes.

THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL Camiel Bouchier BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

No replies on “cb_find_duplicates”