Search for duplicate files on your hard drive - Paul Courbis


   

CAUTION !

Please read this...



You'll fin below this text a semi-automated translation of the original website. Texts will be gradually reviewed to make them as accurate as possible but it's strongly recommanded to read the original page (preview on the right side of this text) :

Link to the original article


(if this link is broken, please go to the original website's root page and look for wanted data. Sorry about that).


Some parts of this website will never be translated to English. Part of them are user's comments that are not transfered from the French (original) version to this version. Obvously, you can leave your own comments here but they wont be transfered to the French version.

You can send comments or suggestions to the webmaster.

   
Paul Courbis


Home page > 03. Various (and summer) > 03. Computers > 02. Scripts and tools home > Search for duplicate files on your hard drive

Search for duplicate files on your hard drive

Saturday 16 August 2008, by Paul Courbis

By dint download it ended up with tons of files in multiple copies. With the current sizes of disks, it becomes increasingly difficult to clean. Here’s one way to find the list of files present in multiple copies on your hard drives ...

The following explanation assumes a Unix (Linux, HPUX, etc ...) or Windows with a Unix-like overlay (like Cygwin).

First step: create the list of files with their MD5

MD5 (Message Digest 5) is the checksum of a file. The probability that two files have the same "signature" is infinitesimal and the fact that two files have the same signature can be considered equivalent to the fact that they are identical. Just use the find command. For example to list all the movies avi on the disc:

find / -iname "*.avi" -print0 |  xargs -0 > /tmp/AllFiles.md5 Warning: this command may take some time to run! The results are stored in / tmp/AllFiles.md5

Second step: determine the list of duplicates

The following script determines the list by sorting the files by their MD5 sum and displaying the names of files with identical amounts (only limitation: If the filenames contain tabs, they will be truncated)

#!/bin/sh

cat /tmp/AllFiles.md5                          |
       sort -u                                 |
       sed 's/ */        /'                         |
       awk -F '        ' '
BEGIN  { FIRST=yes; MD5=""; OLDF="" }

       {
         if ( $1 == MD5 )
         {
            if ( FIRST == "yes" )
            {
               FIRST="no"
               printf( "

FILES WITH %s:

        %s
",
                        $1, OLDF );
            }
            printf( "        %s
", $2 );
         }
         else
         {
            MD5=$1
            OLDF=$2
            FIRST="yes"
         }
       }
'

And now ...

Reply to this article