The following explanation assumes a Unix (Linux, HPUX, etc ...) or Windows with a Unix-like overlay (like Cygwin).
First step: create the list of files with their MD5
MD5 (Message Digest 5) is the checksum of a file. The probability that two files have the same "signature" is infinitesimal and the fact that two files have the same signature can be considered equivalent to the fact that they are identical. Just use the find command. For example to list all the movies avi on the disc:
find / -iname "*.avi" -print0 | xargs -0 > /tmp/AllFiles.md5 Warning: this command may take some time to run! The results are stored in / tmp/AllFiles.md5
Second step: determine the list of duplicates
The following script determines the list by sorting the files by their MD5 sum and displaying the names of files with identical amounts (only limitation: If the filenames contain tabs, they will be truncated)
#!/bin/sh
cat /tmp/AllFiles.md5 |
sort -u |
sed 's/ */ /' |
awk -F ' ' '
BEGIN { FIRST=yes; MD5=""; OLDF="" }
{
if ( $1 == MD5 )
{
if ( FIRST == "yes" )
{
FIRST="no"
printf( "
FILES WITH %s:
%s
",
$1, OLDF );
}
printf( " %s
", $2 );
}
else
{
MD5=$1
OLDF=$2
FIRST="yes"
}
}
'And now ...