Find Duplicate Files

Julio Batista Silva

Last updated on Sep 10, 2012 1 min read

Duplicate files take up unnecessary space on the disk. Fortunately, there are tools that automate the search for duplicates.

Fdupes

Install fdupes:

julio@acer ~> sudo pacman -S fdupes

Run fdupes in recursive mode (-r) and redirect the output to a file:

julio@acer ~/Documents/Ebooks> fdupes -r . > dupes1.txt

On my computer, this command took only 7 minutes to analyze 23500 files. The output file, dupes1.txt, had 5714 lines!

julio@acer ~/Documents/Ebooks> fdupes -rf . > dupes2.txt

It took about 7 minutes to analyze 23500 files: dupes2.txt: 3878 lines

Removing blank lines from dupes2.txt using sed -i '/^$/d' dupes2.txt, the file ended up with 2054 lines.

Many of the files it recognized as duplicates were intentionally identical. Examples of programming books are often repeated. Some version control files (git, svn, etc.) were recognized as duplicates, but should not be deleted.

If you want to reduce disk space usage but avoid breaking anything, you can create a script that replaces all duplicate files with hard links.

Delete all duplicate files (be careful with this script):

julio@acer ~/Documents/Ebooks> while read f; do rm "$f"; done < dupes2.txt

Gemini

A good paid alternative for Mac is Gemini, which lists all duplicates in a user-friendly interface and allows you to preview them before sending them to the trash.

Julio Batista Silva

Data Engineer

I’m a computer engineer passionate about science, technology, photography, and languages. Currently working as a Data Engineer in Germany.