An efficient Windows console application to locate and deduplicate files by content. It recursively scans a directory (with optional extension filtering), groups same-sized files, compares their contents in buffered chunks, reports duplicate groups, and optionally replaces duplicate files with NTFS hard links to save space.
-
Recursively enumerate files under a root folder
-
Optional, case-insensitive extension filter (e.g. .txt)
-
Skip files smaller than 16 KB
-
Group files by size, then by content using a pivot-based, buffered comparison
-
Report duplicate groups and total reclaimed bytes
-
Replace duplicates with hard links to the master file
-
Windows 7 or later on an NTFS volume
-
A C++17-capable compiler (e.g. MSVC)
-
Windows SDK for WinAPI functions
-
Console configured for Unicode output
You can compile with the Microsoft Visual C++ compiler (cl.exe) from a Developer Command Prompt:
cl /EHsc /W4 HandleDuplicateFiles.cpp
Alternatively, create a Visual Studio project:
-
New → Visual C++ → Empty Project
-
Add HandleDuplicateFiles.cpp to Source Files
-
Project → Properties → Configuration Properties → General → Character Set: Use Unicode
-
Build the solution
HandleDuplicateFiles.exe [extension_filter]
-
root_folder The top-level directory to scan.
-
extension_filter (optional) File extension filter including the dot (e.g. .jpg, .txt). Case-insensitive. If omitted, all files ≥ 16 KB are considered.
Find duplicate .txt files in C:\Projects:
HandleDuplicateFiles.exe C:\Projects .txt
Scan all files ≥ 16 KB under D:\Media:
HandleDuplicateFiles.exe D:\Media
-
EnumerateFilesAndGroupBySize Recursively visits each file, skips reparse points, filters by extension and minimum size, and buckets paths by file size.
-
GroupFilesByContentUsingMap For each size-bucket with ≥ 2 files:
-
Picks the first file as the pivot.
-
Compares buffered chunks of the pivot against batches of up to 256 “right” files via CompareFilesBufferedAdvanced.
-
Builds a map of comparison keys (byte offset + mismatch byte) → lists of files.
-
Recursively groups non-matching subsets by their next pivot key.
-
Files matching pivot exactly join the duplicate group.
-
-
Reporting Prints each duplicate group with its byte size and computes total reclaimed bytes.
-
Deduplication For each group, the first file remains the master; others are deleted and replaced with hard links pointing to it (unless they already share the same NTFS file ID).
-
Must run with sufficient permissions to delete files and create hard links.
-
Hard links only work on NTFS volumes; duplicates on other file systems will be reported but not linked.
-
Large file sets can consume memory proportional to the number of open streams and grouping structures.
This project is released under the MIT License. See LICENSE for details.