Skip to content

aliakseis/HandleDuplicateFiles

Repository files navigation

HandleDuplicateFiles

An efficient Windows console application to locate and deduplicate files by content. It recursively scans a directory (with optional extension filtering), groups same-sized files, compares their contents in buffered chunks, reports duplicate groups, and optionally replaces duplicate files with NTFS hard links to save space.

Features

  • Recursively enumerate files under a root folder

  • Optional, case-insensitive extension filter (e.g. .txt)

  • Skip files smaller than 16 KB

  • Group files by size, then by content using a pivot-based, buffered comparison

  • Report duplicate groups and total reclaimed bytes

  • Replace duplicates with hard links to the master file

Requirements

  • Windows 7 or later on an NTFS volume

  • A C++17-capable compiler (e.g. MSVC)

  • Windows SDK for WinAPI functions

  • Console configured for Unicode output

Build Instructions

You can compile with the Microsoft Visual C++ compiler (cl.exe) from a Developer Command Prompt:

cl /EHsc /W4 HandleDuplicateFiles.cpp

Alternatively, create a Visual Studio project:

  1. New → Visual C++ → Empty Project

  2. Add HandleDuplicateFiles.cpp to Source Files

  3. Project → Properties → Configuration Properties → General → Character Set: Use Unicode

  4. Build the solution

Usage

HandleDuplicateFiles.exe [extension_filter]

  • root_folder The top-level directory to scan.

  • extension_filter (optional) File extension filter including the dot (e.g. .jpg, .txt). Case-insensitive. If omitted, all files ≥ 16 KB are considered.

Examples

Find duplicate .txt files in C:\Projects:

HandleDuplicateFiles.exe C:\Projects .txt

Scan all files ≥ 16 KB under D:\Media:

HandleDuplicateFiles.exe D:\Media

How It Works

  1. EnumerateFilesAndGroupBySize Recursively visits each file, skips reparse points, filters by extension and minimum size, and buckets paths by file size.

  2. GroupFilesByContentUsingMap For each size-bucket with ≥ 2 files:

    • Picks the first file as the pivot.

    • Compares buffered chunks of the pivot against batches of up to 256 “right” files via CompareFilesBufferedAdvanced.

    • Builds a map of comparison keys (byte offset + mismatch byte) → lists of files.

    • Recursively groups non-matching subsets by their next pivot key.

    • Files matching pivot exactly join the duplicate group.

  3. Reporting Prints each duplicate group with its byte size and computes total reclaimed bytes.

  4. Deduplication For each group, the first file remains the master; others are deleted and replaced with hard links pointing to it (unless they already share the same NTFS file ID).

Permissions & Notes

  • Must run with sufficient permissions to delete files and create hard links.

  • Hard links only work on NTFS volumes; duplicates on other file systems will be reported but not linked.

  • Large file sets can consume memory proportional to the number of open streams and grouping structures.

License

This project is released under the MIT License. See LICENSE for details.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages