File Deduplication

What is File Deduplication (dedup/dedupe)?

File deduplication is a process that eliminates duplicate copies of data. For example if you have 10 identical copies of the same file, deduplication would recognize that all 10 copies are the same and store only one copy of the file (using a CoW, copy on write filesystem). The other 9 would just be pointers to that single copy. This can save a lot of space on your storage device.

Finding Duplicate Files

Because chkbit already stores a hash of every file, it can quickly build a duplicates database. The first initial scan will take some time but then you can only scan for changes.

Files that already share the same space (either via CoW or hard links) are detected and not included in the reported reclaimable space.

$ chkbit dedup detect .
msg collect matching hashes (min=8K)
msg update file sizes (for legacy indexes)
msg collect matching files

- 13s elapsed
- 1049744 files processed
- 75288.61 files/second

Detected 53576 hashes that are shared by 464530 files:
- Minimum required space: 353.7G
- Maximum required space: 3.4T
- Actual used space:      372.4G
- Reclaimable space:      18.7G
- Efficiency:             99.40%

Usage

chkbit requires an index to operate

if you have not used chkbit before create an atom index
if you already have an atom index continue to the next step
if you were using a split index you need to fuse it first with chkbit fuse . (you can continue to use the split index, this just creates an atom index for dedup detect)

Run detect to update the dedup database

$ chkbit dedup detect .

On supported systems you will see information about the reclaimable space (see above). Otherwise only basic stats will be shown.

To view details and a list of identified files run show:

chkbit dedup show . -f

You can also export this data to json: chkbit dedup show . --json

Finally run dedup either on all identified files

chkbit dedup run .

or on individual hashes (that were listed by show) with chkbit dedup run . HASH HASH ....

If you wish to manually compare files or run dedup see the chkbit util command.

Requirements

chkbit dedup does not create hard links and requires a CoW (copy on write) filesystem
Detecting duplicates is supported for all OS
Detecting reclaimable space is supported on Linux and macOS
Running deduplicate is currently supported on Linux with a CoW filesystem: btrfs, xfs (macOS coming later)

Deduplication is not possible on Windows with NTFS.

Is it Safe?

Modern copy on write (CoW) filesystems will automatically create a copy when you write to a deduplicated file, so you do not modify the other shared copies.

While the deduplication process is as safe as it can be, whenever you handle data you should always have backups in place.

Deduplication is done via the OS. On Linux/btrfs this is actually an atomic operation where the kernel first verifies that the data matches and only then deduplicates the files.

After you’ve deduplicated your files you should run chkbit check to verify that there were no errors.

How does it work?

chkbit’s main focus is to detect file corruption. It does this by building a database of hashes (checksums) for every file.
The same database can be used to identify duplicate files by comparing their checksum (there is a small chance of collisions - these are verified later).
For each group of files with the same hash, chkbit checks where they are stored on the disk to detect already deduplicated (shared) and still duplicated (exclusive) space.
Once you decide to deduplicate, chkbit sends the ‘suggested’ duplicates to the Linux kernel for deduplication by the filesystem.
The kernel verifies that the actual bytes of both files match and then deduplicates the files in an atomic operation.
The space for the duplicate is now free to use again.