Duperemove

Duperemove is a simple tool for finding duplicated extents and
submitting them for deduplication. When given a list of files it will
hash their contents on a block by block basis and compare those hashes
to each other, finding and categorizing extents that match each
other. When given the -d option, duperemove will submit those
extents for deduplication using the btrfs-extent-same ioctl.

Duperemove has two major modes of operation one of which is a subset
of the other.


Readonly / Non-deduplicating Mode

When run without -d (the default) duperemove will print out one or
more tables of matching extents it has determined would be ideal
candidates for deduplication. As a result, readonly mode is useful for
seeing what duperemove might do when run with '-d'. The output could
also be used by some other software to submit the extents for
deduplication at a later time.

It is important to note that this mode will not print out *all*
instances of matching extents, just those it would consider for
deduplication.

Another important note is that duperemove does not concern itself with
the underlying representation of the extents. Some of them could be
compressed, undergoing I/O, or even have already been deduplicated. In
dedupe mode, the kernel handles those details and therefore we try not
to replicate that work. Think of duperemove as trying for 'bulk'
deduplication.


Deduping Mode

This functions similarly to readonly mode with the exception that the
duplicated extents found in our "read, hash, and compare" step will
actually be submitted for deduplication. At the end, a total count of
bytes that were processed by the kernel will be printed.

Keep in mind, that the bytecount we report here (received from the
kernel) is NOT the total amount deduplicated but rather a count of the
amount of data it also found to be identical.

See the duperemove man page for further details about running duperemove.


FAQ

* Is there an upper limit to the amount of data duperemove can process?

Right now duperemove has been tested on small numbers of VMS or iso
files (5-10). I don't believe there should be a major problem scaling
that up to 50 or so.


* Why does it not print out all duplicate extents?

Internally duperemove is classifying extents based on various criteria
like length, number of identical extents, etc. The printout we give is
based on the results of that classification.


* How can I find out my space savings after a dedupe?

The easiest way to do this would be a df before the dedupe operation,
then a df about 60 seconds after the operation. It is common for btrfs
space reporting to be 'behind' while delayed updates get processed, so
an immediate df after deduping might not show any savings.


USAGE EXAMPLES

TODO
