capidup package

Quickly find duplicate files in directories.

CapiDup recursively crawls through all the files in a list of directories and identifies duplicate files. Duplicate files are files with the exact same content, regardless of their name, location or timestamp.

This package is designed to be quite fast. It uses a smart algorithm to detect and group duplicate files using a single pass on each file (that is, CapiDup doesn’t need to compare each file to every other).

capidup.finddups module

This module implements the public API. The rest is just for internal use by the package.

Public functions

capidup.finddups.find_duplicates(filenames, max_size)[source]

Find duplicates in a list of files, comparing up to max_size bytes.

Returns a 2-tuple of two values: (duplicate_groups, errors).

duplicate_groups is a (possibly empty) list of lists: the names of files that have at least two copies, grouped together.

errors is a list of error messages that occurred. If empty, there were no errors.

For example, assuming a1 and a2 are identical, c1 and c2 are identical, and b is different from all others:

>>> dups, errs = find_duplicates(['a1', 'a2', 'b', 'c1', 'c2'], 1024)
>>> dups
[['a1', 'a2'], ['c1', 'c2']]
>>> errs
[]

Note that b is not included in the results, as it has no duplicates.

capidup.finddups.find_duplicates_in_dirs(directories, exclude_dirs=None, exclude_files=None, follow_dirlinks=False)[source]

Recursively scan a list of directories, looking for duplicate files.

exclude_dirs, if provided, should be a list of glob patterns. Subdirectories whose names match these patterns are excluded from the scan.

exclude_files, if provided, should be a list of glob patterns. Files whose names match these patterns are excluded from the scan.

follow_dirlinks controls whether to follow symbolic links to subdirectories while crawling.

Returns a 2-tuple of two values: (duplicate_groups, errors).

duplicate_groups is a (possibly empty) list of lists: the names of files that have at least two copies, grouped together.

errors is a list of error messages that occurred. If empty, there were no errors.

For example, assuming ./a1 and /dir1/a2 are identical, /dir1/c1 and /dir2/c2 are identical, /dir2/b is different from all others, that any subdirectories called tmp should not be scanned, and that files ending in .bak should be ignored:

>>> dups, errs = find_duplicates_in_dirs(['.', '/dir1', '/dir2'], ['tmp'], ['*.bak'])
>>> dups
[['./a1', '/dir1/a2'], ['/dir1/c1', '/dir2/c2']]
>>> errs
[]

Public data members

capidup.finddups.MD5_CHUNK_SIZE = 524288

Chunk size in bytes, when reading from file to calculate MD5.

capidup.finddups.PARTIAL_MD5_READ_MULT = 4096

Divisor of the partial read size, in bytes.

When hashing a portion of a file for comparison, the size of that portion will be a multiple of this value.

Tip

A good choice on GNU/Linux would be multiples of page size (usually 4096 bytes on x86).

capidup.finddups.PARTIAL_MD5_THRESHOLD = 8192

Above this file size in bytes, we do a partial comparison first.

capidup.finddups.PARTIAL_MD5_MAX_READ = 65536

Maximum size of the partial read, in bytes.

capidup.finddups.PARTIAL_MD5_READ_RATIO = 4

Partial reads of 1/n of the file size (below PARTIAL_MD5_MAX_READ).