capidup package¶
Quickly find duplicate files in directories.
CapiDup recursively crawls through all the files in a list of directories and identifies duplicate files. Duplicate files are files with the exact same content, regardless of their name, location or timestamp.
This package is designed to be quite fast. It uses a smart algorithm to detect and group duplicate files using a single pass on each file (that is, CapiDup doesn’t need to compare each file to every other).
capidup.finddups module¶
This module implements the public API. The rest is just for internal use by the package.
Public functions¶
-
capidup.finddups.find_duplicates(filenames, max_size)[source]¶ Find duplicates in a list of files, comparing up to max_size bytes.
Returns a 2-tuple of two values:
(duplicate_groups, errors).duplicate_groups is a (possibly empty) list of lists: the names of files that have at least two copies, grouped together.
errors is a list of error messages that occurred. If empty, there were no errors.
For example, assuming
a1anda2are identical,c1andc2are identical, andbis different from all others:>>> dups, errs = find_duplicates(['a1', 'a2', 'b', 'c1', 'c2'], 1024) >>> dups [['a1', 'a2'], ['c1', 'c2']] >>> errs []
Note that
bis not included in the results, as it has no duplicates.
-
capidup.finddups.find_duplicates_in_dirs(directories, exclude_dirs=None, exclude_files=None, follow_dirlinks=False)[source]¶ Recursively scan a list of directories, looking for duplicate files.
exclude_dirs, if provided, should be a list of glob patterns. Subdirectories whose names match these patterns are excluded from the scan.
exclude_files, if provided, should be a list of glob patterns. Files whose names match these patterns are excluded from the scan.
follow_dirlinkscontrols whether to follow symbolic links to subdirectories while crawling.Returns a 2-tuple of two values:
(duplicate_groups, errors).duplicate_groups is a (possibly empty) list of lists: the names of files that have at least two copies, grouped together.
errors is a list of error messages that occurred. If empty, there were no errors.
For example, assuming
./a1and/dir1/a2are identical,/dir1/c1and/dir2/c2are identical,/dir2/bis different from all others, that any subdirectories calledtmpshould not be scanned, and that files ending in.bakshould be ignored:>>> dups, errs = find_duplicates_in_dirs(['.', '/dir1', '/dir2'], ['tmp'], ['*.bak']) >>> dups [['./a1', '/dir1/a2'], ['/dir1/c1', '/dir2/c2']] >>> errs []
Public data members¶
-
capidup.finddups.MD5_CHUNK_SIZE= 524288¶ Chunk size in bytes, when reading from file to calculate MD5.
-
capidup.finddups.PARTIAL_MD5_READ_MULT= 4096¶ Divisor of the partial read size, in bytes.
When hashing a portion of a file for comparison, the size of that portion will be a multiple of this value.
Tip
A good choice on GNU/Linux would be multiples of page size (usually 4096 bytes on x86).
-
capidup.finddups.PARTIAL_MD5_THRESHOLD= 8192¶ Above this file size in bytes, we do a partial comparison first.
-
capidup.finddups.PARTIAL_MD5_MAX_READ= 65536¶ Maximum size of the partial read, in bytes.
-
capidup.finddups.PARTIAL_MD5_READ_RATIO= 4¶ Partial reads of 1/n of the file size (below PARTIAL_MD5_MAX_READ).