Checkem ======= Find duplicate files efficiently, using Perl on Unix-like operating systems, and maybe other ones too (untested). Requires only modules that have been in Perl core since 5.7.3 at the latest. On earlier Perls, you will need to install `Digest`. Requires at least one directory argument: $ checkem . $ checkem ~tom ~chantelle $ checkem /usr /usr/local You can install it in `/usr/local/bin` with: # make install You can define a `PREFIX` to install it elsewhere: $ make install PREFIX="$HOME"/.local There's a (presently) very basic test suite: $ make test Q&A --- ### Can I compare sets of files rather than sets of directories? Sure. This uses [`File::Find`][1] under the hood, which like POSIX [`find(1)`][2] will still apply tests and actions to its initial arguments even if they're not directories. This means you could do something like this to just look for duplicate `.iso` files, provided you don't have more than `ARG_MAX`: $ checkem ~/media/*.iso Or even this, for a `find(1)` that supports the `+` terminator (POSIX): $ find ~/media -type f -name \*.iso -exec checkem {} + ### Why is this faster than just hashing every file? It checks the size of each file first, and only ends up hashing them if they're the same size but have different devices and/or inode numbers (i.e. they're not hard links). Hashing is an expensive last resort, and in many situations this won't end up running a single hash comparison. ### I keep getting `.git` metadata files listed as duplicates. They're accurate, but you probably don't care. Filter them out by paragraph block. If you have a POSIX-fearing `awk`, you could do something like this: $ checkem /dir | awk -v RS= -v ORS='\n\n' '!index($0,"/.git")' Or, if you were born after the Unix epoch: $ checkem /dir | perl -00 -ne 'print if 0>index $_,"/.git"' ### How could I make it even quicker? Run it on a fast disk, mostly. For large directories or large files, it will probably be I/O bound in most circumstances. If you end up hashing a lot of files because their sizes are the same, and you're not worried about [SHA-1 technically being broken in practice][3], it's a tiny bit faster: $ CHECKEM_ALG=SHA-1 checkem /dir Realistically, though, this is almost certainly splitting hairs. Theoretically, you could read only the first *n* bytes of each hash-needing file and hash those with some suitable inexpensive function *f*, and just compare those before resorting to checking the entire file with a safe hash function *g*. You'd need to decide on suitable values for *n*, *f*, and *g* in such a case; it might be useful for very large sets of files that will almost certainly differ in the first *n* bytes. If there's interest in this at all, I'll write it in as optional behaviour. Contributors ------------ * Timothy Goddard (pruby) fixed two bugs. License ------- Copyright (c) [Tom Ryder][4]. Distributed under an [MIT License][5]. [1]: https://metacpan.org/pod/File::Find [2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/find.html [3]: https://shattered.io/ [4]: https://sanctum.geek.nz/ [5]: https://www.opensource.org/licenses/MIT