diff options
Diffstat (limited to 'README.markdown')
-rw-r--r-- | README.markdown | 96 |
1 files changed, 0 insertions, 96 deletions
diff --git a/README.markdown b/README.markdown deleted file mode 100644 index d09bc87..0000000 --- a/README.markdown +++ /dev/null @@ -1,96 +0,0 @@ -Checkem -======= - -Find duplicate files efficiently, using Perl on Unix-like operating systems, -and maybe other ones too (untested). Requires only modules that have been in -Perl core since 5.7.3 at the latest. On earlier Perls, you will need to install -`Digest`. - -Requires at least one directory argument: - - $ checkem . - $ checkem ~tom ~chantelle - $ checkem /usr /usr/local - -You can install it in `/usr/local/bin` with: - - # make install - -You can define a `PREFIX` to install it elsewhere: - - $ make install PREFIX="$HOME"/.local - -There's a (presently) very basic test suite: - - $ make test - -Q&A ---- - -### Can I compare sets of files rather than sets of directories? - -Sure. This uses [`File::Find`][1] under the hood, which like POSIX -[`find(1)`][2] will still apply tests and actions to its initial arguments even -if they're not directories. This means you could do something like this to just -look for duplicate `.iso` files, provided you don't have more than `ARG_MAX`: - - $ checkem ~/media/*.iso - -Or even this, for a `find(1)` that supports the `+` terminator (POSIX): - - $ find ~/media -type f -name \*.iso -exec checkem {} + - -### Why is this faster than just hashing every file? - -It checks the size of each file first, and only ends up hashing them if they're -the same size but have different devices and/or inode numbers (i.e. they're not -hard links). Hashing is an expensive last resort, and in many situations this -won't end up running a single hash comparison. - -### I keep getting `.git` metadata files listed as duplicates. - -They're accurate, but you probably don't care. Filter them out by paragraph -block. If you have a POSIX-fearing `awk`, you could do something like this: - - $ checkem /dir | awk -v RS= -v ORS='\n\n' '!index($0,"/.git")' - -Or, if you were born after the Unix epoch: - - $ checkem /dir | perl -00 -ne 'print if 0>index $_,"/.git"' - -### How could I make it even quicker? - -Run it on a fast disk, mostly. For large directories or large files, it will -probably be I/O bound in most circumstances. - -If you end up hashing a lot of files because their sizes are the same, and -you're not worried about [SHA-1 technically being broken in practice][3], it's -a tiny bit faster: - - $ CHECKEM_ALG=SHA-1 checkem /dir - -Realistically, though, this is almost certainly splitting hairs. - -Theoretically, you could read only the first *n* bytes of each hash-needing -file and hash those with some suitable inexpensive function *f*, and just -compare those before resorting to checking the entire file with a safe hash -function *g*. You'd need to decide on suitable values for *n*, *f*, and *g* in -such a case; it might be useful for very large sets of files that will almost -certainly differ in the first *n* bytes. If there's interest in this at all, -I'll write it in as optional behaviour. - -Contributors ------------- - -* Timothy Goddard (pruby) fixed two bugs. - -License -------- - -Copyright (c) [Tom Ryder][4]. Distributed under an [MIT License][5]. - -[1]: https://metacpan.org/pod/File::Find -[2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/find.html -[3]: https://shattered.io/ -[4]: https://sanctum.geek.nz/ -[5]: https://www.opensource.org/licenses/MIT |