aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md96
1 files changed, 96 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..d09bc87
--- /dev/null
+++ b/README.md
@@ -0,0 +1,96 @@
+Checkem
+=======
+
+Find duplicate files efficiently, using Perl on Unix-like operating systems,
+and maybe other ones too (untested). Requires only modules that have been in
+Perl core since 5.7.3 at the latest. On earlier Perls, you will need to install
+`Digest`.
+
+Requires at least one directory argument:
+
+ $ checkem .
+ $ checkem ~tom ~chantelle
+ $ checkem /usr /usr/local
+
+You can install it in `/usr/local/bin` with:
+
+ # make install
+
+You can define a `PREFIX` to install it elsewhere:
+
+ $ make install PREFIX="$HOME"/.local
+
+There's a (presently) very basic test suite:
+
+ $ make test
+
+Q&A
+---
+
+### Can I compare sets of files rather than sets of directories?
+
+Sure. This uses [`File::Find`][1] under the hood, which like POSIX
+[`find(1)`][2] will still apply tests and actions to its initial arguments even
+if they're not directories. This means you could do something like this to just
+look for duplicate `.iso` files, provided you don't have more than `ARG_MAX`:
+
+ $ checkem ~/media/*.iso
+
+Or even this, for a `find(1)` that supports the `+` terminator (POSIX):
+
+ $ find ~/media -type f -name \*.iso -exec checkem {} +
+
+### Why is this faster than just hashing every file?
+
+It checks the size of each file first, and only ends up hashing them if they're
+the same size but have different devices and/or inode numbers (i.e. they're not
+hard links). Hashing is an expensive last resort, and in many situations this
+won't end up running a single hash comparison.
+
+### I keep getting `.git` metadata files listed as duplicates.
+
+They're accurate, but you probably don't care. Filter them out by paragraph
+block. If you have a POSIX-fearing `awk`, you could do something like this:
+
+ $ checkem /dir | awk -v RS= -v ORS='\n\n' '!index($0,"/.git")'
+
+Or, if you were born after the Unix epoch:
+
+ $ checkem /dir | perl -00 -ne 'print if 0>index $_,"/.git"'
+
+### How could I make it even quicker?
+
+Run it on a fast disk, mostly. For large directories or large files, it will
+probably be I/O bound in most circumstances.
+
+If you end up hashing a lot of files because their sizes are the same, and
+you're not worried about [SHA-1 technically being broken in practice][3], it's
+a tiny bit faster:
+
+ $ CHECKEM_ALG=SHA-1 checkem /dir
+
+Realistically, though, this is almost certainly splitting hairs.
+
+Theoretically, you could read only the first *n* bytes of each hash-needing
+file and hash those with some suitable inexpensive function *f*, and just
+compare those before resorting to checking the entire file with a safe hash
+function *g*. You'd need to decide on suitable values for *n*, *f*, and *g* in
+such a case; it might be useful for very large sets of files that will almost
+certainly differ in the first *n* bytes. If there's interest in this at all,
+I'll write it in as optional behaviour.
+
+Contributors
+------------
+
+* Timothy Goddard (pruby) fixed two bugs.
+
+License
+-------
+
+Copyright (c) [Tom Ryder][4]. Distributed under an [MIT License][5].
+
+[1]: https://metacpan.org/pod/File::Find
+[2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/find.html
+[3]: https://shattered.io/
+[4]: https://sanctum.geek.nz/
+[5]: https://www.opensource.org/licenses/MIT