aboutsummaryrefslogtreecommitdiff
path: root/README.markdown
diff options
context:
space:
mode:
authorTom Ryder <tom@sanctum.geek.nz>2018-06-29 17:30:33 +1200
committerTom Ryder <tom@sanctum.geek.nz>2018-06-29 17:30:33 +1200
commitb96f6049e48875f59a99dcf42624674c4520eb6a (patch)
tree3100b44e9c394351ebc09ce4483766d76b25e066 /README.markdown
parentRemove Carp dependency (diff)
downloadcheckem-b96f6049e48875f59a99dcf42624674c4520eb6a.tar.gz
checkem-b96f6049e48875f59a99dcf42624674c4520eb6a.zip
Rename README to .md
Diffstat (limited to 'README.markdown')
-rw-r--r--README.markdown96
1 files changed, 0 insertions, 96 deletions
diff --git a/README.markdown b/README.markdown
deleted file mode 100644
index d09bc87..0000000
--- a/README.markdown
+++ /dev/null
@@ -1,96 +0,0 @@
-Checkem
-=======
-
-Find duplicate files efficiently, using Perl on Unix-like operating systems,
-and maybe other ones too (untested). Requires only modules that have been in
-Perl core since 5.7.3 at the latest. On earlier Perls, you will need to install
-`Digest`.
-
-Requires at least one directory argument:
-
- $ checkem .
- $ checkem ~tom ~chantelle
- $ checkem /usr /usr/local
-
-You can install it in `/usr/local/bin` with:
-
- # make install
-
-You can define a `PREFIX` to install it elsewhere:
-
- $ make install PREFIX="$HOME"/.local
-
-There's a (presently) very basic test suite:
-
- $ make test
-
-Q&A
----
-
-### Can I compare sets of files rather than sets of directories?
-
-Sure. This uses [`File::Find`][1] under the hood, which like POSIX
-[`find(1)`][2] will still apply tests and actions to its initial arguments even
-if they're not directories. This means you could do something like this to just
-look for duplicate `.iso` files, provided you don't have more than `ARG_MAX`:
-
- $ checkem ~/media/*.iso
-
-Or even this, for a `find(1)` that supports the `+` terminator (POSIX):
-
- $ find ~/media -type f -name \*.iso -exec checkem {} +
-
-### Why is this faster than just hashing every file?
-
-It checks the size of each file first, and only ends up hashing them if they're
-the same size but have different devices and/or inode numbers (i.e. they're not
-hard links). Hashing is an expensive last resort, and in many situations this
-won't end up running a single hash comparison.
-
-### I keep getting `.git` metadata files listed as duplicates.
-
-They're accurate, but you probably don't care. Filter them out by paragraph
-block. If you have a POSIX-fearing `awk`, you could do something like this:
-
- $ checkem /dir | awk -v RS= -v ORS='\n\n' '!index($0,"/.git")'
-
-Or, if you were born after the Unix epoch:
-
- $ checkem /dir | perl -00 -ne 'print if 0>index $_,"/.git"'
-
-### How could I make it even quicker?
-
-Run it on a fast disk, mostly. For large directories or large files, it will
-probably be I/O bound in most circumstances.
-
-If you end up hashing a lot of files because their sizes are the same, and
-you're not worried about [SHA-1 technically being broken in practice][3], it's
-a tiny bit faster:
-
- $ CHECKEM_ALG=SHA-1 checkem /dir
-
-Realistically, though, this is almost certainly splitting hairs.
-
-Theoretically, you could read only the first *n* bytes of each hash-needing
-file and hash those with some suitable inexpensive function *f*, and just
-compare those before resorting to checking the entire file with a safe hash
-function *g*. You'd need to decide on suitable values for *n*, *f*, and *g* in
-such a case; it might be useful for very large sets of files that will almost
-certainly differ in the first *n* bytes. If there's interest in this at all,
-I'll write it in as optional behaviour.
-
-Contributors
-------------
-
-* Timothy Goddard (pruby) fixed two bugs.
-
-License
--------
-
-Copyright (c) [Tom Ryder][4]. Distributed under an [MIT License][5].
-
-[1]: https://metacpan.org/pod/File::Find
-[2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/find.html
-[3]: https://shattered.io/
-[4]: https://sanctum.geek.nz/
-[5]: https://www.opensource.org/licenses/MIT