README.markdown


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

Checkem
=======

Find duplicate files efficiently, using Perl on Unix-like operating systems,
and maybe other ones too (untested). Requires Perl core modules including
`Digest::SHA`; it should work on anything newer than Perl v5.10.0, possibly
even earlier if you install some extra modules.

Requires at least one directory argument:

    $ checkem .
    $ checkem ~tom ~chantelle
    $ checkem /usr /usr/local

You can install it in `/usr/local/bin` with:

    # make install

You can define a `PREFIX` to install it elsewhere:

    $ make install PREFIX="$HOME"/.local

Q&A
---

### Why is this faster than just hashing every file?

It checks the size of each file first, and only ends up hashing them if they're
the same size but have different devices and/or inode numbers (i.e. they're not
hard links). Hashing is an expensive last resort, and in many situations this
won't end up running a single hash comparison.

### I keep getting `.git` metadata files listed as duplicates.

They're accurate, but you probably don't care. Filter them out by paragraph
block. If you have a POSIX-fearing `awk`, you could do something like this:

    $ checkem /dir | awk 'BEGIN{RS="";ORS="\n\n"} !/\/.git/'

### How could I make it even quicker?

Run it on a fast disk, mostly. For large directories or large files, it will
probably be I/O bound in most circumstances.

If you end up hashing a lot of files because their sizes are the same, and
you're not worried about SHA256 technically being broken in practice, SHA1 is a
tiny bit faster:

    $ CHECKEM_ALG=sha1 checkem /dir

Theoretically, you could read only the first *n* bytes of each hash-needing
file and hash those with some suitable inexpensive function *f*, and just
compare those before resorting to checking the entire file with a safe hash
function *g*.

You'd need to decide on suitable values for *n*, *f*, and *g* in such a case;
it might be useful for very large sets of files that will almost certainly
differ in the first *n* bytes. If there's interest in this at all, I'll write
it in as optional behaviour.

License
-------

Copyright (c) [Tom Ryder][1]. Distributed under an [MIT License][2].

[1]: https://sanctum.geek.nz/
[2]: https://www.opensource.org/licenses/MIT