RMLINT(1) | rmlint documentation | RMLINT(1) |
rmlint - find duplicate files and other space waste efficiently
rmlint [TARGET_DIR_OR_FILES ...] [//] [TAGGED_TARGET_DIR_OR_FILES ...] [-] [OPTIONS]
rmlint finds space waste and other broken things on your filesystem. It's main focus lies on finding duplicate files and directories.
It is able to find the following types of lint:
rmlint itself WILL NOT DELETE ANY FILES. It does however produce executable output (for example a shell script) to help you delete the files if you want to. Another design principle is that it should work well together with other tools like find. Therefore we do not replicate features of other well know programs, as for example pattern matching and finding duplicate filenames. However we provide many convenience options for common use cases that are hard to build from scratch with standard tools.
In order to find the lint, rmlint is given one or more directories to traverse. If no directories or files were given, the current working directory is assumed. By default, rmlint will ignore hidden files and will not follow symlinks (see Traversal Options). rmlint will first find "other lint" and then search the remaining files for duplicates.
rmlint tries to be helpful by guessing what file of a group of duplicates is the original (i.e. the file that should not be deleted). It does this by using different sorting strategies that can be controlled via the -S option. By default it chooses the first-named path on the commandline. If two duplicates come from the same path, it will also apply different fallback sort strategies (See the documentation of the -S strategy).
This behaviour can be also overwritten if you know that a certain directory contains duplicates and another one originals. In this case you write the original directory after specifying a single // on the commandline. Everything that comes after is a preferred (or a "tagged") directory. If there are duplicates from an unpreferred and from a preferred directory, the preferred one will always count as original. Special options can also be used to always keep files in preferred directories (-k) and to only find duplicates that are present in both given directories (-m).
We advise new users to have a short look at all options rmlint has to offer, and maybe test some examples before letting it run on productive data. WRONG ASSUMPTIONS ARE THE BIGGEST ENEMY OF YOUR DATA. There are some extended example at the end of this manual, but each option that is not self-explanatory will also try to give examples.
One of the following groups can be specified at the beginning of the list:
Any of the following lint types can be added individually, or deselected by prefixing with a -:
WARNING: It is good practice to enclose the description in single or double quotes. In obscure cases argument parsing might fail in weird ways, especially when using spaces as separator.
Example:
$ rmlint -T "df,dd" # Only search for duplicate files and directories $ rmlint -T "all -df -dd" # Search for all lint except duplicate files and dirs.
If -o is specified, rmlint's default outputs are overwritten. With --O the defaults are preserved. Either -o or -O may be specified multiple times to get multiple outputs, including multiple outputs of the same format.
Examples:
$ rmlint -o json # Stream the json output to stdout $ rmlint -O csv:/tmp/rmlint.csv # Output an extra csv fle to /tmp
If the value is omitted it is set to a value meaning "enabled".
Examples:
$ rmlint -c sh:link # Smartly link duplicates instead of removing $ rmlint -c progressbar:fancy # Use a different theme for the progressbar
If no argument is given, "rw" is assumed. Note that r does basically nothing user-visible since rmlint will ignore unreadable files anyways. It's just there for the sake of completeness.
By default this check is not done.
$ rmlint -z rx $(echo $PATH | tr ":" " ") # Look at all executable files in $PATH
sha3, blake,
sha,
highway, md
metro, murmur, xxhash
The weaker hash functions still offer excellent distribution properties, but are potentially more vulnerable to malicious crafting of duplicate files.
The full list of hash functions (in decreasing order of checksum length) is:
512-bit: blake2b, blake2bp, sha3-512, sha512
384-bit: sha3-384,
256-bit: blake2s, blake2sp, sha3-256, sha256, highway256, metro256, metrocrc256
160-bit: sha1
128-bit: md5, murmur, metro, metrocrc
64-bit: highway64, xxhash.
The use of 64-bit hash length for detecting duplicate files is not recommended, due to the probability of a random hash collision.
Convenience shortcut for -o progressbar -o summary -o sh:rmlint.sh -o json:rmlint.json -VVV.
NOTE: This flag clears all previous outputs. If you want additional outputs, specify them after this flag using -O.
IMPORTANT: Definition of equal: Two directories are considered equal by rmlint if they contain the exact same data, no matter how the files containing the data are named. Imagine that rmlint creates a long, sorted stream out of the data found in the directory and compares this in a magic way to another directory. This means that the layout of the directory is not considered to be important by default. Also empty files will not count as content. This might be surprising to some users, but remember that rmlint generally cares only about content, not about any other metadata or layout. If you want to only find trees with the same hierarchy you should use --honour-dir-layout / -j.
Output is deferred until all duplicates were found. Duplicate directories are printed first, followed by any remaining duplicate files that are isolated or inside of any original directories.
--rank-by applies for directories too, but 'p' or 'P' (path index) has no defined (i.e. useful) meaning. Sorting takes only place when the number of preferred files in the directory differs.
NOTES:
The letter may also be written uppercase (similar to -S / --rank-by) to reverse the sorting. Note that rmlint has to hold back all results to the end of the run before sorting and printing.
The size format is about the same as dd(1) uses. A valid example would be: "100KB-2M". This limits duplicates to a range from 100 Kilobyte to 2 Megabyte.
It's also possible to specify only one size. In this case the size is interpreted as "bigger or equal". If you want to filter for files up to this size you can add a - in front (-s -1M == -s 0-1M).
Edge case: The default excludes empty files from the duplicate search. Normally these are treated specially by rmlint by handling them as other lint. If you want to include empty files as duplicates you should lower the limit to zero:
$ rmlint -T df --size 0
If --no-hardlinked is given, only one file (of a set of hardlinked files) is considered, all the others are ignored; this means, they are not deleted and also not even shown in the output. The "highest ranked" of the set is the one that is considered.
-n expects a file from which it can read the timestamp. After rmlint run, the file will be updated with the current timestamp. If the file does not initially exist, no filtering is done but the stampfile is still written.
-N, in contrast, takes the timestamp directly and will not write anything.
Note that rmlint will find duplicates newer than timestamp, even if the original is older. If you want only find duplicates where both original and duplicate are newer than timestamp you can use find(1):
Note: you can make rmlint write out a compatible timestamp with:
Note that the combinations of -kM and -Km are prohibited by rmlint. See https://github.com/sahib/rmlint/issues/244 for more information.
Alphabetical sort will only use the basename of the file and ignore its case. One can have multiple criteria, e.g.: -S am will choose first alphabetically; if tied then by mtime. Note: original path criteria (specified using //) will always take first priority over -S options.
For more fine grained control, it is possible to give a regular expression to sort by. This can be useful when you know a common fact that identifies original paths (like a path component being src or a certain file ending).
To use the regular expression you simply enclose it in the criteria string by adding <REGULAR_EXPRESSION> after specifying r or x. Example: -S 'r<.*\.bak$>' makes all files that have a .bak suffix original files.
Warning: When using r or x, try to make your regex to be as specific as possible! Good practice includes adding a $ anchor at the end of the regex.
Tips:
This is very useful if you want to reformat, refilter or resort the output you got from a previous run. Usage is simple: Just pass --replay on the second run, with other changed to the new formatters or filters. Pass the .json files of the previous runs additionally to the paths you ran rmlint on. You can also merge several previous runs by specifying more than one .json file, in this case it will merge all files given and output them as one big run.
If you want to view only the duplicates of certain subdirectories, just pass them on the commandline as usual.
The usage of // has the same effect as in a normal run. It can be used to prefer one .json file over another. However note that running rmlint in --replay mode includes no real disk traversal, i.e. only duplicates from previous runs are printed. Therefore specifying new paths will simply have no effect. As a security measure, --replay will ignore files whose mtime changed in the meantime (i.e. mtime in the .json file differs from the current one). These files might have been modified and are silently ignored.
By design, some options will not have any effect. Those are:
NOTE: In --replay mode, a new .json file will be written to rmlint.replay.json in order to avoid overwriting rmlint.json.
See the individual options below for more details and some examples.
CAUTION: This could potentially lead to false positives if file contents are somehow modified without changing the file modification time. rmlint uses the mtime to determine the modification timestamp if a checksum is outdated. This is not a problem if you use the clone or reflink operation on a filesystem like btrfs. There an outdated checksum entry would simply lead to some duplicate work done in the kernel but would do no harm otherwise.
NOTE: Many tools do not support extended file attributes properly, resulting in a loss of the information when copying the file or editing it.
NOTE: You can specify --xattr-write and --xattr-read at the same time. This will read from existing checksums at the start of the run and update all hashed files at the end.
Usage example:
$ rmlint large_file_cluster/ -U --xattr-write # first run should be slow. $ rmlint large_file_cluster/ --xattr-read # second run should be faster. # Or do the same in just one run: $ rmlint large_file_cluster/ --xattr
This is mainly useful in conjunction with --xattr-write/read. When re-running rmlint on a large dataset this can greatly speed up a re-run in some cases. Please refer to --xattr-read for an example.
If you want to output unique files, please look into the uniques output formatter.
The size-description has the same format as for --size, therefore you can do something like this (use this if you have 1GB of memory available):
$ rmlint -u 512M # Limit paranoid mem usage to 512 MB
Only look at the content of files in the range of from low to (including) high. This means, if the range is less than -q 0% to -Q 100%, than only partial duplicates are searched. If the file size is less than the clamp limits, the file is ignored during traversing. Be careful when using this function, you can easily get dangerous results for small files.
This is useful in a few cases where a file consists of a constant sized header or footer. With this option you can just compare the data in between. Also it might be useful for approximate comparison where it suffices when the file is the same in the middle part.
Example:
$ rmlint -q 10% -Q 512M # Only read the last 90% of a file, but read at max. 512MB
However, with three (or more) files, the mtime difference between two duplicates can be bigger than the mtime window T, i.e. several files may be chained together by the window. Example: If T is 1, the four files fooA (mtime: 00:00:00), fooB (00:00:01), fooC (00:00:02), fooD (00:00:03) would all belong to the same duplicate group, although the mtime of fooA and fooD differs by 3 seconds.
Available options:
Available options:
Default is remove.
Available options:
This formatter is extremely useful if you're in need of scripting more complex behaviour, that is not directly possible with rmlint's built-in options. A very handy tool here is jq. Here is an example to output all original files directly from a rmlint run:
$ rmlint -o | json jq -r '.[1:-1][] | select(.is_original) | .path'
Outputs a timestamp of the time rmlint was run. See also the --newer-than and --newer-than-stamp file option.
Available options:
See also: -g (--progress) for a convenience shortcut option.
Available options:
Available options:
This will only work when Shredder and its dependencies were installed. See also: http://rmlint.readthedocs.org/en/latest/gui.html
The gui has its own set of options, see --gui --help for a list. These should be placed at the end, ie rmlint --gui [options] when calling it from commandline.
Note: This even works for directories and also in combination with paranoid mode (pass -pp for byte comparison); remember that rmlint does not care about the layout of the directory, but only about the content of the files in it. At least two paths need to be given to the commandline.
By default this will use hashing to compare the files and/or directories.
This command is similar to cp --reflink=always <src> <dest> except that it (a) checks that src and dest have identical data, and it makes no changes to dest's metadata.
Running with -r option will enable deduplication of read-only [btrfs] snapshots (requires root).
This is a collection of common use cases and other tricks:
$ rmlint
$ rmlint -g
$ rmlint large_dir/ # First run; writes rmlint.json
$ rmlint --replay rmlint.json large_dir -S MaD
$ rmlint --replay a.json // b.json -k
$ rmlint -T "df,dd" .
$ rmlint -pp .
$ rmlint -e
$ find /usr/lib -iname '*.so' -type f | rmlint - # find all duplicate .so files
$ find /usr/lib -iname '*.so' -type f -print0 | rmlint -0 # as above but handles filenames with newline character in them
$ find ~/pics -iname '*.png' | ./rmlint - # compare png files only
$ rmlint -s 2GB # Find everything >= 2GB
$ rmlint -s 0-2GB # Find everything < 2GB
$ rmlint --perms wx
$ rmlint -c sh:link
$ rmlint -o sh -c sh:cmd='echo "original:" "$2" "is the same as" "$1"'
$ rmlint -c 'sh:cmd=shred -un 10 "$1"'
$ rmlint backup // data --keep-all-tagged --must-match-tagged
$ rmlint --equal a b c && echo "Files are equal" || echo "Files are not equal"
$ rmlint --is-reflink a b && echo "Files are reflinks" || echo "Files are not reflinks".
$ rmlint --xattr
$ rmlint -o uniques
$ rmlint t -o json -o uniques:unique_files | jq -r '.[1:-1][] | select(.is_original) | .path' | sort > original_files $ cat unique_files original_files
Reading the manpages o these tools might help working with rmlint:
Extended documentation and an in-depth tutorial can be found at:
If you found a bug, have a feature requests or want to say something nice, please visit https://github.com/sahib/rmlint/issues.
Please make sure to describe your problem in detail. Always include the version of rmlint (--version). If you experienced a crash, please include at least one of the following information with a debug build of rmlint:
You can build a debug build of rmlint like this:
rmlint is licensed under the terms of the GPLv3.
See the COPYRIGHT file that came with the source for more information.
rmlint was written by:
Also see the http://rmlint.rtfd.org for other people that helped us.
If you consider a donation you can use Flattr or buy us a beer if we meet:
https://flattr.com/thing/302682/libglyr
Christopher Pahl, Daniel Thomas
2014-2023, Christopher Pahl & Daniel Thomas
September 30, 2023 |