agedu(1) | Simon Tatham | agedu(1) |
agedu - correlate disk usage with last-access times to identify large and disused data
agedu [ options ] action [action...]
agedu scans a directory tree and produces reports about how much disk space is used in each directory and subdirectory, and also how that usage of disk space corresponds to files with last-access times a long time ago.
In other words, agedu is a tool you might use to help you free up disk space. It lets you see which directories are taking up the most space, as du does; but unlike du, it also distinguishes between large collections of data which are still in use and ones which have not been accessed in months or years - for instance, large archives downloaded, unpacked, used once, and never cleaned up. Where du helps you find what's using your disk space, agedu helps you find what's wasting your disk space.
agedu has several operating modes. In one mode, it scans your disk and builds an index file containing a data structure which allows it to efficiently retrieve any information it might need. Typically, you would use it in this mode first, and then run it in one of a number of `query' modes to display a report of the disk space usage of a particular directory and its subdirectories. Those reports can be produced as plain text (much like du) or as HTML. agedu can even run as a miniature web server, presenting each directory's HTML report with hyperlinks to let you navigate around the file system to similar reports for other directories.
So you would typically start using agedu by telling it to do a scan of a directory tree and build an index. This is done with a command such as
$ agedu -s /home/fred
which will build a large data file called agedu.dat in your current directory. (If that current directory is inside /home/fred, don't worry - agedu is smart enough to discount its own index file.)
Having built the index, you would now query it for reports of disk space usage. If you have a graphical web browser, the simplest and nicest way to query the index is by running agedu in web server mode:
$ agedu -w
which will print (among other messages) a URL on its standard output along the lines of
URL: http://127.0.0.1:48638/
(That URL will always begin with `127.', meaning that it's in the localhost address space. So only processes running on the same computer can even try to connect to that web server, and also there is access control to prevent other users from seeing it - see below for more detail.)
Now paste that URL into your web browser, and you will be shown a graphical representation of the disk usage in /home/fred and its immediate subdirectories, with varying colours used to show the difference between disused and recently-accessed data. Click on any subdirectory to descend into it and see a report for its subdirectories in turn; click on parts of the pathname at the top of any page to return to higher-level directories. When you've finished browsing, you can just press Ctrl-D to send an end-of-file indication to agedu, and it will shut down.
After that, you probably want to delete the data file agedu.dat, since it's pretty large. In fact, the command agedu -R will do this for you; and you can chain agedu commands on the same command line, so that instead of the above you could have done
$ agedu -s /home/fred -w -R
for a single self-contained run of agedu which builds its index, serves web pages from it, and cleans it up when finished.
In some situations, you might want to scan the directory structure of one computer, but run agedu's user interface on another. In that case, you can do your scan using the agedu -S option in place of agedu -s, which will make agedu not bother building an index file but instead just write out its scan results in plain text on standard output; then you can funnel that output to the other machine using SSH (or whatever other technique you prefer), and there, run agedu -L to load in the textual dump and turn it into an index file. For example, you might run a command like this (plus any ssh options you need) on the machine you want to scan:
$ agedu -S /home/fred | ssh indexing-machine agedu -L
or, equivalently, run something like this on the other machine:
$ ssh machine-to-scan agedu -S /home/fred | agedu -L
Either way, the agedu -L command will create an agedu.dat index file, which you can then use with agedu -w just as above.
(Another way to do this might be to build the index file on the first machine as normal, and then just copy it to the other machine once it's complete. However, for efficiency, the index file is formatted differently depending on the CPU architecture that agedu is compiled for. So if that doesn't match between the two machines - e.g. if one is a 32-bit machine and one 64-bit - then agedu.dat files written on one machine will not work on the other. The technique described above using -S and -L should work between any two machines.)
If you don't have a graphical web browser, you can do text-based queries instead of using agedu's web interface. Having scanned /home/fred in any of the ways suggested above, you might run
$ agedu -t /home/fred
which again gives a summary of the disk usage in /home/fred and its immediate subdirectories; but this time agedu will print it on standard output, in much the same format as du. If you then want to find out how much old data is there, you can add the -a option to show only files last accessed a certain length of time ago. For example, to show only files which haven't been looked at in six months or more:
$ agedu -t /home/fred -a 6m
That's the essence of what agedu does. It has other modes of operation for more complex situations, and the usual array of configurable options. The following sections contain a complete reference for all its functionality.
This section describes the operating modes supported by agedu. Each of these is in the form of a command-line option, sometimes with an argument. Multiple operating-mode options may appear on the command line, in which case agedu will perform the specified actions one after another. For instance, as shown in the previous section, you might want to perform a disk scan and immediately launch a web server giving reports from that scan.
By default, the scan is restricted to a single file system (since the expected use of agedu is that you would probably use it because a particular disk partition was running low on space). You can remove that restriction using the --cross-fs option; other configuration options allow you to include or exclude files or entire subdirectories from the scan. See the next section for full details of the configurable options.
The index file is created with restrictive permissions, in case the file system you are scanning contains confidential information in its structure.
Index files are dependent on the characteristics of the CPU architecture you created them on. You should not expect to be able to move an index file between different types of computer and have it continue to work. If you need to transfer the results of a disk scan to a different kind of computer, see the -D and -L options below.
The web server runs until agedu receives an end-of-file event on its standard input. (The expected usage is that you run it from the command line, immediately browse web pages until you're satisfied, and then press Ctrl-D.) To disable the EOF behaviour, use the --no-eof option.
In case the index file contains any confidential information about your file system, the web server protects the pages it serves from access by other people. On Linux, this is done transparently by means of using /proc/net/tcp to check the owner of each incoming connection; failing that, the web server will require a password to view the reports, and agedu will print the password it invented on standard output along with the URL.
Configurable options for this mode let you specify your own address and port number to listen on, and also specify your own choice of authentication method (including turning authentication off completely) and a username and password of your choice.
Used on its own, -t merely lists the total disk usage in each subdirectory; agedu's additional ability to distinguish unused from recently-used data is not activated. To activate it, use the -a option to specify a minimum age.
The directory structure stored in agedu's index file is treated as a set of literal strings. This means that you cannot refer to directories by synonyms. So if you ran agedu -s ., then all the path names you later pass to the -t option must be either `.' or begin with `./'. Similarly, symbolic links within the directory you scanned will not be followed; you must refer to each directory by its canonical, symlink-free pathname.
(The output of agedu -D on an existing index file will not be exactly identical to what agedu -S would have originally produced, due to a difference in treatment of last-access times on directories. However, it should be effectively equivalent for most purposes. See the documentation of the --dir-atime option in the next section for further detail.)
By default, a single HTML report will be generated and simply written to standard output, with no hyperlinks pointing to other similar pages. If you also specify the -d option (see below), agedu will instead write out a collection of HTML files with hyperlinks between them, and call the top-level file index.html.
The actual CGI program itself should be a tiny wrapper around agedu which passes it the --cgi option, and also (probably) -f to locate the index file. agedu will do everything else. For example, your script might read
#!/bin/sh /some/path/to/agedu --cgi -f /some/other/path/to/agedu.dat
(Note that agedu will produce the entire CGI output, including status code, HTTP headers and the full HTML document. If you try to surround the call to agedu --cgi with code that adds your own HTML header and footer, you won't get the results you want, and agedu's HTTP-level features such as auto-redirecting to canonical versions of URIs will stop working.)
No access control is performed in this mode: restricting access to CGI scripts is assumed to be the job of the web server.
The ordinary dump file format is reasonably readable, but loading it into an index file using agedu -L requires it to be sorted in a specific order, which is complicated to describe and difficult to implement using ordinary Unix sorting tools. So if you want to construct your own data dump from a source of your own that agedu itself doesn't know how to scan, you will need to make sure it's sorted in the right order.
To help with this, agedu provides a secondary dump format which is `sortable', in the sense that ordinary sort(1) without arguments will arrange it into the right order. However, the sortable format is much more unreadable and also twice the size, so you wouldn't want to write it directly!
So the recommended procedure is to generate dump data in the ordinary format; then pipe it through agedu --presort to turn it into the sortable format; then sort it; then pipe it into agedu -L (which can accept either the normal or the sortable format as input). For example:
generate_custom_data.sh | agedu --presort | sort | agedu -L
If you need to transform the sorted dump file back into the ordinary format, agedu --postsort can do that. But since agedu -L can accept either format as input, you may not need to.
This section describes the various configuration options that affect agedu's operation in one mode or another.
The following option affects nearly all modes (except -S):
The following options affect the disk-scanning modes, -s and -S:
(Note that this default is the opposite way round from the corresponding option in du.)
Note that in most Unix shells, wildcards will probably need to be escaped on the command line, to prevent the shell from expanding the wildcard before agedu sees it.
--prune-path is similar to --prune, except that the wildcard is matched against the entire pathname instead of just the filename at the end of it. So whereas --prune *a*b* will match any file whose actual name contains an a somewhere before a b, --prune-path *a*b* will also match a file whose name contains b and which is inside a directory containing an a, or any file inside a directory of that form, and so on.
As above, --exclude-path is similar to --exclude, except that the wildcard is matched against the entire pathname.
For example, if you wanted to see only the disk space taken up by MP3 files, you might run
$ agedu -s . --exclude '*' --include '*.mp3'
which will cause everything to be omitted from the scan, but then the MP3 files to be put back in. If you then wanted only a subset of those MP3s, you could then exclude some of them again by adding, say, `--exclude-path './queen/*'' (or, more efficiently, `--prune ./queen') on the end of that command.
As with the previous two options, --include-path is similar to --include except that the wildcard is matched against the entire pathname.
By default, those progress reports are displayed on agedu's standard error channel, if that channel points to a terminal device. If you need to manually enable or disable them, you can use the above three options to do so: --progress unconditionally enables the progress reports, --no-progress unconditionally disables them, and --tty-progress reverts to the default behaviour which is conditional on standard error being a terminal.
Instead, agedu makes up a fake atime for every directory it scans, which is equal to the newest atime of any file in or below that directory (or the directory's last modification time, whichever is newest). This is based on the assumption that all important accesses to directories are actually accesses to the files inside those directories, so that when any file is accessed all the directories on the path leading to it should be considered to have been accessed as well.
In unusual cases it is possible that a directory itself might embody important data which is accessed by reading the directory. In that situation, agedu's atime-faking policy will misreport the directory as disused. In the unlikely event that such directories form a significant part of your disk space usage, you might want to turn off the faking. The --dir-atime option does this: it causes the disk scan to read the original atimes of the directories it scans.
The faking of atimes on directories also requires a processing pass over the index file after the main disk scan is complete. --dir-atime also turns this pass off. Hence, this option affects the -L option as well as -s and -S.
(The previous section mentioned that there might be subtle differences between the output of agedu -s /path -D and agedu -S /path. This is why. Doing a scan with -s and then dumping it with -D will dump the fully faked atimes on the directories, whereas doing a scan-to-dump with -S will dump only partially faked atimes - specifically, each directory's last modification time - since the subsequent processing pass will not have had a chance to take place. However, loading either of the resulting dump files with -L will perform the atime-faking processing pass, leading to the same data in the index file in each case. In normal usage it should be safe to ignore all of this complexity.)
Another use for this mode might be to find recently created large data. If your disk has been gradually filling up for years, the default mode of agedu will let you find unused data to delete; but if you know your disk had plenty of space recently and now it's suddenly full, and you suspect that some rogue program has left a large core dump or output file, then agedu --mtime might be a convenient way to locate the culprit.
For most files, the physical size of a file will be larger than the logical size, reflecting the fact that filesystem layouts generally allocate a whole number of blocks of the disk to each file, so some space is wasted at the end of the last block. So counting only the logical file size will typically cause under-reporting of the disk usage (perhaps large under-reporting in the case of a very large number of very small files).
On the other hand, sometimes a file with a very large logical size can have `holes' where no data is actually stored, in which case using the logical size of the file will over-report its disk usage. So the use of logical sizes can give wrong answers in both directions.
The following option affects all the modes that generate reports: the web server mode -w, the stand-alone HTML generation mode -H and the text report mode -t.
The following option affects the text report mode -t.
The following options affect the stand-alone HTML generation mode -H and the text report mode -t.
In text mode, the default is 1, meaning that the report will include the directory given on the command line and all of its immediate subdirectories. A depth of two includes another level below that, and so on; a depth of zero means only the directory on the command line.
In HTML mode, specifying this option switches agedu from writing out a single HTML file to writing out multiple files which link to each other. A depth of 1 means agedu will write out an HTML file for the given directory and also one for each of its immediate subdirectories.
If you want agedu to recurse as deeply as possible, give the special word `max' as an argument to -d.
The following option affects only the stand-alone HTML generation mode -H, and even then, only in recursive mode (with -d):
This system of file naming is less intuitive than the default of naming files after the sub-pathname they index. It's also less stable: the same pathname will not necessarily be represented by the same filename if agedu -H is re-run after another scan of the same directory tree. However, it does have the virtue that it keeps the filenames short, so that even if your directory tree is very deep, the output HTML files won't exceed any OS limit on filename length.
The following options affect the web server mode -w, and in some cases also the stand-alone HTML generation mode -H:
The argument to -r consists of a single age, or two ages separated by a minus sign. An age is a number, followed by one of `y' (years), `m' (months), `w' (weeks) or `d' (days). (This syntax is also used by the -a option.) The first age in the range represents the oldest data, and will be coloured red in the HTML; the second age represents the newest, coloured green. If the second age is not specified, it will default to zero (so that green means data which has been accessed just now).
For example, -r 2y will mark data in red if it has been unused for two years or more, and green if it has been accessed just now. -r 2y-3m will similarly mark data red if it has been unused for two years or more, but will mark it green if it has been accessed three months ago or later.
If you specify this option, agedu will not print its URL on standard output (since you are expected to know what address you told it to listen to).
A typical use for this would be `--launch=browse', which uses the XDG `browse' command to automatically open the agedu web interface in your default browser. However, other uses are possible: for example, you could provide a command which communicates the URL to some other software that will use it for something.
The data file is pretty large. The core of agedu is the tree-based data structure it uses in its index in order to efficiently perform the queries it needs; this data structure requires O(N log N) storage. This is larger than you might expect; a scan of my own home directory, containing half a million files and directories and about 20Gb of data, produced an index file over 60Mb in size. Furthermore, since the data file must be memory-mapped during most processing, it can never grow larger than available address space, so a really big filesystem may need to be indexed on a 64-bit computer. (This is one reason for the existence of the -D and -L options: you can do the scanning on the machine with access to the filesystem, and the indexing on a machine big enough to handle it.)
The data structure also does not usefully permit access control within the data file, so it would be difficult - even given the willingness to do additional coding - to run a system-wide agedu scan on a cron job and serve the right subset of reports to each user.
In certain circumstances, agedu can report false positives (reporting files as disused which are in fact in use) as well as the more benign false negatives (reporting files as in use which are not). This arises when a file is, semantically speaking, `read' without actually being physically read. Typically this occurs when a program checks whether the file's mtime has changed and only bothers re-reading it if it has; programs which do this include rsync(1) and make(1). Such programs will fail to update the atime of unmodified files despite depending on their continued existence; a directory full of such files will be reported as disused by agedu even in situations where deleting them will cause trouble.
Finally, of course, agedu's normal usage mode depends critically on the OS providing last-access times which are at least approximately right. So a file system mounted with Linux's `noatime' option, or the equivalent on any other OS, will not give useful results! (However, the Linux mount option `relatime', which distributions now tend to use by default, should be fine for all but specialist purposes: it reduces the accuracy of last-access times so that they might be wrong by up to 24 hours, but if you're looking for files that have been unused for months or years, that's not a problem.)
agedu is free software, distributed under the MIT licence. Type agedu --licence to see the full licence text.
2008‐11‐02 | Simon Tatham |