grcollect - performing transposition on the input tabulated
data
grcollect [options] <input> [...]
[-o <output>|-b <basename>]
The main purpose of the program `grcollect` is twofold. First, it
is intended to do data transposition on the input data, i.e. the input
(which is read from files or standard input) is sorted and splitted to
separate files where the splitting is based on a respective key. These keys
are taken from the input data. In such a case where the input is from more
files and each key is unique in a given file, this process is called data
transposition (since it is similar when a 2 dimensional data matrix is
stored in the form as each row is in a separate file, and one intends to
transpose the matrix, i.e. store each column in a separate file). The other
feature of `grcollect` is to do some sort of statistics on data associated
to different keys. These statistics include average (mean, median, mode) and
scatter (standard deviation or median deviance) estimations with the
optional deselection of outlier points, summation, count statistics and so
on.
- -h, --help
- Give general summary about the command line options.
- --long-help,
--help-long
- Gives a detailed list of command line options.
- --wiki-help,
--help-wiki, --mediawiki-help,
--help-mediawiki
- Gives a detailed list of command line options in Mediawiki format.
- --version,
--version-short, --short-version
- Give some version information about the program.
- <input> [,<input>, ...]
- Name of the input file. At least, one file should be specified. Reading
from standard input can be forced using a single dash "-" as
input file name. More dashes are silently ignored.
- -c, --col-base <key
column index>
- Column index for the key.
- -b, --basename
<base-%b-name>
- Base name of the output files. The base name string should conatain at
least one "%b" tag, which is replaced by the respective key
string on the creation of the file.
- -x, --extension
<extension>, -p, --prefix <prefix>
- Equivalent to "-b|--basename
<prefix>%b.<extension>". Note that in practice,
<prefix> might be some sort of directory name and extension is a
regular file extension, but the above substitution is done literally.
Therefore, the "dot" between the key and the <extension>
is always inserted in the final name of the output files but a trailing
slash is required at the end of <prefix> if the files are to be
created in that particular directory. Note also that this case, the target
directory must exist before the invocation of `grcollect`, otherwise the
output files cannot be created.
- -C, --comment
- Insert a commented line (starting with "#") containing
information about the version and command line invocation syntax of
`grcollect` to the beginning of the transposed files.
- -S,
--additional-comment <...>
- Insert an additional commented lines (starting with "#") to the
beginning of the transposed files.
- -d, --col-stat
<>[,...]
- Comma-separated list of column indices on which the statistics are to be
calculated. Columns with non-numerical contents are ignored.Note that this
option imply the cumulative statistics mode of `grcollect`.
- -o, --output
<filename>
- The name of the output file to which the output statistics are written.
The total number of columns in this file will be 1+C*N, where C is the
number of columns (see -d|--col-stat) on which the statistics are
calculated and N is the number of statistic quantities (see
--stat). The first column in the output file is the key, which is
followed by the per-column list of statistics, in the same order as the
user defined after -d|--col-stat and --stat.
- -s, --stat <list of
statistics>
- Comma-separated list of statistics to be estimated on the input data.
These can be one or more of the following:
- count
- Total number of records, for the given key.
- rcount
- The number of records after rejecting outliers (i.e. it is always the same
as the "count" value if no "--rejection" was
used).
- mean, median, mode
- Mean, median or mode statistics of the data.
- rmean, rmedian,
rmode
- Mean, median or mode, after rejecting outliers.
- {mean|median|mode}stddev, {mean|median|mode}meddev, stddev
- Scatter of the data around the mean, median or mode. The scatter can
either be standard deviation (stddev) or median deviance (meddev). The
literal "stddev" is the classic standard deviation, equivalent
to "meanstddev".
- r{mean|median|mode}stddev,
r{mean|median|mode}meddev, rstddev
- The same scatters as above but after rejecting outliers.
- sum, rsum
- Sum of the data, esp. total sum and sum after rejecting outliers.
- sum2, rsum2
- Sum of the squares, total and after rejecting outliers.
- min, max
- Minimal and maximal data values.
- rmin, rmax
- Minimal and maximal data values after the rejection of outliers.
- -r, --rejection
column=<index>,<rejection parameters>
- Comma-separated directives for outlier rejection for the specified column.
The rejection parameters are:
- iterations=<n>
- Maximum number of iterations to reject outliers.
- mean, median,
mode
- Use the mean, median or mode for the center of the rejection.
- stddev, meddev,
absolute=<limit>
- Use the standard deviation or median deviance for rejection limit units or
define an absolute limit for rejection level.
Note that each column can have different kind of rejection method,
thus more than one "--rejection ..." command line option can be
used at the invocation of `grcollect`.
- -m, --max-memory
<memory>[kmg]
- Maximum amount of memory available for `grcollect`. The prefixes
"k", "m" or "g" can be used for kilobytes,
megabytes and gigabytes, respectively. On 32bit systems, the maximum
memory is limited to 3gigabytes. Note that `grcollect` does not use any
kind of operating system specific methods to determine the maximum amount
of memory, it always should be set by the user. The default value of 8
megabytes is somewhat small, so upon massive data transposition (tens or
hundreds of gigabytes), this limit is worth to be set accordingly to the
physical memory available.
- -t, --tmpdir
<directory>
- Directory for temporary file storage. Note that the default temporary
directory is always the current one (which is is equivalent to define
"--tmpdir ./"), since in a usual configuration the /tmp
directory is small, moreover, it can be some sort of "tmpfs",
temporary file system mount on the physical memory itself.
Report bugs to <apal@szofi.net>, see also
https://fitsh.net/.
Copyright © 1996, 2002, 2004-2008, 2010-2016, 2018-2020;
Pal, Andras <apal@szofi.net>