grmatch - pairing lines by involving identifier or cross
matching
grmatch [options] -r <reference> -i
<input> [-o <output>]
The program `grmatch` matches lines read from two input files,
namely from a reference and from an input file. All implemented algorithms
are symmetric, in the manner that the result should be the same if these two
files are swapped. The only case when the order of these files is important
is when a geometrical transformation is also returned (see point matching
below), in this case the swapping of the files results the inverse form of
the original transformation. The lines (rows) can be matched using various
criteria. 1. Lines can be matched by identifier, where the identifier can be
any concatenation of arbitrary, space-separated columns found in the files.
Generally, the identifier is represented by a single column (e.g. it is an
astronomical catalog identifier). The behaviour of the program can be tuned
for the cases when there are more than one rows with the same identifier. 2.
Lines can be matched using a 2-dimensional point matchig algorithm. In this
method, the program expects two-two columns both from the reference and
input files which can be treated as X and Y coordinates. If both point lists
are known, the program tries to find the appropriate geometrical
transformation which transforms the points from the frame of the reference
list to the frame of the input list and, simultaneously, tries to find as
many pairs as possible. The parameters of the geometrical transformation and
the whole algorithm can be fine-tuned. 3. Lines can be matched using
arbitrary- (N-) dimensional coordinate matching algorithm. This method
expects N-N columns both from the reference and input files which can be
treated as X_1, ..., X_N Cartesian coordinates and the method assumes both
of the point sets in the same reference frame. The point 'A' from the
reference list and the point 'P' from the input list forms a pair if the
closest point to 'A' from the input list is 'P' and vice versa.
- -r <file>,
--reference <file> --input-reference
<file>
- Mandatory, name of the reference file.
- <inputfile>, -i <inputile>, --input
<inputfile>
- Name of the input file. If this switch is omitted, the input isread from
stdin (specifying some input is mandatory).
- --separator-reference
<char>|space, --separator-input <char>|space
- Character for separating the fields of the reference and the input input
files, respectively. By default, the separation is done using whitespaces,
it can be ephasized by defining 'space' here. Otherwise, the character
<char> should only be a single character. For instance, use
'--separator-reference ,' and/or '--separator-input ,' to process CSV
files.
- -o <output>,
--output <output>, --output-matched
<output>
- Name of the output file, containing the matched lines. The matched lines
are pasted lines, the first part is from the reference file and the second
part is from the input file, these two parts are concatenated by a TAB
character. This switch is optional, if it is not specified, no such output
will be generated.
- --output-matched-reference
<out>, --output-matched-input <out>
- Name of the output file, containing the lines corresponding to matches but
only from the reference file or from the input file, respectively.
- --output-excluded-reference
<out>, --output-excluded-input <out>
- Names of the files which contain the valid but excluded lines from the
reference and from the input. These outputs are disjoint from the previous
output and altogether contaions all valid lines.
- --output-id
<out>
- Name of the file which contaions only the identifiers of the matched
lines. If the primary matching method was not identifier matching, one
should specify the column indices of the identifiers by
--col-ref-id and --col-inp-id also.
- --output-transformation
<output-transformation-file>
- Name of the output file containing the geometrical transformation, in
human-readable format, if the matching method was point matching (in other
case, this option has no effect). The commented version of this file
includes some statistics about the matching (the total number of lines
used and matched, the required CPU time, the final triangulation level,
the fit residuals and other things like these).
In all of the above input/output file specifications, the
replacement of the file name by "-" (a single minus sign) forces
the reading from stdin or writing to stdout. Note that all parts of the any
line after "#" (hashmark) are treated as a comment, therefore
ignored.
- --match-points
- This switch forces the usage of the point matching method. By default,
this method is assumed to be used, therefore this switch can be
omitted.
- --col-ref
<x>,<y>, --col-inp <x>,<y>
- The column indices containing the X and Y coordinates, for the reference
and for the input file, respectively. The index of the first column is
always 1, the index of the second is 2 and so on. Lines in which these
columns do not contain valid real numbers bers are omitted.
- -a <order>,
--order <order>
- This switch specifies the polynomial order of the resulted geometrical
transformation. It can be arbitrary positive integer. Note that if the
order is A, at least (A+1)*(A+2)/2 valid points are needed both from the
reference and both from the input file to fit the transformation.
- --max-distance
<maxdist>
- The maximal accepted distance between the matched points in the coordinate
frame of the input coordinate list (and not in the coordinate frame of the
reference coordinate list). Possible pairs (which are valid pairs due to
the symmetric coordinate matching algorihms) are excluded if their
Eucledian distance is larger than maxdist. Note that this option has no
initial value, therefore, if omitted, all possible pairs due to the
symmetric matching are resulted, which, in certain cases in practice, can
result unexpected behaviour. One should always specify a reasonable
maximal distance which can be estimated only by the knowledge of the
physics of the input files.
See more options concerning to point matching in the section
"Fine-Tuning of Point Matching" below. That section also describes
the tuning of the triangulation used by the point matching algorithm. For a
more detailed description about the point matching algorithms based on
pattern and triangle matching see [1], [2] or [3].
- --match-coord,
--match-coords
- This switch forces the usage of the coordinate matching method. Note that
because of the common options with the point matching method, one should
specify this switch to force the usage of the coordinate matching method
(the default method is point matching, see above).
- --col-ref
<x>[,<y>,[<z>...]] --col-inp
<x>[,<y>,[<z>...]]
- The column indices containing the spatial coordinates, for the reference
and for the input file, respectively. The index of the first column is
always 1, the index of the second is 2 and so on. Lines in which these
columns do not contain valid real numbers are omitted. Note that the
dimension of the coordinate matching space is specified indirectly, by the
number of column indices listed here. Because of this, the number of
column indices should be the same for the reference and input, in other
case, when the dimensions are mismatched, the program exits
unsuccessfully.
- --max-distance
<maxdist>
- The maximal accepted distance between the matched points. Possible pairs
(which are valid pairs due to the symmetric coordinate matching algorihms)
are excluded if their Eucledian distance is larger than maxdist. Note that
this option has no initial value, therefore, if omitted, all possible
pairs due to the symmetric matching are resulted (see also point matching,
above).
- --match-id,
--match-identifiers
- This switch forces the usage of the identifier matching method.
- --col-ref-id
<i>[,<j>,[<k>...]] --col-inp-id
<i>[,<j>,[<k>...]]
- Column index or indices containing the identifiers, from the reference and
from the input file, respectively.
- --no-ambiguity,
--first-ambiguity, --any-ambiguity,
--full-ambiguity
- These options tune the behaviour of the matching when there is more than
one occurrence of a given identifier in the reference and/or input file.
If --no-ambiguity is specified, these identifiers are discarded,
this is the default method. If --first-ambiguity is specified, only
the first occurence is treated as a matched line, independently from the
number of occurrences. If the switch --any-ambiguity is specified,
the lines are paired sequentally, until there is any left from the
reference and from the input. For example, if there is 4 occurrences in
the reference and 6 in the input file of a given identifier, 4 matched
pairs are returned. Otherwise, if --full-ambiguity is specified,
all possible combinations of the lines are treated as matched lines. For
example, if there is 4 occurrences in the reference and 6 in the input
file of a given identifier, all 4*6=24 combinations are returned as
matched pairs.
- --triangulation
<parameters>
- This switch is followed by comma-separated directives, which specify the
parameters of the triangulation-based point matching algorithm:
- delaunay,
level=<level>, full, auto, unitarity=<U>
- These directives specify the triangulation level used for point matching.
"delaunay" forces the usage only of the Delaunay-triangles. This
is the fastest method, however, it is only working if the points in the
reference and input lists are almost competely overlapping and describe
almost the same point sets (within a ratio of common points above 60-70%).
The "level" specifies the level of the expansion of the
Delaunay-triangulation (see [1] for more details). In practice, the lower
the ratio of common points and/or the ratio of the overlapping, the higher
level should be used. Specifying "level=1" or
"level=2" gives a robust but still fast method for general
usage. The directive "full" forces full triangulation. This can
be overwhelmingly slow and annoying and requires tons of memory if there
are more than 40-50 points (the amounts of these resources are
proportional to the 6th(!) and 3rd power of the number of the points,
respectively). The directive "auto" increases the level of the
triangulation expansion automatically until a proper match is found. A
match is considered as a good match if the unitarity of the transformation
is less than the unitarity U specified by the "unitarity=U"
directive (see also the section Notes/Unitarity below).
- mixed, conformable,
reverse
- These directives define the chirality of the triangle spaces to be used.
Practically, it means the following. If we don't know whether the input
and reference lists are inverted respecting to each other, one should use
"mixed" triangle space. If we are sure about that the input and
reference lists are not inverted, we can use "conformable"
triangle space. If we know that the input and reference lists are
inverted, we can use "reverse" space. Note that although
"mixed" triangle space can always result a good match, it is a
wise idea to fix the chirality by specifying "conformable" or
"reverse" if we really know that the point sets are not inverted
or inverted respecting to each other. If the chirality is fixed, the
program yields more matched pairs, the appropriate triangulation level can
be smaller and in "auto" mode, the program returns the match
definitely faster.
- maxnumber=<max>,
maxref=<mr>, maxinp=<mi>
- These directives specify the maximal number of points which are used for
triangulation (for any type of triangulation). If "maxnumber" is
specified, it is equivalent to define "maxref" and
"maxinp" with the same values. Then, the first <mr> points
from the reference and the first <mi> points from the input list are
used to generate the triangle sets. The "first" points are
selected using the optional information found in one of the columns, see
the following switches.
(Note that there should be only one --triangulation switch,
all desired directives should be written in the same argument, separated by
commas.)
- --col-ref-ordering
[-]<w>, --col-inp-ordering [-]<w>.
- These switches specify one-one column index from the reference and from
the input files which are used to order these lists and select the first
"maxref" and "maxinp" points (see above) for the
generation of the two triangle meshes. Both columns should contain valid
real numbers, otherwise the whole(!) line is excluded (not only from
sorting but from the whole matching procedure). If there is no negative
sign before the column index, the data are sorted in descending(!) order,
therefore the lines with the lines with the highest(!) values are selected
for triangulation. If there is a negative sign before the index, the data
are sorted in ascending order by these values, therefore the lines with
the smallest(!) values are selected for triangulation. For example, if we
want to match star lists, we might want to use only the brightest ones to
generate the triangle sets. If the brightnesses of the stars are specified
by their fluxes, we should not use the negative sign (the list should be
sorted in descending order to select the first few lines as the brightest
stars), and if the brightness is known by the magnitude, we have to use
the negative sign.
- --fit
iterations=<N>,firstrejection=<F>,sigma=<S>
- Like --triangulation, this switch is followed by some directives.
These directives specify the number <N> of iterations
("iterations=<N>") for point matching. The
"firstrejection" directive speciy the serial number <F> of
the first iteration where points farer than <S> "sigma"
level are excluded in the next iteration. Note that in practice these type
of iteration is really not important (due to, for instance, the
limitations of the outliers by the --max-distance switch), however,
some suspicious users can be convinced by such arguments.
- --weight
reference|input,column=<wi>,[magnitude],[power=<p>]
- These directives specify the weights which are used during the fit of the
geometrical transformation. For example, in practice it is useful in the
following situation. We try to match star lists, then the fainter stars
are believed to have higher astrometrical errors, therefore they should
have smaller influence in the fit. We can take the weights from the
reference (specify "reference") and from the input (specify
"input"), from the column specified by the weight-index. The
weights can be derived from stellar magnitudes, if so, specify
"magnitude" to convert the read values in magnitude to flux. The
real weights then is the "power"th power of the flux. The
default value of the "power" is 1, however, for the
maximum-likelihood estimation of an assumed Gaussian distribution, the
weights should be the second power of the fluxes.
Some notes on unitarity. The unitarity of a geometrical
transformation measures how it differs from the closest transformation which
is affine and a combination of dilation, rotation and shift. For such a
transformation the unitarity is 0 and if the second-order terms in a
transformation distort a such unitary transformation, the unitarity will
have the same magnitude like the magnitude of this second-order effect. For
example, to map a part of a sphere with the size of d degrees will have an
unitarity of 1-cos(d). Therefore, for astrometrical purposes, a reasonable
value of the critical unitarity in "auto" triangulation mode can
be estimated as 2 or 3 times 1-cos(d/2) where d is the size of the field in
which astrometry should be performed.
Report bugs to <apal@szofi.net>, see also
https://fitsh.net/.
Copyright © 1996, 2002, 2004-2008, 2010-2016, 2018-2020;
Pal, Andras <apal@szofi.net>