VNL-JOIN(1) | vnlog | VNL-JOIN(1) |
vnl-join - joins two log files on a particular field
$ cat a.vnl # a b AA 11 bb 12 CC 13 dd 14 dd 123 $ cat b.vnl # a c aa 1 cc 3 bb 4 ee 5 - 23 Try to join unsorted data on field 'a': $ vnl-join -j a a.vnl b.vnl # a b c join: /dev/fd/5:3: is not sorted: CC 13 join: /dev/fd/6:3: is not sorted: bb 4 Sort the data, and join on 'a': $ vnl-join --vnl-sort - -j a a.vnl b.vnl | vnl-align # a b c bb 12 4 Sort the data, and join on 'a', ignoring case: $ vnl-join -i --vnl-sort - -j a a.vnl b.vnl | vnl-align # a b c AA 11 1 bb 12 4 CC 13 3 Sort the data, and join on 'a'. Also print the unmatched lines from both files: $ vnl-join -a1 -a2 --vnl-sort - -j a a.vnl b.vnl | vnl-align # a b c - - 23 AA 11 - CC 13 - aa - 1 bb 12 4 cc - 3 dd 123 - dd 14 - ee - 5 Sort the data, and join on 'a'. Print the unmatched lines from both files, Output ONLY column 'c' from the 2nd input: $ vnl-join -a1 -a2 -o 2.c --vnl-sort - -j a a.vnl b.vnl | vnl-align # c 23 - - 1 4 3 - - 5
Usage: vnl-join [join options] [--vnl-sort -|[sdfgiMhnRrV]+] [ --vnl-[pre|suf]fix[1|2] xxx | --vnl-[pre|suf]fix xxx,yyy,zzz | --vnl-autoprefix | --vnl-autosuffix ] logfile1 logfile2
This tool joins two vnlog files on a given field. "vnl-join" is a wrapper around the GNU coreutils "join" tool. Since this is a wrapper, most commandline options and behaviors of the "join" tool are present; consult the join(1) manpage for detail. The differences from GNU coreutils "join" are
join -j1
to join on the first column, you say
join -j time
to join on column "time".
--vnl-prefix1 --vnl-suffix1 --vnl-prefix2 --vnl-suffix2 --vnl-prefix --vnl-suffix --vnl-autoprefix --vnl-autosuffix
See below for details.
Past that, everything "join" does is supported, so see that man page for detailed documentation. Note that all non-legend comments are stripped out, since it's not obvious where they should end up.
By default, the field names in the output match those in the input. This is what you want most of the time. It is possible, however that a column name adjustment is needed. One common use case for this is if the files being joined have identically-named columns, which would produce duplicate columns in the output. Example: we fixed a bug in a program, and want to compare the results before and after the fix. The program produces an x-y trajectory as a function of time, so both the bugged and the bug-fixed programs produce a vnlog with a legend
# time x y
Joining this on "time" will produce a vnlog with a legend
# time x y x y
which is confusing, and not what you want. Instead, we invoke "vnl-join" as
vnl-join --vnl-suffix1 _buggy --vnl-suffix2 _fixed -j time buggy.vnl fixed.vnl
And in the output we get a legend
# time x_buggy y_buggy x_fixed y_fixed
Much better.
Note that "vnl-join" provides several ways of specifying this. The above works only for 2-way joins. An alternate syntax is available for N-way joins, a comma-separated list. The same could be expressed like this:
vnl-join -a- --vnl-suffix _buggy,_fixed -j time buggy.vnl fixed.vnl
Finally, if passing in structured filenames, "vnl-join" can infer the desired syntax from the filenames. The same as above could be expressed even simpler:
vnl-join --vnl-autosuffix -j time buggy.vnl fixed.vnl
This works by looking at the set of passed in filenames, and stripping out the common leading and trailing strings.
The GNU coreutils "join" tool expects sorted columns because it can then take only a single pass through the data. If the input isn't sorted, then we can use normal shell substitutions to sort it:
$ vnl-join -j key <(vnl-sort -s -k key a.vnl) <(vnl-sort -s -k key b.vnl)
For convenience "vnl-join" provides a "--vnl-sort" option. This allows the above to be equivalently expressed as
$ vnl-join -j key --vnl-sort - a.vnl b.vnl
The "-" after the "--vnl-sort" indicates that we want to sort the input only. If we also want to sort the output, pass the short codes "sort" accepts instead of the "-". For instance, to sort the input for "join" and to sort the output numerically, in reverse, do this:
$ vnl-join -j key --vnl-sort rg a.vnl b.vnl
The reason this shorthand exists is to work around a quirk of "join". The sort order is assumed by "join" to be lexicographical, without any way to change this. For "sort", this is the default sort order, but "sort" has many options to change the sort order, options which are sorely missing from "join". A real-world example affected by this is the joining of numerical data. If you have "a.vnl":
# time a 8 a 9 b 10 c
and "b.vnl":
# time b 9 d 10 e
Then you cannot use "vnl-join" directly to join the data on time:
$ vnl-join -j time a.vnl b.vnl # time a b join: /dev/fd/4:3: is not sorted: 10 c join: /dev/fd/5:2: is not sorted: 10 e 9 b d 10 c e
Instead you must re-sort both files lexicographically, and then (because you almost certainly want to) sort it back into numerical order:
$ vnl-join -j time <(vnl-sort -s -k time a.vnl) <(vnl-sort -s -k time b.vnl) | vnl-sort -s -n -k time # time a b 9 b d 10 c e
Yuck. The shorthand described earlier makes the interface part of this palatable:
$ vnl-join -j time --vnl-sort n a.vnl b.vnl # time a b 9 b d 10 c e
Note that the input sort is stable: "vnl-join" will invoke "vnl-sort -s". If you want a stable post-sort, you need to ask for it with "--vnl-sort s...".
The GNU coreutils "join" tool is inherently designed to join exactly two files. "vnl-join" extends this capability by chaining together a number of "join" invocations to produce a generic N-way join. This works exactly how you would expect with the following caveats:
This and the other "vnl-xxx" tools that wrap coreutils are written specifically to work with the Linux kernel and the GNU coreutils. None of these have been tested with BSD tools or with non-Linux kernels, and I'm sure things don't just work. It's probably not too effortful to get that running, but somebody needs to at least bug me for that. Or better yet, send me nice patches :)
https://github.com/dkogan/vnlog/
Dima Kogan "<dima@secretsauce.net>"
Copyright 2018 Dima Kogan "<dima@secretsauce.net>"
This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.
2019-01-22 |