PBSTRAIN(1) | User Commands | PBSTRAIN(1) |
pbsTrain - Estimate a discrete encoding scheme for probabilistic biological
Estimate a discrete encoding scheme for probabilistic biological sequences (PBSs) based on training data. Input file should be a table of probability vectors, with a row for each distinct vector, and a column of counts (positive integers) followed by d columns for the elements of the d-dimensional probability vectors (see example below). It may be produced with 'prequel' using the --suff-stats option. Output is a code file that can be used with pbsEncode, pbsDecode, etc. By default, a code of size 255 is created, so that encoded PBSs can be represented with one byte per position (the 256th letter in the code is reserved for gaps). The --nbytes option allows larger codes to be created, if desired.
The code is estimated by a two-part procedure designed to minimize the "training error" (defined as the total KL divergence) of the encoded training data with respect to the original training data. First, a "grid" is defined for the probability simplex, partitioning it into regions that intersect "cells" (hypercubes) in a matrix in d-dimensional space. This grid has n "rows" per dimension. By default, n is given the largest possible value such that the number of simplex regions is no larger than the target code size, but smaller values of n can be specified using --nrows. Each simplex region is assigned a letter in the code, and the representative point for that letter is set equal to the mean (weighted by the counts) of all vectors in the training data that fall in that region. This can be shown to minimize the training error for this initial encoding scheme. (If no vectors fall in a region, then the representative point is set equal to the centroid of the region, which can be shown to minimize the expected KL divergence of points uniformly distributed in the region.)
In the second part of the estimation procedure, the remaining letters in the code are defined by a greedy algorithm, which attempts to further minimize the training error. Briefly, on each step, the simplex region with the largest contribution to the total error is identified, and the next letter in the code is assigned to that region. In this new encoding, there are multiple letters, hence multiple representative points, per region; the representative point for a given vector is taken to be the closest, in terms of KL divergence, of the representative points associated with the simplex region in which that vector falls. When a new representative point is added to a region, all representative points for that region are reoptimized using a k-means type algorithm. This procedure is repeated, letter by letter, until the number of code letters equals the target code size.
Generate training data using prequel:
Now estimate a code from the training data:
The code file contains some metadata followed by a list of code indices and representative points, e.g.,
Each index of the code is shown below with its representative probability vector (p1, p2, ..., pd).
The reported "average training error" is the training error divided by the number of data points (the sum of the counts).
--nrows, -n <n> Number of "rows" per dimension in the simplex grid. Default is maximum possible for code size.
--nbytes, -b <b>
--no-greedy, -G Skip greedy optimization -- just assign a single representative point to each region of the probability simplex, equal to the (weighted) mean of all vectors from the training data that fall in that region.
--no-train, -x <dim>
--log, -l <file>
--help, -h
May 2016 | pbsTrain 1.4 |