Cramer Distance¶

The Cramer statistic was introduced by Baringhaus & Franz (2004) for multivariate two-sample testing. The statistic is defined as the difference of the Euclidean distances between the two data sets subtracted by half of the distances measured within each data set.

Yeremi et al. 2015 applied this to position-position-velocity data cubes by selecting a sample of the brightest pixels in each spectral channel to reduce the cube to a 2D data matrix. It was also used tested in Koch et al. 2017, and the definition used in TurbuStat can be found there.

Warning

Koch et al. 2017 find that this test is unsuitable for comparing data cubes that have a large difference in their mean intensities. When using this metric, be sure that the intensity distributions have similar mean intensities or apply some normalization prior to running the metric. Be cautious when interpreting these results and ensure that the distances are compared to a well-understood fiducial.

Using¶

The data in this tutorial are available here.

We need to import the Cramer_Distance class, along with a few other common packages:

>>> from turbustat.statistics import Cramer_Distance
>>> from astropy.io import fits
>>> import matplotlib.pyplot as plt

And we load in the two data sets. The Cramer statistic needs two cubes:

>>> cube = fits.open("Design4_flatrho_0021_00_radmc.fits")[0]  
>>> cube_fid = fits.open("Fiducial0_flatrho_0021_00_radmc.fits")[0]  

Cramer_Distance takes the two cubes as inputs. Minimum intensity values for the statistic to consider can be specified with noise_value1 and noise_value2.

>>> cramer = Cramer_Distance(cube_fid, cube, noise_value1=-np.inf,
...                          noise_value2=-np.inf)  

Note that, since the Cramer statistic defaults to using the upper 20% of the values in each spectral channel, there may not be large differences in the distance when the noise values are low.

The 2D data matrices and the Cramer statistic can now be calculated with:

>>> cramer.distance_metric(normalize=True, n_jobs=1, verbose=True)  

Setting verbose=True creates this figure, where the data matrices are shown for each data cube. The x-axis are the spectral channels and the y-axis are, ordered with the largest at the bottom, the largest pixel values in that spectral channel. Custom labels can be set by setting label1 and label2 in the distance metric call above.

The argument n_jobs sets how many cores to use when calculating pairwise distances with the sklearn paired_distances function. This is the slowest step in computing the Cramer statistic; see format_data for more information.

The distance between the data cubes is:

>>> cramer.distance  
0.18175851051788378

distance_metric performs two steps: format_data to find the 2D data matrix for each cube, and cramer_statistic to calculate the distance. These steps can be run separately to allow for changes in the keyword arguments of both.

References¶

Baringhaus & Franz 2004