PDF Distance¶
See the tutorial for a description of PDFs.
There are multiple ways to define the distance between PDFs. Two of the metrics are non-parametric:
- The Hellinger distance between the PDFs (computed over the same set of bins):
- \[d_{\rm Hellinger}(p_1,p_2) = \frac{1}{\sqrt{2}}\left\{\sum_{\tilde{I}} \left[ \sqrt{p_1(\tilde{I})} - \sqrt{p_{2}(\tilde{I})} \right]^2\right\}^{1/2}.\]
where \(p_i\) are the histogram values at the bin \(\tilde{I}\).
- The Kolmogorov-Smirnov Distance between the ECDFs of the PDFs:
- \[d_{\rm KS}(P_1, P_2) = {\rm sup} \left| P_1(\tilde{I}) - P_2(\tilde{I}) \right|\]
where \(P_i\) is the ECDF at the value \(\tilde{I}\).
There is also one parametric distance metric included in PDF_Distance
: the t-statistic of the difference in the fitted log-normal widths:
where \(w_i\) is the width of the log-normal distribution fit.
More information on the distance metric definitions can be found in Koch et al. 2017.
Using¶
The data in this tutorial are available here.
We need to import the PDF_Distance
class, along with a few other common packages:
>>> from turbustat.statistics import PDF_Distance
>>> from astropy.io import fits
>>> import matplotlib.pyplot as plt
And we load in the two data sets. PDF_Distance
can be given two 2D images or cubes. For this example, we will use two integrated intensity images:
>>> moment0 = fits.open(osjoin(data_path, "Design4_flatrho_0021_00_radmc_moment0.fits"))[0]
>>> moment0_fid = fits.open(osjoin(data_path, "Fiducial0_flatrho_0021_00_radmc_moment0.fits"))[0]
These two images are given as the inputs to PDF_Distance
. Other parameters can be set here, including the minimum images values to be included in the histograms (min_val1
/min_val2
), whether to fit a log-normal distribution (do_fit
), and what type of normalization to use on the data (normalization_type
; see the PDF tutorial):
>>> pdf = PDF_Distance(moment0_fid, moment0, min_val1=0.0, min_val2=0.0,
... do_fit=True, normalization_type=None)
This will create and run two PDF
instances using a common set of bins for the histograms. These can be accessed as pdf1
and pdf2
.
To calculate the distances, we run:
>>> pdf.distance_metric(verbose=True)
Optimization terminated successfully.
Current function value: 6.335450
Iterations: 36
Function evaluations: 72
Optimization terminated successfully.
Current function value: 6.007851
Iterations: 34
Function evaluations: 69
Likelihood Results
==============================================================================
Dep. Variable: y Log-Likelihood: -1.0380e+05
Model: Likelihood AIC: 2.076e+05
Method: Maximum Likelihood BIC: 2.076e+05
Date: Wed, 14 Nov 2018
Time: 09:58:10
No. Observations: 16384
Df Residuals: 16382
Df Model: 2
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
par0 0.4553 0.003 181.019 0.000 0.450 0.460
par1 299.8377 1.067 281.114 0.000 297.747 301.928
==============================================================================
Likelihood Results
==============================================================================
Dep. Variable: y Log-Likelihood: -98433.
Model: Likelihood AIC: 1.969e+05
Method: Maximum Likelihood BIC: 1.969e+05
Date: Wed, 14 Nov 2018
Time: 09:58:10
No. Observations: 16384
Df Residuals: 16382
Df Model: 2
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
par0 0.4360 0.002 181.019 0.000 0.431 0.441
par1 225.6771 0.769 293.602 0.000 224.171 227.184
==============================================================================

This returns a summary of the log-normal fits (if do_fit=True
) and a plot of the PDF and ECDF of each data set. The solid lines in the plot are the fitted distributions.
By default, all three distance metrics are run. For these images, the distances are:
>>> pdf.hellinger_distance
0.23007068347013115
>>> pdf.ks_distance
0.24285888671875
>>> pdf.lognormal_distance
5.561198154785891
Each distance metric can be run separately by running its function in PDF_Distance
, or by setting the statistic
keyword in distance_metric
.
Because of the Hellinger distance requires that the PDF histograms have the same bins, there is no input to give a pre-computed fiducial PDF
, unlike most of the other distance metric classes.