public class KolmogorovSmirnovTest
extends java.lang.Object
The K-S test uses a statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis. For one-sample tests evaluating the null hypothesis that a set of sample data points follow a given distribution, the test statistic is \(D_n=\sup_x |F_n(x)-F(x)|\), where \(F\) is the expected distribution and \(F_n\) is the empirical distribution of the \(n\) sample data points. The distribution of \(D_n\) is estimated using a method based on [1] with certain quick decisions for extreme values given in [2].
Two-sample tests are also supported, evaluating the null hypothesis that the two samples
x and y come from the same underlying distribution. In this case, the test
statistic is \(D_{n,m}=\sup_t | F_n(t)-F_m(t)|\) where \(n\) is the length of x, \(m\) is
the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of
the values in x and \(F_m\) is the empirical distribution of the y values. The
default 2-sample test method, kolmogorovSmirnovTest(double[], double[]) works as
follows:
approximateP(double, int, int) for details on
the approximation.
If the product of the sample sizes is less than 10000 and the sample
data contains ties, random jitter is added to the sample data to break ties before applying
the algorithm above. Alternatively, the bootstrap(double[], double[], int, boolean)
method, modeled after ks.boot
in the R Matching package [3], can be used if ties are known to be present in the data.
In the two-sample case, \(D_{n,m}\) has a discrete distribution. This makes the p-value
associated with the null hypothesis \(H_0 : D_{n,m} \ge d \) differ from \(H_0 : D_{n,m} > d \)
by the mass of the observed value \(d\). To distinguish these, the two-sample tests use a boolean
strict parameter. This parameter is ignored for large samples.
The methods used by the 2-sample default implementation are also exposed directly:
exactP(double, int, int, boolean) computes exact 2-sample p-valuesapproximateP(double, int, int) uses the asymptotic distribution The boolean
arguments in the first two methods allow the probability used to estimate the p-value to be
expressed using strict or non-strict inequality. See
kolmogorovSmirnovTest(double[], double[], boolean).References:
| Modifier and Type | Field and Description |
|---|---|
protected static double |
KS_SUM_CAUCHY_CRITERION
Convergence criterion for
ksSum(double, double, int) |
protected static int |
LARGE_SAMPLE_PRODUCT
When product of sample sizes exceeds this value, 2-sample K-S test uses asymptotic
distribution to compute the p-value.
|
protected static int |
MAXIMUM_PARTIAL_SUM_COUNT
Bound on the number of partial sums in
ksSum(double, double, int) |
protected static int |
MONTE_CARLO_ITERATIONS
Deprecated.
|
protected static double |
PG_SUM_RELATIVE_ERROR
Convergence criterion for the sums in #pelzGood(double, double, int)}
|
private RandomGenerator |
rng
Random data generator used by
monteCarloP(double, int, int, boolean, int) |
protected static int |
SMALL_SAMPLE_PRODUCT
Deprecated.
|
| Constructor and Description |
|---|
KolmogorovSmirnovTest()
Construct a KolmogorovSmirnovTest instance with a default random data generator.
|
KolmogorovSmirnovTest(RandomGenerator rng)
Deprecated.
|
| Modifier and Type | Method and Description |
|---|---|
double |
approximateP(double d,
int n,
int m)
Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\)
is the 2-sample Kolmogorov-Smirnov statistic.
|
double |
bootstrap(double[] x,
double[] y,
int iterations)
Computes
bootstrap(x, y, iterations, true). |
double |
bootstrap(double[] x,
double[] y,
int iterations,
boolean strict)
Estimates the p-value of a two-sample
Kolmogorov-Smirnov test
evaluating the null hypothesis that
x and y are samples drawn from the same
probability distribution. |
private static int |
c(int i,
int j,
int m,
int n,
long cmn,
boolean strict)
The function C(i, j) defined in [4] (class javadoc), formula (5.5).
|
private static long |
calculateIntegralD(double d,
int n,
int m,
boolean strict)
Given a d-statistic in the range [0, 1] and the two sample sizes n and m,
an integral d-statistic in the range [0, n*m] is calculated, that can be used for
comparison with other integral d-statistics.
|
double |
cdf(double d,
int n)
Calculates \(P(D_n < d)\) using the method described in [1] with quick decisions for extreme
values given in [2] (see above).
|
double |
cdf(double d,
int n,
boolean exact)
Calculates
P(D_n < d) using method described in [1] with quick decisions for extreme
values given in [2] (see above). |
double |
cdfExact(double d,
int n)
Calculates
P(D_n < d). |
private void |
checkArray(double[] array)
Verifies that
array has length at least 2. |
private FieldMatrix<BigFraction> |
createExactH(double d,
int n)
Creates
H of size m x m as described in [1] (see above). |
private RealMatrix |
createRoundedH(double d,
int n)
Creates
H of size m x m as described in [1] (see above)
using double-precision. |
private double |
exactK(double d,
int n)
Calculates the exact value of
P(D_n < d) using the method described in [1] (reference
in class javadoc above) and BigFraction (see
above). |
double |
exactP(double d,
int n,
int m,
boolean strict)
Computes \(P(D_{n,m} > d)\) if
strict is true; otherwise \(P(D_{n,m} \ge
d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. |
(package private) static void |
fillBooleanArrayRandomlyWithFixedNumberTrueValues(boolean[] b,
int numberOfTrueValues,
RandomGenerator rng)
Fills a boolean array randomly with a fixed number of
true values. |
private static void |
fixTies(double[] x,
double[] y)
If there are no ties in the combined dataset formed from x and y, this
method is a no-op.
|
private static boolean |
hasTies(double[] x,
double[] y)
Returns true iff there are ties in the combined sample
formed from x and y.
|
private long |
integralKolmogorovSmirnovStatistic(double[] x,
double[] y)
Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\)
where \(n\) is the length of
x, \(m\) is the length of y, \(F_n\) is the
empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\)
is the empirical distribution of the y values. |
private double |
integralMonteCarloP(long d,
int n,
int m,
int iterations)
Uses Monte Carlo simulation to approximate \(P(D_{n,m} >= d/(n*m))\) where \(D_{n,m}\) is the
2-sample Kolmogorov-Smirnov statistic.
|
private static void |
jitter(double[] data,
RealDistribution dist)
Adds random jitter to
data using deviates sampled from dist. |
double |
kolmogorovSmirnovStatistic(double[] x,
double[] y)
Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\)
where \(n\) is the length of
x, \(m\) is the length of y, \(F_n\) is the
empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\)
is the empirical distribution of the y values. |
double |
kolmogorovSmirnovStatistic(RealDistribution distribution,
double[] data)
Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where
\(F\) is the distribution (cdf) function associated with
distribution, \(n\) is the
length of data and \(F_n\) is the empirical distribution that puts mass \(1/n\) at
each of the values in data. |
double |
kolmogorovSmirnovTest(double[] x,
double[] y)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test
evaluating the null hypothesis that
x and y are samples drawn from the same
probability distribution. |
double |
kolmogorovSmirnovTest(double[] x,
double[] y,
boolean strict)
Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test
evaluating the null hypothesis that
x and y are samples drawn from the same
probability distribution. |
double |
kolmogorovSmirnovTest(RealDistribution distribution,
double[] data)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test
evaluating the null hypothesis that
data conforms to distribution. |
double |
kolmogorovSmirnovTest(RealDistribution distribution,
double[] data,
boolean exact)
Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test
evaluating the null hypothesis that
data conforms to distribution. |
boolean |
kolmogorovSmirnovTest(RealDistribution distribution,
double[] data,
double alpha)
Performs a Kolmogorov-Smirnov
test evaluating the null hypothesis that
data conforms to distribution. |
double |
ksSum(double t,
double tolerance,
int maxIterations)
Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial
sums are within
tolerance of one another, or when maxIterations partial sums
have been computed. |
double |
monteCarloP(double d,
int n,
int m,
boolean strict,
int iterations)
Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the
2-sample Kolmogorov-Smirnov statistic.
|
private static double |
n(int i,
int j,
int m,
int n,
long cnm,
boolean strict)
The function N(i, j) defined in [4] (class javadoc).
|
double |
pelzGood(double d,
int n)
Computes the Pelz-Good approximation for \(P(D_n < d)\) as described in [2] in the class javadoc.
|
private double |
roundedK(double d,
int n)
Calculates
P(D_n < d) using method described in [1] and doubles (see above). |
protected static final int MAXIMUM_PARTIAL_SUM_COUNT
ksSum(double, double, int)protected static final double KS_SUM_CAUCHY_CRITERION
ksSum(double, double, int)protected static final double PG_SUM_RELATIVE_ERROR
@Deprecated protected static final int SMALL_SAMPLE_PRODUCT
protected static final int LARGE_SAMPLE_PRODUCT
@Deprecated protected static final int MONTE_CARLO_ITERATIONS
monteCarloP(double, int, int, boolean, int).
Deprecated as of version 3.6, as this method is no longer needed.private final RandomGenerator rng
monteCarloP(double, int, int, boolean, int)public KolmogorovSmirnovTest()
@Deprecated public KolmogorovSmirnovTest(RandomGenerator rng)
rng - random data generator used by monteCarloP(double, int, int, boolean, int)public double kolmogorovSmirnovTest(RealDistribution distribution, double[] data, boolean exact)
data conforms to distribution. If
exact is true, the distribution used to compute the p-value is computed using
extended precision. See cdfExact(double, int).distribution - reference distributiondata - sample being being evaluatedexact - whether or not to force exact computation of the p-valuedata is a sample from
distributionInsufficientDataException - if data does not have length at least 2NullArgumentException - if data is nullpublic double kolmogorovSmirnovStatistic(RealDistribution distribution, double[] data)
distribution, \(n\) is the
length of data and \(F_n\) is the empirical distribution that puts mass \(1/n\) at
each of the values in data.distribution - reference distributiondata - sample being evaluatedInsufficientDataException - if data does not have length at least 2NullArgumentException - if data is nullpublic double kolmogorovSmirnovTest(double[] x,
double[] y,
boolean strict)
x and y are samples drawn from the same
probability distribution. Specifically, what is returned is an estimate of the probability
that the kolmogorovSmirnovStatistic(double[], double[]) associated with a randomly
selected partition of the combined sample into subsamples of sizes x.length and
y.length will strictly exceed (if strict is true) or be at least as
large as strict = false) as kolmogorovSmirnovStatistic(x, y).
exactP(double, int, int, boolean). approximateP(double, int, int)
for details on the approximation.
If x.length * y.length < 10000 and the combined set of values in
x and y contains ties, random jitter is added to x and y to
break ties before computing \(D_{n,m}\) and the p-value. The jitter is uniformly distributed
on (-minDelta / 2, minDelta / 2) where minDelta is the smallest pairwise difference between
values in the combined sample.
If ties are known to be present in the data, bootstrap(double[], double[], int, boolean)
may be used as an alternative method for estimating the p-value.
x - first sample datasety - second sample datasetstrict - whether or not the probability to compute is expressed as a strict inequality
(ignored for large samples)x and y represent
samples from the same distributionInsufficientDataException - if either x or y does not have length at
least 2NullArgumentException - if either x or y is nullbootstrap(double[], double[], int, boolean)public double kolmogorovSmirnovTest(double[] x,
double[] y)
x and y are samples drawn from the same
probability distribution. Assumes the strict form of the inequality used to compute the
p-value. See kolmogorovSmirnovTest(RealDistribution, double[], boolean).x - first sample datasety - second sample datasetx and y represent
samples from the same distributionInsufficientDataException - if either x or y does not have length at
least 2NullArgumentException - if either x or y is nullpublic double kolmogorovSmirnovStatistic(double[] x,
double[] y)
x, \(m\) is the length of y, \(F_n\) is the
empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\)
is the empirical distribution of the y values.x - first sampley - second samplex and
y represent samples from the same underlying distributionInsufficientDataException - if either x or y does not have length at
least 2NullArgumentException - if either x or y is nullprivate long integralKolmogorovSmirnovStatistic(double[] x,
double[] y)
x, \(m\) is the length of y, \(F_n\) is the
empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\)
is the empirical distribution of the y values. Finally \(n m D_{n,m}\) is returned
as long value.x - first sampley - second samplex and
y represent samples from the same underlying distributionInsufficientDataException - if either x or y does not have length at
least 2NullArgumentException - if either x or y is nullpublic double kolmogorovSmirnovTest(RealDistribution distribution, double[] data)
data conforms to distribution.distribution - reference distributiondata - sample being being evaluateddata is a sample from
distributionInsufficientDataException - if data does not have length at least 2NullArgumentException - if data is nullpublic boolean kolmogorovSmirnovTest(RealDistribution distribution, double[] data, double alpha)
data conforms to distribution.distribution - reference distributiondata - sample being being evaluatedalpha - significance level of the testdata is a sample from distribution
can be rejected with confidence 1 - alphaInsufficientDataException - if data does not have length at least 2NullArgumentException - if data is nullpublic double bootstrap(double[] x,
double[] y,
int iterations,
boolean strict)
x and y are samples drawn from the same
probability distribution. This method estimates the p-value by repeatedly sampling sets of size
x.length and y.length from the empirical distribution of the combined sample.
When strict is true, this is equivalent to the algorithm implemented in the R function
ks.boot, described in Jasjeet S. Sekhon. 2011. 'Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R.' Journal of Statistical Software, 42(7): 1-52.
x - first sampley - second sampleiterations - number of bootstrap resampling iterationsstrict - whether or not the null hypothesis is expressed as a strict inequalitypublic double bootstrap(double[] x,
double[] y,
int iterations)
bootstrap(x, y, iterations, true).
This is equivalent to ks.boot(x,y, nboots=iterations) using the R Matching
package function. See #bootstrap(double[], double[], int, boolean).x - first sampley - second sampleiterations - number of bootstrap resampling iterationspublic double cdf(double d,
int n)
throws MathArithmeticException
cdfExact(double, int) because calculations are based on
double rather than BigFraction.d - statisticn - sample sizeMathArithmeticException - if algorithm fails to convert h to a
BigFraction in expressing d as \((k
- h) / m\) for integer k, m and \(0 \le h < 1\)public double cdfExact(double d,
int n)
throws MathArithmeticException
P(D_n < d). The result is exact in the sense that BigFraction/BigReal is
used everywhere at the expense of very slow execution time. Almost never choose this in real
applications unless you are very sure; this is almost solely for verification purposes.
Normally, you would choose cdf(double, int). See the class
javadoc for definitions and algorithm description.d - statisticn - sample sizeMathArithmeticException - if the algorithm fails to convert h to a
BigFraction in expressing d as \((k
- h) / m\) for integer k, m and \(0 \le h < 1\)public double cdf(double d,
int n,
boolean exact)
throws MathArithmeticException
P(D_n < d) using method described in [1] with quick decisions for extreme
values given in [2] (see above).d - statisticn - sample sizeexact - whether the probability should be calculated exact using
BigFraction everywhere at the expense of
very slow execution time, or if double should be used convenient places to
gain speed. Almost never choose true in real applications unless you are very
sure; true is almost solely for verification purposes.MathArithmeticException - if algorithm fails to convert h to a
BigFraction in expressing d as \((k
- h) / m\) for integer k, m and \(0 \le h < 1\).private double exactK(double d,
int n)
throws MathArithmeticException
P(D_n < d) using the method described in [1] (reference
in class javadoc above) and BigFraction (see
above).d - statisticn - sample sizeMathArithmeticException - if algorithm fails to convert h to a
BigFraction in expressing d as \((k
- h) / m\) for integer k, m and \(0 \le h < 1\).private double roundedK(double d,
int n)
P(D_n < d) using method described in [1] and doubles (see above).d - statisticn - sample sizepublic double pelzGood(double d,
int n)
d - value of d-statistic (x in [2])n - sample sizeprivate FieldMatrix<BigFraction> createExactH(double d, int n) throws NumberIsTooLargeException, FractionConversionException
H of size m x m as described in [1] (see above).d - statisticn - sample sizeNumberIsTooLargeException - if fractional part is greater than 1FractionConversionException - if algorithm fails to convert h to a
BigFraction in expressing d as \((k
- h) / m\) for integer k, m and \(0 <= h < 1\).private RealMatrix createRoundedH(double d, int n) throws NumberIsTooLargeException
H of size m x m as described in [1] (see above)
using double-precision.d - statisticn - sample sizeNumberIsTooLargeException - if fractional part is greater than 1private void checkArray(double[] array)
array has length at least 2.array - array to testNullArgumentException - if array is nullInsufficientDataException - if array is too shortpublic double ksSum(double t,
double tolerance,
int maxIterations)
tolerance of one another, or when maxIterations partial sums
have been computed. If the sum does not converge before maxIterations iterations a
TooManyIterationsException is thrown.t - argumenttolerance - Cauchy criterion for partial sumsmaxIterations - maximum number of partial sums to computeTooManyIterationsException - if the series does not convergeprivate static long calculateIntegralD(double d,
int n,
int m,
boolean strict)
strict is
true or not, the returned value divided by (n*m) is greater than
(resp greater than or equal to) the given d value (allowing some tolerance).d - a d-statistic in the range [0, 1]n - first sample sizem - second sample sizestrict - whether the returned value divided by (n*m) is allowed to be equal to dpublic double exactP(double d,
int n,
int m,
boolean strict)
strict is true; otherwise \(P(D_{n,m} \ge
d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. See
kolmogorovSmirnovStatistic(double[], double[]) for the definition of \(D_{n,m}\).
The returned probability is exact, implemented by unwinding the recursive function definitions presented in [4] (class javadoc).
d - D-statistic valuen - first sample sizem - second sample sizestrict - whether or not the probability to compute is expressed as a strict inequalitydpublic double approximateP(double d,
int n,
int m)
kolmogorovSmirnovStatistic(double[], double[]) for the definition of \(D_{n,m}\).
Specifically, what is returned is \(1 - k(d \sqrt{mn / (m + n)})\) where \(k(t) = 1 + 2
\sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2}\). See ksSum(double, double, int) for
details on how convergence of the sum is determined. This implementation passes ksSum
1.0E-20 as tolerance and
100000 as maxIterations.
d - D-statistic valuen - first sample sizem - second sample sizedstatic void fillBooleanArrayRandomlyWithFixedNumberTrueValues(boolean[] b,
int numberOfTrueValues,
RandomGenerator rng)
true values.
The method uses a simplified version of the Fisher-Yates shuffle algorithm.
By processing first the true values followed by the remaining false values
less random numbers need to be generated. The method is optimized for the case
that the number of true values is larger than or equal to the number of
false values.b - boolean arraynumberOfTrueValues - number of true values the boolean array should finally haverng - random data generatorpublic double monteCarloP(double d,
int n,
int m,
boolean strict,
int iterations)
kolmogorovSmirnovStatistic(double[], double[]) for the definition of \(D_{n,m}\).
The simulation generates iterations random partitions of m + n into an
n set and an m set, computing \(D_{n,m}\) for each partition and returning
the proportion of values that are greater than d, or greater than or equal to
d if strict is false.
d - D-statistic valuen - first sample sizem - second sample sizeiterations - number of random partitions to generatestrict - whether or not the probability to compute is expressed as a strict inequalitydprivate double integralMonteCarloP(long d,
int n,
int m,
int iterations)
Here d is the D-statistic represented as long value.
The real D-statistic is obtained by dividing d by n*m.
See also monteCarloP(double, int, int, boolean, int).
d - integral D-statisticn - first sample sizem - second sample sizeiterations - number of random partitions to generated/(n*m))private static void fixTies(double[] x,
double[] y)
x - first sampley - second sampleprivate static boolean hasTies(double[] x,
double[] y)
x - first sampley - second sampleprivate static void jitter(double[] data,
RealDistribution dist)
data using deviates sampled from dist.
Note that jitter is applied in-place - i.e., the array values are overwritten with the result of applying jitter.
data - input/output data array - entries overwritten by the methoddist - probability distribution to sample for jitter valuesjava.lang.NullPointerException - if either of the parameters is nullprivate static int c(int i,
int j,
int m,
int n,
long cmn,
boolean strict)
i - first path parameterj - second path paramterm - first sample sizen - second sample sizecmn - integral D-statistic (see calculateIntegralD(double, int, int, boolean))strict - whether or not the null hypothesis uses strict inequalityprivate static double n(int i,
int j,
int m,
int n,
long cnm,
boolean strict)
i - first path parameterj - second path parameterm - first sample sizen - second sample sizecnm - integral D-statistic (see calculateIntegralD(double, int, int, boolean))strict - whether or not the null hypothesis uses strict inequalityCopyright (c) 2003-2017 Apache Software Foundation