tdigest-0.2.1: On-line accumulation of rank-based statistics

Safe HaskellNone
LanguageHaskell2010

Data.TDigest.Vector

Contents

Description

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. . See original paper: "Computing extremely accurate quantiles using t-digest" by Ted Dunning and Otmar Ertl for more details https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf.

Examples

>>> quantile 0.99 (tdigest [1..1000] :: TDigest 25)
Just 990.5
>>> quantile 0.99 (tdigest [1..1000] :: TDigest 3)
Just 990.3...

t-Digest is more precise in tails, especially median is imprecise:

>>> median (forceCompress $ tdigest [1..1000] :: TDigest 10)
Just 500.5

Semigroup

This operation isn't strictly associative, but statistical variables shouldn't be affected.

>>> let td xs = tdigest xs :: TDigest 10
>>> median (td [1..500] <> (td [501..1000] <> td [1001..1500]))
Just 750.5
>>> median ((td [1..500] <> td [501..1000]) <> td [1001..1500])
Just 750.5

The linear is worst-case scenario:

>>> let td' xs = tdigest (fairshuffle xs) :: TDigest 10
>>> median (td' [1..500] <> (td' [501..1000] <> td' [1001..1500]))
Just 750.5
>>> median ((td' [1..500] <> td' [501..1000]) <> td' [1001..1500])
Just 750.5
Synopsis

Construction

data TDigest (compression :: Nat) #

TDigest is a vector of centroids plus not yet merged elements.

The size of structure is dictated by compression, *𝛿*. And is *O(𝛿)*.

Instances
KnownNat comp => Reducer Double (TDigest comp) #

Both cons and snoc are insert

Instance details

Defined in Data.TDigest.Vector.Internal

Methods

unit :: Double -> TDigest comp #

snoc :: TDigest comp -> Double -> TDigest comp #

cons :: Double -> TDigest comp -> TDigest comp #

Show (TDigest compression) # 
Instance details

Defined in Data.TDigest.Vector.Internal

Methods

showsPrec :: Int -> TDigest compression -> ShowS #

show :: TDigest compression -> String #

showList :: [TDigest compression] -> ShowS #

KnownNat comp => Semigroup (TDigest comp) # 
Instance details

Defined in Data.TDigest.Vector.Internal

Methods

(<>) :: TDigest comp -> TDigest comp -> TDigest comp #

sconcat :: NonEmpty (TDigest comp) -> TDigest comp #

stimes :: Integral b => b -> TDigest comp -> TDigest comp #

KnownNat comp => Monoid (TDigest comp) # 
Instance details

Defined in Data.TDigest.Vector.Internal

Methods

mempty :: TDigest comp #

mappend :: TDigest comp -> TDigest comp -> TDigest comp #

mconcat :: [TDigest comp] -> TDigest comp #

NFData (TDigest comp) # 
Instance details

Defined in Data.TDigest.Vector.Internal

Methods

rnf :: TDigest comp -> () #

KnownNat comp => HasHistogram (TDigest comp) Maybe # 
Instance details

Defined in Data.TDigest.Vector.Internal

tdigest :: (Foldable f, KnownNat comp) => f Double -> TDigest comp #

Strict foldl' over Foldable structure.

Population

singleton :: Double -> TDigest comp #

Make a TDigest of a single data point.

insert #

Arguments

:: KnownNat comp 
=> Double

element

-> TDigest comp 
-> TDigest comp 

Insert single value into TDigest.

insert' #

Arguments

:: KnownNat comp 
=> Double

element

-> TDigest comp 
-> TDigest comp 

Insert single value, don't compress TDigest even if needed.

This may violate the insertion buffer size invariant.

For sensibly bounded input, it makes sense to let TDigest grow (it might grow linearly in size), and after that compress it once.

Compression

>>> let digest = foldl' (flip insert') mempty [0..1000] :: TDigest 5
>>> (size digest, size $ compress digest)
(1001,173)
>>> (quantile 0.1 digest, quantile 0.1 $ compress digest)
(Just 99.6...,Just 99.6...)

compress :: forall comp. KnownNat comp => TDigest comp -> TDigest comp #

forceCompress :: forall comp. KnownNat comp => TDigest comp -> TDigest comp #

Statistics

minimumValue :: KnownNat comp => TDigest comp -> Mean #

Center of left-most centroid. Note: may be different than min element inserted.

>>> minimumValue (tdigest [1..100] :: TDigest 3)
1.0

maximumValue :: KnownNat comp => TDigest comp -> Mean #

Center of right-most centroid. Note: may be different than max element inserted.

>>> maximumValue (tdigest [1..100] :: TDigest 3)
100.0

Percentile

median :: KnownNat comp => TDigest comp -> Maybe Double #

Median, i.e. quantile 0.5.

quantile :: KnownNat comp => Double -> TDigest comp -> Maybe Double #

Calculate quantile of a specific value.

Mean & Variance

>>> stddev (tdigest $ fairshuffle [0..100] :: TDigest 10)
Just 29.0...

mean :: KnownNat comp => TDigest comp -> Maybe Double #

Mean.

>>> mean (tdigest [1..100] :: TDigest 10)
Just 50.5

Note: if you only need the mean, calculate it directly.

variance :: KnownNat comp => TDigest comp -> Maybe Double #

Variance.

stddev :: KnownNat comp => TDigest comp -> Maybe Double #

Standard deviation, square root of variance.

CDF

cdf :: KnownNat comp => Double -> TDigest comp -> Double #

Cumulative distribution function.

Note: if this is the only thing you need, it's more efficient to count this directly.

icdf :: KnownNat comp => Double -> TDigest comp -> Maybe Double #

An alias for quantile

Debug

size :: TDigest comp -> Int #

Size of structure

validate :: TDigest comp -> Either String (TDigest comp) #

Check various invariants in the TDigest structure.