Summarize pdqr-functions with summ_*()

Concept of summary functions is to take one or more pdqr-function(s) and return a summary value (which shouldn’t necessarily be a number). Argument method is used to choose function-specific algorithm of computation.

Note that some summary functions can accumulate pdqr approximation error (like summ_moment() for example). For better precision increase number intervals for piecewise-linear density using either n argument for density() in new_*() or n_grid argument in as_*().

We will use the following distributions throughout this vignette:

my_beta <- as_d(dbeta, shape1 = 2, shape2 = 5)
my_norm <- as_d(dnorm, mean = 0.5)
my_beta_mix <- form_mix(list(my_beta, my_beta + 1))

Although they both are continuous, discrete distributions are also fully supported.

Basic numerical summary

Center

Spread

Moments

summ_moment() has extra arguments for controlling the nature of moment (which can be combined):

There are wrappers for most common moments: skewness and kurtosis:

Quantiles

summ_quantile(f, probs) is essentially a more strict version of as_q(f)(probs):

Entropy

summ_entropy() computes differential entropy (which can be negative) for “continuous” type pdqr-functions, and information entropy for “discrete”:

summ_entropy2() computes entropy based summary of relation between a pair of distributions. There are two methods: default “relative” (for relative entropy which is Kullback-Leibler divergence) and “cross” (for cross-entropy). It handles different supports by using clip (default exp(-20)) value instead of 0 during log() computation. Order of input does matter: summ_entropy2() uses support of the first pdqr-function as integration/summation reference.

Regions

Distributions can be summarized with regions: union of closed intervals. Region is represented as data frame with rows representing intervals and two columns “left” and “right” with left and right interval edges respectively.

Single interval

summ_interval() summarizes input pdqr-function with single interval based on the desired coverage level supplied in argument level. It has three methods:

Highest density region

summ_hdr() computes highest density region (HDR) of a distribution: set of intervals with the lowest total width among all sets with total probability not less than an input level. With unimodal distribution it is essentially the same as summ_interval() with “minwidth” method.

Work with region

There is a region_*() family of functions which helps working with them:

Distance

Function summ_distance() takes two pdqr-functions and returns a distance between two distributions they represent. Many methods of computation are available. This might be useful for doing comparison statistical inference.

# Kolmogorov-Smirnov distance
summ_distance(my_beta, my_norm, method = "KS")
#> [1] 0.419766

# Total variation distance
summ_distance(my_beta, my_norm, method = "totvar")
#> [1] 0.730451

# Probability of one distribution being bigger than other, normalized to [0;1]
summ_distance(my_beta, my_norm, method = "compare")
#> [1] 0.1678761

# Wassertein distance: "average path density point should travel while
# transforming from one into another"
summ_distance(my_beta, my_norm, method = "wass")
#> [1] 0.6952109

# Cramer distance: integral of squared difference of p-functions
summ_distance(my_beta, my_norm, method = "cramer")
#> [1] 0.1719884

# "Align" distance: path length for which one of distribution should be "moved"
# towards the other so that they become "aligned" (probability of one being
# greater than the other is 0.5)
summ_distance(my_beta, my_norm, method = "align")
#> [1] 0.2147014

# "Entropy" distance: `KL(f, g) + KL(g, f)`, where `KL()` is Kullback-Leibler
# divergence. Usually should be used for distributions with same support, but
# works even if they are different (with big numerical penalty).
summ_distance(my_beta, my_norm, method = "entropy")
#> [1] 13.05768

Separation and classification

Separation

Function summ_separation() computes a threshold that optimally separates distributions represented by pair of input pdqr-functions. In other words, summ_separation() solves a binary classification problem with one-dimensional linear classifier: values not more than some threshold are classified as one class, and more than threshold - as another. Order of input functions doesn’t matter.

Classification metrics

Functions summ_classmetric() and summ_classmetric_df() compute metric(s) of classification setup, similar to one used in summ_separation(). Here classifier threshold should be supplied and order of input matters. Classification is assumed to be done as follows: any x value not more than threshold value is classified as “negative”; if more - “positive”. Classification metrics are computed based on two pdqr-functions: f, which represents the distribution of values which should be classified as “negative” (“true negative”), and g - the same for “positive” (“true positive”).

With summ_roc() and summ_rocauc() one can compute data frame of ROC curve points and ROC AUC value respectively. There is also a roc_plot() function for predefined plotting of ROC curve.

Ordering

‘pdqr’ has functions that can order set of distributions. They are summ_order(), summ_sort(), and summ_rank(), which are analogues of order(), sort(), and rank() respectively. They take a list of pdqr-functions as input, establish their ordering based on specified method, and return the desired output.

There are two sets of methods:

# Here the only clear "correct" ordering is that `a <= b`.
f_list <- list(a = my_beta, b = my_beta + 1, c = my_norm)

# Returns an integer vector representing a permutation which rearranges f_list
# in desired order
summ_order(f_list, method = "compare")
#> [1] 1 3 2

  # In this particular case of `f_list` all orderings agree with each other, but
  # generally this is not the case: for any pair of methods there is a case
  # when they disagree with each other
summ_order(f_list, method = "mean")
#> [1] 1 3 2
summ_order(f_list, method = "median")
#> [1] 1 3 2
summ_order(f_list, method = "mode")
#> [1] 1 3 2

  # Use `decreasing = TRUE` to sort decreasingly
summ_order(f_list, method = "compare", decreasing = TRUE)
#> [1] 2 3 1

# Sort list
summ_sort(f_list)
#> $a
#> Density function of continuous type
#> Support: ~[0, 0.95557] (10000 intervals)
#> 
#> $c
#> Density function of continuous type
#> Support: ~[-4.25342, 5.25342] (10000 intervals)
#> 
#> $b
#> Density function of continuous type
#> Support: ~[1, 1.95557] (10000 intervals)
summ_sort(f_list, decreasing = TRUE)
#> $b
#> Density function of continuous type
#> Support: ~[1, 1.95557] (10000 intervals)
#> 
#> $c
#> Density function of continuous type
#> Support: ~[-4.25342, 5.25342] (10000 intervals)
#> 
#> $a
#> Density function of continuous type
#> Support: ~[0, 0.95557] (10000 intervals)

# Rank elements: 1 indicates "the smallest", `length(f_list)` - "the biggest"
summ_rank(f_list)
#> a b c 
#> 1 3 2

Other

Functions summ_prob_true() and summ_prob_false() should be used to extract probabilities from boolean pdqr-functions: outputs of comparing basic operators (like >=, ==, etc.):

summ_prob_true(my_beta >= my_norm)
#> [1] 0.416062
summ_prob_false(my_beta >= 2*my_norm)
#> [1] 0.6391

summ_pval() computes p-value(s) of observed statistic(s) based on the distribution. You can compute left, right, or two-sided p-values with methods “left”, “right”, and “both” respectively. By default multiple input values are adjusted for multiple comparisons (using stats::p.adjust()):

# By default two-sided p-value is computed
summ_pval(my_beta, obs = 0.7)
#> [1] 0.02186803
summ_pval(my_beta, obs = 0.7, method = "left")
#> [1] 0.989066
summ_pval(my_beta, obs = 0.7, method = "right")
#> [1] 0.01093401

# Multiple values are adjusted with `p.adjust()` with "holm" method by default
obs_vec <- seq(0, 1, by = 0.1)
summ_pval(my_beta, obs = obs_vec)
#>  [1] 0.0000000000 1.0000000000 1.0000000000 1.0000000000 1.0000000000
#>  [6] 1.0000000000 0.4915085377 0.1530761780 0.0255840348 0.0009720023
#> [11] 0.0000000000

  # Use `adjust = "none"` to not adjust
summ_pval(my_beta, obs = obs_vec, adjust = "none")
#>  [1] 0.0000000000 0.2285302047 0.6892806594 0.8403488674 0.4665584871
#>  [6] 0.2187482323 0.0819180896 0.0218680254 0.0031980044 0.0001080003
#> [11] 0.0000000000