Information Theory

Information Theory measures in philentropy

The laws of probability, so true in general, so fallacious in particular.

- Edward Gibbon

Information theory and statistics were beautifully fused by Solomon Kullback. This fusion allowed to quantify correlations and similarities between random variables using a more sophisticated toolkit. Modern fields such as machine learning and statistical data science build upon this fusion and the most powerful statistical techniques used today are based on an information theoretic foundation.

The philentropy aims to follow this tradition and therefore, it implements the most important information theory measures.

Shannon’s Entropy H(X)

\(H(X) = -\sum\limits_{i=1}^n P(x_i) * log_b(P(x_i))\)

[1] 3.103643

Shannon’s Joint-Entropy H(X,Y)

\(H(X,Y) = -\sum\limits_{i=1}^n\sum\limits_{j=1}^m P(x_i, y_j) * log_b(P(x_i, y_j))\)

[1] 6.372236

Shannon’s Conditional-Entropy H(X | Y)

\(H(Y|X) = \sum\limits_{i=1}^n\sum\limits_{j=1}^m P(x_i, y_j) * log_b( P(x_i) / P(x_i, y_j) )\)

[1] 0

Mutual Information I(X,Y)

\(MI(X,Y) = \sum\limits_{i=1}^n\sum\limits_{j=1}^m P(x_i, y_j) * log_b( P(x_i, y_j) / ( P(x_i) * P(y_j) )\)

[1] 3.311973

Kullback-Leibler Divergence

\(KL(P || Q) = \sum\limits_{i=1}^n P(p_i) * log_2(P(p_i) / P(q_i)) = H(P, Q) - H(P)\)

where H(P, Q) denotes the joint entropy of the probability distributions P and Q and H(P) denotes the entropy of probability distribution P. In case P = Q then KL(P, Q) = 0 and in case P != Q then KL(P, Q) > 0.

The KL divergence is a non-symmetric measure of the directed divergence between two probability distributions P and Q. It only fulfills the positivity property of a distance metric.

Because of the relation KL(P||Q) = H(P,Q) - H(P), the Kullback-Leibler divergence of two probability distributions P and Q is also named Cross Entropy of two probability distributions P and Q.

# KL(x, unit = "log2") # Default
Kulback-Leibler Divergence using unit 'log2'.
kullback-leibler 
       0.1392629 
# KL(x, unit = "log")
Kulback-Leibler Divergence using unit 'log'.
kullback-leibler 
      0.09652967 
# KL(x, unit = "log10")
Kulback-Leibler Divergence using unit 'log10'.
kullback-leibler 
       0.0419223 

Jensen-Shannon Divergence

This function computes the Jensen-Shannon Divergence JSD(P || Q) between two probability distributions P and Q with equal weights π_1 = π_2 = 1/2.

The Jensen-Shannon Divergence JSD(P || Q) between two probability distributions P and Q is defined as:

\(JSD(P || Q) = 0.5 * (KL(P || R) + KL(Q || R))\)

where R = 0.5 * (P + Q) denotes the mid-point of the probability vectors P and Q, and KL(P || R), KL(Q || R) denote the Kullback-Leibler Divergence of P and R, as well as Q and R.

# JSD(x, unit = "log2") # Default
Jensen-Shannon Divergence using unit 'log2'.
jensen-shannon 
    0.03792749 
# JSD(x, unit = "log")
Jensen-Shannon Divergence using unit 'log'.
jensen-shannon 
    0.02628933 
# JSD(x, unit = "log10")
Jensen-Shannon Divergence using unit 'log10'.
jensen-shannon 
    0.01141731 

Alternatively, users can specify count data.

Jensen-Shannon Divergence using unit 'log2'.
jensen-shannon 
    0.03792749

Or users can compute distances based on a probability matrix

           v1           v2           v3
v1 0.00000000 0.0379274917 0.0435852218
v2 0.03792749 0.0000000000 0.0002120578
v3 0.04358522 0.0002120578 0.0000000000

Properties of the Jensen-Shannon Divergence:

  • JSD is non-negative.

  • JSD is a symmetric measure JSD(P || Q) = JSD(Q || P).

  • JSD = 0, if and only if P = Q.

Generalized Jensen-Shannon Divergence

The generalized Jensen-Shannon Divergence \(gJSD_{\pi_1,...,\pi_n}(P_1, ..., P_n)\) enables distance comparisons between multiple probability distributions \(P_1,...,P_n\):

\(gJSD_{\pi_1,...,\pi_n}(P_1, ..., P_n) = H(\sum_{i = 1}^n \pi_i*P_i) - \sum_{i = 1}^n \pi_i*H(P_i)\)

where \(\pi_1,...,\pi_n\) denote the weights selected for the probability vectors \(P_1,...,P_n\) and \(H(P_i)\) denotes the Shannon Entropy of probability vector \(P_i\).

#> No weights were specified ('weights = NULL'), thus equal weights for all
#> distributions will be calculated and applied.
#> Metric: 'gJSD'; unit = 'log2'; comparing: 3 vectors (v1, ... , v3).
#> Weights: v1 = 0.333333333333333, v2 = 0.333333333333333, v3 = 0.333333333333333
[1] 0.03512892

As you can see, the gJSD function prints out the exact number of vectors that were used to compute the generalized JSD. By default, the weights are uniformly distributed (weights = NULL).

Users can also specify non-uniformly distributed weights via specifying the weights argument:

#> Metric: 'gJSD'; unit = 'log2'; comparing: 3 vectors (v1, ... , v3).
#> Weights: v1 = 0.5, v2 = 0.25, v3 = 0.25
[1] 0.04081969

Finally, users can use the argument est.prob to empirically estimate probability vectors when they wish to specify count vectors as input:

#> No weights were specified ('weights = NULL'), thus equal weights for all distributions will be calculated and applied.
#> Metric: 'gJSD'; unit = 'log2'; comparing: 3 vectors (v1, ... , v3).
#> Weights: v1 = 0.333333333333333, v2 = 0.333333333333333, v3 = 0.333333333333333
[1] 0.03512892