Processing math: 100%

Comparing splits using information theory

Martin R. Smith

2020-07-09

To understand the information-based metrics implemented in ‘TreeDist’, it is useful to recall some basic concepts of information theory (see MacKay (2003) for an introduction).

Shannon information

Information is usually measured in bits. One bit is the amount of information generated by tossing a fair coin: to record the outcome of a coin toss, I must record either a H or a T, and with each of the two symbols equally likely, there is no way to compress the results of multiple tosses.

The Shannon (1948) information content of an outcome x is defined to be h(x)=log2P(x), which simplifies to log2n when all n outcomes are equiprobable. Thus, the information content of a fair coin toss is log22=1 bit; the information content of rolling a six-sided die is log262.58 bits; and the information content of selecting at random any one of the 105 unrooted binary six-leaf trees is log21056.71 bits.

Application to splits

The split S1= AB|CDEF is found in 15 of the 105 six-leaf trees; as such, the probability that a randomly drawn tree contains S1 is P(S1)=15105, and the information content h(S1)=log2151052.81 bits. Steel & Penny (2006) dub this quantity the phylogenetic information content.

Likewise, the split S2= ABC|DEF occurs in nine six-leaf trees, so h(S2)=log291053.54 bits. Three six-leaf trees contain both splits, so in combination the splits deliver h(S1,S2)=log231055.13 bits of information.

Because h(S1,S2)<h(S1)+h(S2), some of the information in S1 is also present in S2. The information in common between S1 and S2 is hshared(S1,S2)=h(S1)+h(S2)h(S1,S2)1.22 bits. The information unique to S1 and S2 is hdifferent(S1,S2)=2h(S1,S2)h(S1)h(S2)3.91 bits.

These quantities can be calculated using functions in the ‘TreeTools’ package.

library('TreeTools', quietly = TRUE, warn.conflicts = FALSE)
library('TreeDist')
treesMatchingSplit <- c(
  AB.CDEF = TreesMatchingSplit(2, 4),
  ABC.DEF = TreesMatchingSplit(3, 3)
)
treesMatchingSplit
## AB.CDEF ABC.DEF 
##      15       9
proportionMatchingSplit <- treesMatchingSplit / NUnrooted(6)
proportionMatchingSplit
##    AB.CDEF    ABC.DEF 
## 0.14285714 0.08571429
splitInformation <- -log2(proportionMatchingSplit)
splitInformation
##  AB.CDEF  ABC.DEF 
## 2.807355 3.544321
treesMatchingBoth <- TreesConsistentWithTwoSplits(6, 2, 3)
combinedInformation <- -log2(treesMatchingBoth / NUnrooted(6))

sharedInformation <- sum(splitInformation) - combinedInformation
sharedInformation
## [1] 1.222392
# Or more concisely:
SplitSharedInformation(n = 6, 2, 3)
## [1] 1.222392

Entropy

Entropy is the average information content of each outcome, weighted by its probability: plog2(p). Where all n outcomes are equiprobable, this simplifies to log2n.

Consider a case in which Jane rolls a dice, and makes two true statements about the outcome x:

S1: “Is the roll even?”.

S2: “Is the roll greater than 3?”

The joint entropy of S1 and S2 is the entropy of the association matrix that considers each possible outcome:

  S1:x odd S1:x even
S2:x3 x1,3;p=26 x=2;p=16
S2:x>3 x=5;p=16 x4,6;p=26

H(S1,S2)=23log223+13log213+13log213+23log2231.84 bits

Note that this less than the log262.58 bits we require to determine the exact value of the roll: knowledge of S1 and S2 is not guaranteed to be sufficient to unambiguously identify x.

The mutual information between S1 and S2 describes how much knowledge of S1 reduces our uncertainty in S2 (or vice versa). So if we learn that S1 is ‘even’, we become a little more confident that S2 is ‘greater than three’.

The mutual information I(S1;S2) corresponds to the sum of the individual entropies, minus the joint entropy:

I(S1;S2)=H(S1)+H(S2)H(S1,S2)

The entropy distance, also termed the variation of information (Meilă, 2007), is the information that S1 and S2 do not have in common:

HD(S1,S2)=H(S1,S2)I(S1;S2)=2H(S1,S2)H(S1)H(S2)

Application to splits

Let split S divides n leaves into two partitions A and B. The probability that a randomly chosen leaf x is in partition k is P(xk)=|k|n. S thus corresponds to a random variable with entropy H(S)=|A|nlog2|A|n|B|nlog2|B|n (Meilă, 2007).

The entropy of a split corresponds to the average number of bits necessary to transmit the partition label for a given leaf.

The joint entropy of two splits, S1 and S2, corresponds to the entropy of the association matrix of probabilities that a randomly selected leaf belongs to each pair of partitions:

  S1:xA1 S1:xB1
S2:xA2 P(A1,A2)=|A1A2|n P(B1,A2)=|B1A2|n
S2:xB2 P(A1,B2)=|A1B2|n P(B1,B2)=|B1B2|n

H(S1,S2)=P(A1,A2)log2P(A1,A2)+P(B1,A2)log2P(B1,A2)

+P(A1,B2)log2P(A1,B2)+P(B1,B2)log2P(B1,B2)

These values can then be substituted into the definitions of mutual information and entropy distance given above.

As S1 and S2 become more different, the disposition of S1 gives less information about the configuration of S2, and the mutual information decreases accordingly.

References

MacKay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press. Retrieved from https://www.inference.org.uk/itprnn/book.pdf

Meilă, M. (2007). Comparing clusterings—an information based distance. Journal of Multivariate Analysis, 98(5), 873–895. doi:10.1016/j.jmva.2006.11.013

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656.

Steel, M. A., & Penny, D. (2006). Maximum parsimony and the phylogenetic information in multistate characters. In V. A. Albert (Ed.), Parsimony, phylogeny, and genomics (pp. 163–178). Oxford: Oxford University Press.