To understand the information-based metrics implemented in ‘TreeDist’, it is useful to recall some basic concepts of information theory (see MacKay (2003) for an introduction).
Information is usually measured in bits. One bit is the amount of information generated by tossing a fair coin: to record the outcome of a coin toss, I must record either a H
or a T
, and with each of the two symbols equally likely, there is no way to compress the results of multiple tosses.
The Shannon (1948) information content of an outcome x is defined to be h(x)=−log2P(x), which simplifies to log2n when all n outcomes are equiprobable. Thus, the information content of a fair coin toss is log22=1 bit; the information content of rolling a six-sided die is log26≈2.58 bits; and the information content of selecting at random any one of the 105 unrooted binary six-leaf trees is log2105≈6.71 bits.
The split S1= AB|CDEF
is found in 15 of the 105 six-leaf trees; as such, the probability that a randomly drawn tree contains S1 is P(S1)=15105, and the information content h(S1)=−log215105≈2.81 bits. Steel & Penny (2006) dub this quantity the phylogenetic information content.
Likewise, the split S2= ABC|DEF
occurs in nine six-leaf trees, so h(S2)=−log29105≈3.54 bits. Three six-leaf trees contain both splits, so in combination the splits deliver h(S1,S2)=−log23105≈5.13 bits of information.
Because h(S1,S2)<h(S1)+h(S2), some of the information in S1 is also present in S2. The information in common between S1 and S2 is hshared(S1,S2)=h(S1)+h(S2)−h(S1,S2)≈1.22 bits. The information unique to S1 and S2 is hdifferent(S1,S2)=2h(S1,S2)−h(S1)−h(S2)≈3.91 bits.
These quantities can be calculated using functions in the ‘TreeTools’ package.
library('TreeTools', quietly = TRUE, warn.conflicts = FALSE)
library('TreeDist')
c(
treesMatchingSplit <-AB.CDEF = TreesMatchingSplit(2, 4),
ABC.DEF = TreesMatchingSplit(3, 3)
) treesMatchingSplit
## AB.CDEF ABC.DEF
## 15 9
treesMatchingSplit / NUnrooted(6)
proportionMatchingSplit <- proportionMatchingSplit
## AB.CDEF ABC.DEF
## 0.14285714 0.08571429
-log2(proportionMatchingSplit)
splitInformation <- splitInformation
## AB.CDEF ABC.DEF
## 2.807355 3.544321
TreesConsistentWithTwoSplits(6, 2, 3)
treesMatchingBoth <- -log2(treesMatchingBoth / NUnrooted(6))
combinedInformation <-
sum(splitInformation) - combinedInformation
sharedInformation <- sharedInformation
## [1] 1.222392
# Or more concisely:
SplitSharedInformation(n = 6, 2, 3)
## [1] 1.222392
Entropy is the average information content of each outcome, weighted by its probability: ∑−plog2(p). Where all n outcomes are equiprobable, this simplifies to log2n.
Consider a case in which Jane rolls a dice, and makes two true statements about the outcome x:
S1: “Is the roll even?”.
S2: “Is the roll greater than 3?”
The joint entropy of S1 and S2 is the entropy of the association matrix that considers each possible outcome:
S1:x odd | S1:x even | |
---|---|---|
S2:x≤3 | x∈1,3;p=26 | x=2;p=16 |
S2:x>3 | x=5;p=16 | x∈4,6;p=26 |
H(S1,S2)=23log223+13log213+13log213+23log223≈1.84 bits
Note that this less than the log26≈2.58 bits we require to determine the exact value of the roll: knowledge of S1 and S2 is not guaranteed to be sufficient to unambiguously identify x.
The mutual information between S1 and S2 describes how much knowledge of S1 reduces our uncertainty in S2 (or vice versa). So if we learn that S1 is ‘even’, we become a little more confident that S2 is ‘greater than three’.
The mutual information I(S1;S2) corresponds to the sum of the individual entropies, minus the joint entropy:
I(S1;S2)=H(S1)+H(S2)−H(S1,S2)The entropy distance, also termed the variation of information (Meilă, 2007), is the information that S1 and S2 do not have in common:
HD(S1,S2)=H(S1,S2)−I(S1;S2)=2H(S1,S2)−H(S1)−H(S2)Let split S divides n leaves into two partitions A and B. The probability that a randomly chosen leaf x is in partition k is P(x∈k)=|k|n. S thus corresponds to a random variable with entropy H(S)=−|A|nlog2|A|n−|B|nlog2|B|n (Meilă, 2007).
The entropy of a split corresponds to the average number of bits necessary to transmit the partition label for a given leaf.
The joint entropy of two splits, S1 and S2, corresponds to the entropy of the association matrix of probabilities that a randomly selected leaf belongs to each pair of partitions:
S1:x∈A1 | S1:x∈B1 | |
---|---|---|
S2:x∈A2 | P(A1,A2)=|A1∩A2|n | P(B1,A2)=|B1∩A2|n |
S2:x∈B2 | P(A1,B2)=|A1∩B2|n | P(B1,B2)=|B1∩B2|n |
H(S1,S2)=P(A1,A2)log2P(A1,A2)+P(B1,A2)log2P(B1,A2)
+P(A1,B2)log2P(A1,B2)+P(B1,B2)log2P(B1,B2)
These values can then be substituted into the definitions of mutual information and entropy distance given above.
As S1 and S2 become more different, the disposition of S1 gives less information about the configuration of S2, and the mutual information decreases accordingly.
MacKay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press. Retrieved from https://www.inference.org.uk/itprnn/book.pdf
Meilă, M. (2007). Comparing clusterings—an information based distance. Journal of Multivariate Analysis, 98(5), 873–895. doi:10.1016/j.jmva.2006.11.013
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656.
Steel, M. A., & Penny, D. (2006). Maximum parsimony and the phylogenetic information in multistate characters. In V. A. Albert (Ed.), Parsimony, phylogeny, and genomics (pp. 163–178). Oxford: Oxford University Press.