The Poisson binomial distribution (in the following abbreviated as PBD) is becoming increasingly important, especially in the areas of statistics, finance, insurance mathematics and quality management. This package provides functions for two types of PBDs: ordinary and generalized PBDs (henceforth referred to as O-PBDs and G-PBDs).
The O-PBD is the distribution of the sum of a number \(n\) of independent Bernoulli-distributed random indicators \(X_i \in \{0, 1\}\) \((i = 1, ..., n)\): \[X := \sum_{i = 1}^{n}{X_i}.\] Each of the \(X_i\) possesses a predefined probability of success \(p_i := P(X_i = 1)\) (subsequently \(P(X_i = 0) = 1 - p_i =: q_i\)). With this, mean, variance and skewness can be expressed as \[E(X) = \sum_{i = 1}^{n}{p_i} \quad \quad Var(X) = \sum_{i = 1}^{n}{p_i q_i} \quad \quad Skew(X) = \frac{\sum_{i = 1}^{n}{p_i q_i(q_i - p_i)}}{\sqrt{Var(X)}^3}.\] All possible observations are in \(\{0, ..., n\}\).
The G-PBD is defined very similar. Again, it is the distribution of a sum random variables, but here, each \(X_i \in \{u_i, v_i\}\) with \(P(X_i = u_i) =: p_i\) and \(P(X_i = v_i) = 1 - p_i =: q_i\). Using ordinary Bernoulli-distributed random variables \(Y_i\), \(X_i\) can be expressed as \(X_i = u_i Y_i + v_i(1 - Y_i) = v_i + Y_i \cdot (u_i - v_i)\). As a result, mean, variance and skewness are given by \[E(X) = \sum_{i = 1}^{n}{v_i} + \sum_{i = 1}^{n}{p_i (u_i - v_i)} \quad \quad Var(X) = \sum_{i = 1}^{n}{p_i q_i(u_i - v_i)^2} \\Skew(X) = \frac{\sum_{i = 1}^{n}{p_i q_i(q_i - p_i)(u_i - v_i)^3}}{\sqrt{Var(X)}^3}.\] All possible observations are in \(\{U, ..., V\}\) with \(U := \sum_{i = 1}^{n}{\min\{u_i, v_i\}}\) and \(V := \sum_{i = 1}^{n}{\max\{u_i, v_i\}}\). Note that the size \(m := V - U\) of the distribution does not generally equal \(n\)!
Computing these distributions exactly is computationally demanding, but in the last few years, some efficient algorithms have been developed. Particularly significant in this respect are the works of Hong (2013), who derived the DFT-CF procedure for O-PBDs, Biscarri, Zhao & Brunner (2018) who developed two immensely faster algorithms for O-PBDs, namely the DC and DC-FFT procedures, and Zhang, Hong and Balakrishnan (2018) who further developed Hong’s (2013) DFT-CF algorithm for G-PBDs (in the following, this generalized procedure is referred to as G-DFT-CF). Still, only a few R packages exist for the calculation of either ordinary and generalized PBDs, e.g. poibin
and poisbinom
for O-PBDs and GPB
for G-PDBs. Before the release of this PoissonBinomial
package, there has been no R package that implemented the DC and DC-FFT algorithms of Biscarri, Zhao & Brunner (2018), as they only published a reference implementation for R, but refrained from releasing it as a package. Additionally, there are no comparable approaches for G-PBDs to date.
The poibin
package implements the DFT-CF algorithm along with the exact recursive method of Barlow & Heidtmann (1984) and Normal and Poisson approximations. However, both exact procedures of this package possess some disadvantages, i.e. they are relatively slow at computing very large distributions, with the recursive algorithm being also very memory consuming. The G-DFT-CF procedure is implemented in the GPB
package and inherits this performance drawback. The poisbinom
package provides a more efficient and much faster DFT-CF implementation. The performance improvement over the poibin
package lies in the use of the FFTW C library. Unfortunately, it sometimes yields some negative probabilities in the tail regions, especially for large distributions. However, this numerical issue has not been addressed to date. This PoissonBinomial
also utilizes FFTW for both DFT-CF and G-DFT-CF algorithms, but corrects that issue. In addition to the disadvantages regarding computational speed (poibin
and GPB
) or numerics (poisbinom
), especially for very large distributions, the aforementioned packages do not provide headers for their internal C/C++ functions, so that they cannot be imported directly by C or C++ code of other packages that use for example Rcpp
.
In some situations, people might have to deal with Poisson binomial distributions that include Bernoulli variables with \(p_i \in \{0, 1\}\). Calculation performance can be further optimized by handling these indicators before the actual computations. Approximations also benefit from this in terms of accuracy. None of the aforementioned packages implements such optimizations. Therefore, the advantages of this PoissonBinomial
package can be summarized as follows:
In total, this package includes 10 different algorithms of computing ordinary Poisson binomial distributions, including optimized versions of the Normal, Refined Normal and Poisson approaches, and 5 approaches for generalized PBDs. In addition, the implementation of the exact recursive procedure for O-PBDs was rewritten so that it is considerably less memory intensive: the poibin
implementation needs the memory equivalent of \((n + 1)^2\) values of type double
, while ours only needs \(3 \cdot (n + 1)\).
In this package implements the following exact algorithms for computing ordinary Poisson binomial distributions:
For generalized Poisson binomial distributions, this package provides:
Examples and performance comparisons of these procedures are presented in a separate vignette.
In addition, the following O-PBD approximation methods are included:
For G-PBDs, there are
Examples and performance comparisons of these approaches are provided in a separate vignette as well.
Handling special cases, such as ordinary binomial distributions, zeros and ones is useful to speed up performance. Unfortunately, some approximations do not work well for Bernoulli trials with \(p_i \in \{0, 1\}\), e.g. the Geometric Mean Binomial Approximations. This is why handling these values before the actual computation of the distribution is not only a performance tweak, but sometimes even a necessity. It is achieved by some simple preliminary considerations.
These cases are illustrated in the following example:
# Case 1
dpbinom(NULL, rep(0.3, 7))
#> [1] 0.0823543 0.2470629 0.3176523 0.2268945 0.0972405 0.0250047 0.0035721
#> [8] 0.0002187
dbinom(0:7, 7, 0.3)
#> [1] 0.0823543 0.2470629 0.3176523 0.2268945 0.0972405 0.0250047 0.0035721
#> [8] 0.0002187
# equal results
dpbinom(NULL, c(0, 0, 0, 0, 0, 0, 0))
#> [1] 1 0 0 0 0 0 0 0
dpbinom(NULL, c(1, 1, 1, 1, 1, 1, 1))
#> [1] 0 0 0 0 0 0 0 1
# Case 2
dpbinom(NULL, c(0, 0, 0, 0, 1, 1, 1))
#> [1] 0 0 0 1 0 0 0 0
# Case 3
dpbinom(NULL, c(0, 0, 0.1, 0.2, 0.4, 0.8, 1))
#> [1] 0.0000 0.0864 0.4344 0.3784 0.0944 0.0064 0.0000 0.0000
dpbinom(NULL, c(0, 0, 0.4, 1))
#> [1] 0.0 0.6 0.4 0.0 0.0
These cases are illustrated in the following example:
set.seed(1)
pp <- runif(7)
va <- sample(0:6, 7, TRUE)
vb <- sample(0:6, 7, TRUE)
# Case 1
dgpbinom(NULL, pp, rep(1, 7), rep(0, 7))
#> [1] 8.114776e-05 3.112722e-03 4.063146e-02 2.115237e-01 3.793308e-01
#> [6] 2.735489e-01 8.297278e-02 8.798512e-03
dpbinom(NULL, pp)
#> [1] 8.114776e-05 3.112722e-03 4.063146e-02 2.115237e-01 3.793308e-01
#> [6] 2.735489e-01 8.297278e-02 8.798512e-03
# equal results
dgpbinom(NULL, pp, rep(0, 7), rep(1, 7))
#> [1] 8.798512e-03 8.297278e-02 2.735489e-01 3.793308e-01 2.115237e-01
#> [6] 4.063146e-02 3.112722e-03 8.114776e-05
dpbinom(NULL, 1 - pp)
#> [1] 8.798512e-03 8.297278e-02 2.735489e-01 3.793308e-01 2.115237e-01
#> [6] 4.063146e-02 3.112722e-03 8.114776e-05
# equal results
dgpbinom(NULL, pp, c(rep(1, 3), rep(0, 4)), c(rep(0, 3), rep(1, 4)))
#> [1] 3.062225e-02 1.998504e-01 3.769239e-01 2.828424e-01 9.450797e-02
#> [6] 1.426764e-02 9.620692e-04 2.331571e-05
dpbinom(NULL, c(pp[1:3], 1 - pp[4:7])) # reorder for 0 and 1
#> [1] 3.062225e-02 1.998504e-01 3.769239e-01 2.828424e-01 9.450797e-02
#> [6] 1.426764e-02 9.620692e-04 2.331571e-05
# equal results
# Case 2 a)
dgpbinom(NULL, rep(0, 7), rep(4, 7), rep(2, 7))
#> [1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
dgpbinom(7 * 2, rep(0, 7), rep(4, 7), rep(2, 7))
#> [1] 1
# equal results
# Case 2 b)
dgpbinom(NULL, rep(1, 7), rep(4, 7), rep(2, 7))
#> [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
dgpbinom(7 * 4, rep(1, 7), rep(4, 7), rep(2, 7))
#> [1] 1
# equal results
# Case 2 c)
dgpbinom(NULL, rep(0.3, 7), rep(4, 7), rep(2, 7))
#> [1] 0.0823543 0.0000000 0.2470629 0.0000000 0.3176523 0.0000000 0.2268945
#> [8] 0.0000000 0.0972405 0.0000000 0.0250047 0.0000000 0.0035721 0.0000000
#> [15] 0.0002187
dbinom(0:7, 7, 0.3)
#> [1] 0.0823543 0.2470629 0.3176523 0.2268945 0.0972405 0.0250047 0.0035721
#> [8] 0.0002187
# equal results, but on different support set
# Case 2 d)
dgpbinom(NULL, pp, rep(4, 7), rep(2, 7))
#> [1] 8.114776e-05 0.000000e+00 3.112722e-03 0.000000e+00 4.063146e-02
#> [6] 0.000000e+00 2.115237e-01 0.000000e+00 3.793308e-01 0.000000e+00
#> [11] 2.735489e-01 0.000000e+00 8.297278e-02 0.000000e+00 8.798512e-03
dpbinom(NULL, pp)
#> [1] 8.114776e-05 3.112722e-03 4.063146e-02 2.115237e-01 3.793308e-01
#> [6] 2.735489e-01 8.297278e-02 8.798512e-03
# equal results, but on different support set
# Case 3 a)
dgpbinom(NULL, c(0, 0, 0, 0, 0, 0, 0), va, vb)
#> [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
dgpbinom(sum(vb), rep(0, 7), va, vb)
#> [1] 1
# equal results
# Case 3 b)
dgpbinom(NULL, c(1, 1, 1, 1, 1, 1, 1), va, vb)
#> [1] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
dgpbinom(sum(va), rep(1, 7), va, vb)
#> [1] 1
# equal results
# Case 3 c)
dgpbinom(NULL, c(0, 0, 0, 1, 1, 1, 1), va, vb)
#> [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
dgpbinom(sum(va[4:7], vb[1:3]), c(0, 0, 0, 1, 1, 1, 1), va, vb)
#> [1] 1
# equal results
# Case 4
dgpbinom(NULL, c(0, 0, 0.3, 0.6, 1, 1, 1), va, vb)
#> [1] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.18 0.00 0.00 0.12 0.42 0.00 0.00 0.28
#> [16] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sure <- sum(va[5:7], vb[1:2])
x.transf <- sum(pmin(va[3:4], vb[3:4])):sum(pmax(va[3:4], vb[3:4]))
dgpbinom(sure + x.transf, c(0, 0, 0.3, 0.6, 1, 1, 1), va, vb)
#> [1] 0.18 0.00 0.00 0.12 0.42 0.00 0.00 0.28
dgpbinom(x.transf, c(0.3, 0.6), va[3:4], vb[3:4])
#> [1] 0.18 0.00 0.00 0.12 0.42 0.00 0.00 0.28
# equal results
dgpbinom(NULL, c(0, 0, 0, 0.6, 1, 1, 1), va, vb)
#> [1] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.0 0.4 0.0 0.0 0.0 0.0
#> [20] 0.0 0.0 0.0 0.0 0.0 0.0
sure <- sum(va[5:7], vb[1:3])
x.transf <- va[4]:vb[4]
dgpbinom(sure + x.transf, c(0, 0, 0, 0.6, 1, 1, 1), va, vb)
#> [1] 0.6 0.0 0.0 0.4
dgpbinom(x.transf, 0.6, va[4], vb[4])
#> [1] 0.6 0.0 0.0 0.4
# equal results; essentially transformed Bernoulli
How to import and use the internal C++ functions in Rcpp
based packages is described in a separate vignette.