Modelling the number of reverse dependencies

2020-07-18

In the introduction we have seen that reverse dependencies of all CRAN packages seem to follow the power law. This echoes the phenomenon in other software dependencies observed by, for example, LaBelle and Wallingford, 2004, Baxter et al., 2006, Jenkins and Kirk, 2007, Wu et al. 2007, Louridas et al., 2008, Zheng et al., 2008, Kohring, 2009, Li et al., 2013, Bavota et al. 2015, and Cox et al., 2015. In this vignette, we will fit the discrete power law to model this number of reverse dependencies, using functions with the suffix upp. While we shall focus on “Depends”, which is one of the serveral kinds of dependencies in R, the same analysis can be carried out for all other types.

library(crandep)
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:igraph':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
data(cran_dependencies)

Obtaining the reverse dependencies

We continue with the dependency network g0.depends in the introduction. As we are using the forward dependencies to construct the network, the number of reverse dependencies is equivalent to the in-degree of a package, which can be obtained via the function igraph::degree(). We then construct a data frame to hold this in-degree information.

g0.depends <- cran_dependencies %>%
    dplyr::filter(type == "depends" & !reverse) %>%
    df_to_graph(nodelist = dplyr::rename(cran_dependencies, name = from))
d0.depends <- g0.depends %>% igraph::degree(mode = "in")
df0.depends <-
    data.frame(name = names(d0.depends), degree = as.integer(d0.depends)) %>%
    dplyr::arrange(dplyr::desc(degree), name)
head(df0.depends, 10)
##        name degree
## 1      MASS    430
## 2   ggplot2    309
## 3    Matrix    250
## 4  survival    227
## 5   mvtnorm    192
## 6      Rcpp    184
## 7   lattice    160
## 8        sp    126
## 9    igraph    104
## 10  foreach     85

For the purpose of verification, we use the reverse dependencies to construct the network, then look at the out-degrees of the packages this time.

g0.rev_depends <- cran_dependencies %>%
    dplyr::filter(type == "depends" & reverse) %>% # note the difference to above
    df_to_graph(nodelist = dplyr::rename(cran_dependencies, name = from))
d0.rev_depends <- g0.rev_depends %>% igraph::degree(mode = "out") # note the difference to above
df0.rev_depends <-
    data.frame(name = names(d0.rev_depends), degree = as.integer(d0.rev_depends)) %>%
    dplyr::arrange(dplyr::desc(degree), name)
head(df0.rev_depends, 10)
##        name degree
## 1      MASS    430
## 2   ggplot2    309
## 3    Matrix    250
## 4  survival    227
## 5   mvtnorm    192
## 6      Rcpp    184
## 7   lattice    160
## 8        sp    126
## 9    igraph    104
## 10  foreach     85
identical(df0.depends, df0.rev_depends) # expected TRUE
## [1] TRUE

Exploratory analysis and selecting the threshold

We construct a data frame for the empirical frequencies and survival function at the whole range of data.

df1.depends <- df0.depends %>%
    dplyr::filter(degree > 0L) %>% # to prevent warning when plotting on log-log scale
    dplyr::count(degree, name = "frequency") %>%
    dplyr::arrange(dplyr::desc(degree)) %>%
    dplyr::mutate(survival = cumsum(frequency)/sum(frequency))

Before fitting the discrete power law, to be determined first is a threshold above which it is appropriate. We will visualise the degree distribution to determine such threshold.

gg0 <- df1.depends %>%
    ggplot2::ggplot() +
    ggplot2::geom_point(aes(degree, frequency), size = 0.75) +
    ggplot2::scale_x_log10() +
    ggplot2::scale_y_log10() +
    ggplot2::coord_cartesian(ylim = c(1L, 1e+3L)) +
    ggplot2::theme_bw(12)
gg0

The power law seems appropriate for the whole range of data, and so, for illustration purposes, the threshold will be set at 1 inclusive. 0’s will be excluded anyway because, as we will see, the probability mass function (PMF) is not well-defined at 0.

While it is straightforward to determine the threshold here, such linearity over the whole range might not be seen for other data. The package poweRlaw provides functions for and references to more systematic/objective procedures of selecting the threshold.

Fitting the discrete power law

We use the function mcmc_upp() to fit the discrete power law, of which the PMF is proportional to \(x^{-\alpha}\), where \(\alpha\) is the lone scalar parameter. Here we will use the parameter \(\xi_1=1/(\alpha-1)\) to align with the parameterisation of mcmc_mix() and other distributions in extreme value theory, which is an extension of the power law.

The Bayesian approach is used here for inference, meaning that a prior has to be set for the parameters. We assume a uniform distribution \(U(a_{\xi_1}=0, b_{\xi_1}=100)\) for \(\xi_1\). Markov chain Monte Carlo (MCMC) is used as the inference algorithm.

x <- dplyr::filter(df0.depends, degree > 0L)$degree # data
u <- 1L # threshold
xi1 <- 1.0 # initial value
a_xi1 <- 0.0 # lower bound of uniform distribution
b_xi1 <- 100.0 # upper bound of uniform distribution
set.seed(3075L)
mcmc0.depends <- mcmc_upp(x = x, u = u, xi1 = xi1, a_xi1 = a_xi1, b_xi1 = b_xi1) # takes seconds

Now we have the samples representing the posterior distribution of \(\xi_1\):

mcmc0.depends %>%
    ggplot2::ggplot() +
    ggplot2::geom_density(aes(xi1)) +
    ggplot2::theme_bw(12)

We can obtain the resulting posterior of \(\alpha=1/\xi_1+1\).

mcmc0.depends <- mcmc0.depends %>%
    dplyr::mutate(alpha = 1.0 / xi1 + 1.0)
mcmc0.depends %>%
    ggplot2::ggplot() +
    ggplot2::geom_density(aes(alpha)) +
    ggplot2::theme_bw(12)

This means the number of reverse “Depends” follows approximately a power law with exponent 1.77. We can also calculate the fitted frequencies and survival function, using dupp() and Supp() respectively.

n0 <- length(x) ## or sum(df1.depends$frequency)
freq0 <- n0 * dupp(x = df1.depends$degree, u = 1L, xi1 = mean(mcmc0.depends$xi1))
surv0 <- Supp(x = df1.depends$degree, u = 1L, xi1 = mean(mcmc0.depends$xi1))
df1.depends <- df1.depends %>%
    dplyr::mutate(frequency.fitted = freq0, survival.fitted = surv0)

Finally, we overlay the fitted line on the plot above to check goodness-of-fit:

gg1 <- df1.depends %>%
    ggplot2::ggplot() +
    ggplot2::geom_point(aes(degree, frequency), size = 0.75) +
    ggplot2::geom_line(aes(degree, frequency.fitted), col = 2, lty = 2) +
    ggplot2::scale_x_log10() +
    ggplot2::scale_y_log10() +
    ggplot2::coord_cartesian(ylim = c(1L, 1e+3L)) +
    ggplot2::theme_bw(12)
gg1

The corresponding plot using the survival function can also be obtained:

gg2 <- df1.depends %>%
    ggplot2::ggplot() +
    ggplot2::geom_point(aes(degree, survival), size = 0.75) +
    ggplot2::geom_line(aes(degree, survival.fitted), col = 2, lty = 2) +
    ggplot2::scale_x_log10() +
    ggplot2::scale_y_log10() +
    ggplot2::theme_bw(12)
gg2

This shows the discrete power law doesn’t fit as good as it seems according to the frequency plot. Potential improvements include using a higher threshold, which inevitably means throwing away some data, or using a more flexible distribution, such as those in extreme value theory.