This vignette provides an introduction to the functions facilitating the analysis of the dependencies of CRAN packages.
To obtain the information about various kinds of dependencies of a package, we can use the function get_dep_all()
which takes the package name and the type of dependencies as the first and second arguments, respectively. Currently, the second argument accepts Depends
, Imports
, LinkingTo
, Suggests
, Reverse_depends
, Reverse_imports
, Reverse_linking_to
, and Reverse_suggests
, or any variations in their letter cases.
get_dep_all("dplyr", "Imports")
#> [1] "ellipsis" "generics" "glue" "lifecycle" "magrittr"
#> [6] "methods" "R6" "rlang" "tibble" "tidyselect"
#> [11] "utils" "vctrs"
get_dep_all("MASS", "depends")
#> [1] "grDevices" "graphics" "stats" "utils"
get_dep_all("MASS", "dePends") # should give same result
#> [1] "grDevices" "graphics" "stats" "utils"
Imports
and Depends
are the most common types of dependencies in R
packages, but there are other types such as Suggests
. For more information on different types of dependencies, see the official guidelines and http://r-pkgs.had.co.nz/description.html.
As the information all dependencies of one package are on the same page on CRAN, to avoid scraping the same multiple times, we can use get_dep_df()
instead of get_dep_all()
. The output will be a data frame instead of a character vector.
get_dep_df("dplyr", c("imports", "LinkingTo"))
#> from to type reverse
#> 1 dplyr ellipsis imports FALSE
#> 2 dplyr generics imports FALSE
#> 3 dplyr glue imports FALSE
#> 4 dplyr lifecycle imports FALSE
#> 5 dplyr magrittr imports FALSE
#> 6 dplyr methods imports FALSE
#> 7 dplyr R6 imports FALSE
#> 8 dplyr rlang imports FALSE
#> 9 dplyr tibble imports FALSE
#> 10 dplyr tidyselect imports FALSE
#> 11 dplyr utils imports FALSE
#> 12 dplyr vctrs imports FALSE
The column type
is the type of the dependency converted to lower case. Also, LinkingTo
is now converted to linking_to
for consistency. For the four reverse dependencies, the substring "reverse_"
will not be shown in type
; instead the reverse
column will be TRUE
. This can be illustrated by the following:
get_dep_all("abc", "depends")
#> [1] "abc.data" "nnet" "quantreg" "MASS" "locfit"
get_dep_all("abc", "reverse_depends")
#> [1] "abctools" "EasyABC"
get_dep_df("abc", c("depends", "reverse_depends"))
#> from to type reverse
#> 1 abc abc.data depends FALSE
#> 2 abc nnet depends FALSE
#> 3 abc quantreg depends FALSE
#> 4 abc MASS depends FALSE
#> 5 abc locfit depends FALSE
#> 6 abc abctools depends TRUE
#> 7 abc EasyABC depends TRUE
Theoretically, for each forward dependency
#> from to type reverse
#> 1 A B c FALSE
there should be an equivalent reverse dependency
#> from to type reverse
#> 1 B A c TRUE
Aligning the type
in the forward and reverse dependencies enables this to be checked easily.
To build a dependency network, we have to obtain the dependencies for multiple packages. For illustration, we choose the core packages of the tidyverse, and find out what each package Imports
. We put all the dependencies into one data frame, in which the package in the from
column imports the package in the to
column. This is essentially the edge list of the dependency network.
df0.imports <- rbind(
get_dep_df("ggplot2", "Imports"),
get_dep_df("dplyr", "Imports"),
get_dep_df("tidyr", "Imports"),
get_dep_df("readr", "Imports"),
get_dep_df("purrr", "Imports"),
get_dep_df("tibble", "Imports"),
get_dep_df("stringr", "Imports"),
get_dep_df("forcats", "Imports")
)
head(df0.imports)
#> from to type reverse
#> 1 ggplot2 digest imports FALSE
#> 2 ggplot2 glue imports FALSE
#> 3 ggplot2 grDevices imports FALSE
#> 4 ggplot2 grid imports FALSE
#> 5 ggplot2 gtable imports FALSE
#> 6 ggplot2 isoband imports FALSE
tail(df0.imports)
#> from to type reverse
#> 61 stringr magrittr imports FALSE
#> 62 stringr stringi imports FALSE
#> 63 forcats ellipsis imports FALSE
#> 64 forcats magrittr imports FALSE
#> 65 forcats rlang imports FALSE
#> 66 forcats tibble imports FALSE
With the help of the ‘igraph’ package, we can use this data frame to build a graph object that represents the dependency network.
g0.imports <- igraph::graph_from_data_frame(df0.imports)
set.seed(1457L)
old.par <- par(mar = rep(0.0, 4))
plot(g0.imports, vertex.label.cex = 1.5)
par(old.par)
The nature of a dependency network makes it a directed acyclic graph (DAG). We can use the ‘igraph’ function is_dag()
to check.
Note that this applies to Imports
(and Depends
) only due to their nature. This acyclic nature does not apply to a network of, for example, Suggests
.
It is possible to set a boundary on the nodes to which the edges are directed, using the function df_to_graph()
. The second argument takes in a data frame that contains the list of such nodes in the column name
.
df0.nodes <- data.frame(name = c("ggplot2", "dplyr", "tidyr", "readr", "purrr", "tibble", "stringr", "forcats"), stringsAsFactors = FALSE)
g0.core <- df_to_graph(df0.imports, df0.nodes)
set.seed(259L)
old.par <- par(mar = rep(0.0, 4))
plot(g0.core, vertex.label.cex = 1.5)
par(old.par)
Since networks according to Imports
or Depends
are DAGs, we can obtain the topological ordering using, for example, Kahn’s (1962) sorting algorithm.
topo_sort_kahn(g0.core)
#> id id_num
#> 1 forcats 1
#> 2 ggplot2 2
#> 3 readr 3
#> 4 tidyr 4
#> 5 dplyr 5
#> 6 purrr 6
#> 7 tibble 7
In the topological ordering, represented by the column id_num
, a low (high) number represents being at the front (back) of the ordering. If package A Imports
package B i.e. there is a directed edge from A to B, then A will be topologically before B. As the package ‘tibble’ doesn’t import any package but is imported by most other packages, it naturally goes to the back of the ordering. This ordering may not be unique for a DAG, and other admissible orderings can be obtained by setting random=TRUE
in the function:
set.seed(387L); topo_sort_kahn(g0.core, random = TRUE)
#> id id_num
#> 1 ggplot2 1
#> 2 readr 2
#> 3 forcats 3
#> 4 tidyr 4
#> 5 purrr 5
#> 6 dplyr 6
#> 7 tibble 7
We can also apply the topological sorting to the bigger dependencies network.
Ultimately, we can use get_dep_df()
to obtain all dependencies of all packages available on CRAN. This package provides an example dataset cran_dependencies
that contains all such dependencies as of 2020-05-09.
data(cran_dependencies)
cran_dependencies
#> # A tibble: 211,381 x 4
#> from to type reverse
#> <chr> <chr> <chr> <lgl>
#> 1 A3 xtable depends FALSE
#> 2 A3 pbapply depends FALSE
#> 3 A3 randomForest suggests FALSE
#> 4 A3 e1071 suggests FALSE
#> 5 aaSEA DT imports FALSE
#> 6 aaSEA networkD3 imports FALSE
#> 7 aaSEA shiny imports FALSE
#> 8 aaSEA shinydashboard imports FALSE
#> 9 aaSEA magrittr imports FALSE
#> 10 aaSEA Bios2cor imports FALSE
#> # … with 211,371 more rows
We can build dependency network in the same way as above. Furthermore, we can verify that the forward and reverse dependency networks are (almost) the same.
g0.depends <- cran_dependencies %>%
dplyr::filter(type == "depends" & !reverse) %>%
df_to_graph(nodelist = dplyr::rename(cran_dependencies, name = from))
g0.rev_depends <- cran_dependencies %>%
dplyr::filter(type == "depends" & reverse) %>%
df_to_graph(nodelist = dplyr::rename(cran_dependencies, name = from))
g0.depends
#> IGRAPH 3372080 DN-- 4810 8070 --
#> + attr: name (v/c), type (e/c), reverse (e/l)
#> + edges from 3372080 (vertex names):
#> [1] A3 ->xtable A3 ->pbapply abc ->abc.data
#> [4] abc ->nnet abc ->quantreg abc ->MASS
#> [7] abc ->locfit abcdeFBA ->Rglpk abcdeFBA ->rgl
#> [10] abcdeFBA ->corrplot abcdeFBA ->lattice ABCp2 ->MASS
#> [13] abctools ->abc abctools ->abind abctools ->plyr
#> [16] abctools ->Hmisc abd ->nlme abd ->lattice
#> [19] abd ->mosaic abn ->nnet abn ->MASS
#> [22] abn ->lme4 abodOutlier->cluster AbSim ->ape
#> + ... omitted several edges
g0.rev_depends
#> IGRAPH 7434ace DN-- 4810 8070 --
#> + attr: name (v/c), type (e/c), reverse (e/l)
#> + edges from 7434ace (vertex names):
#> [1] abc ->abctools abc ->EasyABC abc.data->abc
#> [4] abd ->tigerstats abind ->abctools abind ->BCBCSF
#> [7] abind ->CPMCGLM abind ->depth abind ->dgmb
#> [10] abind ->dynamo abind ->fractaldim abind ->informR
#> [13] abind ->interplot abind ->magic abind ->mlma
#> [16] abind ->mlogitBMA abind ->multicon abind ->MultiPhen
#> [19] abind ->multipol abind ->mvmesh abind ->mvSLOUCH
#> [22] abind ->plfm
#> + ... omitted several edges
Their size (number of edges) and order (number of nodes) should be very close if not identical to each other. Because of the dependency direction, their edge lists should be the same but with the column names from
and to
swapped.
One may notice that there are external reverse dependencies which won’t be appear in the forward dependencies if the scraping is limited to CRAN packages. We can find these external reverse dependencies by nodelist = NULL
in df_to_graph()
:
df1.rev_depends <- cran_dependencies %>%
dplyr::filter(type == "depends" & reverse) %>%
df_to_graph(nodelist = NULL, gc = FALSE) %>%
igraph::as_data_frame() # to obtain the edge list
df1.depends <- cran_dependencies %>%
dplyr::filter(type == "depends" & !reverse) %>%
df_to_graph(nodelist = NULL, gc = FALSE) %>%
igraph::as_data_frame()
dfa.diff.depends <- dplyr::anti_join(
df1.rev_depends,
df1.depends,
c("from" = "to", "to" = "from")
)
head(dfa.diff.depends)
#> from to type reverse
#> 1 abind baySeq depends TRUE
#> 2 abind CNORdt depends TRUE
#> 3 abind FISHalyseR depends TRUE
#> 4 abind flowMap depends TRUE
#> 5 abind riboSeqR depends TRUE
#> 6 abind RNAinteract depends TRUE
This means we are extracting the reverse dependencies of which the forward equivalents are not listed. The column to
shows the packages external to CRAN. On the other hand, if we apply dplyr::anti_join()
by switching the order of two edge lists,
dfb.diff.depends <- dplyr::anti_join(
df1.depends,
df1.rev_depends,
c("from" = "to", "to" = "from")
)
head(dfb.diff.depends)
#> from to type reverse
#> 1 abctools parallel depends FALSE
#> 2 abd grid depends FALSE
#> 3 AcceptanceSampling methods depends FALSE
#> 4 AcceptanceSampling stats depends FALSE
#> 5 accrued grid depends FALSE
#> 6 acid stats depends FALSE
the column to
lists those which are not on the page of available packages (anymore). These are either defunct or core packages.
We can also obtain the degree for each package and each type:
df0.summary <- dplyr::count(cran_dependencies, from, type, reverse)
df0.summary
#> # A tibble: 34,861 x 4
#> from type reverse n
#> <chr> <chr> <lgl> <int>
#> 1 A3 depends FALSE 2
#> 2 A3 suggests FALSE 2
#> 3 ABACUS imports FALSE 2
#> 4 ABACUS suggests FALSE 2
#> 5 ABC.RAP imports FALSE 3
#> 6 ABC.RAP suggests FALSE 2
#> 7 ABCanalysis imports FALSE 1
#> 8 ABCanalysis suggests TRUE 4
#> 9 ABCoptim imports FALSE 4
#> 10 ABCoptim linking_to FALSE 1
#> # … with 34,851 more rows
We can look at the “winner” in each of the reverse dependencies:
df0.summary %>%
dplyr::filter(reverse) %>%
dplyr::group_by(type) %>%
dplyr::top_n(1, n)
#> # A tibble: 4 x 4
#> # Groups: type [4]
#> from type reverse n
#> <chr> <chr> <lgl> <int>
#> 1 MASS depends TRUE 455
#> 2 Rcpp linking_to TRUE 2082
#> 3 ggplot2 imports TRUE 2038
#> 4 knitr suggests TRUE 5806
This is not surprising given the nature of each package. To take the summarisation one step further, we can obtain the frequencies of the degrees, and visualise the empirical degree distribution neatly on the log-log scale:
df1.summary <- df0.summary %>%
dplyr::count(type, reverse, n)
#> Storing counts in `nn`, as `n` already present in input
#> ℹ Use `name = "new_name"` to pick a new name.
gg0.summary <- df1.summary %>%
dplyr::mutate(reverse = ifelse(reverse, "reverse", "forward")) %>%
ggplot2::ggplot() +
ggplot2::geom_point(ggplot2::aes(n, nn)) +
ggplot2::facet_grid(type ~ reverse) +
ggplot2::scale_x_log10() +
ggplot2::scale_y_log10() +
ggplot2::labs(x = "Degree", y = "Number of packages") +
ggplot2::theme_bw(20)
gg0.summary
This shows the reverse dependencies, in particular
Reverse_depends
and Reverse_imports
, follow the power law, which is empirically observed in various academic fields.
Methods in social network analysis, such as community detection algorithms and/or stochastic block models, can be applied to study the properties of the dependency network. Ideally, by analysing the dependencies of all CRAN packages, we can obtain a bird’s-eye view of the ecosystem.