missSBM: a case study with war networks

The war network

The war data set comes with the missSBM package:

data("war")

This data set contains a list of two networks (beligerent and alliance) where the nodes are countries; an edge in the network beligerent means that the two countries have been at war at least once between years 1816 to 2007; an edge in network alliance means that the two countries have had a formal alliance between years 1816 and 2012. The network beligerent have less nodes since countries which have not been at war at all are not considered.

These two networks were extracted from http://www.correlatesofwar.org/ (see Sarkees and Wayman (2010) for war data, and Gibler (2008) for formal alliance). Version 4.0 was used for war data and version 4.1 for formal alliance. On the top of the two networks, two covariates were considered. One covariate is concerned with military power of the states (see Singer, Bremer, and Stuckey (1972), version 5.0 was used) and the other is concerned with trade exchanges between country (see Barbieri, Keshk, and Pollins (2016) and Barbieri, Keshk, and Pollins (2009), version 4.0 was used). In the following, we focus on the network war$beligerent, which is provided as an igraph object:

par(mar = c(0,0,0,0))
plot(war$beligerent, 
     vertex.shape="none", vertex.label=V(war$beligerent)$name,
     vertex.label.color = "steel blue", vertex.label.font=1.5,
     vertex.label.cex=.6, edge.color="gray70", edge.width = 2)

To pursue our analysis, we extract the adjacency matrix of the network, a covariate on the vertices describing the military power of each country, and a covariate on the dyads describing the intensity of trade between two countries.

beligerent_adjacency <- as_adj(war$beligerent, sparse = FALSE)
beligerent_power     <- war$beligerent$power
beligerent_trade     <- war$beligerent$trade

Generating missing data

Even though the dataset was complete, we can assume that some data may be missing for the sake of illustration.
More specifically, the data collection may be missing for some countries in the sense that data were collected comprehensively for a subset of countries and for the other countries we only observe their edges with the first subset and not within them. Thus, the sampling is node-centered and collects edges information accordingly (there will be a block of missing data on the diagonal of the adjacency matrix). To this end we rely on the function sample in missSBM:

sampledNet_war <- missSBM::sample(beligerent_adjacency, sampling = "node", parameters = .8)
plot(sampledNet_war)

Estimation with missing data

We can now adjust a Stochastic Block Model with the function estimate under this type of sampling:

vBlocks <- 1:5
collection_sbm <- missSBM::estimate(sampledNet_war, vBlocks = vBlocks, sampling = "node")
res_unsmoothed <- data.frame(
  ICL     = collection_sbm$ICL,
  nBlocks = vBlocks, 
  type    = "raw"
)

Smoothing

The smooth function allows the user to produce a smoothed version of the Integrated Classification Likelihood Criterion, commonly used to perform model selection. This will make the choice of the number of group/block more robust.

smooth(collection_sbm, "both")
res_smoothed <- data.frame(
  ICL     = collection_sbm$ICL,
  nBlocks = vBlocks, 
  type    = "smoothed"
)

Let us now check that the smoothing did its job correctly:

rbind(res_unsmoothed, res_smoothed) %>% 
  ggplot(aes(x = nBlocks, y = ICL, group = type, color = type)) + 
    geom_line() + theme_bw()

Estimation on fully observed network

We would like to compare our results with the clustering obtained on the fully observed network. To this end, we adjust - and smooth - a collection of SBM on the original adjacency matrix:

collection_sbm_full <- 
  missSBM::estimate(
    sampledNet  = prepare_data(beligerent_adjacency), 
    vBlocks     = vBlocks, 
    sampling    = "node"
  )
smooth(collection_sbm_full, "forward", control = list(iterates = 3))

As expected, the ICL on the fully observed network is better. But more interestingly, the number of groups selected may differ in the the presence of missing data.

res_missing <- res_smoothed
res_missing$type <- "missing"
res_full <- data.frame(
  ICL     = collection_sbm_full$ICL,
  nBlocks = vBlocks, 
  type    = "full"
)
rbind(res_missing, res_full) %>% 
  ggplot(aes(x = nBlocks, y = ICL, group = type, color = type)) + 
    geom_line() + theme_bw()

Indeed, two classes found on the fully observed network fuse in the SBM fitted on the partially observed network.

table(
  collection_sbm$bestModel$fittedSBM$memberships,
  collection_sbm_full$bestModel$fittedSBM$memberships
  )

##    
##      1  2  3
##   1 33  0  0
##   2  1  4  9
##   3 30  6  0

The model finally fitted on the network data can be represented thanks to a plot method applying on objects with class SBM:

par(mfrow = c(2,2))
plot(collection_sbm$bestModel$fittedSBM, type = "network")
plot(collection_sbm$bestModel$fittedSBM, type = "connectivity")
plot(collection_sbm_full$bestModel$fittedSBM, type = "network")
plot(collection_sbm_full$bestModel$fittedSBM, type = "connectivity")

Taking covariates into account

This part shows how to account for covariates in the model.

Military power

We first consider a covariate reflecting the military power of the country, hence associated to the nodes. We typically expect a part of the network to be explained by this covariate.

Let us first prepare the data for missSBM inference on the full set of data. When preparing the data the covariate should be provided in a list as a vector:

sampleNet_cov <- prepare_data(beligerent_adjacency, list(beligerent_power))

Then we run the inference on the fully observed network¹:

vBlocks <- 1:4
collection_sbm_power_full <- estimate(sampleNet_cov, vBlocks = vBlocks, sampling = "node", useCovariates = TRUE)

The option useCovariates = TRUE specifies that the covariate(s) should be used in the SBM i.e. the distribution of edges depends on the covariate(s).

The covariate provided as a vector is transferred on edges through an $\ell_1$ similarity: for edge $(i,j)$ the associated covariate is defined by $|x_i-x_j|$ where $x_i$ denotes the covariate for node $i$. Another similarity measure could be provided via the option similarity.

The estimated effect of the covariate is obtained through

collection_sbm_power_full$bestModel$fittedSBM$covarParam

## [1] -0.2138825

The covariate could be responsible for the sampling. The state with bigger military power are more likely to be fully observed than the others. We will simulate this sampling. An intercept is considered by default in the sampling model.

nWar <- nrow(beligerent_adjacency)
parameters_sample <- 600
sampleNet_power_miss <- missSBM::sample(
   beligerent_adjacency,
   sampling = "covar-node",
   parameters = parameters_sample, covariates = list(beligerent_power), intercept = -2
  )
boxplot(1/(1 + exp(-cbind(1,beligerent_power) %*% c(-2, parameters_sample))) ~ sampleNet_power_miss$observedNodes, ylab="mil power",xlab = "observed node")

plot(sampleNet_power_miss)

Then, we can estimate the model by setting the sampling to be covar-node. We can still choose whether to consider or not the covariate in the SBM.

collection_sbm_power_miss <- estimate(sampleNet_power_miss, vBlocks = vBlocks, sampling = "covar-node", useCovariates = TRUE)

## 
## 
##  Adjusting Variational EM for Stochastic Block Model
## 
##  Imputation assumes a 'covar-node' network-sampling process
## 
##  Initialization of model with 1 blocks. 
 Initialization of model with 2 blocks. 
 Initialization of model with 3 blocks. 
 Initialization of model with 4 blocks. 
##  Performing VEM inference for model with 1 blocks.
 Performing VEM inference for model with 2 blocks.
 Performing VEM inference for model with 3 blocks.
 Performing VEM inference for model with 4 blocks.

Then we can access the estimated sampling parameters:

collection_sbm_power_miss$bestModel$fittedSampling$parameters

## [1]  -1.815746 595.186057

and the parameters in the SBM associated with the covariate:

collection_sbm_power_miss$bestModel$fittedSBM$covarParam

## [1] 0.308483

Trade data

Another covariate is the average trade exchange between the states. This covariate is related to pairs of states hence to the dyads. We first build a matrix of dissimilarity according to this covariate:

trade <- beligerent_trade 
trade[is.na(trade)] <- 0
trade <- trade + t(trade)
trade <- log(trade + 1)
diag(trade) = 0

We then conduct a similar analysis as with the power to see how it can be accounted for in the SBM.

We first sample according to it:

parameters_sample <- 1
sampleNet_trade_miss <- missSBM::sample(beligerent_adjacency, sampling = "covar-dyad", parameters = parameters_sample, covariates = list(trade), intercept = -2)
plot(sampleNet_trade_miss)

The choice of the sampling parameters leads to more likely observe dyads with important trade exchange. We then perform estimation on the missing data drawn according to the trade covariate.

collection_sbm_trade_miss <- estimate(sampleNet_trade_miss ,vBlocks = vBlocks, sampling = "covar-dyad", useCovariates  = TRUE)

## 
## 
##  Adjusting Variational EM for Stochastic Block Model
## 
##  Imputation assumes a 'covar-dyad' network-sampling process
## 
##  Initialization of model with 1 blocks. 
 Initialization of model with 2 blocks. 
 Initialization of model with 3 blocks. 
 Initialization of model with 4 blocks. 
##  Performing VEM inference for model with 1 blocks.
 Performing VEM inference for model with 2 blocks.
 Performing VEM inference for model with 3 blocks.
 Performing VEM inference for model with 4 blocks.

collection_sbm_trade_miss$bestModel$fittedSampling$parameters

## [1] -1.7805356  0.9356245

collection_sbm_trade_miss$bestModel$fittedSBM$covarParam

## [1] 0.1581317

Note that smoothing is not performed to alleviate computational time.↩

missSBM: a case study with war networks

missSBM team

Last update: 2019-09-16

Prerequisites