Introduction

The flows package contains functions that select flows, provide statistics on selections and propose map and graph visualisations.
The first part of the vignette reminds several methods of flow selection, the second part presents the main functions of the package and the last one proposes an exemple of analysis based on commuters data in the French Grand Est region.

Analysis of geographic flows: issues and methods

In the field of spatial analysis, working on flows supposes to focus on the relationships between places rather than on their characteristics. Analysis and flow representation often assume a selection to ease the interpretation.

One of the first method developed was the so-called dominant flows (or nodal regions) proposed by Nystuen and Dacey in 1961 (Dacey (1961)). Working on telephone flows between cities in the Seattle area, they sought to highlight hierarchy between locations. According to this method, a place i is dominated by a place j if two conditions are met:

the most important flow from i is emitted towards j;
the sum of the flows received by j is greater than the sum of the flows received by i.

This method creates what is called in graph theory a tree (acyclic graph) or a forest (a set of unconnected trees) with three types of nodes: dominant, dominated and intermediate. If the method creates a clear functional hierarchy, its major drawback is to undervalue flows intensities.

Various methods have subsequently been proposed to better reflect this intensity, one of the most frequently used being the so-called major flows: it selects only the most important flows, absolute or relative, either locally or globally. Analysing commuters data between cities, one may choose to select:

all flows greater than 100;
the 50 first flows (global criterion);
the 10 first flows emitted by each city (local criterion).

These criteria can also be expressed in relative form:

flows that represent more than 10% of the active population of each city (local criterion);
flows that take into account 80% of all commuters (global criterion).

These methods often highlight hierarchies between places but the loss of information created by the selection is rarely questioned. So it seems useful to propose statistical indicators to assess the volume of lost information and characteristics of the selected flows.

The package flows

A typical data workflow may be:

data preparation;
flow selection;
statistical data and graphical outputs on the selection made;
graph or map representation (dominant flows).

Data Preparation

Flow data can be found in wide (matrix) or long format (i-j-fij, i.e. origin - destination - flow intensity). As all flows function take flow data in wide format, the preflows function transforms a link list into a square matrix. preflows has four arguments: a data.frame to transform (mat), the origin (i), the destination (j) and the flow intensity (fij).

library(flows)
# Import data
data(nav)
head(nav)

##     i namei      wi   j                 namej          wj         fij
## 1 001 Paris 5599722 001                 Paris 5599722.265 1698.155329
## 2 001 Paris 5599722 048                Troyes   75561.974    3.909858
## 3 001 Paris 5599722 129                  Sens   24625.065  286.788719
## 4 001 Paris 5599722 529              Vouziers    2119.563    4.047245
## 5 001 Paris 5599722 025                 Dijon  164439.563    5.406881
## 6 001 Paris 5599722 752 Saint-Julien-du-Sault    1048.426    8.097588

# Prepare data
myflows <- prepflows(mat = nav, i = "i", j = "j", fij = "fij")
myflows[1:4,1:4]

##          001         009         020         024
## 001 1698.155      0.0000      0.0000      0.0000
## 009    0.000 298895.3551    402.2043    281.4378
## 020    0.000    263.9613 154742.7863   3040.1983
## 024    0.000    258.6355   4500.3492 129716.7266

Flow Selection

Three selection methods based on the flow origins are accessible through the firstflows function:

nfirst: the k first flows from all origins;
xfirst: all flows greater than a threshold k;
xsumfirst: as many flows as necessary for each origin so that their sum is at least equal to k.

Figure 1: The three methods of the firstflows function
Black links are the selected ones.

Methods taking into account the total volume of flows are implemented in the firstflowsg function. They are identical to the ones described above: selection of the k first flows, selection of flows greater than k and selection of flows such as the sum is at least equal to k.

The domflows function selects flows based on a dominance test. This function may be used to select flows obeying the second criterion of Nystuen and Dacey method.

All these functions take as input a square matrix of flows and generate binary matrices of the same size. Selected flows are coded 1, others 0. It is therefore possible to combine criteria of selection through element-wise multiplication of matrices (Figure 2).

Figure 2: Flow selection and criteria combination

The statmat function provides various indicators and graphical outputs on a flow matrix to allow statistically relevant selections. Measures provided are density (number of present flows divided by the number of possible flows); number, size and composition of connected components; sum, quartiles and average intensity of flows. In addition, four graphics can be plotted: degree distribution curve (by default, outdegree), weighted degree distribution curve, Lorenz curve and boxplot on flow intensities.

# Import data
data(nav)
myflows <- prepflows(mat = nav, i = "i", j = "j", fij = "fij")

# Get statistics about the matrix
statmat(mat = myflows, output = "none", verbose = TRUE)

## matrix dimension: 159 X 159 
## nb. links: 3350 
## density: 0.1333493 
## nb. of components (weak) 1 
## nb. of components (weak, size > 1) 1 
## sum of flows: 2306585 
## min: 0.8795206 
## Q1: 4.008417 
## median: 9.544442 
## Q3: 54.80416 
## max: 298895.4 
## mean: 688.5328 
## sd: 7765.105

# Plot Lorenz curve only
statmat(mat = myflows, output = "lorenz", verbose = FALSE)

# Graphics only
statmat(mat = myflows, output = "all", verbose = FALSE)

# Statistics only
mystats <- statmat(mat = myflows, output = "none", verbose = FALSE)
str(mystats)

## List of 16
##  $ matdim      : int [1:2] 159 159
##  $ nblinks     : num 3350
##  $ density     : num 0.133
##  $ connectcomp : int 1
##  $ connectcompx: int 1
##  $ sizecomp    :'data.frame':    1 obs. of  3 variables:
##   ..$ idcomp  : int 1
##   ..$ sizecomp: num 159
##   ..$ wcomp   : num 2306585
##  $ compocomp   :'data.frame':    159 obs. of  2 variables:
##   ..$ id    : chr [1:159] "001" "009" "020" "024" ...
##   ..$ idcomp: num [1:159] 1 1 1 1 1 1 1 1 1 1 ...
##  $ degrees     :'data.frame':    159 obs. of  3 variables:
##   ..$ id     : chr [1:159] "001" "009" "020" "024" ...
##   ..$ degree : num [1:159] 7 89 78 76 87 61 65 55 44 49 ...
##   ..$ wdegree: num [1:159] 2021 318296 170691 148765 157823 ...
##  $ sumflows    : num 2306585
##  $ min         : num 0.88
##  $ Q1          : num 4.01
##  $ median      : num 9.54
##  $ Q3          : num 54.8
##  $ max         : num 298895
##  $ mean        : num 689
##  $ sd          : num 7765

# Sum of flows
mystats$sumflows

## [1] 2306585

To ease comparisons, the compmat function outputs a data.frame that provides statistics on differences between two matrices (for example a matrix and selection of this matrix).

Visualisation helps analysis, plotDomFlows function produces a graph where sizes and colors of vertices depend on their position in the graph (dominant, intermediate or dominated) and thicknesses of links depend on flow intensites.

The plotMapDomFlows function maps the selected flows according to the same principles.
Both functions only apply to a dominant flows selection¹.

Commuters flows in the French Grand Est

As an illustration, we present a brief analysis of commuter flows between urban areas of the Grand Est region in France².

We compare two different thresholds (500 and 1000) on the total volume of flows.

# Import data
data(nav)
myflows <- prepflows(mat = nav, i = "i", j = "j", fij = "fij")

# Remove the matrix diagonal
diag(myflows) <- 0

# Selection of flows > 500
flowSel1 <- firstflowsg(mat = myflows, method = "xfirst", k = 500)
# Selection of flows > 1000
flowSel2 <- firstflowsg(mat = myflows, method = "xfirst", k = 1000)

# Compare initial matrix and selected matrices
compmat(mat1 = myflows, mat2 = myflows * flowSel1, digits = 1)

##                  mat1     mat2  absdiff reldiff
## nblinks        3191.0    137.0   3054.0    95.7
## sumflows     313298.7 193196.2 120102.4    38.3
## connectcompx      1.0     10.0      9.0      NA
## min               0.9    502.4       NA      NA
## Q1                4.0    583.6       NA      NA
## median            8.2    880.0       NA      NA
## Q3               40.5   1701.5       NA      NA
## max            8654.0   8654.0       NA      NA
## mean             98.2   1410.2       NA      NA
## sd              399.9   1343.1       NA      NA

compmat(mat1 = myflows, mat2 = myflows * flowSel2, digits = 1)

##                  mat1     mat2  absdiff reldiff
## nblinks        3191.0     62.0   3129.0    98.1
## sumflows     313298.7 145365.0 167933.7    53.6
## connectcompx      1.0      7.0      6.0      NA
## min               0.9   1020.7       NA      NA
## Q1                4.0   1252.6       NA      NA
## median            8.2   1791.3       NA      NA
## Q3               40.5   2938.4       NA      NA
## max            8654.0   8654.0       NA      NA
## mean             98.2   2344.6       NA      NA
## sd              399.9   1543.3       NA      NA

If we select flows greater than 500 commuters, we loose 95.7% of all links but only 38% of the volume of flows. With a threshold of 1000 commuters, 98% of links are lost but only 53% of the volume of flows.

The following exemple selects flows that represent at least 20% of the sum of outgoing flows for each urban area.

# Import data
data(nav)
myflows <- prepflows(mat = nav, i = "i", j = "j", fij = "fij")

# Remove the matrix diagonal
diag(myflows) <- 0

# Percentage of each outgoing flows
myflows2 <- myflows / rowSums(myflows) * 100

# Select flows that represent at least 20% of the sum of outgoing flows for 
# each urban area.
flowSel <- firstflows(mat = myflows2, method = "xfirst", k = 20)

# Compare initial and selected matrices
compmat(mat1 = myflows,mat2 = flowSel * myflows)

##                mat1   mat2 absdiff reldiff
## nblinks        3191    239    2952      93
## sumflows     313299 166697  146601      47
## connectcompx      1      6       5      NA
## min               1      3      NA      NA
## Q1                4    156      NA      NA
## median            8    323      NA      NA
## Q3               41    585      NA      NA
## max            8654   8654      NA      NA
## mean             98    697      NA      NA
## sd              400   1150      NA      NA

This selection keeps only 7% of all links and 53% of the volume of flows.

We decide to use this selection as first criteria of our analysis. The second one will be a dominant flow selection based on the sum of incoming flows.

# Import data
data(nav)
myflows <- prepflows(mat = nav, i = "i", j = "j", fij = "fij")

# Remove the matrix diagonal
diag(myflows) <- 0

# Select flows that represent at least 20% of the sum of outgoing flows for 
# each urban area.
flowSel1 <- firstflows(mat = myflows/rowSums(myflows)*100, method = "xfirst", 
                       k = 20)


# Select the dominant flows (incoming flows criterion)
flowSel2 <- domflows(mat = myflows, w = colSums(myflows), k = 1)

# Combine selections
flowSel <- myflows * flowSel1 * flowSel2

# Node weights
inflows <- data.frame(id = colnames(myflows), w = colSums(myflows))

# Plot dominant flows map
opar <- par(mar = c(0,0,2,0))
sp::plot(GE, col = "#cceae7", border = NA)
plotMapDomFlows(mat = flowSel, spdf = UA, spdfid = "ID", w = inflows, wid = "id",
                wvar = "w", wcex = 0.05, add = TRUE,
                legend.flows.pos = "topright",
                legend.flows.title = "Nb. of commuters")
title("Dominant Flows of Commuters")
mtext(text = "INSEE, 2011", side = 4, line = -1, adj = 0.01, cex = 0.8)

par(opar)


# Statistics on major urban areas
inflows <- data.frame(id = colnames(flowSel), w = colSums(flowSel))
UA.df <- unique(data.frame(id = c(nav$i, nav$j),name = c(nav$namei, nav$namej)))
UAindegreew <- merge(inflows, UA.df, by = "id", all.x = TRUE)
UAindegreew[order(UAindegreew$w, decreasing = TRUE),][1:10,]

##     id         w                                          name
## 3  020 14869.995                                         Nancy
## 2  009 14499.285                 Strasbourg (partie francaise)
## 7  034 11900.563                                      Mulhouse
## 4  024 11719.812                                          Metz
## 15 073  5871.828                                       Belfort
## 17 082  5609.468 Sarrebruck (ALL) - Forbach (partie francaise)
## 6  029  5466.937                                         Reims
## 8  041  5391.573                                      Besancon
## 16 080  3409.050                          Charleville-Mezieres
## 12 060  3297.406                              Chalon-sur-Saone

The top of the node hierarchy brings out clearly, in descending order, the domination of Nancy, Strasbourg, Mulhouse and Metz, each attracting more than 10 000 commuters.

One could easily repeat these selections, with higher or smaller thresholds, to identify the most robust connections and the intermittent ones.

Conclusion

The flows package aims to enable a relevant selection of flows, while leaving maximum flexibility to the user.

References

Dacey, M. 1961. “A Graph Theory Interpretation of Nodal Regions.” Papers and Proceedings of the Regional Science Association 7: 29–42.bt.

Viewing options are only dedicated to the nodal regions / dominant flows method since other R packages exist to ensure graph or map representations.↩
Data comes from the 2011 French National Census (Recensement Général de la Population de l’INSEE). The area includes five administrative regions: Champagne-Ardenne, Lorraine, Alsace, Bourgogne, and Franche-Comté. Cities are urban areas (2010 borders).↩

Introduction to the flows package

Timothée Giraud, Laurent Beauguitte, Marianne Guérois

2016-12-05