Seriation

N. Frerebeau

2020-03-19

# Load packages
library(tabula)
library(magrittr)

1 Introduction

The matrix seriation problem in archaeology is based on three conditions and two assumptions, which Dunnell (1970) summarizes as follows.

The homogeneity conditions state that all the groups included in a seriation must:

The mathematical assumptions state that the distribution of any historical or temporal class:

Theses assumptions create a distributional model and ordering is accomplished by arranging the matrix so that the class distributions approximate the required pattern. The resulting order is inferred to be chronological.

2 Visualization

Several types of graphs are available in tabula which uses ggplot2 for plotting informations. This makes it easy to customize diagrams (e.g. using themes and scales).

2.1 Spot plot

Spot matrix allows direct examination of data (above/below some threshold):

Spot plot

Spot plot

Spot plot of co-occurrence

Spot plot of co-occurrence

2.2 Heatmap

Abundance matrix can be displayed as a heatmap of relative abundances (frequency), or as percentages of the independence value (in french, “pourcentages de valeur d’indépendance”, PVI).

Heatmap

Heatmap

PVI is calculated for each cell as the percentage to the column theoretical independence value: PVI greater than \(1\) represent positive deviations from the independence, whereas PVI smaller than \(1\) represent negative deviations (Desachy 2004). The PVI matrix allows to explore deviations from independence (an intuitive graphical approach to \(\chi^2\)), in such a way that a high-contrast matrix has quite significant deviations, with a low risk of being due to randomness (Desachy 2004).

Matrigraphe

Matrigraphe

2.3 Bar plot

Bertin (1977) or Ford (1962) (battleship curve) diagrams can also be plotted, with statistic threshold.

Bertin diagram

Bertin diagram

Ford diagram

Ford diagram

3 Reciprocal ranking

Reciprocal ranking iteratively rearrange rows and/or columns according to their weighted rank in the data matrix until convergence (Ihm 2005).

For a given incidence matrix \(C\):

\[ x_{i} = \sum_{j = 1}^{p} j \frac{c_{ij}}{c_{i \cdot}} \]

\[ y_{j} = \sum_{i = 1}^{m} i \frac{c_{ij}}{c_{\cdot j}} \]

These two steps are repeated until convergence. Note that this procedure could enter into an infinite loop.

## Build an incidence matrix with random data
set.seed(12345)
incidence1 <- IncidenceMatrix(data = sample(0:1, 400, TRUE, c(0.6, 0.4)),
                              nrow = 20)

## Get seriation order on rows and columns
## If no convergence is reached before the maximum number of iterations (100), 
## it stops with a warning.
(indices <- seriate_reciprocal(incidence1, margin = c(1, 2), stop = 100))
#> <PermutationOrder: 0b378f50-729a-4887-8632-235f9ddbac8d>
#> Permutation order for matrix seriation:
#> - Row order: 1 4 20 3 9 16 19 10 13 2 11 7 17 5 6 18 14 15 8 12...
#> - Column order: 1 16 9 4 8 14 3 20 13 2 6 18 7 17 5 11 19 12 15 10...
#> - Method: reciprocal

## Permute matrix rows and columns
incidence2 <- permute(incidence1, indices)

## Plot matrix
plot_heatmap(incidence1) + 
  ggplot2::labs(title = "Original matrix") +
  ggplot2::scale_fill_manual(values = c("TRUE" = "black", "FALSE" = "white"))
plot_heatmap(incidence2) + 
  ggplot2::labs(title = "Rearranged matrix") +
  ggplot2::scale_fill_manual(values = c("TRUE" = "black", "FALSE" = "white"))

The positive difference from the column mean percentage (in french “écart positif au pourcentage moyen”, EPPM) represents a deviation from the situation of statistical independence (Desachy 2004). As independence can be interpreted as the absence of relationships between types and the chronological order of the assemblages, EPPM is a useful graphical tool to explore significance of relationship between rows and columns related to seriation (Desachy 2004).

## Replicates Desachy 2004 results

## Coerce dataset to an abundance matrix
compiegne_counts <- as_count(compiegne)

## Plot original data matrix
plot_ford(compiegne_counts, EPPM = TRUE) +
  ggplot2::labs(title = "Original dataset") +
  khroma::scale_fill_bright()

## Get seriation order for columns on EPPM using the reciprocal averaging method
## Expected column order: N, A, C, K, P, L, B, E, I, M, D, G, O, J, F, H
compiegne_indices <- seriate_reciprocal(compiegne_counts, EPPM = TRUE, margin = 2)

## Permute columns
compiegne_seriation <- permute(compiegne_counts, compiegne_indices)

## Plot new matrix
plot_ford(compiegne_seriation, EPPM = TRUE) +
  ggplot2::labs(title = "Reordered dataset") +
  khroma::scale_fill_bright()

4 correspondence analysis

4.1 Seriation

correspondence Analysis (CA) is an effective method for the seriation of archaeological assemblages. The order of the rows and columns is given by the coordinates along one dimension of the CA space, assumed to account for temporal variation. The direction of temporal change within the correspondence analysis space is arbitrary: additional information is needed to determine the actual order in time.

4.2 Refining

Peeples and Schachner (2012) propose a procedure to identify samples that are subject to sampling error or samples that have underlying structural relationships and might be influencing the ordering along the CA space. This relies on a partial bootstrap approach to CA-based seriation where each sample is replicated n times. The maximum dimension length of the convex hull around the sample point cloud allows to remove samples for a given cutoff value.

According to Peeples and Schachner (2012), “[this] point removal procedure [results in] a reduced dataset where the position of individuals within the CA are highly stable and which produces an ordering consistend with the assumptions of frequency seriation.”

## Replicates Peeples and Schachner 2012 results

## Samples with convex hull maximum dimension length greater than the cutoff
## value will be marked for removal.
## Define cutoff as one standard deviation above the mean
fun <- function(x) { mean(x) + sd(x) }

## Get indices of samples to be kept
## Warning: this may take a few seconds!
set.seed(123)
(zuni_keep <- refine_seriation(zuni_counts, cutoff = fun, n = 1000))
#> Partial bootstrap CA seriation refinement:
#> - Cutoff values: 2.22 (rows) - 0.37 (columns)
#> - Rows to keep: 349 of 420 (83%)
#> - Columns to keep: 14 of 18 (78%)

## Plot convex hull
## blue: convex hull for samples; red: convex hull for types
### All bootstrap samples
ggplot2::ggplot(mapping = ggplot2::aes(x = x, y = y, group = id)) +
  ggplot2::geom_vline(xintercept = 0, linetype = 2) +
  ggplot2::geom_hline(yintercept = 0, linetype = 2) +
  ggplot2::geom_polygon(data = zuni_keep[["rows"]], 
                        fill = "blue", alpha = 0.05) +
  ggplot2::geom_polygon(data = zuni_keep[["columns"]], 
                        fill = "red", alpha = 0.5) +
  ggplot2::coord_fixed() + 
  ggplot2::labs(title = "Whole dataset", x = "Dim. 1", y = "Dim. 2") + 
  ggplot2::theme_bw()
### Only retained samples
ggplot2::ggplot(mapping = ggplot2::aes(x = x, y = y, group = id)) +
  ggplot2::geom_vline(xintercept = 0, linetype = 2) +
  ggplot2::geom_hline(yintercept = 0, linetype = 2) +
  ggplot2::geom_polygon(data = subset(zuni_keep[["rows"]], 
                                      id %in% names(zuni_keep[["keep"]][[1]])),
                        fill = "blue", alpha = 0.05) +
  ggplot2::geom_polygon(data = zuni_keep[["columns"]], 
                        fill = "red", alpha = 0.5) +
  ggplot2::coord_fixed() + 
  ggplot2::labs(title = "Selected samples", x = "Dim. 1", y = "Dim. 2") + 
  ggplot2::theme_bw()

## Histogram of convex hull maximum dimension length
hull_length <- cbind.data.frame(length = zuni_keep[["lengths"]][[1]])
ggplot2::ggplot(data = hull_length, mapping = ggplot2::aes(x = length)) +
  ggplot2::geom_histogram(breaks = seq(0, 4.5, by = 0.5), fill = "grey70") +
  ggplot2::geom_vline(xintercept = fun(hull_length$length), colour = "red") +
  ggplot2::labs(title = "Convex hull max. dim.", 
                x = "Maximum length", y = "Count") + 
  ggplot2::theme_bw()

If the results of refine_seriation is used as an input argument in seriate, a correspondence analysis is performed on the subset of object which matches the samples to be kept. Then excluded samples are projected onto the dimensions of the CA coordinate space using the row transition formulae. Finally, row coordinates onto the first dimension give the seriation order.

References

Bertin, Jacques. 1977. La graphique et le traitement graphique de l’information. Nouvelle bibliothèque scientifique. Paris: Flammarion.

Desachy, Bruno. 2004. “Le sériographe EPPM: un outil informatisé de sériation graphique pour tableaux de comptages.” Revue archéologique de Picardie 3 (1): 39–56. https://doi.org/10.3406/pica.2004.2396.

Dunnell, Robert C. 1970. “Seriation Method and Its Evaluation.” American Antiquity 35 (03): 305–19. https://doi.org/10.2307/278341.

Ford, J. A. 1962. A Quantitative Method for Deriving Cultural Chronology. Technical Manual 1. Washington, DC: Pan American Union.

Ihm, Peter. 2005. “A Contribution to the History of Seriation in Archaeology.” In Classification the Ubiquitous Challenge, edited by Claus Weihs and Wolfgang Gaul, 307–16. Berlin Heidelberg: Springer. https://doi.org/10.1007/3-540-28084-7_34.

Peeples, Matthew A., and Gregson Schachner. 2012. “Refining Correspondence Analysis-Based Ceramic Seriation of Regional Data Sets.” Journal of Archaeological Science 39 (8): 2818–27. https://doi.org/10.1016/j.jas.2012.04.040.