Creating Neighbours using sf objects

Introduction

This vignette tracks the legacy nb vignette, which was based on part of the first (2008) edition of ASDAR. It adds hints to the code in the nb vignette to use the sf vector representation instead of the sp vector representation to create neighbour objects.

Summary

This is a summary of the results below:

  • In general, if you need to reproduce results from using sp objects in spdep, coerce sf objects to sp objects before constructing neighbour objects (particularly if polygon centroids are used for point representation).

  • Further, for new work, you can either coerce from sf to sp objects and just use spdep to create nb objects, or use sf functions to create sparse geometry binary predicate objects and coerce these to neighbour objects.

  • Polygon validity matters: sf geometries need to be valid; sp geometries (and their use in spdep) pre-date OGC SF validity.

  • sf functions (st_relate, st_is_within_distance, st_centroid, etc.) appear to be an order of magnitude slower than equivalent sp/spdep/rgeos functions (poly2nb, dnearneigh, gCentroid, etc.)

  • sf functions appear to scale linearly in n, like sp/spdep functions

Data set

We’ll use the whole NY 8 county set of boundaries, as they challenge the implementations more than just the Syracuse subset. The description of the input geometries from ADSAR is: New York leukemia: used and documented extensively in Waller and Gotway (2004) and with data made available in Chap. 9 of [http://web1.sph.emory.edu/users/lwaller/ch9index.htm] (http://web1.sph.emory.edu/users/lwaller/ch9index.htm); the data import process is described in the help file of NY_data in spdep; geometries downloaded from the CIESIN server at https://sedac.ciesin.columbia.edu/data/set/acrp-boundary-1992/data-download, file /pub/census/usa/tiger/ny/bna_st/t8_36.zip, and extensively edited; a zip archive NY_data.zip of shapefiles and a GAL format neighbours list is on the book website. Further, the zipfile is now at: [a new location requiring login] (http://sedac.ciesin.columbia.edu/ftpsite/pub/census/usa/tiger/ny/bna_st/t8_36.zip). The object listw_NY is directly imported from nyadjwts.dbf on the Waller & Gotway (2004) chapter 9 website.

The version of the New York 8 counties geometries used in ASDAR and included as a shapefile in spdep was converted from the original BNA file using an external utility program to convert to MapInfo format and converted on from there using GDAL 1.4.1 (the OGR BNA driver was not then available; it entered OGR at 1.5.0, release at the end of 2007), and contains invalid geometries. What was found in mid-2007 was that included villages were in/excluded by in-out umbilical cords to the boundary of the enclosing tract, when the underlying BNA file was first converted to MapInfo (holes could not exist then).

Here we will use a GPKG file created as follows (rgdal could also be used with the same output; GDAL here is built with GEOS, so the BNA vector driver will use geometry tests: The BNA driver supports reading of polygons with holes or lakes. It determines what is a hole or a lake only from geometrical analysis (inclusion, non-intersection tests) and ignores completely the notion of polygon winding (whether the polygon edges are described clockwise or counter-clockwise). GDAL must be built with GEOS enabled to make geometry test work.):

library(sf)
sf_bna <- st_read("t8_36.bna", stringsAsFactors=FALSE)
table(st_is_valid(sf_bna))
sf_bna$AREAKEY <- gsub("\\.", "", sf_bna$Primary.ID)
data(NY_data, package="spData")
key <- as.character(nydata$AREAKEY)
sf_bna1 <- sf_bna[match(key, sf_bna$AREAKEY), c("AREAKEY")]
sf_bna2 <- merge(sf_bna1, nydata, by="AREAKEY")
sf_bna2_utm18 <- st_transform(sf_bna2, "+proj=utm +zone=18 +datum=NAD27")
st_write(sf_bna2_utm18, "NY8_bna_utm18.gpkg")

nb and listw objects (copied from the nb_igraph vignette)

Since the spdep package was created, spatial weights objects have been constructed as lists with three components and a few attributes, in old-style class listw objects. The first component of a listw object is an nb object, a list of n integer vectors, with at least a character vector region.id attribute with n unique values (like the row.names of a data.frame object); n is the number of spatial entities. Component i of this list contains the integer identifiers of the neighbours of i as a sorted vector with no duplication and values in 1:n; if i has no neighbours, the component is a vector of length 1 with value 0L. The nb object may contain an attribute indicating whether it is symmetric or not, that is whether i is a neighbour of j implies that j is a neighbour of i. Some neighbour definitions are symmetric by construction, such as contiguities or distance thresholds, others are asymmetric, such as k-nearest neighbours. The nb object redundantly stores both i-j and j-i links.

The second component of a listw object is a list of n numeric vectors, each of the same length as the corresponding non-zero vectors in the nbobject. These give the values of the spatial weights for each i-j neighbour pair. It is often the case that while the neighbours are symmetric by construction, the weights are not, as for example when weights are row-standardised by dividing each row of input weights by the count of neighbours or cardinality of the neighbour set of i. In the nb2listwfunction, it is also possible to pass through general weights, such as inverse distances, shares of boundary lengths and so on.

The third component of a listw object records the style of the weights as a character code, with "B" for binary weights taking values zero or one (only one is recorded), "W" for row-standardised weights, and so on. In order to subset listw objects, knowledge of the style may be necessary.

Comparison of sp and sf approaches

First some housekeeping and setup to permit this vignette to be built when packages are missing or out-of-date:

Let us read the GPKG file with valid geometries in to ‘sf’ and ‘sp’ objects:

Contiguity neighbours for polygon support

Here we first generate a queen contiguity nb object using the legacy spdep approach. This first either uses a pre-computed list of vectors of probable neighbours or finds intersecting bounding boxes internally. Then the points on the boundaries of each set of polygons making up an observation are checked for a distance less than snap to any of the points of the set of polygons making up an observation included in the set of candidate neighbours. Because contiguity is symmetric, only i to j contiguities are tested. A queen contiguity is found as soon as one point matches, a rook contiguity as soon as two points match:

Using rgeos STR trees to check the intersection of envelopes (bounding boxes) is much faster than the internal method in poly2nb for large n. Because contiguity is symmetric by definition, the queries only return intersections for higher indices.

Using sf::st_relate, we can define an un-snapped relational pattern for queen contiguities:

The output from st_queen is a list with attributes:

As we can see, the sf-based contiguity test is an order of magnitude slower than spdep::poly2nb; fortunately, it also scales linearly in the number of observations. spdep::poly2nb uses two heuristics, first to find candidate neighbours from intersecting bounding boxes, and second to use the symmetry of the relationship to halve the number of remaining tests. This means that performance is linear in n, but with overhead for identifying candidates, and back-filling symmetric neighbours. Further, spdep::poly2nb stops searching for queen contiguity as soon as the first neighbour point is found within snap distance (if not identical, which is tested first); second neighbour point for rook contiguities. spdep::poly2nb was heavily optimised when written, as processor speed was a major constraint at that time.

The addition of STR tree queries to identify candidates permits the construction of contiguous neighbour objects for quite large objects, for example the ZCTA 2016 shapefile with 33144 features. sf::st_read imports the data in about 3 s., rgdal::readOGR in under 8 s; in both cases the polygon geometries are valid. Finding the candidate neighbours with rgeos::gUnarySTRtreeQuery takes 4.5 s, and spdep::poly2nb a further 4.4 s. So for the sp variant, the total time is about 17 s., and using sf::st_read 3 s. and coercion to sp 7.5 s., then rgeos::gUnarySTRtreeQuery 4.5 s, and spdep::poly2nb 4.4 s., in total about 20 s. Running st_queen defined above using sf::st_relate takes about 136 s. for a total of 139 s. to generate a queen neighbour object. The contiguity neighbour objects using st_queen and spdep::poly2nb are identical.

Using sf::st_is_within_distance to emulate the snap= argument in spdep::poly2nb is very time-consuming; it takes more than 70 s. to run:

After removal of self-contiguities yields the same sets of neighbours.

We can convert an object of class sgbp (sparse geometry binary predicate) to nb in this way, taking care to represent observations with no neighbours with integer 0:

The neighbour objects produced by st_queen and spdep::poly2nb contain the same sets of neighbours:

To get around the time penalty of using GEOS functions in sf to find contiguous neighbours, we may coerce to the sp representation first:

It is the use of GEOS functionality that costs time, as we can see by using rgeos::gTouches:

or equivalently sf::st_touches with very similar timings, like those of sf::st_relate in st_queen:

The output objects are the same, once again. What we now have as queen contiguity neighbours are:

Contiguity neighbours from invalid polygons

Next, we explore a further possible source of differences in neighbour object reproduction, using the original version of the tract boundaries used in ASDAR, but with some invalid geometries as mentioned earlier:

We can see that there are a number of differences between the neighbour sets derived from the fully valid geometries and the older partly invalid set:

spdep::poly2nb does not object to using invalid geometries, as it only uses the boundary points defining the polygons (as do the rgeos STR tree construction and query functions, because points are sufficient to construct bounding boxes).

Using the standard “trick”, we can buffer by 0 to try to make the geometries valid:

Hoverver, in doing this, we change the geometries, so the new sets of neighbours still differ from those made with the valid geometries in the same ways as before imposing validity:

Tne neighbour sets are the same for the old boundaries with or without imposing validity:

Using the sf route, we also see invalid geometries:

On trying to find contiguities, sf::st_relate fails:

Using the buffer by 0 approach, we can impose validity:

We can now find the neighbours, but with differences from the sets found from the valid polygons:

There are also differences between the sets found when imposing validity on the sf and sp routes.

Point-based neighbours

Finding points for polygon objects

Getting a 2D matrix of centroid coordinates (centroids of largest exterior ring) from SpatialPolygons* is just ‘coordinates’, which takes trivial time because the values are computed (as label points) when the objects are constructed:

‘row.names’ gives the IDs (from feature FIDs if read with ‘rgdal::readOGR’):

We can use ‘st_centroid’ to get the centroids; if the sf object is sfc_MULTIPOLYGON, the ‘of_largest_polygon’ attribute should be set to replicate sp ‘coordinates’ behaviour. Curiously, st_centroid is quite time-consuming:

Before getting the coordinate matrix, we need to drop any Z or M coordinates:

We need to check whether coordinates are planar or not:

Then ‘st_coordinates’ can be used to get the coordinate matrix (here simply copying-out 2D point coordinates):

Unfortunately, the coordinate matrices differ:

The sp route derives the point coordinate values from the centroid of the largest member polygon of the observation, treated as its exterior ring. In sf, and qualifying to request the centroid of the largest POLYGON in a MULTIPOLYGON object, the centroid takes account of possible interior rings. Both are centroids, but the sp coordinates method for Polygons objects returns that of the largest (not metric, treating coordinates as planar) exterior ring, while the sf centroid function takes account of interior rings:

The reason for the discrepancy is that the label point (a.k.a. centroid) returned by sp::coordinates for Polygons objects takes the centroid gross of holes. This is not done by sf::st_centroid, nor by rgeos::gCentroid (which is much quicker than st_centroid even when using the same underlying GEOS code and converting to GEOS SF representation internally):

The output coordinates are still off, but much less:

The affected objects are not those with holes, and probably discrepancies between the sp and sf object implementations of the GEOS function may be disregarded.

Graph-based neighbours

From this, we can check the graph-based neighbours (planar coordinates only):

Using ‘st_triangulate’ is unsatisfactory, as is ‘rgeos::gDelaunayTriangulation’ because they do not maintain node order, see the example in ‘?rgeos::gDelaunayTriangulation’ for details:

The ‘soi.graph’ function may be re-written to use sf functionality internally, but for now just the triangulation and coordinates:

The discrepancy is for one of the polygons with a hole:

Here the difference between the points is enough to remove the triangulation link between 97 and 99 under the sphere of influence criterion.

If we use the GEOS-based coordinates, the distance is reduced to zero:

and the neighbour objects are the same. Using rgeos::gCentroid centroids rather than sp label points would be the choice to make for new work:

If we coerce from sf to sp representation before extracting point coordinates, we get:

It turns out that the geometries are identical after coercion (as we would expect from obtaining the same contiguity neighbours above):

so we can reproduce the sp-based label points by coercing first; if we need to reproduce existing work, this is the best choice:

The remaining two graph-based methods appear to be less sensitive to the differences between the coordinates:

K-nearest neighbours

K-nearest neighbours use the coordinate matrices, and can handle Great Circle distances, but this is not demonstrated here, as the data set used is planar:

While the first nearest neighbours are found correctly for this data set, the same sensitivity is present with regard to coordinate position at larger k:

Distance neighbours

Distance neighbours need a threshold - nbdists shows the maximum distance to first nearest neighbour:

dnearneigh can also handle Great Circle distances, but this is not demonstrated here, as the data set used is planar:

We could use more sf functionality, but it is an order of magnitude slower, through buffering all points by the threshold while distance is symmetric, so i to j and j to i are equal. Buffering also presupposes planar coordinates. In addition, the point itself intersects, so has to be removed:

And it does not match:

This time the problem is in the number of line segments in each buffer circle quadrant, so if we smooth the circle, things get better:

We can do the same with sf::st_is_within_distance too: