Introduction

The objective of this prototype cluster analysis it to perform unsupervised learning to identify distinct and homogeneous market areas in Dallas County with respect to their

socio-economic,
demographic and
infrastructure

characteristics of the underlying census tracts.

The package “ClustGeo” is the workhorse of this analysis. It performs hierarchical cluster analysis using the Ward parametrization of the Lance & Williams algorithm under an imposed spatial contiguity constraint.

Potential improvements could be:

Include additional features of the census tracts to the analysis. However, maintain a balance between the socio-economic, demographic and infrastructure themes. These features could be:
- school district information,
- characteristics of the housing and rental stock as well as occupancy,
- available employment opportunities etc.
Add relevant geographic characteristics to the analysis such as the distance to the CBD, road network or recreational amenities.
Depending on data availability increase the spatial resolution and move from census tracts to block groups.
Increase the number of clusters in order to boost the homogeneity of the market areas.
Edit the spatial contiguity matrix to introduce physical boundaries between market areas, for instance, tracts located to left or right of the Trinity river.

Definition of the spatial dissimilarity metric

Set the spatial tessellation of the census tracts up and prepare the contiguity space dissimilarity among the census tracts.

data(tractShp)
tractShp <- tractShp[tractShp$NIGHTPOP!=0, ]         # Remove 2 airport tracts with NA's

##
## Identify spatial neighboring census tracts and process them
##
nb <- spdep::poly2nb(tractShp, queen=F)              # extract first order neighbors links
B <- spdep::nb2mat(nb, style="B")                    # convert neighbor list to binary matrix

## Visualize first order neighbors
plot(tractShp, col="palegreen3", border=grey(0.9), axes=T) 
plot(nb, coords=coordinates(tractShp), pch=19, cex=0.1, col="blue", add=T)
title("Spatial Neighbors Links among Tracts")


## Transform the contiguity matrix into spatial dissimilarity matrix
geoDist <- 1-B; 
diag(geoDist) <- 0; 
geoDist <- as.dist(geoDist)

##
## Alternative spatial dissimilarity matrices
##
# # Calculate shortest path distances between the tracts
# BNa <- ifelse(B==0,NA,B)                             # recode 0's to NA's
# allPath <- e1071::allShortestPaths(BNa)              # calucate the shortest path among all nodes
# topoDist <- as.dist(allPath$length)                  # number of steps from origin to destination node
# e1071::extractPath(allPath, 1, 527)                  # example: path between tract 1 and tract 527
#
# # spherical distances matrix among tracts in km
# sphDist <- as.dist(sp::spDists(tractShp, longlat=T))

Definition of the feature dissimilarity metric

Set the variables of the census tracts up, impute 25 missing median home values and define the feature space dissimilarity matrix.

##
## Select variables for feature dissimilarity organized into three groups:
##   [a] demographic features
##   [b] socio-economic features
##   [c] housing infrastructure features
##

## Calculate additional features
tractShp$PRE1960 <- tractShp$PCTB1950+tractShp$PCTB1940+tractShp$PCTBPRE
tractShp$POST2000 <- tractShp$PCTB2000+tractShp$PCTB2010

## Feature list
varKeep <- c("PCTWHITE","PCTBLACK","PCTHISPAN","MEDAGE",
             "PCTUNIVDEG","HHMEDINC","PCTUNEMP","PCTNOHINS",
             "POPDEN","PCTDAYPOP","PCTHUVAC","MEDVALHOME","PRE1960","POST2000")
xVars <- tractShp@data
xVars <- xVars[varKeep]
summary(xVars)
R>     PCTWHITE          PCTBLACK        PCTHISPAN         MEDAGE     
R>  Min.   :  5.586   Min.   : 0.000   Min.   : 0.00   Min.   :20.30  
R>  1st Qu.: 48.713   1st Qu.: 5.062   1st Qu.:17.13   1st Qu.:29.70  
R>  Median : 66.040   Median :14.509   Median :34.02   Median :33.10  
R>  Mean   : 62.669   Mean   :21.250   Mean   :38.36   Mean   :34.55  
R>  3rd Qu.: 79.745   3rd Qu.:27.923   3rd Qu.:59.53   3rd Qu.:38.60  
R>  Max.   :100.000   Max.   :93.397   Max.   :93.71   Max.   :65.70  
R>                                                                    
R>    PCTUNIVDEG       HHMEDINC         PCTUNEMP        PCTNOHINS    
R>  Min.   : 1.00   Min.   : 14754   Min.   : 0.000   Min.   : 0.00  
R>  1st Qu.:11.00   1st Qu.: 39048   1st Qu.: 3.460   1st Qu.:12.10  
R>  Median :22.00   Median : 51139   Median : 5.393   Median :22.00  
R>  Mean   :30.64   Mean   : 62459   Mean   : 5.973   Mean   :21.61  
R>  3rd Qu.:47.00   3rd Qu.: 72096   3rd Qu.: 7.697   3rd Qu.:30.48  
R>  Max.   :91.00   Max.   :250001   Max.   :25.191   Max.   :56.51  
R>                                                                   
R>      POPDEN           PCTDAYPOP        PCTHUVAC        MEDVALHOME     
R>  Min.   :   82.61   Min.   :13.96   Min.   : 0.000   Min.   :  12600  
R>  1st Qu.: 2363.97   1st Qu.:26.96   1st Qu.: 4.512   1st Qu.:  93225  
R>  Median : 3708.51   Median :33.11   Median : 7.355   Median : 133450  
R>  Mean   : 5231.91   Mean   :38.77   Mean   : 8.230   Mean   : 208912  
R>  3rd Qu.: 5382.85   3rd Qu.:44.88   3rd Qu.:11.221   3rd Qu.: 238850  
R>  Max.   :80322.40   Max.   :99.81   Max.   :43.878   Max.   :2000001  
R>                                                      NA's   :25       
R>     PRE1960          POST2000     
R>  Min.   : 0.000   Min.   : 0.000  
R>  1st Qu.: 2.506   1st Qu.: 3.147  
R>  Median :11.378   Median : 8.905  
R>  Mean   :22.478   Mean   :15.460  
R>  3rd Qu.:37.149   3rd Qu.:22.819  
R>  Max.   :92.321   Max.   :89.306  
R> 

## Replace missing values in MEDVALHOME by neighbors average
xVars <- DMwR::knnImputation(xVars, k=5, scale=T, meth="weighAve")
summary(xVars$MEDVALHOME)
R>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
R>   12600   93950  134000  208647  242093 2000001

## ID each row of xVars
row.names(xVars) <- 1:nrow(xVars)

## Calculate feature distance matrix
featDist <- dist(scale(xVars))

Identification of the combined dissimilarity metric

Identify the \(\alpha\) parameter which balances the convex mixture of the feature space dissimilarities (black curve) with the contiguity space dissimilarities (red curve):

\[(1-\alpha)*FeatureDissimilarity + \alpha*ContiguityDissimilarity\].

Preference should be given to the feature dissimilarity (black) and for neither dissimilarity metric the intra-cluster homogeneity should not drop rapidly for the sequence of \(\alpha\)-values. Low \(\alpha\) values may lead to spatially split clusters.

For this prototype analysis the number of clusters is set to \(K=12\), because \(12\) is the maximum number of factor levels that the function “mapColorQual” can display. A work-around to increase the number of clusters is to split \(K>12\) clusters across several maps of maximal \(12\) factor levels. Please see step 8 below.

The selected mixture parameter \(\alpha\) is set at \(\alpha=0.2\) because for smaller \(\alpha\)-values the intra-cluster homogeneity for spatial contiguity metric drops rapidly.

##
## Evaluate mixture of feature and spatial dissimilarity.
##
K <- 12                            # Number of distinct clusters
range.alpha <- seq(0, 1, by=0.1)   # Evaluation range
cr <- choicealpha(featDist, geoDist, range.alpha, K, graph=TRUE)

Spatially constrained cluster analysis

Show the cluster generation history in a dendrogram and highlight the 12 market areas of census tracts.

## 
## Perform spatially constrained cluster analysis
##
tree <- hclustgeo(featDist, geoDist, alpha=0.2)
plot(tree, hang=-1)
rect.hclust(tree, k=K)

Evaluate the number of census tracts per market area.

## Number of census tracts per market area
neighClus <- as.factor(cutree(tree, K))        # Determine cluster membership
table(neighClus)                               # number of tracts in each cluster
R> neighClus
R>  1  2  3  4  5  6  7  8  9 10 11 12 
R> 33 23 92 32 55 91 46 59 23  9 22 42

Evaluation of the identified market areas

Map the identified homogeneous market area. An interpretation can be performed by their locations within Dallas County and by the levels and dispersion of their underlying features.

Note that some internally homogeneous clusters of market areas are distributed disjunctly across Dallas County. This implies that areas of similar characteristics can be found in several locations of Dallas County.

Map identified market areas

##
## Map Results
##
mapColorQual(neighClus, tractShp, 
             map.title ="Spatially constrained Cluster Analysis",
             legend.title="Cluster\nId.", legend.cex=0.9)
plot(lakesShp, col="skyblue", border="skyblue",add=T)
plot(hwyShp, col="cornsilk2", lwd=4, add=T)

Attribute analysis of the identified market areas

Evaluate the unique feature compositions for each market area. A market area should significantly differ from the remaining market areas in its variable profile by at least one feature or a combination of features.

To maintain a sufficient visual resolution, the variable box-plots by market areas are split into two panels.

##
## Evaluate cluster characteristics
##
plotBoxesByFactor(xVars[,1:7], neighClus, ncol=2, zTrans=T, varwidth=F)

plotBoxesByFactor(xVars[,8:14], neighClus, ncol=2, zTrans=T, varwidth=F)

Map selected clusters

To map just a set of selected clusters or, alternatively, for more than 12 generated clusters a sequence of maps with consecutive subsets of clusters the remaining clusters are set to NA. Tract with NA’s are displayed in light grey.

Clusters 4 and 10 are predominant employment centers with a proportionally higher number of day-time population relative to the nighttime population.

## Select subset nLevels[c(4,10)] and set remaining clusters to NA
nLevels <- levels(neighClus)
neighClusSet <- as.factor(ifelse(neighClus %in% nLevels[c(4,10)], neighClus, NA))
mapColorQual(neighClusSet, tractShp, 
             map.title ="Clusters of Major Employment Centers",
             legend.title="Cluster\nId.", legend.cex=0.9)
plot(lakesShp, col="skyblue", border="skyblue",add=T)
plot(hwyShp, col="cornsilk2", lwd=4, add=T)

Dallas Market Area Analysis with Contiguity Constraint

Michael Tiefelsdorf

2020-02-23