HiClimR: Hierarchical Climate Regionalization
HiClimR is a tool for Hierarchical Climate Regionalization applicable to any correlation-based clustering. Climate regionalization is the process of dividing an area into smaller regions that are homogeneous with respect to a specified climatic metric. Several features are added to facilitate the applications of climate regionalization (or spatiotemporal analysis in general) and to implement a cluster validation function with an objective tree cutting to find an optimal number of clusters for a user-specified confidence level. These include options for preprocessing and postprocessing as well as efficient code execution for large datasets and options for splitting big data and computing only the upper-triangular half of the correlation/dissimilarity matrix to overcome memory limitations. Hybrid hierarchical clustering reconstructs the upper part of the tree above a cut to get the best of the available methods. Multivariate clustering (MVC) provides options for filtering all variables before preprocessing, detrending and standardization of each variable, and applying weights for the preprocessed variables.
HiClimR adds several features and a new clustering method (called, regional linkage) to hierarchical clustering in R (hclust function in stats library) including:
BLAS library on 64-Bit machines
ATLASOpenBLASIntel MKLregional linkage or minimum inter-regional correlationward’s minimum variance or error sum of squares methodsingle linkage or nearest neighbor methodcomplete linkage or diameteraverage linkage, group average, or UPGMA methodmcquitty’s or WPGMA methodmedian, Gower’s or WPGMC methodcentroid or UPGMC methodregional linkage methodThe regional linkage method is explained in the context of a spatiotemporal problem, in which N spatial elements (e.g., weather stations) are divided into k regions, given that each element has a time series of length M. It is based on inter-regional correlation distance between the temporal means of different regions (or elements at the first merging step). It modifies the update formulae of average linkage method by incorporating the standard deviation of the merged region timeseries, which is a function of the correlation between the individual regions, and their standard deviations before merging. It is equal to the average of their standard deviations if and only if the correlation between the two merged regions is 100%. In this special case, the regional linkage method is reduced to the classic average linkage clustering method.
Badr et al. (2015) describes the regionalization algorithms, features, and data processing tools included in the package and presents a demonstration application in which the package is used to regionalize Africa on the basis of interannual precipitation variability. The figure below shows a detailed flowchart for the package. Cyan blocks represent helper functions, green is input data or parameters, yellow indicates agglomeration Fortran code, and purple shows graphics options. For multivariate clustering (MVC), the input data is a list of matrices (one matrix for each variable with the same number of rows to be clustered; the number of columns may vary per variable). The blue dashed boxes involve a loop for all variables to apply mean and/or variance thresholds, detrending, and/or standardization per variable before weighing the preprocessed variables and binding them by columns in one matrix for clustering. x is the input N x M data matrix, xc is the coarsened N0 x M data matrix where N0 ≤ N (N0 = N only if lonStep = 1 and latStep = 1), xm is the masked and filtered N1 x M1 data matrix where N1 ≤ N0 (N1 = N0 only if the number of masked stations/points is zero) and M1 ≤ M (M1 = M only if no columns are removed due to missing values), and x1 is the reconstructed N1 x M1 data matrix if PCA is performed.

HiClimR is applicable to any correlation-based clustering.
There are many ways to install an R package from precompiled binaries or source code. For more details, you may search for how to install an R package, but here are the most convenient ways to install HiClimR:
This is the easiest way to install an R package on Windows, Mac, or Linux. You just fire up an R shell and type:
In theory the package should just install, however, you may be asked to select your local mirror (i.e. which server should you use to download the package). If you are using R-GUI or R-Studio, you can find a menu for package installation where you can just search for HiClimR and install it.
This is intended for developers and requires a development environment (compilers, libraries, … etc) to install the latest development release of HiClimR. On Linux and Mac, you can download the source code and use R CMD INSTALL to install it. In a convenient way, you may use devtools as follows:
devtools from CRAN:Make sure you have a working development environment:
Rtools.Install HiClimR from GitHub source:
The source code repository can be found on GitHub at hsbadr/HiClimR.
HiClimR is licensed under GPL-2 | GPL-3. The code is modified by Hamada S. Badr from src/library/stats/R/hclust.R part of R package Copyright © 1995-2019 The R Core Team.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
A copy of the GNU General Public License is available at https://www.r-project.org/Licenses.
Copyright © 2013-2019 Earth and Planetary Sciences (EPS), Johns Hopkins University (JHU).
To cite HiClimR in publications, please use:
Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2015): A Tool for Hierarchical Climate Regionalization, Earth Science Informatics, 8(4), 949-958, https://doi.org/10.1007/s12145-015-0221-7.
Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2014): HiClimR: Hierarchical Climate Regionalization, Comprehensive R Archive Network (CRAN), https://cran.r-project.org/package=HiClimR.
| Version | Date | Comment | Author | |
|---|---|---|---|---|
| May 1992 | Original | F. Murtagh | ||
| Dec 1996 | Modified | Ross Ihaka | ||
| Apr 1998 | Modified | F. Leisch | ||
| Jun 2000 | Modified | F. Leisch | ||
| 1.0.0 | 03/07/14 | HiClimR | Hamada S. Badr | badr@jhu.edu |
| 1.0.1 | 03/08/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.0.2 | 03/09/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.0.3 | 03/12/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.0.4 | 03/14/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.0.5 | 03/18/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.0.6 | 03/25/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.0.7 | 03/30/14 | Hybrid | Hamada S. Badr | badr@jhu.edu |
| 1.0.8 | 05/06/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.0.9 | 05/07/14 | CRAN | Hamada S. Badr | badr@jhu.edu |
| 1.1.0 | 05/15/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.1.1 | 07/14/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.1.2 | 07/26/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.1.3 | 08/28/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.1.4 | 09/01/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.1.5 | 11/12/14 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.1.6 | 03/01/15 | GitHub | Hamada S. Badr | badr@jhu.edu |
| 1.2.0 | 03/27/15 | MVC | Hamada S. Badr | badr@jhu.edu |
| 1.2.1 | 05/24/15 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.2.2 | 07/21/15 | Updated | Hamada S. Badr | badr@jhu.edu |
| 1.2.3 | 08/05/15 | Updated | Hamada S. Badr | badr@jhu.edu |
| 2.0.0 | 12/22/18 | NOTE | Hamada S. Badr | badr@jhu.edu |
| 2.1.0 | 01/01/19 | NetCDF | Hamada S. Badr | badr@jhu.edu |
| 2.1.1 | 01/02/19 | Updated | Hamada S. Badr | badr@jhu.edu |
| 2.1.2 | 01/04/19 | Updated | Hamada S. Badr | badr@jhu.edu |
| 2.1.3 | 01/10/19 | Updated | Hamada S. Badr | badr@jhu.edu |
| 2.1.4 | 01/20/19 | Updated | Hamada S. Badr | badr@jhu.edu |
| 2.1.5 | 12/10/19 | inherits | Hamada S. Badr | badr@jhu.edu |
| 2.1.6 | 02/22/20 | Updated | Hamada S. Badr | badr@jhu.edu |
inherits() to check class inheritanceHiClimR2nc: Updated documentation and examples\code{} instead of \bold{} for classesHiClimR2nc: Fixed timeseries variable definitionREADME: Link HiClimR to CRAN package pagefastCor: Fixed row/col names of the correlation matrixfastCor: Cleaned up zero-variance data checkmulti-variate with multivariateweightedVar to weightMVCDESCRIPTIONREADMEfastCor: Removed zero-variance datafastCor: Introduced optBLASfastCor: Code cleanupREADME and all URLsgeogMask confusing country codes/namesgeogMask filtering InDispute areasx should be created using as.vector(t(x0))x0 is the n by m original data matrixn = length(unique(lon)) and m = length(unique(lat))coarseR now returns the original row numbersREADME corrections and updatesUndefined global functionsREADME corrections and updatespch and cex)geogMask supports ungridded data-180 to 180 (not 0 to 360)rownames(TestCase$x) for example!
longitude,latitudeverbose fixes and updatesREADME corrections and updatesx can now be a list of matrices (one matrix for each variable)
length(x) = nvars where nvars is the number of variablesN = number of objects (e.g., stations) to be clusteredM may vary for each variables
nvars (number of variables)
length(meanThresh) = length(x) = nvarslength(varThresh) = length(x) = nvarslength(detrend) = length(x) = nvarslength(standardize) = length(x) = nvarslength(weightMVC) = length(x) = nvarsmeanThresh and varThresh thresholdsdetrend option, if requestedstandardize option, if requested
weightMVC option (default is 1)fastCor can now split the data matrix into nSplit splitsupperTri to fastCor function
verbose for printing processing informationdendrogram for plotting dendrogram\dontrun{} to skip time-consuming examples
k = 2, for objective tree cutting
k = NULL in validClimR functionformatRError in -mask : invalid argument to unary operator
lonSkip and latSkip renamed to lonStep and latStep, respectivelystandardize = FALSE
data component are now corrected
geogMaskInDispute is added to geogMask function to optionally consider areas in dispute for geographic masking by countryfastCor function that degrades its performance on 32-bit machines has been fixed
BLAS library, such as ATLAS, OpenBLAS, or the commercial Intel MKLres parameter) in geographic masking is removedLazyLoad and LazyData are enabled in the description fileworldMask and TestCase data are converted to lists to avoid conflicts of variable names (lon, lat, info, and mask) with lazy loadinghybrid is added to enable a second clustering step
regional linkage for reconstructing the upper part of the tree at a cutkH (number of clusters to restart with using the regional linkage method)kH = NULL, the tree will be reconstructed for the upper part with the first merging cost larger than the mean merging cost for the entire tree
coarseR function is called inside the core HiClimR functioncoords component to the output tree for the longitude and latitude coordinates
validClimR function does not require lon and lat arguments
coords component)HiClimR internally calls all other functionsregional linkage method
k should be specifiedHiClimR to regional linkage methodretData.
data of the output treeHiCLimR preprocessing options for further analysisregion component
NvalidCLimR that enables users to exclude very small clusters from validation indices interCor, intraCor, diffCor, and statSum, by setting a value for the minimum cluster size (minSize > 1)
validClimR in clustFlag component, which takes a value of 1 for valid clusters or 0 for excluded clustersHiClimR (currently, regional linkage) method, noisy spatial elements (or stations) are isolated in very small-size clusters or individuals since they do not correlate well with any other elementscoarseR function for coarsening spatial resolution of the input matrix xHiClimR package that modifies hclust function in stats libraryN spatial elements (e.g., stations) are divided into k regions, given that each element has observations (or timeseries) of length M
average update formulae by incorporating the standard deviation of the mean of the merged region100%.average linkage clustering methodvalidClimRlibrary(HiClimR)
#----------------------------------------------------------------------------------#
# Typical use of HiClimR for single-variate clustering: #
#----------------------------------------------------------------------------------#
## Load the test data included/loaded in the package (1 degree resolution)
x <- TestCase$x
lon <- TestCase$lon
lat <- TestCase$lat
## Generate/check longitude and latitude mesh vectors for gridded data
xGrid <- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
lon <- c(xGrid$lon)
lat <- c(xGrid$lat)
## Single-Variate Hierarchical Climate Regionalization
y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
#----------------------------------------------------------------------------------#
# Additional Examples: #
#----------------------------------------------------------------------------------#
## Use Ward's method
y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 5, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## Use data splitting for big data
y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = TRUE, kH = NULL,
members = NULL, nSplit = 10, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## Use hybrid Ward-Regional method
y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = TRUE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## Check senitivity to kH for the hybrid method aboverequire(HiClimR)
#----------------------------------------------------------------------------------#
# Typical use of HiClimR for multivariate clustering: #
#----------------------------------------------------------------------------------#
## Load the test data included/loaded in the package (1 degree resolution)
x1 <- TestCase$x
lon <- TestCase$lon
lat <- TestCase$lat
## Generate/check longitude and latitude mesh vectors for gridded data
xGrid <- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
lon <- c(xGrid$lon)
lat <- c(xGrid$lat)
## Test if we can replicate single-variate region map with repeated variable
y <- HiClimR(x=list(x1, x1), lon = lon, lat = lat, lonStep = 1, latStep = 1,
geogMask = FALSE, continent = "Africa", meanThresh = list(10, 10),
varThresh = list(0, 0), detrend = list(TRUE, TRUE), standardize = list(TRUE, TRUE),
nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## Generate a random matrix with the same number of rows
x2 <- matrix(rnorm(nrow(x1) * 100, mean=0, sd=1), nrow(x1), 100)
## Multivariate Hierarchical Climate Regionalization
y <- HiClimR(x=list(x1, x2), lon = lon, lat = lat, lonStep = 1, latStep = 1,
geogMask = FALSE, continent = "Africa", meanThresh = list(10, NULL),
varThresh = list(0, 0), detrend = list(TRUE, FALSE), standardize = list(TRUE, TRUE),
weightMVC = list(1, 1), nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## You can apply all clustering methods and optionsrequire(HiClimR)
#----------------------------------------------------------------------------------#
# Miscellaneous examples to provide more information about functionality and usage #
# of the helper functions that can be used separately or for other applications. # #
#----------------------------------------------------------------------------------#
## Load test case data
x <- TestCase$x
## Generate longitude and latitude mesh vectors
xGrid <- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
lon <- c(xGrid$lon)
lat <- c(xGrid$lat)
## Coarsening spatial resolution
xc <- coarseR(x = x, lon = lon, lat = lat, lonStep = 2, latStep = 2)
lon <- xc$lon
lat <- xc$lat
x <- xc$x
## Use fastCor function to compute the correlation matrix
t0 <- proc.time(); xcor <- fastCor(t(x)); proc.time() - t0
## compare with cor function
t0 <- proc.time(); xcor0 <- cor(t(x)); proc.time() - t0
## Check the valid options for geographic masking
geogMask()
## geographic mask for Africa
gMask <- geogMask(continent = "Africa", lon = lon, lat = lat, plot = TRUE,
colPalette = NULL)
## Hierarchical Climate Regionalization Without geographic masking
y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## With geographic masking (you may specify the mask produced above to save time)
y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = TRUE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## With geographic masking and contiguity constraint
## Change contigConst as appropriate
y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = TRUE,
continent = "Africa", contigConst = 1, meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = 12, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
## Find minimum significant correlation at 95% confidence level
rMin <- minSigCor(n = nrow(x), alpha = 0.05, r = seq(0, 1, by = 1e-06))
## Validtion of Hierarchical Climate Regionalization
z <- validClimR(y, k = 12, minSize = 1, alpha = 0.01, plot = TRUE, colPalette = NULL)
## Apply minimum cluster size (minSize = 25)
z <- validClimR(y, k = 12, minSize = 25, alpha = 0.01, plot = TRUE, colPalette = NULL)
## The optimal number of clusters, including small clusters
k <- length(z$clustFlag)
## The selected number of clusters, after excluding small clusters (if minSize > 1)
ks <- sum(z$clustFlag)
## Dendrogram plot
plot(y, hang = -1, labels = FALSE)
## Tree cut
cutTree <- cutree(y, k = k)
table(cutTree)
## Visualization for gridded data
RegionsMap <- matrix(y$region, nrow = length(unique(y$coords[, 1])), byrow = TRUE)
colPalette <- colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan",
"#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))
image(unique(y$coords[, 1]), unique(y$coords[, 2]), RegionsMap, col = colPalette(ks))
## Visualization for gridded or ungridded data
plot(y$coords[, 1], y$coords[, 2], col = colPalette(max(y$region, na.rm = TRUE))[y$region], pch = 15, cex = 1)
## Change pch and cex as appropriate!
## Export region map and mean timeseries into NetCDF-4 file
library(ncdf4)
y.nc <- HiClimR2nc(y=y, ncfile="HiClimR.nc", timeunit="years", dataunit="mm")
## The NetCDF-4 file is still open to add other variables or close it
nc_close(y.nc)