Introduction

Clustering refers to the task of assign categories to describe a dataset and its points according to its similarities or dissimilarities. It consists of unsupervised learning to detect hidden structures and to separate the data into partitions whose elements in the same partition are similar to each other and dissimilar to the elements in the other partitions, in agreement with some criterion or distance metric. It is widely used to classify datasets when the target class is not known.

Anil K Jain and Dubes (1988) divide the clustering techniques into three generous categories: nonexclusive overlapping, partitional, and hierarchical. While overlapping algorithms can be subdivided in soft — full pertinence to one or more clusters, or fuzzy — degrees of pertinence to one or more clusters; the hierarchical and partitional clustering are related to each other, in as much as they are both hard partitions — mutually disjoint subsets, and the first one is an encapsulated sequence of the second one.

There are a broad variety of methods and algorithms to do hard, hierarchical or overlapping clustering. One of the most popular, simple and still widely used clustering algorithms, K-means, was first published in 1955. After that, thousands of clustering algorithms have been published since then. Some of them emerging in useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering (Anil K. Jain 2010).

Another way to do clustering is by using evolutionary approaches to find the best partitions. From an optimization perspective, clustering can be formally considered as a particular kind of NP-hard grouping problem (Falkenauer 1998). Under this assumption, a large number of evolutionary algorithms for solving clustering problems have been proposed in the literature. These algorithms are based on the optimization of some objective function (i.e., the so-called fitness function) that guides the evolutionary search (Hruschka et al. 2009).

gama concept

Following the taxonomy proposed by Hruschka et al. (2009), we aim to present gama, a Genetic Approach to Maximize a clustering criteria — a R package for evolutionary hard partitional clustering, by using guided operators (those guided by some information about the quality of individual clusters), for a fixed a priori known number of partitions, encoded as real-valued centroid based partitions.

The main advantage of the proposed technique is to allow the user a way to realize clustering guided by the maximization of a chosen cluster validation criterion. The use of genetic search enables the algorithm to find a centroid configuration so good as bigger the number of generations used until the convergence for a local (reasonable) or global maximum, if there are any.

This document shows a few set of examples for usage of gama (version 1.0.3). It was written in R Markdown, using the [knitr] (Y. Xie 2014) package. See also help(package="gama") for a list of available functions and methods.

Hereafter, we assume that the gama package is already installed and loaded in the current R session, by entering the following command:

library(gama)

Optimization problem

In the domain of real numbers from the minimum to maximum values of the given dataset, the problem is to discover the centroid or centroids whom best separate the dataset into k partitions. The search will is guided by the maximization of one of four available clustering validation criteria.

The best partition varies according to the nature of each criterion (Desgraupes 2013), as follows:

Calinski-Harabasz: starts in zero, the higher, the better; a direct maximization problem;
Silhouette: ranges from -1 (the worst) to 1 (the best); a direct maximization problem;
Dunn Index: starts in zero, the higher , the better; a direct maximization problem;
C-Index: ranges from 1 (the worst) to 0 (the best); an inverse maximization problem. The algoritm needs to calcultate the maximization of the expression 1 - c_index.

Available datasets

We deployed gama with six different datasets, all of them having real number values. Two of them are in-house datasets about CPU consumption metrics from a real big data processing platform during execution of distributed machine learning (DML) algorithms. The others are synthetic datasets made available by the scientific community for benchmark purposes (Fränti and Sieranoja 2018). They are specified, in sequence.

In-house datasets

To collect CPU execution metrics, we have run two distributed machine learning (DML) algorithms. The first one was a recommendation system problem solved by the execution of an Alternating Least Squares (ALS) (Goldberg et al. 1992) and (Hu, Koren, and Volinsky 2008) over 1.68 GB dataset of users and products. The second algorithm was a dimensionality reduction problem solved by a Principal Component Analysis (PCA) (Jolliffe 2002) over 257 MB dataset composed of floating point numbers. The experiments were hosted on Google Cloud Data Proc cloud provider and composed of a master node and seven slave nodes, all of them having the same specification: a 16-core Intel Xeon CPU @ 2.60GHz, 2 threads per core, 106 GB RAM, 500 GB HDD storage and 375 GB SSD storage, managed by a Debian GNU/Linux 8.10 (jessie) operating system. The solid-state drive (SSD) was used on each node to support auxiliary operating system tasks. A total of 128 CPU cores formed the cluster, having yet 848 GB RAM, 3.4 TB of storage, big data processing framework Apache Spark 2.2.0 [http://spark.apache.org] and Apache Hadoop 2.8.2 [http://hadoop.apache.org].

We collect, every five seconds, the CPU consumption metrics regarding user algorithm load, system load, I/O wait, and software interruption request. The datasets are described following:

cpu.als: a data frame containing 308 observations and four dimensions. The number of partitions is unknown;
cpu.pca: a data frame containing 938 observations and four dimensions. The number of partitions is unknown.

The images above shows the geometric shapes of the in-house datasets, decomposed from their four original dimensions to their two principal components. One can observe the number of partitions of both datasets is unknown, but there is a strong visual suggestion for three partitions in cpu.pca.

Third-party datasets

Initially, all datasets had contained one more dimension. We intentionally removed the last dimension that corresponds to the label which the data point belongs.

aggregation: a synthetic dataset (Gionis, Mannila, and Tsaparas 2007) that contains features that are known to create difficulties for the selected algorithms such as narrow bridges between clusters, uneven-sized clusters, and so on. It contains 788 observations and two dimensions, intentionally forming seven partitions;
compound: a synthetic dataset (Zahn 1970) that contains groups of different density points, varied shapes, and necks between partitions. It contains 399 observations and two dimensions, forming six clusters;
flame: a dataset used in the research of fuzzy clustering method for the analysis of DNA microarray data (Fu and Medico 2007). It contains 240 observations and two dimensions, intentionally forming two partitions;
paht_based: a synthetic dataset (Chang and Yeung 2008) consisting of a circular cluster with an opening near the bottom and two Gaussian distributed clusters inside. It contains 300 observations and two dimensions, intentionally forming three partitions.

Third-party software and data

Some tools, techniques and third-party packages were applied to build gama. We used Intel HiBench (S. Huang et al. 2010) big data benchmark framework to generate the datasets about CPU execution perfomance of large machine learning workloads. The other avaliable data in the package are synthetic datasets made available by the scientific community for clustering benchmark purposes (Fränti and Sieranoja 2018). We also used the following libraries and R packages, they are: Cluster (Maechler et al. 2018) and NbClust (Charrad et al. 2014) for cluster validation, GA (Scrucca 2013) for genetic algorithms, and Rfast (Papadakis et al. 2018) to speed up R calculations over matrices .

Examples

Segmentation of well-defined partitions

In this example, we will be using the cpu.pca dataset. One can observe that the dataset has three well-separated partitions. This way, the gama function will be called with a specific value for k = 3, as follows:

data(cpu.als)

# call gama evolutionary clustering for k = 3 partitions
# plot.internals = FALSE / do not show fitness values evolution across generations
obj <- gama(dataset = cpu.pca, k = 3, plot.internals = FALSE, seed.p = 42)
## 
## Details for the object of class 'gama':
## 
## Original data (first rows):
##         user     system     iowait softirq
## 1 0.04995629 0.01248907 0.00000000       0
## 2 0.03746254 0.02497502 0.00000000       0
## 3 0.03747190 0.02498126 0.01249063       0
## 4 0.06242197 0.02496879 0.00000000       0
## 5 0.04995629 0.02497814 0.00000000       0
## 6 0.04994381 0.00000000 0.00000000       0
## 
## Cluster partitions:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2
## [211] 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2
## [246] 2 3 1 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2
## [281] 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2
## [316] 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2
## [351] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 3 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [386] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [421] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [456] 3 2 2 2 2 2 2 3 3 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [491] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [526] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 3 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [561] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [596] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [631] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [666] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [701] 3 2 2 2 2 2 2 3 3 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [736] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [771] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [806] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [841] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [876] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## [911] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 2
## 
## Cluster Centers:
##          user     system    iowait   softirq
## 1  0.08846322  0.7074869 0.5285097 0.2379882
## 2 39.09156199 19.8591386 0.1718969 0.1096162
## 3 59.69671874 16.9813975 0.4664263 0.1289043
## 
## Average Silhouette Width index (ASW):
## [1] 0.8628907
## 
## Calinski Harabasz index (CH):
## [1] 4188.998
## 
## C-Index (CI):
## [1] 0.01254375
## 
## Dunn index (DI):
## [1] 0.03160641
## 
## Call:
## gama(dataset = cpu.pca, k = 3, seed.p = 42, plot.internals = FALSE)
## 
## Runtime:
## Time difference of 14.21 secs

# plot the graph of partitions
gama.plot.partitions(obj)

Segmentation of an unknown number of partitions

In this example, we will be using the cpu.als dataset. The best value for k will be derived by using the ‘minimal’ method of estimation (an approximation of the second derivatives to find the inflection point in the elbow graph). Once the dataset is about CPU load, one can observe the definition of a user function called my.penalty which avoid the the algorithm to choose points whose sum of CPU loads overflows 100% (a physical limit). This specific call specifies 500 generations, a penalty function, uses the default options to plot internal graphs (evolution and silhouette) and let the algorithm to infer the best value for k.

## Looking for the best k value by 'broad' method...
## There was found 6 suggestions for k = [ 2, 3, 4, 7, 9, 10 ].
## Best k suggestion by majority voting = 3.
## 
## 
## Details for the object of class 'gama':
## 
## Original data (first rows):
##         user     system    iowait softirq
## 1 0.16239850 0.02498438 0.4996877       0
## 2 0.07497189 0.01249531 0.5622891       0
## 3 0.16237822 0.03747190 0.4746440       0
## 4 0.11241569 0.04996253 0.6120410       0
## 5 0.07495315 0.03747658 0.7495315       0
## 6 0.14985015 0.04995005 0.5619381       0
## 
## Cluster partitions:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 2 3 2 2 3 3 2 3 3 3 3
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [211] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [246] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [281] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1
## 
## Cluster Centers:
##       user   system    iowait   softirq
## 3 11.61683 6.504285 0.5638404 0.1192261
## 2 23.04150 6.109322 0.4010132 0.1845900
## 1 66.35791 3.057684 0.6460705 0.1349795
## 
## Average Silhouette Width index (ASW):
## [1] 0.8137554
## 
## Calinski Harabasz index (CH):
## [1] 414.4831
## 
## C-Index (CI):
## [1] 0.0348298
## 
## Dunn index (DI):
## [1] 0.02320579
## 
## Call:
## gama(dataset = cpu.als, generations = 150, seed.p = 42, penalty.function = my.penalty)
## 
## Runtime:
## Time difference of 6.47 secs

One can observe the parameter view = ‘total.sum’. This kind of partitions visualization plots the sum of all dimensions each observation belongs to. It is specially util in datasets like the cpu.als when the sum of dimensions it is important to the visualization purposes.

Segmentation of complex datasets

In the next examples, we will be using the aggregation, compound, and path.based datasets. One can observe that the clusters have groups of different density points, varied shapes, and necks between partitions. The datasets contain features that are known to create difficulties for the selected algorithm such as narrow bridges between clusters and uneven-sized clusters. For demonstration purposes, we call gama with a small number of generations. We strongly recommend larger values to complex datasets.

First, a call to the algorithm over the compound dataset with the already a priori known number k = 6. The method of evaluation was changed from default (silhouette) to Calinski-Harabasz index to exemplify the adaptable fitness criterion feature.

## 
## Details for the object of class 'gama':
## 
## Original data (first rows):
##      x1    x2
## 1 26.75 22.15
## 2 29.80 22.15
## 3 31.55 21.10
## 4 27.70 20.85
## 5 29.90 19.95
## 6 26.80 19.05
## 
## Cluster partitions:
##   [1] 1 5 5 5 5 5 5 5 5 5 5 5 5 4 5 5 5 5 4 4 4 5 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [36] 2 2 2 2 2 2 2 2 2 2 5 5 5 5 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
##  [71] 5 2 5 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 5 5 5 5 5 5 5
## [106] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## [141] 5 5 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [176] 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## [246] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## [281] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4
## [316] 6 6 6 6 6 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [351] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [386] 4 4 4 4 4 4 6 6 6 6 6 6 6 6
## 
## Cluster Centers:
##         x1       x2
## 6 10.27127 12.05669
## 3 10.17120 17.09351
## 4 22.79056 10.66915
## 1 20.22489 20.35384
## 5 32.14655 15.68631
## 2 37.34206 14.41635
## 
## Average Silhouette Width index (ASW):
## [1] 0.5978951
## 
## Calinski Harabasz index (CH):
## [1] 849.4788
## 
## C-Index (CI):
## [1] 0.03711695
## 
## Dunn index (DI):
## [1] 0.02963503
## 
## Call:
## gama(dataset = compound, k = 6, seed.p = 42, fitness.criterion = "CH", 
##     plot.internals = FALSE)
## 
## Runtime:
## Time difference of 18.93 secs

The algorithm has not been able to detect complex forms like the clusters that exist within other clusters.

Second, a call to detect partitions in aggregation dataset (k = 7) and Dunn index as fitness criterion.

## 
## Details for the object of class 'gama':
## 
## Original data (first rows):
##      x1    x2
## 1 15.55 28.65
## 2 14.90 27.55
## 3 14.45 28.35
## 4 14.15 28.80
## 5 13.75 28.05
## 6 13.35 28.45
## 
## Cluster partitions:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [71] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 6 6 6 6 3 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [211] 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2 2 2 2 2 2 2 6 6 6 2 2 2 2 2 2
## [246] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 2 2 2 2 2 2 2 2 2 6 6 2 2
## [281] 2 2 2 2 2 2 6 2 2 2 2 6 6 6 6 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## [316] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## [351] 6 6 6 6 6 7 7 7 7 7 7 6 7 7 6 6 7 7 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## [386] 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 4 7 7 7 7 7 7 7 7 7 6 6 6 6 6 6 6
## [421] 6 6 6 6 6 6 6 7 7 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
## [456] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 4 7 7 7 7 7 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [491] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [526] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [561] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1
## [596] 1 1 1 1 1 1 1 1 1 1 1 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [631] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [666] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [701] 1 1 1 1 1 1 1 1 1 5 5 5 5 3 3 5 5 5 5 5 5 5 5 5 5 5 3 3 5 5 3 5 1 1 5
## [736] 5 1 1 1 1 1 1 1 5 5 5 5 5 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [771] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 
## Cluster Centers:
##          x1       x2
## 2  6.490798 10.65759
## 6 16.617178 13.84415
## 3 11.569065 21.43069
## 7 20.684434 14.64211
## 5 24.175935 17.06886
## 4 30.645513 12.35962
## 1 27.905204 19.84107
## 
## Average Silhouette Width index (ASW):
## [1] 0.5313879
## 
## Calinski Harabasz index (CH):
## [1] 1094.469
## 
## C-Index (CI):
## [1] 0.04818904
## 
## Dunn index (DI):
## [1] 0.032834
## 
## Call:
## gama(dataset = aggregation, k = 7, seed.p = 42, fitness.criterion = "DI", 
##     plot.internals = FALSE)
## 
## Runtime:
## Time difference of 49.39 secs

Moreover, the path.based dataset, by calling a specific method to infer the best value for k, before calling the clustering algorithm.

## Looking for the best k value by 'minimal' method...

## Best k suggestion by using second derivative aproximation over 
## 'Within-Cluster Sum of Squares Error' (WCSSE) graphic = 2. See 'elbow' graphic for details.
## 
## Details for the object of class 'gama':
## 
## Original data (first rows):
##      x1   x2
## 1 11.25 5.05
## 2 10.95 4.70
## 3  9.85 5.80
## 4  9.80 5.75
## 5  9.15 6.80
## 6  8.65 6.60
## 
## Cluster partitions:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [106] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Cluster Centers:
##         x1       x2
## 1 13.67276 14.23638
## 2 22.62516 13.37190
## 
## Average Silhouette Width index (ASW):
## [1] 0.6974735
## 
## Calinski Harabasz index (CH):
## [1] 371.0553
## 
## C-Index (CI):
## [1] 0.1020544
## 
## Dunn index (DI):
## [1] 0.03698117
## 
## Call:
## gama(dataset = path.based, k = best.k, seed.p = 42, plot.internals = FALSE)
## 
## Runtime:
## Time difference of 3.95 secs

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gama_1.0.3 knitr_1.21
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0           GA_3.2               cluster_2.0.7-1     
##  [4] magrittr_1.5         NbClust_3.0          munsell_0.5.0       
##  [7] colorspace_1.4-0     rlang_0.3.1          foreach_1.4.4       
## [10] plyr_1.8.4           stringr_1.3.1        tools_3.5.2         
## [13] parallel_3.5.2       Rfast_1.9.2          grid_3.5.2          
## [16] gtable_0.2.0         xfun_0.4             cli_1.0.1           
## [19] htmltools_0.3.6      iterators_1.0.10     lazyeval_0.2.1      
## [22] yaml_2.2.0           digest_0.6.18        assertthat_0.2.0    
## [25] clusterCrit_1.2.8    tibble_2.0.1         crayon_1.3.4        
## [28] ggplot2_3.1.0        ArgumentCheck_0.10.2 codetools_0.2-15    
## [31] evaluate_0.12        rmarkdown_1.11       labeling_0.3        
## [34] stringi_1.2.4        pillar_1.3.1         compiler_3.5.2      
## [37] scales_1.0.0         RcppZiggurat_0.1.5   pkgconfig_2.0.2

References

Chang, Hong, and Dit-Yan Yeung. 2008. “Robust Path-Based Spectral Clustering.” Pattern Recognition 41 (1). Elsevier: 191–203.

Charrad, Malika, Nadia Ghazzali, Véronique Boiteau, and Azam Niknafs. 2014. “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set.” Journal of Statistical Software 61 (6): 1–36. http://www.jstatsoft.org/v61/i06/.

Desgraupes, Bernard. 2013. “Clustering Indices.” University of Paris Ouest-Lab Modal’X 1: 34.

Falkenauer, Emanuel. 1998. Genetic Algorithms and Grouping Problems. Wiley New York.

Fränti, Pasi, and Sami Sieranoja. 2018. “K-Means Properties on Six Clustering Benchmark Datasets.” http://cs.uef.fi/sipu/datasets/.

Fu, Limin, and Enzo Medico. 2007. “FLAME, a Novel Fuzzy Clustering Method for the Analysis of Dna Microarray Data.” BMC Bioinformatics 8 (1). BioMed Central: 3.

Gionis, Aristides, Heikki Mannila, and Panayiotis Tsaparas. 2007. “Clustering Aggregation.” ACM Transactions on Knowledge Discovery from Data (TKDD) 1 (1). ACM: 4.

Goldberg, David, David Nichols, Brian M Oki, and Douglas Terry. 1992. “Using collaborative filtering to weave an information tapestry.” Communications of the ACM 35 (12). ACM: 61–70.

Hruschka, Eduardo Raul, Ricardo J G B Campello, Alex A Freitas, and Others. 2009. “A survey of evolutionary algorithms for clustering.” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 39 (2). IEEE: 133–55.

Hu, Yifan, Yehuda Koren, and Chris Volinsky. 2008. “Collaborative filtering for implicit feedback datasets.” In Data Mining, 2008. Icdm’08. Eighth Ieee International Conference on, 263–72. Ieee.

Huang, Shengsheng, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. 2010. “The HiBench benchmark suite: Characterization of the MapReduce-based data analysis.” In 2010 Ieee 26th International Conference on Data Engineering Workshops (Icdew 2010), 41–51. IEEE. doi:10.1109/ICDEW.2010.5452747.

Jain, Anil K, and Richard C Dubes. 1988. “Algorithms for Clustering Data.” Prentice-Hall, Inc.

Jain, Anil K. 2010. “Data clustering: 50 years beyond K-means.” Pattern Recognition Letters 31 (8). Elsevier B.V.: 651–66. doi:10.1016/j.patrec.2009.09.011.

Jolliffe, I T. 2002. “Principal Component Analysis, Second Edition.” Encyclopedia of Statistics in Behavioral Science 30 (3): 487. doi:10.2307/1270093.

Maechler, Martin, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. 2018. Cluster: Cluster Analysis Basics and Extensions.

Papadakis, Manos, Michail Tsagris, Marios Dimitriadis, Stefanos Fafalios, Ioannis Tsamardinos, Matteo Fasiolo, Giorgos Borboudakis, John Burkardt, Changliang Zou, and Kleanthi Lakiotaki. 2018. Rfast: A Collection of Efficient and Extremely Fast R Functions.

Scrucca, Luca. 2013. “GA: A Package for Genetic Algorithms in R.” Journal of Statistical Software, Articles 53 (4): 1–37. doi:10.18637/jss.v053.i04.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.

Zahn, Charles T. 1970. “Graph Theoretical Methods for Detecting and Describing Gestalt Clusters.” IEEE Trans. Comput. 20 (SLAC-PUB-0672-REV): 68.

Using gama - a Genetic Approach to Maximize clustering criterion

Jairson Rodrigues, Germano Vasconcelos, Renato Tinós

15 Feb 2019