Clustering Package

Luis Alfonso Pérez Martos

2020-04-23

Clustering is considered as a concise data model by which from a set of data we must partition them and introduce them in data groups, which are ́an as similar as possible. If review all clustering algorithm implements in R, can see a great number of packages that implement or improve algorithm or functionality.

The Clustering package contain multiply implementations of algorithms like: gmm, kmeans-arma, kmeans-rcpp, fuzzy_cm, fuzzy_gg, fuzzy_gk, hclust, apclusterk,aggExcluster,clara, daisy, diana,fanny,gama,mona,pam, pvclust,pvpick.

Also can use differents similarity measures to calculate the distance between points like: Euclidean, Manhattan, Jaccard, Gower, Mahalanobis, Correlation and Minkowski.

Furthermore, the package offers functions to:

Clustering

It’s the main method of the package.Clustering method processes a set of clustering algorithms. If we need to get information about the parameters that the method has we can do so by using the ?function or help(function). The way to load the datasets can be done in two different ways:

Once the method has been executed, we obtain the results divided into four parts:


df <- Clustering::clustering(df = Clustering::basketball,  
                             packages = c("clusterr"), min = 4, max = 6)

Here we have a dataframe with the result of the execution. In it you can see all the algorithms, the similarity measures used, the variables classified in order of importance, the execution time of the algorithms and the evaluation metrics.

Algorithm Distance Clusters Dataset Ranking timeExternal entropy variation_information precision recall f_measure fowlkes_mallows_index connectivity dunn silhouette timeInternal
gmm gmm_euclidean 4 dataframe 1 0.0203 0.3161 4.762 0.1822 0.451 0.2595 0.2867 34.09 0.1646 0.23 0.007
gmm gmm_euclidean 4 dataframe 2 0.0257 0.3085 4.741 0.1113 0.4005 0.1742 0.2111 34.09 0.1646 0.23 0.0071
gmm gmm_euclidean 4 dataframe 3 0.2384 0.0064 4.72 0 0 0 0 34.09 0.1646 0.23 0.009
gmm gmm_euclidean 4 dataframe 4 0.239 0.0032 4.143 0 0 0 0 34.09 0.1646 0.23 0.0091
gmm gmm_euclidean 4 dataframe 5 0.3997 0 3.671 0 0 0 0 34.09 0.1646 0.23 0.0108
gmm gmm_euclidean 5 dataframe 1 0.0245 0.4175 4.363 0.1637 0.2865 0.2084 0.2165 42.08 0.1619 0.25 0.0065
gmm gmm_euclidean 5 dataframe 2 0.0271 0.3857 4.346 0.1109 0.2823 0.1592 0.1769 42.08 0.1619 0.25 0.0071
gmm gmm_euclidean 5 dataframe 3 0.1838 0.0064 4.342 0 0 0 0 42.08 0.1619 0.25 0.0078
gmm gmm_euclidean 5 dataframe 4 0.1863 0.0032 4.321 0 0 0 0 42.08 0.1619 0.25 0.0078
gmm gmm_euclidean 5 dataframe 5 0.2108 0 4.022 0 0 0 0 42.08 0.1619 0.25 0.0153
gmm gmm_euclidean 6 dataframe 1 0.0278 0.433 4.439 0.1744 0.2791 0.2147 0.2206 51.46 0.1619 0.23 0.0064
gmm gmm_euclidean 6 dataframe 2 0.0289 0.4209 4.179 0.1062 0.2473 0.1486 0.1621 51.46 0.1619 0.23 0.0069
gmm gmm_euclidean 6 dataframe 3 0.1654 0.0064 4.159 0 0 0 0 51.46 0.1619 0.23 0.0075
gmm gmm_euclidean 6 dataframe 4 0.1811 0.0032 4.138 0 0 0 0 51.46 0.1619 0.23 0.0079
gmm gmm_euclidean 6 dataframe 5 0.1918 0 3.954 0 0 0 0 51.46 0.1619 0.23 0.0154
gmm gmm_manhattan 4 dataframe 1 0.0149 0.3161 4.762 0.1822 0.451 0.2595 0.2867 35.59 0.1348 0.23 0.0065
gmm gmm_manhattan 4 dataframe 2 0.0186 0.3085 4.741 0.1113 0.4005 0.1742 0.2111 35.59 0.1348 0.23 0.0065
gmm gmm_manhattan 4 dataframe 3 0.1536 0.0064 4.72 0 0 0 0 35.59 0.1348 0.23 0.0065
gmm gmm_manhattan 4 dataframe 4 0.1605 0.0032 4.143 0 0 0 0 35.59 0.1348 0.23 0.0068
gmm gmm_manhattan 4 dataframe 5 0.1839 0 3.671 0 0 0 0 35.59 0.1348 0.23 0.0069
gmm gmm_manhattan 5 dataframe 1 0.0257 0.4258 4.35 0.167 0.2828 0.21 0.2173 46.83 0.1322 0.26 0.0072
gmm gmm_manhattan 5 dataframe 2 0.0446 0.3892 4.338 0.1114 0.2742 0.1584 0.1747 46.83 0.1322 0.26 0.0092
gmm gmm_manhattan 5 dataframe 3 0.1582 0.0064 4.317 0 0 0 0 46.83 0.1322 0.26 0.0116
gmm gmm_manhattan 5 dataframe 4 0.2633 0.0032 4.296 0 0 0 0 46.83 0.1322 0.26 0.012
gmm gmm_manhattan 5 dataframe 5 0.4502 0 4.059 0 0 0 0 46.83 0.1322 0.26 0.0301
gmm gmm_manhattan 6 dataframe 1 0.0421 0.4555 4.298 0.1669 0.2608 0.2035 0.2085 54.87 0.1467 0.25 0.0089
gmm gmm_manhattan 6 dataframe 2 0.0918 0.4052 4.161 0.1148 0.2606 0.1594 0.173 54.87 0.1467 0.25 0.0092
gmm gmm_manhattan 6 dataframe 3 0.2203 0.0064 4.14 0 0 0 0 54.87 0.1467 0.25 0.0112
gmm gmm_manhattan 6 dataframe 4 0.3068 0.0032 4.119 0 0 0 0 54.87 0.1467 0.25 0.0187
gmm gmm_manhattan 6 dataframe 5 0.3333 0 4.102 0 0 0 0 54.87 0.1467 0.25 0.0292
kmeans_arma kmeans_arma 4 dataframe 1 0.0009 0 0 0 0 0 0 44.21 0.1495 0.23 0.0088
kmeans_arma kmeans_arma 4 dataframe 2 0.001 0 0 0 0 0 0 44.21 0.1495 0.23 0.009
kmeans_arma kmeans_arma 4 dataframe 3 0.0013 0 0 0 0 0 0 44.21 0.1495 0.23 0.0099
kmeans_arma kmeans_arma 4 dataframe 4 0.0016 0 0 0 0 0 0 44.21 0.1495 0.23 0.0103
kmeans_arma kmeans_arma 4 dataframe 5 0.0023 0 0 0 0 0 0 44.21 0.1495 0.23 0.0128
kmeans_arma kmeans_arma 5 dataframe 1 0.0008 0 0 0 0 0 0 49.22 0.1538 0.26 0.0082
kmeans_arma kmeans_arma 5 dataframe 2 0.001 0 0 0 0 0 0 49.22 0.1538 0.26 0.0107
kmeans_arma kmeans_arma 5 dataframe 3 0.0011 0 0 0 0 0 0 49.22 0.1538 0.26 0.0111
kmeans_arma kmeans_arma 5 dataframe 4 0.0012 0 0 0 0 0 0 49.22 0.1538 0.26 0.0172
kmeans_arma kmeans_arma 5 dataframe 5 0.0017 0 0 0 0 0 0 49.22 0.1538 0.26 0.022
kmeans_arma kmeans_arma 6 dataframe 1 0.001 0 0 0 0 0 0 57.63 0.1619 0.24 0.0081
kmeans_arma kmeans_arma 6 dataframe 2 0.0011 0 0 0 0 0 0 57.63 0.1619 0.24 0.0087
kmeans_arma kmeans_arma 6 dataframe 3 0.0012 0 0 0 0 0 0 57.63 0.1619 0.24 0.0102
kmeans_arma kmeans_arma 6 dataframe 4 0.0013 0 0 0 0 0 0 57.63 0.1619 0.24 0.0103
kmeans_arma kmeans_arma 6 dataframe 5 0.0017 0 0 0 0 0 0 57.63 0.1619 0.24 0.0124
kmeans_rcpp kmeans_rcpp 4 dataframe 1 0.0173 0.3728 4.627 0.1697 0.5 0.23 0.2461 51.04 0.1741 0.23 0.0073
kmeans_rcpp kmeans_rcpp 4 dataframe 2 0.0266 0.3494 4.606 0.1003 0.3567 0.1511 0.1753 51.04 0.1741 0.23 0.0075
kmeans_rcpp kmeans_rcpp 4 dataframe 3 0.1904 0.0032 4.606 0.0009 0.3065 0.0018 0.021 51.04 0.1741 0.23 0.0078
kmeans_rcpp kmeans_rcpp 4 dataframe 4 0.1909 0.0032 4.531 0 0 0 0 51.04 0.1741 0.23 0.0079
kmeans_rcpp kmeans_rcpp 4 dataframe 5 0.2305 0 3.804 0 0 0 0 51.04 0.1741 0.23 0.0143
kmeans_rcpp kmeans_rcpp 5 dataframe 1 0.0218 0.4269 4.551 0.1663 0.5 0.2104 0.2183 66.85 0.152 0.19 0.007
kmeans_rcpp kmeans_rcpp 5 dataframe 2 0.0233 0.4135 4.329 0.1019 0.2865 0.1457 0.1613 66.85 0.152 0.19 0.0078
kmeans_rcpp kmeans_rcpp 5 dataframe 3 0.1641 0.0032 4.308 0.0011 0.2554 0.0022 0.0232 66.85 0.152 0.19 0.0079
kmeans_rcpp kmeans_rcpp 5 dataframe 4 0.1798 0.0032 4.308 0 0 0 0 66.85 0.152 0.19 0.0081
kmeans_rcpp kmeans_rcpp 5 dataframe 5 0.2077 0 4.079 0 0 0 0 66.85 0.152 0.19 0.0097
kmeans_rcpp kmeans_rcpp 6 dataframe 1 0.0284 0.4545 4.331 0.1703 0.2458 0.2012 0.2046 74.78 0.1522 0.19 0.007
kmeans_rcpp kmeans_rcpp 6 dataframe 2 0.0335 0.4169 4.104 0.1152 0.2419 0.1561 0.167 74.78 0.1522 0.19 0.0073
kmeans_rcpp kmeans_rcpp 6 dataframe 3 0.1752 0.0064 4.083 0 0 0 0 74.78 0.1522 0.19 0.0074
kmeans_rcpp kmeans_rcpp 6 dataframe 4 0.1962 0.0032 4.062 0 0 0 0 74.78 0.1522 0.19 0.0078
kmeans_rcpp kmeans_rcpp 6 dataframe 5 0.2077 0 4.037 0 0 0 0 74.78 0.1522 0.19 0.0208
mini_kmeans mini_kmeans 4 dataframe 1 0.001 0 0 0 0 0 0 50.35 0.1571 0.21 0.007
mini_kmeans mini_kmeans 4 dataframe 2 0.001 0 0 0 0 0 0 50.35 0.1571 0.21 0.0071
mini_kmeans mini_kmeans 4 dataframe 3 0.0011 0 0 0 0 0 0 50.35 0.1571 0.21 0.0072
mini_kmeans mini_kmeans 4 dataframe 4 0.0011 0 0 0 0 0 0 50.35 0.1571 0.21 0.0073
mini_kmeans mini_kmeans 4 dataframe 5 0.0016 0 0 0 0 0 0 50.35 0.1571 0.21 0.0077
mini_kmeans mini_kmeans 5 dataframe 1 0.0009 0 0 0 0 0 0 76.4 0.1216 0.17 0.0073
mini_kmeans mini_kmeans 5 dataframe 2 0.0009 0 0 0 0 0 0 76.4 0.1216 0.17 0.0074
mini_kmeans mini_kmeans 5 dataframe 3 0.0009 0 0 0 0 0 0 76.4 0.1216 0.17 0.0076
mini_kmeans mini_kmeans 5 dataframe 4 0.001 0 0 0 0 0 0 76.4 0.1216 0.17 0.0077
mini_kmeans mini_kmeans 5 dataframe 5 0.0014 0 0 0 0 0 0 76.4 0.1216 0.17 0.0087
mini_kmeans mini_kmeans 6 dataframe 1 0.0008 0 0 0 0 0 0 76.53 0.15 0.17 0.007
mini_kmeans mini_kmeans 6 dataframe 2 0.0008 0 0 0 0 0 0 76.53 0.15 0.17 0.0072
mini_kmeans mini_kmeans 6 dataframe 3 0.001 0 0 0 0 0 0 76.53 0.15 0.17 0.0076
mini_kmeans mini_kmeans 6 dataframe 4 0.001 0 0 0 0 0 0 76.53 0.15 0.17 0.008
mini_kmeans mini_kmeans 6 dataframe 5 0.0012 0 0 0 0 0 0 76.53 0.15 0.17 0.0106

This property tells us if we have made an internal evaluation of the groups

#> [1] TRUE

This property tells us if we have made an external evaluation of the groups

#> [1] TRUE

Algorithms executed

#> [1] "gmm"         "kmeans_arma" "kmeans_rcpp" "mini_kmeans"

Similarity Metrics

#> [1] "gmm_euclidean" "gmm_manhattan" "kmeans_arma"   "kmeans_rcpp"  
#> [5] "mini_kmeans"

If we want to obtain the classified variables instead of the values we must use the variable property


df_variable <- Clustering::clustering(df = Clustering::basketball,  
                             packages = c("clusterr"), min = 4, max = 6, variables = TRUE)
Algorithm Distance Clusters Dataset Ranking timeExternal entropy variation_information precision recall f_measure fowlkes_mallows_index connectivity dunn silhouette timeInternal
gmm gmm_euclidean 4 dataframe 1 4 2 3 2 2 2 2 1 1 1 4
gmm gmm_euclidean 4 dataframe 2 2 4 5 4 4 4 4 2 2 2 5
gmm gmm_euclidean 4 dataframe 3 5 3 1 1 1 1 1 3 3 3 2
gmm gmm_euclidean 4 dataframe 4 1 5 4 3 3 3 3 4 4 4 1
gmm gmm_euclidean 4 dataframe 5 3 1 2 5 5 5 5 5 5 5 3
gmm gmm_euclidean 5 dataframe 1 3 2 3 2 2 2 2 1 1 1 2
gmm gmm_euclidean 5 dataframe 2 1 4 4 4 4 4 4 2 2 2 4
gmm gmm_euclidean 5 dataframe 3 4 3 5 1 1 1 1 3 3 3 1
gmm gmm_euclidean 5 dataframe 4 2 5 1 3 3 3 3 4 4 4 3
gmm gmm_euclidean 5 dataframe 5 5 1 2 5 5 5 5 5 5 5 5
gmm gmm_euclidean 6 dataframe 1 3 2 4 2 2 2 2 1 1 1 3
gmm gmm_euclidean 6 dataframe 2 1 4 3 4 4 4 4 2 2 2 5
gmm gmm_euclidean 6 dataframe 3 4 3 5 1 1 1 1 3 3 3 1
gmm gmm_euclidean 6 dataframe 4 2 5 1 3 3 3 3 4 4 4 2
gmm gmm_euclidean 6 dataframe 5 5 1 2 5 5 5 5 5 5 5 4
gmm gmm_manhattan 4 dataframe 1 5 2 3 2 2 2 2 1 1 1 2
gmm gmm_manhattan 4 dataframe 2 1 4 5 4 4 4 4 2 2 2 1
gmm gmm_manhattan 4 dataframe 3 3 3 1 1 1 1 1 3 3 3 4
gmm gmm_manhattan 4 dataframe 4 2 5 4 3 3 3 3 4 4 4 5
gmm gmm_manhattan 4 dataframe 5 4 1 2 5 5 5 5 5 5 5 3
gmm gmm_manhattan 5 dataframe 1 5 2 4 2 2 2 2 1 1 1 3
gmm gmm_manhattan 5 dataframe 2 1 4 3 4 4 4 4 2 2 2 2
gmm gmm_manhattan 5 dataframe 3 3 3 5 1 1 1 1 3 3 3 4
gmm gmm_manhattan 5 dataframe 4 2 5 1 3 3 3 3 4 4 4 5
gmm gmm_manhattan 5 dataframe 5 4 1 2 5 5 5 5 5 5 5 1
gmm gmm_manhattan 6 dataframe 1 4 2 4 2 4 2 2 1 1 1 2
gmm gmm_manhattan 6 dataframe 2 1 4 3 4 2 4 4 2 2 2 1
gmm gmm_manhattan 6 dataframe 3 5 3 5 1 1 1 1 3 3 3 3
gmm gmm_manhattan 6 dataframe 4 2 5 1 3 3 3 3 4 4 4 5
gmm gmm_manhattan 6 dataframe 5 3 1 2 5 5 5 5 5 5 5 4
kmeans_arma kmeans_arma 4 dataframe 1 5 1 1 1 1 1 1 1 1 1 5
kmeans_arma kmeans_arma 4 dataframe 2 3 2 2 2 2 2 2 2 2 2 2
kmeans_arma kmeans_arma 4 dataframe 3 4 3 3 3 3 3 3 3 3 3 4
kmeans_arma kmeans_arma 4 dataframe 4 2 4 4 4 4 4 4 4 4 4 3
kmeans_arma kmeans_arma 4 dataframe 5 1 5 5 5 5 5 5 5 5 5 1
kmeans_arma kmeans_arma 5 dataframe 1 5 1 1 1 1 1 1 1 1 1 1
kmeans_arma kmeans_arma 5 dataframe 2 2 2 2 2 2 2 2 2 2 2 3
kmeans_arma kmeans_arma 5 dataframe 3 3 3 3 3 3 3 3 3 3 3 4
kmeans_arma kmeans_arma 5 dataframe 4 1 4 4 4 4 4 4 4 4 4 5
kmeans_arma kmeans_arma 5 dataframe 5 4 5 5 5 5 5 5 5 5 5 2
kmeans_arma kmeans_arma 6 dataframe 1 3 1 1 1 1 1 1 1 1 1 5
kmeans_arma kmeans_arma 6 dataframe 2 1 2 2 2 2 2 2 2 2 2 3
kmeans_arma kmeans_arma 6 dataframe 3 5 3 3 3 3 3 3 3 3 3 4
kmeans_arma kmeans_arma 6 dataframe 4 2 4 4 4 4 4 4 4 4 4 1
kmeans_arma kmeans_arma 6 dataframe 5 4 5 5 5 5 5 5 5 5 5 2
kmeans_rcpp kmeans_rcpp 4 dataframe 1 3 4 5 2 3 2 2 1 1 1 4
kmeans_rcpp kmeans_rcpp 4 dataframe 2 1 2 1 4 2 4 4 2 2 2 5
kmeans_rcpp kmeans_rcpp 4 dataframe 3 4 3 3 3 4 3 3 3 3 3 3
kmeans_rcpp kmeans_rcpp 4 dataframe 4 2 5 4 1 1 1 1 4 4 4 1
kmeans_rcpp kmeans_rcpp 4 dataframe 5 5 1 2 5 5 5 5 5 5 5 2
kmeans_rcpp kmeans_rcpp 5 dataframe 1 3 2 4 2 3 2 2 1 1 1 4
kmeans_rcpp kmeans_rcpp 5 dataframe 2 1 4 5 4 2 4 4 2 2 2 5
kmeans_rcpp kmeans_rcpp 5 dataframe 3 4 3 1 3 4 3 3 3 3 3 3
kmeans_rcpp kmeans_rcpp 5 dataframe 4 2 5 3 1 1 1 1 4 4 4 2
kmeans_rcpp kmeans_rcpp 5 dataframe 5 5 1 2 5 5 5 5 5 5 5 1
kmeans_rcpp kmeans_rcpp 6 dataframe 1 4 2 4 2 2 2 2 1 1 1 2
kmeans_rcpp kmeans_rcpp 6 dataframe 2 1 4 3 4 4 4 4 2 2 2 5
kmeans_rcpp kmeans_rcpp 6 dataframe 3 3 3 5 1 1 1 1 3 3 3 1
kmeans_rcpp kmeans_rcpp 6 dataframe 4 2 5 1 3 3 3 3 4 4 4 4
kmeans_rcpp kmeans_rcpp 6 dataframe 5 5 1 2 5 5 5 5 5 5 5 3
mini_kmeans mini_kmeans 4 dataframe 1 2 1 1 1 1 1 1 1 1 1 1
mini_kmeans mini_kmeans 4 dataframe 2 1 2 2 2 2 2 2 2 2 2 4
mini_kmeans mini_kmeans 4 dataframe 3 5 3 3 3 3 3 3 3 3 3 2
mini_kmeans mini_kmeans 4 dataframe 4 3 4 4 4 4 4 4 4 4 4 5
mini_kmeans mini_kmeans 4 dataframe 5 4 5 5 5 5 5 5 5 5 5 3
mini_kmeans mini_kmeans 5 dataframe 1 4 1 1 1 1 1 1 1 1 1 1
mini_kmeans mini_kmeans 5 dataframe 2 3 2 2 2 2 2 2 2 2 2 3
mini_kmeans mini_kmeans 5 dataframe 3 2 3 3 3 3 3 3 3 3 3 2
mini_kmeans mini_kmeans 5 dataframe 4 5 4 4 4 4 4 4 4 4 4 5
mini_kmeans mini_kmeans 5 dataframe 5 1 5 5 5 5 5 5 5 5 5 4
mini_kmeans mini_kmeans 6 dataframe 1 3 1 1 1 1 1 1 1 1 1 1
mini_kmeans mini_kmeans 6 dataframe 2 5 2 2 2 2 2 2 2 2 2 5
mini_kmeans mini_kmeans 6 dataframe 3 4 3 3 3 3 3 3 3 3 3 2
mini_kmeans mini_kmeans 6 dataframe 4 1 4 4 4 4 4 4 4 4 4 3
mini_kmeans mini_kmeans 6 dataframe 5 2 5 5 5 5 5 5 5 5 5 4

If we only want to obtain the best classified variables or values for the external variables we execute the following method:


df_best_ranked_external <- Clustering::best_ranked_external_metrics(df$result)
Algorithm Distance Clusters Dataset Ranking timeExternal entropy variation_information precision recall f_measure fowlkes_mallows_index
gmm gmm_euclidean 4 dataframe 1 0.0203 0.3161 4.762 0.1822 0.451 0.2595 0.2867
gmm gmm_euclidean 5 dataframe 1 0.0245 0.4175 4.363 0.1637 0.2865 0.2084 0.2165
gmm gmm_euclidean 6 dataframe 1 0.0278 0.433 4.439 0.1744 0.2791 0.2147 0.2206
gmm gmm_manhattan 4 dataframe 1 0.0149 0.3161 4.762 0.1822 0.451 0.2595 0.2867
gmm gmm_manhattan 5 dataframe 1 0.0257 0.4258 4.35 0.167 0.2828 0.21 0.2173
gmm gmm_manhattan 6 dataframe 1 0.0421 0.4555 4.298 0.1669 0.2608 0.2035 0.2085
kmeans_arma kmeans_arma 4 dataframe 1 0.0009 0 0 0 0 0 0
kmeans_arma kmeans_arma 5 dataframe 1 0.0008 0 0 0 0 0 0
kmeans_arma kmeans_arma 6 dataframe 1 0.001 0 0 0 0 0 0
kmeans_rcpp kmeans_rcpp 4 dataframe 1 0.0173 0.3728 4.627 0.1697 0.5 0.23 0.2461
kmeans_rcpp kmeans_rcpp 5 dataframe 1 0.0218 0.4269 4.551 0.1663 0.5 0.2104 0.2183
kmeans_rcpp kmeans_rcpp 6 dataframe 1 0.0284 0.4545 4.331 0.1703 0.2458 0.2012 0.2046
mini_kmeans mini_kmeans 4 dataframe 1 0.001 0 0 0 0 0 0
mini_kmeans mini_kmeans 5 dataframe 1 0.0009 0 0 0 0 0 0
mini_kmeans mini_kmeans 6 dataframe 1 0.0008 0 0 0 0 0 0

We also obtain the best classified values for internal evaluation


df_best_ranked_internal <- Clustering::best_ranked_internal_metrics(df$result)
Algorithm Distance Clusters Dataset Ranking timeInternal connectivity dunn silhouette
gmm gmm_euclidean 4 dataframe 1 0.007 34.09 0.1646 0.23
gmm gmm_euclidean 5 dataframe 1 0.0065 42.08 0.1619 0.25
gmm gmm_euclidean 6 dataframe 1 0.0064 51.46 0.1619 0.23
gmm gmm_manhattan 4 dataframe 1 0.0065 35.59 0.1348 0.23
gmm gmm_manhattan 5 dataframe 1 0.0072 46.83 0.1322 0.26
gmm gmm_manhattan 6 dataframe 1 0.0089 54.87 0.1467 0.25
kmeans_arma kmeans_arma 4 dataframe 1 0.0088 44.21 0.1495 0.23
kmeans_arma kmeans_arma 5 dataframe 1 0.0082 49.22 0.1538 0.26
kmeans_arma kmeans_arma 6 dataframe 1 0.0081 57.63 0.1619 0.24
kmeans_rcpp kmeans_rcpp 4 dataframe 1 0.0073 51.04 0.1741 0.23
kmeans_rcpp kmeans_rcpp 5 dataframe 1 0.007 66.85 0.152 0.19
kmeans_rcpp kmeans_rcpp 6 dataframe 1 0.007 74.78 0.1522 0.19
mini_kmeans mini_kmeans 4 dataframe 1 0.007 50.35 0.1571 0.21
mini_kmeans mini_kmeans 5 dataframe 1 0.0073 76.4 0.1216 0.17
mini_kmeans mini_kmeans 6 dataframe 1 0.007 76.53 0.15 0.17

In order to obtain the best evaluation by algorithm


df_best_validation_external <- Clustering::evaluate_best_validation_external_by_metrics(df$result)
Algorithm Distance timeExternal entropy variation_information precision recall f_measure fowlkes_mallows_index
gmm gmm_euclidean 0.0278 0.433 4.762 0.1822 0.451 0.2595 0.2867
gmm gmm_manhattan 0.0421 0.4555 4.762 0.1822 0.451 0.2595 0.2867
kmeans_arma kmeans_arma 0.001 0 0 0 0 0 0
kmeans_rcpp kmeans_rcpp 0.0284 0.4545 4.627 0.1703 0.5 0.23 0.2461
mini_kmeans mini_kmeans 0.001 0 0 0 0 0 0

Based on the results obtained we can see that the gmm algorithm behaves better.

From the algorithm with the best rating we can select the most appropriate number of clusters.


df_result_external <- Clustering::result_external_algorithm_by_metric(df$result,"gmm")
Algorithm Clusters timeExternal entropy variation_information precision recall f_measure fowlkes_mallows_index
gmm 4 0.0203 0.3161 4.762 0.1822 0.451 0.2595 0.2867
gmm 5 0.0257 0.4258 4.363 0.167 0.2865 0.21 0.2173
gmm 6 0.0421 0.4555 4.439 0.1744 0.2791 0.2147 0.2206

The same checks performed for external evaluation metrics, we can perform for internal evaluation.


df_best_validation_internal <-   
  Clustering::evaluate_best_validation_internal_by_metrics(df$result)
Algorithm Distance timeInternal connectivity dunn silhouette
gmm gmm_euclidean 0.007 51.46 0.1646 0.25
gmm gmm_manhattan 0.0089 54.87 0.1467 0.26
kmeans_arma kmeans_arma 0.0088 57.63 0.1619 0.26
kmeans_rcpp kmeans_rcpp 0.0073 74.78 0.1741 0.23
mini_kmeans mini_kmeans 0.0073 76.53 0.1571 0.21

In this case we can see that depending on the evaluation you want to make, one algorithm or another is chosen.

If we want to see graphically the representation of any metric as a function of the number of clusters and algorithm we can do it in the following way depending if the evaluation metric is internal or external


Clustering::plot_external_validation(df,"variation_information")