Introduction to the evclust package

Thierry Denoeux

2020-01-07

The package evclust contains methods for evidential clustering. In evidential clustering, cluster membership uncertainty is represented by Dempster-Shafer mass functions. The user is invited to read the papers cited in this vignette to get familiar with the main concepts underlying evidential clustering. These papers can be downloaded from the author’s web site, at https://www.hds.utc.fr/~tdenoeux. Here, we provide a guided tour of the main functions in the evclust package. It consists in six main functions, implementing six different evidential clustering algorithms:

You first need to install this package:

library("evclust")

Evidential c-means (ECM) algorithm

The Evidential \(c\)-means (ECM) algorithm (Masson and Denoeux 2008) is a \(c\)-means-like algorithm that minimizes a cost function by searching alternatively the space of prototypes and the space of credal partitions. Unlike the hard and fuzzy \(c\)-means algorithms, ECM associates a prototype not only to clusters, but also to sets of clusters. The prototype associated to a set of clusters is defined as the barycenter of the prototypes of each single cluster in the set. The cost function to be minimized insures that objects close to a prototype have a large mass assigned to the corresponding set of clusters.

Consider, for instance, the fourclass data. This dataset consists in four clusters in a two-dimensional space.

data(fourclass)
x<-fourclass[,1:2]
y<-fourclass[,3]
plot(x[,1],x[,2],pch=y,col=y,xlab=expression(x[1]),ylab=expression(x[2]))

We can run ECM with c=4 clusters on this data as follows:

clus<-ecm(x,c=4,type='full',alpha=1,beta=2,delta=sqrt(20),disp=FALSE)

The option type='full' is actually the option. It implies that mass functions in the credal partition will have \(2^c\) focal sets. You can get basic information about the credal partition using the method summary:

summary(clus)
## ------ Credal partition ------
## 4 classes,400 objects
## Generated by ecm
## Focal sets:
##       [,1] [,2] [,3] [,4]
##  [1,]    0    0    0    0
##  [2,]    1    0    0    0
##  [3,]    0    1    0    0
##  [4,]    1    1    0    0
##  [5,]    0    0    1    0
##  [6,]    1    0    1    0
##  [7,]    0    1    1    0
##  [8,]    1    1    1    0
##  [9,]    0    0    0    1
## [10,]    1    0    0    1
## [11,]    0    1    0    1
## [12,]    1    1    0    1
## [13,]    0    0    1    1
## [14,]    1    0    1    1
## [15,]    0    1    1    1
## [16,]    1    1    1    1
## Value of the criterion=265.38
## Nonspecificity=0.32
## Prototypes:
##            [,1]       [,2]
## [1,] -0.5321957  4.1806930
## [2,]  4.4363717  4.6389264
## [3,] -0.4517042 -1.0674512
## [4,]  4.3309815 -0.2225598
## Number of outliers=1.00

We can restrict the focals set to pairs by changing the type argument.

clus<-ecm(x,c=4,type='pairs',alpha=1,beta=2,delta=sqrt(20),disp=FALSE)
summary(clus)
## ------ Credal partition ------
## 4 classes,400 objects
## Generated by ecm
## Focal sets:
##       [,1] [,2] [,3] [,4]
##  [1,]    0    0    0    0
##  [2,]    1    0    0    0
##  [3,]    0    1    0    0
##  [4,]    0    0    1    0
##  [5,]    0    0    0    1
##  [6,]    1    1    0    0
##  [7,]    1    0    1    0
##  [8,]    1    0    0    1
##  [9,]    0    1    1    0
## [10,]    0    1    0    1
## [11,]    0    0    1    1
## [12,]    1    1    1    1
## Value of the criterion=308.06
## Nonspecificity=0.25
## Prototypes:
##            [,1]        [,2]
## [1,]  4.1994436 -0.08173186
## [2,] -0.4750072  4.06488537
## [3,] -0.3439293 -0.87892163
## [4,]  4.3184187  4.56133236
## Number of outliers=1.00

A plot of the credal partition can be generated as follows:

plot(clus,x,mfrow=c(2,2),ytrue=y,Approx=2)

In this plot, the lower and upper approximations of each cluster are plotted as solid and interrupted lines, respectively. Since we selected Approx=2, the lower and upper approximations are defined as follows. Let \(A_i\) be the focal set of mass function \(m_i\) with the largest mass. The lower approximation of cluster \(\omega_k\) is the set of objects such that \(A_i=\{\omega_k\}\), while the upper approximation is the set of objects such that \(\omega_k \in A_i\). The outliers are the objects such that \(A_i=\emptyset\). They are displayed as circles in the figure above.

RECM

The ECM algorithm can only be used with attribute data. The Relational Evidential c-means algorithm (RECM) (Masson and Denœux 2009) is a version of ECM that can handle dissimilarity data, i.e., data that consist in a matrix on dissimilarities between \(n\) objects. Consider, for instance, the Proteins dataset. It consists of a dissimilarity matrix derived from the structural comparison of 213 protein sequences. Each of these proteins is known to belong to one of four classes of globins: hemoglobin-\(\alpha\) (HA), hemoglobin-\(\beta\) (HB), myoglobin (M) and heterogeneous globins (G). We can display these data using Multidimensional Scaling:

data(protein)
z<- cmdscale(protein$D,k=2)
plot(z[,1],z[,2],xlab='axis 1',ylab='axis 2')

We can see that the data points are grouped in four clusters. We can run RECM on this dataset at follows,

clus <- recm(D=protein$D,c=4,type='pairs',alpha=0.2,beta=1.1,delta2=20,disp=FALSE)
summary(clus)
## ------ Credal partition ------
## 4 classes,213 objects
## Generated by recm
## Focal sets:
##       [,1] [,2] [,3] [,4]
##  [1,]    0    0    0    0
##  [2,]    1    0    0    0
##  [3,]    0    1    0    0
##  [4,]    0    0    1    0
##  [5,]    0    0    0    1
##  [6,]    1    1    0    0
##  [7,]    1    0    1    0
##  [8,]    1    0    0    1
##  [9,]    0    1    1    0
## [10,]    0    1    0    1
## [11,]    0    0    1    1
## [12,]    1    1    1    1
## Value of the criterion=771.08
## Nonspecificity=0.07
## Number of outliers=0.00
plot(clus,X=z,mfrow=c(2,2),ytrue=protein$y)

k-EVCLUS

The RECM algorithm requires to store the whole dissimilarity matrix. Consequently, it is not suitable to handle large dissimilarity datasets. The k-EVCLUS algorithm, recently introduced in (Denœux, Sriboonchitta, and Kanjanatarakul 2016), overcomes this limitation. It is an improved version of the EVCLUS algorithm introduced in (Denœux and Masson 2004). The EVCLUS algorithm applies ideas from Multidimensional Scaling to clustering: given a dissimilarity matrix, it finds a credal partition such that the degrees of conflict between mass functions match the dissimilarities, dissimilar objects being represented by highly conflicting mass functions; this is achieved by iteratively minimizing a stress function. The k-EVCLUS algorithm uses a more efficient optimization procedure, and uses only the dissimilarities between each object and \(k\) randomly chosen other objects. As a result, the complexity is reduced from quadratic to linear.

As an example, let us generate a dataset from the same distribution as the fourclass dataset, but with \(n=1000\) objects.

n<-1000
x<-matrix(rt(2*n,df=5),n,2)
y<-c(rep(1,n/4),rep(2,n/4),rep(3,n/4),rep(4,n/4))
x[(y==2)|(y==4),2]<- 5 + x[(y==2)|(y==4),2]
x[(y==3)|(y==4),1]<- 5 + x[(y==3)|(y==4),1]

plot(x[,1],x[,2],pch=y,col=y,xlab=expression(x[1]),ylab=expression(x[2]))

There are \(1000\times 999/2=499,500\) pairwises distances for these data. However, by choosing \(k=50\), we can use only \(1000\times 50=50,000\) distances, i.e., only 10 percent of the original distances to generate a credal partition.

clus<-kevclus(x=x,c=4,k=50,disp=FALSE)
## [1] 1.00000000 0.01925626 0.01925626
summary(clus)
## ------ Credal partition ------
## 4 classes,1000 objects
## Generated by kevclus
## Focal sets:
##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    1    0    0    0
## [3,]    0    1    0    0
## [4,]    0    0    1    0
## [5,]    0    0    0    1
## [6,]    1    1    1    1
## Value of the criterion=0.02
## Nonspecificity=0.13
## Number of outliers=10.00

For a partition created by kevclus, the plot.credpart functions generates two plots: the usual plot of the credal partition in the attribute space, and the Shepard diagram. This is a plot of the degrees of conflict vs. the transformed dissimilarities. If the argument X is not supplied, only the Shepard diagram is plotted.

plot(clus,X=x,ytrue=y)

EKNN-clus

The clustering methods described above require the number of cluters to be fixed in advance. In contrast, EK-NNclus (Denœux, Kanjanatarakul, and Sriboonchitta 2015) is another evidential clustering algorithm that automatically determines the number of clusters. Starting from an initial partition, E\(K\)-NNclus iteratively reassigns objects to clusters using the E\(K\)-NN rule, until a stable partition is obtained. After convergence, the cluster membership of each object is described by a Dempster-Shafer mass function assigning a mass to each cluster and to the whole set of clusters. The mass assigned to the set of clusters can be used to identify outliers. The method can be implemented in a competitive Hopfield neural network, whose energy function is related to the plausibility of the partition. The procedure can thus be seen as searching for the most plausible partition of the data. The E\(K\)-NNclus algorithm depends on two parameters, the number \(K\) of neighbors and a scale parameter \(q\), which can be fixed using simple heuristics.

As an example, let us apply E\(K\)-NNclus to the s2 dataset. This dataset contains 5000 two-dimensional vectors grouped in 15 Gaussian clusters.

data(s2)
plot(s2[,1],s2[,2],xlab=expression(x[1]),ylab=expression(x[2]),pch='.')

To apply E\(K\)-NNclus, we need to fix \(k\) and \(q\). Results presented (Denœux, Kanjanatarakul, and Sriboonchitta 2015) suggest that a value of \(K\) of the order of two or there times \(\sqrt{n}\) and \(q\) greater than 0.5 are suitable. For instance, let use set \(K=200\) and \(q=0.9\). The algorithm is initialized with a random partition in 500 clusters. It is run ntrials=5 times, and the result with the lowest value of the energy function is kept.

y0<-sample(500,size=5000,replace=TRUE)
clus<- EkNNclus(s2,K=200,y0=y0,ntrials=5,q=0.9,disp=FALSE)
## Summary of the partition
summary(clus)
## ------ Credal partition ------
## 14 classes,5000 objects
## Generated by EkNNclus
## Focal sets:
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
##  [1,]    0    0    0    0    0    0    0    0    0     0     0     0     0
##  [2,]    1    0    0    0    0    0    0    0    0     0     0     0     0
##  [3,]    0    1    0    0    0    0    0    0    0     0     0     0     0
##  [4,]    0    0    1    0    0    0    0    0    0     0     0     0     0
##  [5,]    0    0    0    1    0    0    0    0    0     0     0     0     0
##  [6,]    0    0    0    0    1    0    0    0    0     0     0     0     0
##  [7,]    0    0    0    0    0    1    0    0    0     0     0     0     0
##  [8,]    0    0    0    0    0    0    1    0    0     0     0     0     0
##  [9,]    0    0    0    0    0    0    0    1    0     0     0     0     0
## [10,]    0    0    0    0    0    0    0    0    1     0     0     0     0
## [11,]    0    0    0    0    0    0    0    0    0     1     0     0     0
## [12,]    0    0    0    0    0    0    0    0    0     0     1     0     0
## [13,]    0    0    0    0    0    0    0    0    0     0     0     1     0
## [14,]    0    0    0    0    0    0    0    0    0     0     0     0     1
## [15,]    0    0    0    0    0    0    0    0    0     0     0     0     0
## [16,]    1    1    1    1    1    1    1    1    1     1     1     1     1
##       [,14]
##  [1,]     0
##  [2,]     0
##  [3,]     0
##  [4,]     0
##  [5,]     0
##  [6,]     0
##  [7,]     0
##  [8,]     0
##  [9,]     0
## [10,]     0
## [11,]     0
## [12,]     0
## [13,]     0
## [14,]     0
## [15,]     1
## [16,]     1
## Value of the criterion=544366.05
## Nonspecificity=0.00
## Number of outliers=0.00
## Plot of the partition
plot(clus,X=s2,Outliers=FALSE)

We can see that E\(K\)-NNclus generally finds 15 or 14 clusters for these data. In constrast with the previous algorithms, E\(K\)-NNclus generates a credal partition with normalized mass functions. Outliers tend to have a large mass \(m(\Omega)\) on the frame of discernment, but the credal partition is otherwise usually very close to a hard partition.

Generating a credal partition with a large number of clusters

If no restriction is imposed on the focal sets, the number of parameters to be estimated in evidential clustering grows exponentially with the number \(c\) of clusters, which makes it intractable unless \(c\) is small. If we allow masses to be assigned to pairs of clusters, as suggested in [@denoeux04b] and [@masson08], the number of focal sets becomes proportional to \(c^2\), which is manageable for moderate values of \(c\) (say, until 10), but still impractical for larger \(n\). It is clear, however, that only a few pairs of contiguous clusters will be assigned some mass during the learning process. To determine which pairs of clusters can potentially become focal sets, we can use the following three-step approach (Denœux, Sriboonchitta, and Kanjanatarakul 2016):

  1. In the first step, a clustering algorithm (ecm, recm or kevclus) is run in the basic configuration, with focal sets of cardinalities 0, 1 and (optionally) \(c\). A credal partition \(M_0\) is obtained.
  2. The similarity between each pair of clusters \((\omega_j,\omega_\ell)\) is analyzed, and we determine the set \(P_K\) of pairs \(\{\omega_j,\omega_\ell\}\) that are mutual \(K\) nearest neighbors.
  3. The clustering algorithm is run again, starting from the previous credal partition \(M_0\), and adding as focal sets the pairs in \(P_K\).

As an example, consider again the s2 dataset with the ecm algorithm. We first construct a credal partition with \(c=15\) clusters and \(f=16\) focal sets (the empty set and the sigletons):

data(s2)
clus<-ecm(x=s2,c=15,type='simple',Omega=FALSE,delta=1,disp=FALSE)
plot(x=clus,X=s2,Outliers = TRUE)

We then search for pairs of clusters that are \(K=2\) mutual nearest neighbors:

P<-createPairs(clus,k=2)
P$pairs
##       row col
##  [1,]   2   4
##  [2,]   3   4
##  [3,]   6   7
##  [4,]   1  11
##  [5,]   2  12
##  [6,]   6  12
##  [7,]   1  14
##  [8,]  11  14
##  [9,]  10  15
## [10,]  13  15

Finally, we run again ecm, with the pairs of clusters found in the previous step:

clus1<-ecm(x=s2,c=15,type='pairs',Omega=FALSE,pairs=P$pairs,g0=clus$g,delta=1,disp=FALSE)
plot(x=clus1,X=s2,Outliers = TRUE,Approx=2)

This time, we can see that the outer approximations of neighboring clusters overlap, reflecting uncertainty in the partition.

Evidential clustering with pairwise constraints

The clustering task can be made easier if we have some prior knowledge of the underlying partition. Such knowledge can take the form of must-link (ML) constraints (pairs of objects that are known to belong to the same clusters) and cannot-link (CL) constraints (pairs of objects that are known to belong to different clusters). The CECM algorithm (Antoine et al. 2012) is a variant of ECM that makes it possible to handle such constraints.

Consider, for instance, the Four-class dataset. It is actually composed of four clusters, but let us assume that the partition of interest has only two groups, with clusters 1-2 and 3-4 in the first group. We can display the corresponding binary partition:

data(fourclass)
x<-as.matrix(fourclass[,1:2])
y<-as.vector(fourclass[,3])
## We group classes 1-2 and 3-4
y<-2-(y<=2)
plot(x[,1],x[,2],pch=y,col=y)

Depending on the initial conditions, ecm may converge to this partition, or to a partition close to the following one:

y1<-((fourclass[,3]==1) | (fourclass[,3]==3))+1
plot(x[,1],x[,2],pch=y1,col=y1)

To guide the clustering algorithm to the desired partition, we can provide ML and CL constraints. They can be randomly generated using the create_MLCL function:

const<-create_MLCL(y,nbConst=30)

Here, we generated 30 constraints, of ML or VL type. These constraints can be passed to function cecm. Other important arguments of cecm are the coefficient bal, which determines the tradeoff between the stress and penalty terms in the cost function, and distance, which determines the distance function used. The default (distance=0) is the Euclidean function. By setting distance=1, we get generalized Euclidean (Mahalanobis) distances for each cluster.

clus<-cecm(x,c=2,type='full',ntrials=5,ML=const$ML,CL=const$CL,delta=10,bal=0.5,distance=1,disp=FALSE)
## [1] 1.0000000 0.4479564 0.4479564
## [1] 2.0000000 0.5424532 0.4479564
## [1] 3.0000000 0.4791905 0.4479564
## [1] 4.0000000 0.4552794 0.4479564
## [1] 5.0000000 0.4370583 0.4370583
plot(clus,X=x,ytrue=y,Outliers=TRUE,Approx=1)

More sophisticated active learning stategies for selecting the constraints are described in (Antoine et al. 2012).

BootClus

This is the most recent algorithm for generating a credal partition. It works by bootstrapping mixture models (Denoeux 2019). The current implementation uses the mclust package and is thus restricted to Gaussian mixtures, but the approach can be applied to any mixture model. The idea is to compute bootstrap percentile intervals on each pairwise probability \(P_{ij}\) that objects \(i\) and \(j\) belong to the same cluster. We then search for a credal partition such that the belief and plausibilities that each pair of objects \(i\) and \(j\) belong to the same cluster approximate the bounds of the confidence intervals in the least squares sense.

Let us illustrate this approach with the Faithful geyser data that are part of package mclust:

data("faithful")
X<-faithful

We run function bootclus with B=100 bootstrap samples to shorten execution time (better results are obtained with at least 500 bootstrap samples). The argument alpha specifies the levels of the quantiles used to compute the confidence intervals; here, by choosing 5% and 95% quantiles, we obtain equal-tail 90% confidence inetrvals. The argument G=3 is passed to function Mclustcalled inside bootclus though the list param, to obtain a partition with three clusters:

param=list(G=3)
res.faithful<-bootclus(X,alpha=c(0.05,0.95),B=100,param=param)

Function bootclus returns the credal partition Clus, an object of class credpart:

plot(res.faithful$Clus,X)

The confidence inetrvals as well as the approximating belief-plausibility intervals are also returned by function bootclus as arrays CI and BelPl, respectively. the following graphs show the pairwise belief and and plausibility degrees vs. the lower and upper bounds of the confidence intervals, which allows us to assess the quality of the approximation:

plot(as.dist(res.faithful$CI[1,,]),as.dist(res.faithful$BelPl[1,,]),pch=".",xlab="",ylab="")
title(xlab="Lower bound of 90% CI",ylab="Belief",line=2.2,cex.lab=1.7)
abline(0,1)

plot(as.dist(res.faithful$CI[2,,]),as.dist(res.faithful$BelPl[2,,]),pch=".",xlab="",ylab="")
title(xlab="Upper bound of 90% CI",ylab="Plausibility",line=2.2,cex.lab=1.7)
abline(0,1)

References

Antoine, V., B. Quost, M.-H. Masson, and T. Denoe“ux. 2012. “CECM: Constrained Evidential c-Means Algorithm.” Computational Statistics & Data Analysis 56 (4): 894–914.

Denoeux, T. 2019. “Calibrated Model-Based Evidential Clustering Using Bootstrapping (Preprint arXiv:1912.06137).”

Denœux, T., and M.-H. Masson. 2004. “EVCLUS: Evidential Clustering of Proximity Data.” IEEE Trans. on Systems, Man and Cybernetics B 34 (1): 95–109.

Denœux, T., O. Kanjanatarakul, and S. Sriboonchitta. 2015. “E\(K\)-NNclus: A Clustering Procedure Based on the Evidential \(K\)-Nearest Neighbor Rule.” Knowledge-Based Systems 88: 57–69.

Denœux, T., S. Sriboonchitta, and O. Kanjanatarakul. 2016. “Evidential Clustering of Large Dissimilarity Data.” Knowledge-Based Systems 106: 179–95.

Masson, M.-H., and T. Denoeux. 2008. “ECM: An Evidential Version of the Fuzzy c-Means Algorithm.” Pattern Recognition 41 (4): 1384–97.

Masson, M.-H., and T. Denœux. 2009. “RECM: Relational Evidential c-Means Algorithm.” Pattern Recognition Letters 30: 1015–26.