output: github_document
Copyright 2019 Faustine Bousquet (faustine.bousquet@tabmo.io or faustine.bousquet@umontpellier.fr) from TabMo and IMAG (Institut Montpelliérain Alexander Grothendieck, University of Montpellier). The binomialMix package is available under the Apache2 license.
The binomialMix package provides a clustering method for longitudinal and non gaussian data. It uses an EM algorithm for GLM.
You can install the binomialMix
R package with the following R command:
# install.packages("devtools")
devtools::install_git("https://gitlab.com/tabmo/binomialmix")
devtools::install_gitlab("tabmo/binomialMix")
You can also directly use the git repository :
Once you cloned the git repository, you can run to install the binomialMix
package:
Of course, you can use your own data. The format you need to have is the following : - a dataframe is needed - a column with factor id representing the objects you want to cluster - a target value * a weighted value variable as we are in case of binomial data - at least, one column as explicative variable
Run the clustering algorithm Here, we want to cluster advertising campaigns. Each campaigns (column “id”) is composed of n_c observations from the whole dataset. We have repeated mesure for a same id level. The explicatives variables could be : day, timeSlot or app_or_site. We want to try with K=3 clusters.
model_formula<-"ctr~timeSlot+day"
weighted_variable<-"impressions"
nb_cluster<-3
df_tocluster<-adcampaign
col_id<-"id"
result_K3<-runEM(model_formula,
weighted_variable,
nb_cluster,
df_tocluster,
col_id)
Plotting evolution of Loglikelihood over iteration
# Plotting Loglikelihood :
install.packages("ggplot2")
library(ggplot2)
qplot(seq_along(result_K3[[1]]), result_K3[[1]])
Matrix of beta estimated (values taken for last iteration) :
## [,1] [,2] [,3]
## [1,] -3.8126661 -5.2914380 -3.2418550
## [2,] -0.4134079 0.3794783 0.4115441
## [3,] -0.2975236 0.2407683 0.4076950
## [4,] -0.1948168 0.2122175 0.3753815
## [5,] -0.1590104 0.4028323 0.1885215
## [6,] -0.2160946 0.3545593 0.1872363
Vector of proportion in each cluster (values taken for last iteration) :
## [1] 0.1871000 0.7246125 0.0883000
Matrix of proability for each campaign to belong to the different cluster (values taken for last iteration) :
BIC value as numeric :
## [1] "BIC=387914.537681485"
ICL value as numeric :
## [1] "ICL value=387919.96962191"
Total number of EM iteration as numeric value :
## [1] "Number of EM iteration :10"
Matrix of Fisher scoring number of iteration at each M step :
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 4 3 4 6 3 3 2 1 1
## [2,] 3 2 2 2 2 2 2 1 1
## [3,] 5 4 2 2 3 1 1 1 1