Copyright 2019 Faustine Bousquet (faustine.bousquet@tabmo.io or faustine.bousquet@umontpellier.fr) from TabMo and IMAG (Institut Montpelliérain Alexander Grothendieck, University of Montpellier). The binomialMix package is available under the Apache2 license.
The binomialMix package provides a clustering method for longitudinal and non gaussian data. It uses an EM algorithm for GLM. For now, a model-based clustering for mixture of binomial data is available.
You can install the binomialMix
R package with the following R command:
# install.packages("devtools")
devtools::install_git("https://gitlab.com/tabmo/binomialmix")
devtools::install_gitlab("tabmo/binomialMix")
You can also directly use the git repository :
git clone https://gitlab.com/tabmo/binomialMix
Once you cloned the git repository, you can run to install the binomialMix
package:
devtools::install("/path/to/binomialMix/pkg") # edit the path
Imagine that you are working for an advertising company. You need to make groups of campaigns with similar profiles.
# our library for mixture modelling:
library(binomialMix)
# if not installed :
#install.packages("pander", repos="http://cran.us.r-project.org")
#install.packages("ggplot2", repos="http://cran.us.r-project.org")
#library(pander)
library(qpdf)
data(adcampaign)
## id timestamp_ymd yearDay day timeSlot app_or_site impressions click
## 1 14 2019-01-01 1 3 1 app 2675 117
## 2 14 2019-01-01 1 3 1 app 729 16
## 3 14 2019-01-01 1 3 2 app 1016 33
## 4 14 2019-01-01 1 3 2 app 342 6
## 5 14 2019-01-01 1 3 3 app 3431 92
## 6 14 2019-01-01 1 3 3 app 864 9
## ctr
## 1 0.04373832
## 2 0.02194787
## 3 0.03248031
## 4 0.01754386
## 5 0.02681434
## 6 0.01041667
NB : Of course, you can use your own data. The format you need to have is the following:
a dataframe type is needed (ex: adcampaign from binomialMix)
a column with factor id representing the objects you want to cluster (ex: id from adcampaign )
a target value (ex: ctr from adcampaign)
a weighted value variable as we are in case of binomial data (ex: impressions from adcampaign)
at least, one column as explicative variable (ex: day from adcampaign)
The objective of the study is to group advertising campaigns into clusters. We observe by campaign, time slot, day of week and ad slot campaign (like app or site) the observed number of clicks and impressions. CTR corresponds to the number of click on the number of impressions. CTR value differs a lot from one observation to another, as well as the total length of a campaign. Some last fews days and others broadcast for months. Then, each campaigns (column “id”) is composed of n_c observations from the whole dataset and we have repeated mesure for a same id level. The available explicative variables are:
day
timeSlot
app_or_site
Let’s now try to cluster our dataset into K groups.
# The dataframe to cluster:
df_tocluster<-adcampaign
# We choose two explainable variables:
model_formula<-"ctr~timeSlot+day"
# As we are in a case of binomial mixture model, we define the weighted variable
weighted_variable<-"impressions"
# We want to analyse results for K=3.
K<-3
# We define the individual to cluster:
col_id<-"id"
set.seed(1992)
# We run our EM algorithm developped for mixture of binomial and longitudinal dataset:
result_K3<-runEM(model_formula,
weighted_variable,
K,
df_tocluster,
col_id)
The output of the runEM function provides the following values:
Loglikelihood for each EM iteration
Estimation of model parameters (β, λ, π )
BIC and ICL values
Number of fisher iteration needed for each M-Step
Plotting evolution of Loglikelihood over iteration
library(ggplot2)
qplot(seq_along(result_K3[[1]]), result_K3[[1]],
xlab="Number of EM iterations",
ylab="Loglikelihood")
Estimated β parameters
Let’s have a look at the estimated parameters for each cluster k. We only show the estimation from the last EM iteration in the following.
result_K3[[3]][[length(result_K3[[3]])]]
## k=1 k=2 k=3
## [1,] -3.27524617 -6.02001952 -5.0421272
## [2,] 0.31581767 0.12798712 0.4729844
## [3,] 0.20178096 0.26338824 0.5358001
## [4,] 0.26372976 0.43174257 0.7074703
## [5,] 0.09812297 0.41739277 0.8413444
## [6,] 0.08388539 0.09588596 0.7098054
Estimated proportion of campaigns λ for each cluster
We want to have a look at the repartition of our campaigns for adcampaign dataset to analyze the size of each cluster. We only display value for the last iteration of EM algorithm.
result_K3[[3]][[length(result_K3[[3]])]]
## [1] 0.114300 0.498075 0.387625
Matrix of proability for each campaign to belong to the different clusters
We analyze the contribution of each campaign to the K clusters. The columns define the campaigns and the rows the different cluster k.
# We only display the results for the first 10 campaigns (10 columns)
set.seed(1992)
result_K3[[4]][[length(result_K3[[4]])]][,1:10]
## ID_1 ID_2 ID_3 ID_4 ID_5 ID_6 ID_7 ID_8 ID_9 ID_10
## k=1 0 0 0 0 0 0 0 0.000 0.000 0
## k=2 0 0 1 0 0 0 1 0.999 0.096 1
## k=3 1 1 0 1 1 1 0 0.001 0.904 0
Analyze of BIC and ICL values
The analyze of BIC and ICL values is essential when we want to choose the right number of clusters. We can compare BIC/ICL values and choose the K that minimize one or both of these criteria.
result_K3[[5]][[length(result_K3[[5]])]] # BIC value
result_K3[[6]][[length(result_K3[[6]])]] # ICL value
## [1] "BIC=372360.14"
## [1] "ICL=372367.72"
Analyze of Fisher scoring number of iterations for each M step
If we want to know the number of Fisher scoring iterations at each M step, we can display the following matrix.
matrix(unlist(result_K3[[7]]),ncol=length(result_K3[[7]])-1)
## iter_1 iter_2 iter_3 iter_4 iter_5 iter_6 iter_7 iter_8
## k=1 4 3 3 2 1 1 1 1
## k=2 4 3 1 3 2 1 1 1
## k=3 4 3 3 2 2 1 1 1