R-package binomialMix tutorial

Copyright 2019 Faustine Bousquet (faustine.bousquet@tabmo.io or faustine.bousquet@umontpellier.fr) from TabMo and IMAG (Institut Montpelliérain Alexander Grothendieck, University of Montpellier). The binomialMix package is available under the Apache2 license.

STEP 2: Use-case tutorial

Imagine that you are working for an advertising company. You need to make groups of campaigns with similar profiles.

1. First, you need to import the following library:

# our library for mixture modelling:
library(binomialMix)
# if not installed : 
#install.packages("pander", repos="http://cran.us.r-project.org")
#install.packages("ggplot2", repos="http://cran.us.r-project.org")
#library(pander)
library(qpdf)

2. Let’s have a look at the dataset:

data(adcampaign)

##   id timestamp_ymd yearDay day timeSlot app_or_site impressions click
## 1 14    2019-01-01       1   3        1         app        2675   117
## 2 14    2019-01-01       1   3        1         app         729    16
## 3 14    2019-01-01       1   3        2         app        1016    33
## 4 14    2019-01-01       1   3        2         app         342     6
## 5 14    2019-01-01       1   3        3         app        3431    92
## 6 14    2019-01-01       1   3        3         app         864     9
##          ctr
## 1 0.04373832
## 2 0.02194787
## 3 0.03248031
## 4 0.01754386
## 5 0.02681434
## 6 0.01041667

NB : Of course, you can use your own data. The format you need to have is the following:

a dataframe type is needed (ex: adcampaign from binomialMix)
a column with factor id representing the objects you want to cluster (ex: id from adcampaign )
a target value (ex: ctr from adcampaign)
a weighted value variable as we are in case of binomial data (ex: impressions from adcampaign)
at least, one column as explicative variable (ex: day from adcampaign)

3. Let’s make some clusters!

The objective of the study is to group advertising campaigns into clusters. We observe by campaign, time slot, day of week and ad slot campaign (like app or site) the observed number of clicks and impressions. CTR corresponds to the number of click on the number of impressions. CTR value differs a lot from one observation to another, as well as the total length of a campaign. Some last fews days and others broadcast for months. Then, each campaigns (column “id”) is composed of n_c observations from the whole dataset and we have repeated mesure for a same id level. The available explicative variables are:

day
timeSlot
app_or_site

Let’s now try to cluster our dataset into K groups.

# The dataframe to cluster: 
df_tocluster<-adcampaign
# We choose two explainable variables:
model_formula<-"ctr~timeSlot+day"
# As we are in a case of binomial mixture model, we define the weighted variable
weighted_variable<-"impressions"
# We want to analyse results for K=3.
K<-3
# We define the individual to cluster:
col_id<-"id"
set.seed(1992)
# We run our EM algorithm developped for mixture of binomial and longitudinal dataset:
result_K3<-runEM(model_formula,
                weighted_variable,
                K,
                df_tocluster,
                col_id)

4. Analysis of clustering results:

The output of the runEM function provides the following values:

Loglikelihood for each EM iteration
Estimation of model parameters (β, λ, π )
BIC and ICL values
Number of fisher iteration needed for each M-Step

Plotting evolution of Loglikelihood over iteration

library(ggplot2)
qplot(seq_along(result_K3[[1]]), result_K3[[1]],
      xlab="Number of EM iterations",
      ylab="Loglikelihood")

Estimated β parameters

Let’s have a look at the estimated parameters for each cluster k. We only show the estimation from the last EM iteration in the following.

result_K3[[3]][[length(result_K3[[3]])]]

##              k=1         k=2        k=3
## [1,] -3.27524617 -6.02001952 -5.0421272
## [2,]  0.31581767  0.12798712  0.4729844
## [3,]  0.20178096  0.26338824  0.5358001
## [4,]  0.26372976  0.43174257  0.7074703
## [5,]  0.09812297  0.41739277  0.8413444
## [6,]  0.08388539  0.09588596  0.7098054

Estimated proportion of campaigns λ for each cluster

We want to have a look at the repartition of our campaigns for adcampaign dataset to analyze the size of each cluster. We only display value for the last iteration of EM algorithm.

result_K3[[3]][[length(result_K3[[3]])]]

## [1] 0.114300 0.498075 0.387625

Matrix of proability for each campaign to belong to the different clusters

We analyze the contribution of each campaign to the K clusters. The columns define the campaigns and the rows the different cluster k.

# We only display the results for the first 10 campaigns (10 columns)
set.seed(1992)
result_K3[[4]][[length(result_K3[[4]])]][,1:10]

##     ID_1 ID_2 ID_3 ID_4 ID_5 ID_6 ID_7  ID_8  ID_9 ID_10
## k=1    0    0    0    0    0    0    0 0.000 0.000     0
## k=2    0    0    1    0    0    0    1 0.999 0.096     1
## k=3    1    1    0    1    1    1    0 0.001 0.904     0

Analyze of BIC and ICL values

The analyze of BIC and ICL values is essential when we want to choose the right number of clusters. We can compare BIC/ICL values and choose the K that minimize one or both of these criteria.

result_K3[[5]][[length(result_K3[[5]])]] # BIC value 
result_K3[[6]][[length(result_K3[[6]])]] # ICL value

## [1] "BIC=372360.14"

## [1] "ICL=372367.72"

Analyze of Fisher scoring number of iterations for each M step

If we want to know the number of Fisher scoring iterations at each M step, we can display the following matrix.

matrix(unlist(result_K3[[7]]),ncol=length(result_K3[[7]])-1)

##     iter_1 iter_2 iter_3 iter_4 iter_5 iter_6 iter_7 iter_8
## k=1      4      3      3      2      1      1      1      1
## k=2      4      3      1      3      2      1      1      1
## k=3      4      3      3      2      2      1      1      1