For the analyses implemented in the dad package, the data \(\mathbf{X}\) (Table 1) of interest have three kinds of objects: occasions (or groups) \(\times\) individuals \(\times\) variables. In what follows, the terms occasion and group are used interchangeably: in the case of three-way data, the term occasion would be preferable, while in the case of multigroup data the term group would be more appropriate.
The groups define a partition of the individuals on which are measured the variables. If \(T\) denotes the number of groups, for each \(t\) in \(\left\{1,\ldots,T\right\}\) the rows of the table \(\mathbf{X}_t\) correspond to \(n_t\) observations \(\mathbf{x}_{t1}\,,\ldots,\,\mathbf{x}_{tn_t}\) of \(X_t\) a random vector with \(p\) components.
Table 1: For each group (or occasion) \(t = 1,\ldots,T\), the same \(p\) variables are observed for \(n_t\) individuals.
For discriminant analysis, the data of interest are similar to the previous ones with the difference that we have two categories of occasions or groups. The first category, consisting of \(T\) occasions, are partitioned into \(K\) subsets deriving from a factor \(G\) defined on occasions . The second category consists of occasions, numbered \(T+1,\ \ldots\) for which we have data of type \(\mathbf{X}\) but not the value of \(G\).
Table 2: Each occasion \(t\) (\(t=1,\ldots,T\)) matches a table with \(n_t\) rows and \(p\) columns (see Table 1). The variable \(G\) defined on the occasions takes values \(\{1,\ldots,K\}\). For each \(k=1,\ldots,K\), the value \(k\) is taken \(T_k\) times. The \(G\) values of the occasions \(T+1,\,\ldots\) are not available and have to be predicted.
When the individuals are organised into groups, the analyst could be interested in taking into account this data organisation by associating with each group a mathematical object and performing multivariate techniques on these objects. In the dad package devoted to such data, the objects are probability density functions. These densities are either all continuous (numeric data with Lebesgue measure as reference measure) or all discrete (categorical data with counting measure as reference measure) and are subjected to following analyses:
These three multivariate techniques are essentially based on distance indices between probability density functions. Literature abounds with such indices: as an example, the encyclopedia of distances of Deza (p. 235–245)1 lists some forty. The package proposes to calculate ten of them by considering the case of discrete densities and that of continuous densities. The results returned by the three previous multivariate techniques depend on the distance index used. The choice of such a distance index depends above all on the modeling hypotheses: discrete or continuous data, Gaussian or not…
Thus, for each distance index, the dad package implements:
The dad package uses objects of class "folder"
or "folderh"
. These objects are lists of data frames having particular formats.
folder
Such objects are lists of data frames which have the same column names. They are created by the functions folder
or as.folder
(see their help in R).
data("roses")
rosesf <- as.folder(roses[,c("Sha", "Den", "Sym", "rose")], groups = "rose")
print(rosesf, max = 9)
## $A
## Sha Den Sym
## 1 7.0 6.7 6.7
## 2 7.1 7.8 8.1
## 3 7.0 6.8 7.4
## [ reached 'max' / getOption("max.print") -- omitted 39 rows ]
##
## $B
## Sha Den Sym
## 43 8.1 7.7 3.0
## 44 8.6 5.9 6.7
## 45 7.7 6.7 7.4
## [ reached 'max' / getOption("max.print") -- omitted 39 rows ]
##
## $C
## Sha Den Sym
## 85 0.7 9.3 1.4
## 86 2.3 7.7 2.4
## 87 3.6 7.9 7.2
## [ reached 'max' / getOption("max.print") -- omitted 39 rows ]
##
## $D
## Sha Den Sym
## 127 9.2 1.8 9.0
## 128 9.0 2.3 9.2
## 129 6.9 2.6 7.6
## [ reached 'max' / getOption("max.print") -- omitted 39 rows ]
##
## $E
## Sha Den Sym
## 169 5.6 1.7 8.2
## 170 7.5 3.4 8.6
## 171 5.8 3.9 5.8
## [ reached 'max' / getOption("max.print") -- omitted 39 rows ]
##
## $F
## Sha Den Sym
## 211 8.3 8.0 6.5
## 212 8.4 7.8 3.3
## 213 9.2 8.2 7.6
## [ reached 'max' / getOption("max.print") -- omitted 39 rows ]
##
## $G
## Sha Den Sym
## 253 8.6 2.0 5.4
## 254 8.5 2.3 7.9
## 255 7.6 3.5 7.1
## [ reached 'max' / getOption("max.print") -- omitted 39 rows ]
##
## $H
## Sha Den Sym
## 295 6.5 4.3 2.6
## 296 6.6 2.9 2.9
## 297 8.4 5.1 6.4
## [ reached 'max' / getOption("max.print") -- omitted 39 rows ]
##
## $I
## Sha Den Sym
## 337 4.9 6.5 7.6
## 338 5.8 6.6 7.9
## 339 4.3 5.6 6.0
## [ reached 'max' / getOption("max.print") -- omitted 39 rows ]
##
## $J
## Sha Den Sym
## 379 4.9 5.2 8.9
## 380 4.6 8.1 8.6
## 381 3.5 7.8 7.4
## [ reached 'max' / getOption("max.print") -- omitted 39 rows ]
##
## attr(,"class")
## [1] "folder"
## attr(,"same.rows")
## [1] FALSE
folderh
In the most useful case, such objects are hierarchical lists of two data frames df1
and df2
related by means of a key which describes the “1 to N” relationship between the data frames. They are created by the function folderh
(see its help in R for the case of three data frames or more).
data(roseflowers)
df1 <- roseflowers$variety
df2 <- roseflowers$flower
fh1 <- folderh(df1, "rose", df2)
print(fh1)
## $df1
## place rose variety
## 34 outdoors 34 v1
## 40 outdoors 40 v4
## 60 outdoors 60 v3
## 66 glasshouse 66 v3
## 68 glasshouse 68 v4
##
## $df2
## rose numflower diameter height nleaves
## 1 34 1 94.5 57.0 8
## 2 34 2 89.5 54.0 10
## 3 40 1 57.0 21.5 9
## 4 40 2 52.5 20.5 5
## 5 40 3 51.5 14.0 7
## 6 60 1 53.0 23.0 4
## 7 60 2 52.0 24.5 9
## 8 66 1 35.0 9.5 4
## 9 66 2 35.0 14.0 6
## 10 66 3 36.0 13.5 7
## 11 68 1 45.5 19.5 10
##
## attr(,"class")
## [1] "folderh"
## attr(,"keys")
## [1] "rose"
Deza, M.M. and Deza, E. (2013), Encyclopedia of Distances. Springer-Verlag, Heidelberg.↩︎