Empirical Distribution Ordering Inference Framework (EDOIF)

Travis CI build statusminimal R versionLicense

Given a dataset of careers and incomes, how large a difference of income between any pair of careers would be? Given a dataset of travel time records, how long do we need to spend more when choosing a public transportation mode A instead of B to travel? In this work, we developed a framework to solve these problems named “EDOIF”.

EDOIF is a nonparametric framework based on “Estimation Statistics” principle. Its main purpose is to infer orders of empirical distributions from different categories based on a probability of finding a value in one distribution that is greater than an expectation of another distribution. Given a set of ordered-pair of real-category values the framework is capable of

  1. inferring orders of domination of categories and representing orders in a form of a graph;
  2. estimating magnitude of difference between a pair of categories in forms of mean-difference confidence intervals; and
  3. visualizing domination orders and magnitudes of difference of categories.

Installation

Please call the following command in R terminal.

remotes::install_github("DarkEyes/EDOIF")

This requires a user to install the “remotes” package before installing EDOIF.

Example: Inferring orders of categories based on their empirical distributions

library(EDOIF)

#== simulation: Generating distributuions of five categories: 
# Category5 dominates Category4
# Category4 dominates Category3
# Category3 dominates Category2
# Category2 dominates Category1

nInv=150 # number of samples per categories
initMean=10
stepMean=20
std=8

simData1<-c()
simData1$Values<-rnorm(nInv,mean=initMean,sd=std)
simData1$Group<-rep(c("Category1"),times=nInv)
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean,sd=std) )
simData1$Group<-c(simData1$Group,rep(c("Category2"),times=nInv))
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean+2*stepMean,sd=std) )
simData1$Group<-c(simData1$Group,rep(c("Category3"),times=nInv) )
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean+3*stepMean,sd=std) )
simData1$Group<-c(simData1$Group, rep(c("Category4"),times=nInv) )
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean+4*stepMean,sd=std) )
simData1$Group<-c(simData1$Group, rep(c("Category5"),times=nInv) )

#== parameter setting
bootT=1000 # number of times of sample with replacement in bootstrap function.
alpha=0.05 # Significance level

#== Calling the class constructor
A1<-EDOIF(simData1$Values,simData1$Group, bootT=bootT, alpha=alpha, methodType ="perc") 

#== Visualizing results
print(A1) # print the results in text mode
plot(A1, fontSize=10) # print the results in graphic mode

Graphic mode results 1. An alpha-confidence-interval of mean plot for five categories. The horizontal axis represents categories and the vertical axis represents values within distributions of categories.

2. A dominant-distribution network of five categories. A node represents categories and an edge represents a dominant-distribution relation between categories. If there is an edge from category A to B, then A dominates B. A larger node size implies a higher mean value of a category.

  1. An alpha-confidence-interval of mean difference plot for five categories.

Text mode results

EDOIF (Empirical Distribution Ordering Inference Framework)
=======================================================
Alpha = 0.050000, Number of bootstrap resamples = 1000, CI type = perc
Using Mann-Whitney test to report whether A ≺ B
A dominant-distribution network density:0.900000
Distribution: Category1
Mean:10.840671 95CI:[ 9.706981,12.014179]
Distribution: Category2
Mean:11.044785 95CI:[ 9.806991,12.446037]
Distribution: Category3
Mean:50.462935 95CI:[ 49.208005,51.757706]
Distribution: Category4
Mean:70.299726 95CI:[ 69.103924,71.502505]
Distribution: Category5
Mean:91.190505 95CI:[ 89.895480,92.518455]
=======================================================
Mean difference of Category2 (n=150) minus Category1 (n=150): Category1 ⊀ Category2
 :p-val 0.4463
Mean Diff:0.204114 95CI:[ -1.545130,1.930609]

Mean difference of Category3 (n=150) minus Category1 (n=150): Category1 ≺ Category3
 :p-val 0.0000
Mean Diff:39.622264 95CI:[ 37.984831,41.378232]

Mean difference of Category4 (n=150) minus Category1 (n=150): Category1 ≺ Category4
 :p-val 0.0000
Mean Diff:59.459055 95CI:[ 57.921328,61.127817]

Mean difference of Category5 (n=150) minus Category1 (n=150): Category1 ≺ Category5
 :p-val 0.0000
Mean Diff:80.349835 95CI:[ 78.620391,82.133270]

Mean difference of Category3 (n=150) minus Category2 (n=150): Category2 ≺ Category3
 :p-val 0.0000
Mean Diff:39.418150 95CI:[ 37.543210,41.241722]

Mean difference of Category4 (n=150) minus Category2 (n=150): Category2 ≺ Category4
 :p-val 0.0000
Mean Diff:59.254941 95CI:[ 57.304359,61.098774]

Mean difference of Category5 (n=150) minus Category2 (n=150): Category2 ≺ Category5
 :p-val 0.0000
Mean Diff:80.145720 95CI:[ 78.313321,82.040234]

Mean difference of Category4 (n=150) minus Category3 (n=150): Category3 ≺ Category4
 :p-val 0.0000
Mean Diff:19.836791 95CI:[ 18.047421,21.762239]

Mean difference of Category5 (n=150) minus Category3 (n=150): Category3 ≺ Category5
 :p-val 0.0000
Mean Diff:40.727570 95CI:[ 39.004372,42.627946]

Mean difference of Category5 (n=150) minus Category4 (n=150): Category4 ≺ Category5
 :p-val 0.0000
Mean Diff:20.890780 95CI:[ 19.079287,22.625807]

Citation

Chainarong Amornbunchornvej, Navaporn Surasvadi, Anon Plangprasopchok, and Suttipong Thajchayapong (2019). A nonparametric framework for inferring orders of categorical data from category-real ordered pairs. arXiv preprint arXiv:1911.06723. link

Contact