Introduction

This document is to show how to use the gfmR package in R. This package implements group fused multinomial regression model described by “Automatic Response Category Combination in Multinomial Logistic Regression” by Bradley S. Price, Charles J. Geyer and Adam J. Rothman. This vignette will describe the use of the major functions in the package using the example presented in the same manuscript. The process will be directly applied to finding response category groupings of the self identified political party contained nes96 data in the faraway package based on age, education, and income. For methodology description we refer the reader to the manuscript which can be found at: https://arxiv.org/abs/1705.03594.

Data Description

library(gfmR)

## Loading required package: parallel

## Loading required package: nnet

## Loading required package: faraway

data(nes96)
head(nes96)

##   popul TVnews selfLR ClinLR DoleLR     PID age  educ   income    vote
## 1     0      7 extCon extLib    Con  strRep  36    HS $3Kminus    Dole
## 2   190      1 sliLib sliLib sliCon weakDem  20  Coll $3Kminus Clinton
## 3    31      7    Lib    Lib    Con weakDem  24 BAdeg $3Kminus Clinton
## 4    83      4 sliLib    Mod sliCon weakDem  28 BAdeg $3Kminus Clinton
## 5   640      7 sliCon    Con    Mod  strDem  68 BAdeg $3Kminus Clinton
## 6   110      3 sliLib    Mod    Con weakDem  21  Coll $3Kminus Clinton

We have 7 levels of personal identification:

levels(nes96$PID)

## [1] "strDem"  "weakDem" "indDem"  "indind"  "indRep"  "weakRep" "strRep"

Tuning Parameter Selection

Penalized likelihood methods rely on tuning parameter selection, so that is where we will begin our discussion. To show the basic functionality of the software we first need to understand the data requirements. The first that we have a matrix of category counts for the response variable. We say that the Y matrix needs has rows that correspond to observations and columns that correspond to observed category counts. Note the current implementation is given for a multinomial experiment size of 1.

We're going to use the matrix Response to be our response in this example.

attach(nes96)
Response=matrix(0,944,7)
for(i in 1:944){
  if(PID[i]=="strRep"){Response[i,1]=1}
  if(PID[i]=="weakRep"){Response[i,2]=1}
  if(PID[i]=="indRep"){Response[i,3]=1}
  if(PID[i]=="indind"){Response[i,4]=1}
  if(PID[i]=="indDem"){Response[i,5]=1}
  if(PID[i]=="weakDem"){Response[i,6]=1}
  if(PID[i]=="strDem"){Response[i,7]=1}
}
head(Response)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    1    0    0    0    0    0    0
## [2,]    0    0    0    0    0    1    0
## [3,]    0    0    0    0    0    1    0
## [4,]    0    0    0    0    0    1    0
## [5,]    0    0    0    0    0    0    1
## [6,]    0    0    0    0    0    1    0

Next we will define our penalty set, this is the set that that has elements that will be fused together to create the estimator. We are going to use the ordered example from the manuscript, but the ordered example could be used as well.

Hmat2=matrix(0,dim(Response)[2],dim(Response)[2])
for(j in 1:6){
  Hmat2[j,j+1]=1
  Hmat2[j+1,j]=1
}
Hmat2[3,5]=1
Hmat2[5,3]=1

The next step is to establish the set of predictors that we will use to analyze the data. We will simply just use the model matrix that is produced by lm.

ModMat<-lm(popul~age+educ+income,x=TRUE)$x

X=cbind(ModMat[,1],apply(ModMat[,-1],2,scale))

Finally we are going to create a 5 fold cross validation where we are randomly going assign are going to randomly assign folds.

set.seed(1010)
n=dim(Response)[1]
sampID=rep(5,n)
samps=sample(1:n)
mine=floor(n/5)
for(j in 1:4){
  sampID[samps[((j-1)*mine+1):(j*mine)]]=j
}

The function GFMR.cv is the cross validation function. We have added multicore functionality for platforms that support it. WINDOWS users should use n.cores=1. The example provided here has 944 observations with 7 response categories. We implement 5 cores to speed up the 5 fold cross validation.

o1<-GFMR.cv(Response,X,lamb = 2^seq(4.2,4.3,.1),H=Hmat2,sampID = sampID,n.cores =5,rho=10^2)

names(o1)

## [1] "vl"     "vl.sd"  "lambda" "vl.mat"

o1$vl

## [1] -355.8967 -355.8744

which(o1$vl==max(o1$vl))

## [1] 2

o1$lambda[2]

## [1] 19.69831

Basic Model Run

Once the tuning parameter has been selected we refit the model on the entire data. We have adjusted iterations and tuning parameters for speed and convergence.

mod<-GroupFusedMulti(Response,X,lambda=2^4.3,H=Hmat2,rho=10^2,iter=50,tol1=10^-4,tol2=10^-4)
## save.image("election_pred.Rdata")

mod

## Number of groups[1] 5
## The corresponding groups are[[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3 4 5
## 
## [[4]]
## [1] 6
## 
## [[5]]
## [1] 7

Finally we see the results of the tuning parameter selection with 5 groups. We see the combination of the Independent republican, democrat and independents.