When the outcome variable is binary such as alive/dead or yes/no, the most popular analytic method is logistic regression.
\[\textrm{logit}(\mathbb{E}[y]) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots \]
The name “logistic” might have come from the equation below, which can be derived from applying the inverse function of logit on the both side of the equation above.
\[ \mathbb{E}[y] = \textrm{logistic}( \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots)\]
The link function of the logistic regression is logit()
. We can replace it with log()
and the result looks like the below.
\[ \textrm{log}(\mathbb{E}[y]) = \beta_0 + \beta_1 x_1 + \cdots \] This equation represents “Relative Risk Regression” a.k.a log-binomial regression.
Risk, Relative Risk
Risk is just another term for probability. For instance, “the probability of being hit by a lightening” can be rephrased to “the risk of being hit by a lightening”.
Relative risk or risk ratio(RR) is the ratio of two probability(risk). Relative risk is to compare the probabilities of two events. For example, compare the probability of being hit by a lightening when standing alone with the probability of being hit by a lightening having an umbrella open. If we divide the second probability by the first probability, we get how many times we are likely to be hit by a lightening when having an umbrella open compared to having nothing at all. This is relative risk, or risk ratio. If it is 2, in average we will get hit twice (with an umbrella open) every one hit (with nothing).
The name “Relative Risk Regression” seems to come from the fact that the coefficients of relative risk regression is closely related to relative risk! Let’s imagine a relative risk regression with only one predictor \(x\) , which is \(1\) for having an umbrella open, and \(0\) for having nothing. We can compare \(y|x=0\) and \(y|x=1\) .
\[\log(y_{x=1}) = \beta_0 + \beta_1\] \[\Rightarrow y_{x=1} = \exp(\beta_0 + \beta_1)\]
\[\Rightarrow y_{x=1} = \exp(\beta_0)\exp(\beta_1)\]
\[y_{x=0} = \exp(\beta_0)\]
Combining the last two equations, we can derive the following.
\[y_{x=1}/y_{x=0} = \exp(\beta_1)\]
Let’s interpret \(y_{x=1}\) as the probability of being hit when \(x=1\) (with an umbrella open), then relative risk or risk ratio is \(\exp(\beta_1)\) !
The risk of being hit when having an umbrella open over the risk of being hit with nothing is exponential of \(\beta_1\), the coefficient. So if \(\beta_1\) equals to 1, having an umbrella open is approximately 2.718( \(exp(1) = 2.718\cdots\) ) times bigger. You are likely to be hit 2.718 times (with an umbrella opne) in average when people are hit with nothing one time.
Difficulties of applying MLE
Open any mathematical statistics, you will see wonderful characteristics of MLE(Maximum Likelihood Estimate). So MLE is the way to go when we estimate the coefficients of a relative risk regression. But estimating a relative risk regression is difficult because it is optimizing the likelihood with parameters constrained. See the equation below.
\[\log(y|x_1) = \beta_0 + \beta_1 x_1\] \[y|x_1 = \exp(\beta_0 + \beta_1 x_1)\]
Since \(y\) stands for the probability, \(\exp(\beta_0 + \beta_1 x_1)\) with any possible \(x_1\) can not be less than \(0\) or over than \(1\) ! Another problem is that since parameters can be on the edge of the possible parameter space, it becomes difficult to estimate the variance of the parameter.
- [AD] Book for R power users : Data Analysis with R: Data Preprocessing and Visualization
Using R for Relative Risk Regression
We can use the traditional function glm()
for relative risk regression but the package logbin
seems to offer convenience and functionality. We can choose the estimating method with the package logbin
. Let’s get to it!
First we will use Heart Attack Data(data(heart)
). The description of the data can be found by ?heart
This data set is a cross-tabulation of data on 16949 individuals who experienced a heart attack (ASSENT-2 Investigators, 1999). There are 4 categorical factors each at 3 levels, together with the number of patients and the number of deaths for each observed combination of the factors. This data set is useful for illustrating the convergence properties of glm and glm2.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(logbin) # https://github.com/mdonoghoe/logbin
require(glm2, quietly = TRUE)
data(heart)
head(heart)
## Deaths Patients AgeGroup Severity Delay Region
## 1 49 2611 1 1 1 1
## 2 1 74 1 1 1 2
## 3 2 96 1 1 1 3
## 4 30 2888 1 1 2 1
## 5 0 81 1 1 2 2
## 6 8 155 1 1 2 3
We can fit the relative risk regression model to the data like the following. Notice that the response variable part in the fomula is cbind(# of success, # of failure)
.
start.p <- sum(heart$Deaths) / sum(heart$Patients)
fit <-
logbin(cbind(Deaths, Patients-Deaths) ~
factor(AgeGroup) + factor(Severity)
+ factor(Delay) + factor(Region),
data = heart)
fit$converged
Using binary response variable, we can do like the following.
sum(duplicated(heart %>% select(AgeGroup:Region)))
## [1] 0
heart2 <- heart %>%
group_by(AgeGroup, Severity, Delay, Region) %>%
summarise(data.frame(dead = c(rep(1,Deaths),
rep(0,Patients-Deaths)))) %>%
ungroup()
## `summarise()` has grouped output by 'AgeGroup', 'Severity', 'Delay', 'Region'.
## You can override using the `.groups` argument.
fit2 <-
logbin(dead ~
factor(AgeGroup) + factor(Severity)
+ factor(Delay) + factor(Region),
data = heart2)
fit2$converged
For me, it took LONG!!! Here is the faster way.1
start.p <- sum(heart$Deaths) / sum(heart$Patients)
fit <-
logbin(cbind(Deaths, Patients-Deaths) ~
factor(AgeGroup) + factor(Severity)
+ factor(Delay) + factor(Region),
data = heart,
start = c(log(start.p), -rep(1e-4,8)),
method = 'glm2')
cat('Is fit converged? ', fit$converged, '\n')
## Is fit converged? TRUE
fit2 <-
logbin(dead ~
factor(AgeGroup) + factor(Severity)
+ factor(Delay) + factor(Region),
data = heart2,
start = c(log(start.p), -rep(1e-4,8)),
method = 'glm2')
cat('Is fit2 converged? ', fit2$converged, '\n')
## Is fit2 converged? TRUE
Here is a tip. Use the form of # of success and # of failure. Using binary response took longer!
The results are almost identical
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
compareCoefs(fit, fit2)
## Calls:
## 1: logbin(formula = cbind(Deaths, Patients - Deaths) ~ factor(AgeGroup) +
## factor(Severity) + factor(Delay) + factor(Region), data = heart, start =
## c(log(start.p), -rep(1e-04, 8)), method = "glm2")
## 2: logbin(formula = dead ~ factor(AgeGroup) + factor(Severity) +
## factor(Delay) + factor(Region), data = heart2, start = c(log(start.p),
## -rep(1e-04, 8)), method = "glm2")
##
## Model 1 Model 2
## (Intercept) -4.0275 -4.0273
## SE 0.0889 0.0889
##
## factor(AgeGroup)2 1.104 1.104
## SE 0.089 0.089
##
## factor(AgeGroup)3 1.9268 1.9266
## SE 0.0924 0.0924
##
## factor(Severity)2 0.7035 0.7035
## SE 0.0701 0.0701
##
## factor(Severity)3 1.3767 1.3768
## SE 0.0955 0.0955
##
## factor(Delay)2 0.0590 0.0589
## SE 0.0693 0.0693
##
## factor(Delay)3 0.1718 0.1720
## SE 0.0808 0.0808
##
## factor(Region)2 0.0757 0.0757
## SE 0.1775 0.1775
##
## factor(Region)3 0.483 0.483
## SE 0.111 0.111
##
The authors of logbin
states that logbin
solves problems that might pop up using other packages.
Let’s compare!
start.p <- sum(heart$Deaths) / sum(heart$Patients)
t.glm <- system.time(
fit.glm <-
logbin(cbind(Deaths, Patients-Deaths) ~
factor(AgeGroup) + factor(Severity)
+ factor(Delay) + factor(Region),
data = heart,
start = c(log(start.p), -rep(1e-4, 8)),
method = "glm",
maxit = 10000)
)
t.glm2 <- system.time(
fit.glm2 <- update(fit.glm, method='glm2'))
t.cem <- system.time(
fit.cem <- update(fit.glm, method = "cem")
#fit.cem <- update(fit.glm, method='cem', start = NULL)
)
t.em <- system.time(
fit.em <- update(fit.glm, method = "em"))
t.cem.acc <- system.time(
fit.cem.acc <- update(fit.cem, accelerate = "squarem"))
t.em.acc <- system.time(
fit.em.acc <- update(fit.em, accelerate = "squarem"))
objs = list("glm"=fit.glm,
"glm2"=fit.glm2,
"cem"=fit.cem,
"em"=fit.em,
"cem.acc" = fit.cem.acc,
"em.acc" = fit.em.acc)
params = c('converged', "loglik", "iter")
to_dataframe = function(objs, params) {
#param = params[1]
#obj[[param]]
dat = data.frame(model=names(objs))
for (param in params) {
dat[[param]] = sapply(objs,
function(x)
x[[param]])
}
return(dat)
}
dat = to_dataframe(objs, params)
dat$time = c(t.glm['elapsed'],
t.glm2['elapsed'],
t.cem['elapsed'],
t.em['elapsed'],
t.cem.acc['elapsed'],
t.em.acc['elapsed'])
Let’s see the result.
print(dat)
## model converged loglik iter time
## 1 glm FALSE -186.7366 10000 1.61
## 2 glm2 TRUE -179.9016 14 0.00
## 3 cem TRUE -179.9016 223196, 8451 42.47
## 4 em TRUE -179.9016 6492 2.34
## 5 cem.acc TRUE -179.9016 4215, 114 3.78
## 6 em.acc TRUE -179.9016 81 0.09
The authors of the package logbin
stated that the cem is the best but the time it took was the longest. glm2
was the fastest and has converged. But glm2
requires sensible start points. So we do not tell which will win when the data is large and the model is more complex.
In the next post, I will explain how the model and the meaning of coefficient changes with different link functions.
This one uses
glm2
package. I thinklogbin
is just a wrapper in this case. I omitted warnings and messages.↩︎