Maximum Likelihood

Background

Recall, that most of our interesting distributions come in parameterized families. For example, \(\textrm{Binom}(n,p)\), is a two-parameter family of discrete distributions. When \(p\) and \(n\) are specified then the density function for the distribution is given by this equation:

\[ \textrm{P}(X=k) = {n \choose k}p^k(1-p)^{n-k} \]

Probability Histogram of Binomial Distribution

Number of trials

Probability of Success

The Likelihood Function

On the other hand if we have an observed value \(X=k\), and assuming we know the value of \(n\), then we can produce a likelihood function. This is just the function that describes the probability of observering \(X=k\) for different values of the parameter \(p\). In other words– the parameter function fixed \(n\) and \(p\) and shows the result of varying \(k\). While the likelihood function (in this scenario) fixes \(k\) and \(n\) and sees what happens when we vary \(p\).

The result is not a probability distribution (review the Week 7 and Week 8 notes). But, given \(k\), we can choose the value of \(p\) for which the probability of attaining the observed \(k\) is highest.

In the notes I used the explicit formula represented above… but R has a wide variety of built-in distributions… Below is an interactive demonstration of the likelihood function for \(0 \le p \le 1\) with the option of choosing various values of \(n\) and \(k\):

Number of trials

Number of successes (as percentage of total trials)

Maximum Likelihood Estimate

You will notice that in all cases– there is a single maximum. The value of \(p\) for which this occurs is the maximum likelihood estimate for \(p\). You might also notice that this is a smooth curve– even though \(\textrm{Binom}(n,p)\) is a discrete distribution. That’s because we are smoothly varying the possible values of \(p\). Those values are not discrete– even though the allowed values of \(k\) are discrete.

You will also notice that the desired maximum is achieved at \(p=\frac{k}{n}\). This fact can be derived analytically by taking the derivative of the likelihood function (with respect to \(p\)) and finding which values of \(p\) make the derivative equal to 0. (Hint… the answer is at \(p=\frac{k}{n}\)), but we can also use an approximation technique.

Naive Grid Search

Here’s one, very inefficient technique. Call it naive grid search:

maximizer=function(f,a,b,tolerance=0.01){
  x=seq(a,b,length.out=1000)
  F=Vectorize(f) #Just in case f isn't already vector-capable
  best=which.max(F(x))
  if(best==1 | best==length(x)){return(x[best])} # maximum occurs at boundary
  lower=x[best-1]
  upper=x[best+1]
  if(abs(upper-lower)<tolerance){return(x[best])} #Ensures recursion eventually stops
  return(maximizer(f,x[best-1],x[best+1],tolerance))
}

Here’s an example of using it:

n=10;k=3
lik=function(p){dbinom(k,n,p)}
plot(0,0,type="n",xlim=c(0,1),ylim=c(0,0.4),main=paste0("Likelihood for Binom(",n,",p) when X=",k),xlab="prob p",ylab="Likelihood")
  curve(lik(x),0,1,add=TRUE,type="h",col="skyblue")  
  curve(lik(x),0,1,add=TRUE,lwd=2)
near.best=maximizer(lik,0,1,0.0001)
abline(v=near.best)

near.best

## [1] 0.3000007

Exercise 1

In the exercise below enter the maximizer function (by hand.. this is not a good copy-and-paste situation). Make sure you understand what it is doing.

Using the dpois() function in R and a parameter value of \(\mu=4\), create the following graphs using the code block below:

A barplot of the distribution for \(Y \sim \textrm{Pois}(4)\) showing \(0 \le X \le 10\) (use the named argument names.arg to provide the labels for the bars.)
For a given value of \(Y=3\), produce the likelihood graph for \(0 \le \mu \le 10\).
Use the maximizer() function to estimate the best value of \(\mu\) for the observed value of \(Y=3\)
Add a vertical line to your graph of the likelihood function at that location Replicate the code above to generate a maximum likelihood curve for a Poisson

(I suspect the maximum likelyhood estimate is not going to surprise you).

Likelihood for multiple observations

Now, in these examples we’ve been using a situation in which we had one observation. What if we had multiple observations? Let’s consider \(n\) independent observations. All of which come from the same distribution. We’ll call the random variables \(Y_1, \ldots, Y_n\) and the observed values as \(y_1, \ldots, y_n\). The probability function will depend upon the exact distribution of \(Y_i\), but for now:

\[ \textrm{P}(Y_1=y_1 \textrm{ and } Y_2=y_2 \textrm{ and }\cdots \textrm{ and }Y_n=y_n ) = \textrm{P}(Y_1=y_1)\textrm{P}(Y_2=y_2)\cdots \textrm{P}(Y_n=y_n) \] The independence of the \(n\) observations makes the calculation of the joint probability the product of the individual probabilities… nice! Let’s suppose that \(Y_i \sim \textrm{Pois}(\mu)\). Recall the probability density function for the Poisson distribution:

\[ P(X=k) = e^{-\mu}\frac{\mu^k}{k!} \] Then the joint probability is

\[ \textrm{Joint Prob} = e^{-\mu}\frac{\mu^{y_1}}{y_1!}e^{-\mu}\frac{\mu^{y_2}}{y_2!}\cdots e^{-\mu}\frac{\mu^{y_n}}{y_n!} \]

Now we can do our Likelihood trick and reinterpret this as a function of \(\mu\). This time, however, there is more data:

\[ \mathcal{L}(\mu) = e^{-\mu}\frac{\mu^{y_1}}{y_1!}e^{-\mu}\frac{\mu^{y_2}}{y_2!}\cdots e^{-\mu}\frac{\mu^{y_n}}{y_n!} \] The maximizer() doesn’t care. It can handle this without any problems

Let’s generate 20 output values from a known distribution and see how close our maximim likelihood estimate works:

set.seed(1)
Y=rpois(20,3)
plot(Y)

barplot(table(Y))

Our likelihood function is going to depend upon the values in the vector \(Y\). We hid that in our example of the \(\textrm{Binom}\) distribution. Reread and notice that the value of \(k\) (we also used \(k=3\) in that example) was part of the calculation. I am going to do something similar to make the presentation easier to follow… but again, our likelihood function will have a value that depends upon all of the \(y_1, y_2, \ldots, y_n\) values in the variable \(Y\). Furthermore I do not want to write my function using all those factorial and \(e^{-\mu}\) terms… luckily R has all of the already encoded in dpois(k,3):

lik=function(mu){#Assume Y holds the relevant data
  prod(dpois(Y,mu))
}

Lik=Vectorize(lik)
plot(0,0,type="n",xlim=c(0,10),main="Likelihood for Pois(mu) when Y=[y1,y2,...,yn]",xlab="mu",ylab="Likelihood",ylim=c(0,3.5e-17))
  curve(Lik(x),0,10,add=TRUE,type="h",col="skyblue")  
  curve(Lik(x),0,10,add=TRUE,lwd=2)

Now before I attempt to maximize this… notice how very, very small the y scale is. That should make some sense. After all the probability of getting any particular collection of values is actually quite low. Numbers that tiny should also start to make you worry a bit. Are we getting into a realm where our floating point representations are sufficient?

So we introduce our next key idea in the following section.

Log Likelihood

Rather than maximizing the likelihood we want maximize the log likelihood. That’s just the likelihood with a \(\log()\) function applied to it. We get several advantages:

The \(\log()\) function (where defined) is strictly monotonic. It preserves order. This means that the value of \(\mu\) for which \(\mathcal{L}(\mu)\) achieves a maximum is the same as that for whcih \(\log \mathcal{L}(\mu)\) achieves a maximum.
Multiplication turns into addition… so instead of a bunch of terms of the form \(\textrm{P}(Y_i=y_i)\) being multiplied, we add a bunch terms of the form \(\log\textrm{P}(Y_i=y_i)\).
All those pesky \(e^{-15}\), etc terms become far more manageable
The potential for floating point rounding errors messing up the results is diminished.

So let’s calculate the log likelihood function. Notice that for any individual observation \(y_i\):

\[ \begin{aligned} \textrm{P}(Y_i=y_i) &= e^{-\mu}\frac{\mu^{y_i}}{y_i!}\\ \log\textrm{P}(Y_i=y_i) &= \log\biggl(e^{-\mu}\frac{\mu^{y_i}}{y_i!}\biggr)\\ &= \log\biggl(e^{-\mu}\biggr)+\log\biggl(\frac{\mu^{y_i}}{y_i!}\biggr)\\ &= -\mu + \log(\mu^{y_i})-\log(y_i!)\\ &= -\mu + y_i\log(\mu)-\log(y_i!)\\ \end{aligned} \]

The only potentially problematic term is \(\log(y_i!)\). The issue is that if we calculated \(y_i!\) as part of our intermediate work we might produce a value too big to be represented by the normal numeric values in R. Luckily, R has an lfactorial() specifically for calculating such things.

Okay… So, our overall log likelihood is the sum of terms like those up above. Also notice that out log likelihood function is NOT happy with \(\mu=0\) (that’s because \(\log(\mu)\) is not defined for \(\mu=0\))… We’ll start at \(\mu=0.1\):

log.lik=function(mu){# Assume Y contains the values y_1, ..., y_n
  sum(Y*log(mu)-mu - lfactorial(Y))
}
Log.lik=Vectorize(log.lik)
par(mfrow=c(1,2))
plot(0,0,type="n",xlim=c(0,10),main="Log Likelihood",xlab="mu",ylab="Log Likelihood",ylim=c(-160,-30))
  curve(Log.lik(x),0,10,add=TRUE,lwd=2)
plot(0,0,type="n",xlim=c(0,10),main="Likelihood",xlab="mu",ylab="Likelihood",ylim=c(0,3.5e-17))
  curve(Lik(x),0,10,add=TRUE,type="h",col="skyblue")  
  curve(Lik(x),0,10,add=TRUE,lwd=2)

Let’s use the maximizer on both:

maximizer(lik,0,10,0.001)

## [1] 3.349997

maximizer(log.lik,0,10,0.001)

## [1] 3.349997

So… probably 3.35. Now we know that the true value was \(\mu=3\), but that’s the problem with random data. And therein lies an important cautionary tale: The maximum likely estimate generated by a collection of random values drawn from the same distribution is frequently not the true parameter value (which is why we use confidence intervals). But, it’s still the (by several measures) the best we can do… so that’s our approach.

If you want to do a fun exercise in calculus take the derivative with respect to \(\mu\) of the log likelihood. Set equal to 0 and solve. You’ll find that the answer to that equation is \(\mu = \overline{Y}\):

mean(Y)

## [1] 3.35

Yep. Math works.

Now it’s your turn:

Find the mean of the 30 randomly generated values.
Create a function to calculate the log likelihood for the Poisson family
Plot the likelihood function from \(0 \le \mu \le 9\):

set.seed(10)
y=rpois(30,2.3)

Bringing in the model

If you re-read the last few sections carefully you’ll notice that all our \(y_i\) values were drawn from the same distribution (in our examples \(\textrm{Pois}(3)\)). That’s not the way it works when we are doing regression. In that situation we assume that the distribution describing \(Y_i\) (the ith sample) is directly related to the values of the explanatory variables \(X_1, X_2, \ldots X_p\) (where I’m using \(p\) to denote the number of explanatory variables).

So this means that \(\textrm{P}(Y_i=y_i)\) is actually going to use different formulas for different \(i\). Broadly speaking (if I use the Poisson family):

\[ Y_i \sim \textrm{P}(\mu_i) \]

That’s correct as far as it goes, but it overlooks the dependence of \(\mu_i\) on the explanatory variable values \(x_{i1},x_{i2},\ldots x_{in}\). The first subscript denotes the observation (aka the row) and the second denotes the variable (aka the column). So we have the linking relationship:

\[ \begin{aligned} g(\mu_i) &= b_0 + b_1 x_{i1} + b_2 x_{i2} + \cdots + b_p x_{ip} + \textrm{error}\\ \mu_i &= g^{-1}\biggl(b_0 + b_1 x_{i1} + b_2 x_{i2} + \cdots + b_p x_{ip}\biggr) + \textrm{error}\\ \end{aligned} \] So here’s the magic. If \(g\) is 1-1 then \(g^{-1}\) exists and is strictly monotonic. (We usually use \(g\) monotonically increasing… which also means that \(g^{-1}\) is also monotonically increasing).

For example, in logistic regression \(\mu_i = p_i\) and \(g(p) = \log\left(\frac{p}{1-p}\right)\) (otherwise known as the log odds) We say that the link function is the logit. We have also seen that if \(g(p) = \frac{p}{{1-p}}\), then \(g^{-1}(\textrm{log odds}) = \frac{1+\exp(\textrm{log odds})}{\exp{\textrm{log odds}}}\)

Recall:

\[ \begin{aligned} \log\biggl(\frac{p}{1-p}\biggr)&=\textrm{log odds}\\ \frac{p}{1-p} &= \exp(\textrm{log odds})\\ \frac{1-p}{p}&= \frac{1}{\exp(\textrm{log odds})} & \textrm{taking reciprocal of both sides}\\ \frac{1}{p} - 1&= \frac{1}{\exp(\textrm{log odds})}\\ \frac{1}{p} &= \frac{1}{\exp(\textrm{log odds})}+1\\ \frac{1}{p} &= \frac{1+\exp(\textrm{log odds})}{\exp(\textrm{log odds})}\\ p&=\frac{\exp(\textrm{log odds})}{1+\exp(\textrm{log odds})}\\ &=\frac{1}{\exp(-\log\textrm{odds})+1} \end{aligned} \]

Oh right: \(g^{-1}\) is the logistic curve! And the linear part of the equation is predicting the value of \(\log \textrm{odds}\)! Lets let \(\eta_i = b_0 + b_1 X_{i1} + b_2 X_{i2} + \cdots + b_p X_{ip}\). We now know that \(Y_i\), with corresponding fixed, explanatory values \(X_{i*}\) should have this

\[ Y_i \sim \textrm{Binom}\bigg(n_i,\frac{1}{\exp(-\eta_i)+1}\bigg) \]

where \(n_i\) is the number of “trials” contributing to the value of \(Y_i\).

Let’s imagine that we have one explanatory variable and so our linear model has two parameters. An intercept which we will, imaginatively, call \(b_0\) and and coefficient for the explanatory called \(b_1\). In order to make this more concrete. Let’s use the UCLA data set from the regression tutorial and try to make our prediction using \(\textrm{gre}\):

df <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

To more closely match what we’ve discussed above we would want to split the observations based off the values of the explanatory variables. It actually turns out that we do NOT need to do that. We can treat each observation (which is a 0 or a 1) as its own observation. Although I won’t prove it– the maximum likelehood estimate for \(b_0\) and \(b_1\) will be the same whether we group the observations into binomial situations with \(n\ge 1\) or treat them all as Bernoulli distributions with \(n=1\). Either way we will get the same estimate. As a reminder the response is \(\textrm{admit}\) and the explanatory that we will use is \(\textrm{gre}\). With that in mind:

\[ \textrm{P}(Y_i=y_i) = \begin{cases} \frac{1}{\exp(-(b_0 + b_1\ \textrm{gre}))+1} & \textrm{ if }y_i = 1\\ 1- \frac{1}{\exp(-(b_0 + b_1\ \textrm{gre}))+1}& \textrm{ if }y_i = 0\end{cases} \] And so we produce the likelihood equation in terms of the two parameters, \(b_0\) and \(b_1\):

\[ \mathcal{L}_i(b_0,b_1) = \begin{cases} \frac{1}{\exp(-(b_0 + b_1\ \textrm{gre}))+1} & \textrm{ if }y_i = 1\\ 1- \frac{1}{\exp(-(b_0 + b_1\ \textrm{gre}))+1}& \textrm{ if }y_i = 0\end{cases} \]

(I used \(\mathcal{L}_i()\) to indicate the contribution of observation \(i\) to the overall Likelihood.

We will then want to calculate the \(\log\textrm{likelihood}\). Although we don’t have the nicest analytic formula, we can write this quite concisely in code by making the following identifications:

\[ \begin{aligned} \eta_i &= b_0 + b_1 x_{i}\\\ \eta &= b_0 + b_1 x\\ \textrm{logistic}(\eta) &= \frac{1}{1+\exp(-\eta)} \end{aligned} \]

In order to build a maximizer we are going to need to deal with two parameters. So first let’s set up the log likelihood function. As you read the code below notice the two uses of Vectorize. This ensures that the function will produce an output for every combination of values in vector \(b_0\) and in vector \(b_1\):

logistic=Vectorize(function(x){1/(1+exp(-x))})

log.lik=function(b0,b1){#Assume that X contains explanatory and Y contains response
    eta=function(x){b0 + b1*x} #Assume b0, b1 are scalar
    prob=logistic(sapply(df$gre,eta))
    comp=1-prob
  sum(log(ifelse(df$admit==1,prob,comp)))
}
log.lik=Vectorize(log.lik,"b0")
log.lik=Vectorize(log.lik,"b1") #Ensure that applying log.lik to two vectors makes table of outputs

Next we need to create our function that searches for the maximum. I’m including a very inefficient example below. It is also very sensitive to initial conditions (read more about that later in this tutorial). The code is hidden (you’ll need to look at the raw .RMD file) because the exercise below asks you to try to write one of your own.

glm(family=binomial,data=df,admit~gre)

## 
## Call:  glm(formula = admit ~ gre, family = binomial, data = df)
## 
## Coefficients:
## (Intercept)          gre  
##   -2.901344     0.003582  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  398 Residual
## Null Deviance:       500 
## Residual Deviance: 486.1     AIC: 490.1

(opt1<-maximizer(log.lik,-3,-2,0,0.05,0.001,0.00001,15))

## [1] -2.894814843  0.003571429

One short-coming of the grid-search approach is that it can be easy to converge to the wrong answer. The changes in x and the changes in y might not be on remotely similar scales and the algorithm might get locked into the wrong region of parameter space. The number of subdivisions made at each level can influence this– sometimes increasing the number of subdivisions makes the estimate less accurate because the algorithm didn’t happen to land very close to the true optimimum:

(opt2<-maximizer(log.lik,-3,-2,0,0.05,0.001,0.00001,20))

## [1] -2.449067667  0.002856425

Let’s compare:

log.lik(opt1[1],opt1[2])

## [1] -243.0281

log.lik(opt2[1],opt2[2])

## [1] -243.3116

Her’s a contour map reminiscent of the first iteration of the grid-searcher:

x=seq(-3,-2,length.out=20)
y=seq(0,0.05,length.out=20)
samples<-log.lik(x,y)
contour(x,y,samples)

So the problem is that the maximizer finds the best contour line in the wrong region. The grid size shrinks down and it can’t get back out to the proper region quickly enough. This can be addressed by a smarter choice of initial parameter bounds– but that requires a fair amount of interaction from the user:

x=seq(-3,-2,length.out=20)
y=seq(0,0.005,length.out=20)
samples<-log.lik(x,y)
contour(x,y,samples)

You can probably tell from the contour map the region you would prefer to focus your search for the best value. You are intuitively following a gradient ascent method.

Exercise: Attempt to create your own maximizer. Do not spend more than an hour on this exercise. If you find yourself not moving forward, you can look at the code for the maximizer that is in the source code of this document.

df <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

logistic=Vectorize(function(x){1/(1+exp(-x))})

log.lik=function(b0,b1){#Assume that X contains explanatory and Y contains response
    eta=function(x){b0 + b1*x} #Assume b0, b1 are scalar
    prob=logistic(sapply(df$gre,eta))
    comp=1-prob
  sum(log(ifelse(df$admit==1,prob,comp)))
}
log.lik=Vectorize(log.lik,"b0")
log.lik=Vectorize(log.lik,"b1") #Ensure that applying log.lik to two vectors makes table of outputs

glm(family=binomial,data=df,admit~gre)

maximizer=function(f,x1,x2,y1,y2,tolerance.x,tolerance.y,n){
  #Your code goes here
}

maximizer(log.lik,-3,-2,0,0.05,0.001,0.00001,15)

The important thing to take away from this are as follows:

You can write a function to numerically search for an optimum
Somebody has almost certainly made a more efficient version
You don’t need to understand any deeper theory to search for optimmum numericaly
All GLM fitting, ultimately is a form of searching for parameter values that have the highest likelihood for the given data set.

Likelihood ratios and deviance

So… we’ve chosen a parameterized family of distributions that we think is a good fit for our reponse variable, and we have a link function and its inverse that allow us to take a good ole linear expression of the form \(b_0 + b_1 X_1 + \cdots + b_p X_p\) and transform it into the parameter for that family of distributions. We know, in the background, that there is likelihood function (or a log likelihood function). That likelihood function makes use of the data and is, in turn, used by some numerical approximating algorithm to estimates the values of \(b_0, b_1, \ldots b_p\) that result in a model that maximizes the likelihood of the observed data. This all happens automatically if we use a function like glm(). Note that the model has parameters \(b_0, b_1, \ldots b_p\) and the conditional distribution of \(P(Y|X)\) is a member of parameterized family. That second “parameter” is not refering to \(b_0, b_1, \ldots b_p\). In the logistic distribution that we’ve been playing with, the families of distributions is binomial and we need to determine the value of \(p\) for particular collections of explanatory values. In the example above \(p=\textrm{logistic}(b_0 + b_1\ \textrm{gre})\). The link function is the \(\texrm{log odds}\) (also known as the \(\textrm{logit}\)) and the inverse is the \(\textrm{logistic function}\) which turns the \(\textrm{log odds}\) back into the probability parameter \(p_i\): \(Y_i \sim \textrm{Binom}(n_i,p_i)\).

Likelihood ratios

Now how do we compare models? I’m not going into the theory of how confidence intervals can be generated for the coefficients in a GLM. But I do want to touch upon Likelihood ratios. The idea is that you start with two nested models. (That means that there are two models– one simpler than the other. The simpler model uses a subset of the explanatory variables from the other. We will refer to it as \(M_{\textrm{restricted}}\) and the larger model as \(M_{\textrm{full}}\)). Increasing the number of explanatory variables provides more model parameters which make it possible to have more control over the value of response family parameter. The likelihood of a fuller model will always be an improvement.

So the general approach is:

\[ -2\log\biggl(\frac{\mathcal{L}(M_{\textrm{restricted}})}{\mathcal{L}(M_{\textrm{full}})}\biggr) \] The ratio will be less than 1 (and hence produce a negative value) when the log is applied. The -2 makes the result positive. The overall statistic (at least asymptotically) has a \(\chi^2\) distribution with \(p_{\textrm{full}} - p_{\textrm{nested}}\) degrees of freedom. (Ie– the difference in the number of parameters between the two models). Here’s a quick example:

model.full<-glm(data=df,admit~gre+gpa+rank,family=binomial())
model.gre<-glm(data=df,admit~gre,family=binomial())
anova(model.gre,model.full,test="Chisq")

Notice that it is smart enough to do an analysis of deviance (which is the GLM equivalent to an analysis of variance). So we have statistically significant evidence that the full model is doing a better job than the reduced model.

Saturated Model

But what is this deviance thing that we keep seeing?

To undestand that we need to understand the saturated model. This is a special model that doesn’t take the linear predictors into considerations. Every observed response \(y_1, y_2, \ldots y_n\) is taken to have its own, independent distribution. Rather than the explanatory values, we just assume that \(y_i\) perfectly describes the mean for the distribution of \(Y_i\). For the logistic regression this is particularly simple. Whatever we observe (be it 0 or 1) has a 100% chance of occuring (under the saturated model). So the Log Likelihood of that model is a sum of terms of the form \(\log(1)\). But all of those are 0. In other words, the log likelihood of the saturated model is precisely 0.

Deviance

The deviance is a ratio test between the model under consideration and the saturated model: \[ \begin{aligned} \textrm{Deviance} &= -2 \log\left(\frac{\mathcal{L}(M)}{\mathcal{L}(M_{\textrm{saturated}})}\right)\\ &= -2 \biggl(\log\left(\mathcal{L}(M)\right)-\log\left(\mathcal{L}(M_{\textrm{saturated}})\right)\biggr) \end{aligned} \]

In the logistic case, the log likelihood was exactly 0. Furthermore, for the main model, \(p_i\) is the fitted value (ie \(\textrm{logistic(log odds)}\)). An individual observation has a Bernouli distribution:

\[ \textrm{P}(Y_i=y)= p_i^y(1-p_i)^{1-y}\qquad\textrm{where }y=0,1 \]

And hence:

\[ \begin{aligned} \log(\textrm{P}(Y_i=y))&= \log(p_i^y(1-p_i)^{1-y})\qquad\textrm{where }y=0,1\\ &=y\log(p_i) + (1-y)\log(1-p_i) \end{aligned} \]

\[ \begin{aligned} \textrm{Deviance of logistic} &= -2 \log\left(\frac{\mathcal{L}(M)}{\mathcal{L}(M_{\textrm{saturated}})}\right)\\ &= -2 \biggl(\log\left(\mathcal{L}(M)\right)-\log\left(\mathcal{L}(M_{\textrm{saturated}})\right)\biggr)\\ &= -2 \biggl(\log\left(\mathcal{L}(M)\right)- 0\biggr)\\ &= -2 \biggl(\log\left(\mathcal{L}(M)\right)\biggr)\\ &= -2 \biggl(\sum \left(y_i\log(p_i) + (1-y_i)\log(1-p_i)\right)\biggr)\\ \end{aligned} \] Or, in R:

observed=df$admit
fitted=fitted.values(model.gre)
-2*sum(observed*log(fitted) + (1-observed)*log(1-fitted))

## [1] 486.0561

And you will notice that we have correctly derived the deviance reported by the glm function:

model.gre

## 
## Call:  glm(formula = admit ~ gre, family = binomial(), data = df)
## 
## Coefficients:
## (Intercept)          gre  
##   -2.901344     0.003582  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  398 Residual
## Null Deviance:       500 
## Residual Deviance: 486.1     AIC: 490.1

Deviance residuals

In normal regression the variance in the response is broken down into variance components:

\[ \begin{aligned} \textrm{Observed Response} &= \textrm{Fitted Value} + \textrm{Residual}\\ \textrm{VAR}(\textrm{Response}) &= \textrm{VAR}(\textrm{Fitted Value}) + \textrm{VAR}(\textrm{Residual})\\ \end{aligned} \] But this is because of the independence between the fitted values and the residuals. In otherwords– if we look at the distribution of all residuals vs residuals that are associated to specific combinations of explanatory values then independence demands that we have:

\[ P(\textrm{residual}=e)= P(\textrm{residual}=e |\textrm{explanatory values}) \]

But if the variance of the residuals is different for different explanatory values, then the above equation is false. This means the fundamental assumption between using Analysis of Variance to explore the model and its nested sub-models fails!

Note, for reference, that if the conditions are satisifed, then the variance in the response is the sum of the variance of the fitted values and the variance of the residuals. The variance of the fitted values is purely due to the model. The variance of the residuals however is related to the conditional distribution of \(Y\) (given \(X\)).

If we take the variance due to the model and divide by the total variance. This is, amazingly equal to \(r^2\) (the square of the correlation coefficient).

But you might notice a similarity with the likelihood ratio idea. (In that we’re looking at ratios).

You might also notice that each observation is the sum of a fitted value and a residual (or error): \(y_i = \widehat{y_i} + \varepsilon_i\). Also recall that expected value of the residuals is 0. So look at how the variance relationship plays out in a bit more detail:

\[ \begin{aligned} \textrm{VAR}(Y) &= \textrm{VAR}(\widehat{Y}) + \textrm{VAR}(\textrm{err})\\ \mathbb{E}\biggl((Y-\overline{Y})^2\biggr) &= \mathbb{E}\biggl((\widehat{Y}-\overline{\widehat{Y}})^2\biggr)+\mathbb{E}\biggl((\varepsilon-0)^2\biggr)\\ (Y-\overline{Y})^2 &= (\widehat{Y}-\overline{\widehat{Y}})^2+(\varepsilon)^2\\ \end{aligned} \]

There’s a lot of symbol overload in that last collection of equations… But the key idea is that the variance of the response can be written as sum of terms– ONE FOR EACH observation in the data table:

\[ \begin{aligned} \sum(y_i-\overline{Y})^2 &= \sum \big(\widehat{y_i} - \mathbb{E}(\widehat{Y})\big)^2 + \varepsilon_i^2\\ \end{aligned} \]

This does NOT mean that \((y_i-\overline{Y})^2 = (\widehat{y_i} - \mathbb{E}(\widehat{Y}))^2 + \varepsilon_i^2\). The equation is only true because we sum over all the observations

Deviance residuals

GLM models offer the deviance residual. The idea is to break the deviance of the model into terms whose sum of squares equals the deviance. For logistic regression… we already have the necessary terms… we just need their square-roots:

\[ d_i = \begin{cases} \sqrt{-2\log(p_i)}&\textrm{if }Y_i=1\\ -\sqrt{-2\log(1-p_i)}&\textrm{if }Y_i = 0 \end{cases} \]

Make sure you understand how what is expressed here is consistent with the earlier equation (the key is that \(y_i\) can only be 0 or 1).

Exercise: Rebuild the admit ~ gre logistic regression model using glm() and, using the fitted.values(model) calculate the deviance residuals and plot them. Verify that the sum of their squares is the deviance for the model.

df <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
model.gre<- #Your code here
deviances<- #Your code here
plot(deviances)