The goal of this series of posts is to show that practically all the features of logistic regression can be explained in terms of information theory.
Let’s start with definitions. Binary logistic regression model with dichotomic outcome Y is defined by equation
ln[P(Y = 1)/P(Y = 0) = ln[p/(1-p)] = b0 + b1X1 + b2X2 + … bkXk (1)
where p is the predicted probability of EVENT (Y = 1) and 1 – p is the probability of NONEVENT (Y = 0)) , X = ( X1, X2, … Xk) is a vector of predictors or explanatory variables, b = ( b0 , b1, b2, … bk) are the corresponding coefficients. For comparison we define linear probability model as
p = a0 + a1X1 + a2X2 + … + akXk (2)
Note we use different notations for the coefficients in (1) and (2). The coefficients (ai) are estimated by using Ordinary Least Squares (OLS) method, the coefficients (bi) - by using Maximum Likelihood (ML) approach.
As said in Paul Allison (2012), p. 26:
“For the linear probability model, a coefficient of 0.25 tells you that the predicted probability p of the event increases by 0.25 for every 1-unit increase in the explanatory variable. By contrast, a logit coefficient of 0.25 tells you that the log odds increases by 0.25 for every 1-unit increase in the explanatory variable. But who knows what a 0.25 increase in log odds means?”
Paul von Hippel in his post of July 5, 2015 supports this statement in even stronger terms:
“In the linear model, if a1 is (say) .05, that means that a one-unit increase in X1 is associated with a 5 percentage point increase in the probability that Y is one. Just about everyone has some understanding of what it would mean to increase by 5 percentage points their probability of, say, voting, or dying, or becoming obese. The logistic model is less interpretable. In the logistic model, if b1 is .05, that means that a one-unit increase in X1 is associated with a .05 increase in the log odds that Y is 1. And what does that mean? I’ve never met anyone with any intuition for log odds” (in both citations bold print is mine).
Some researches question interpretability and intuition not only of log odds but even odds themselves (see discussions on the subject in Allison (2015) and (2017), and von Hippel (2015). Hosmer and Lemeshow (1989), p. 42 moderate discussions by saying “The interpretation given for the odds ratio is based on the fact that in many instances it approximates a quantity called the relative risk.”
And nevertheless, log odds or logits do have quite a simple and intuitive interpretation on their own. It is known (see, for example, Rissanen (1989) p. 38) that for each random event E there are necessarily two numbers, the event’s probability p = P(E) and its information I(E), i.e. the information contained in the message that E occurred. P(E) and I(E) are related to each other by with equation
I(E) = - lnP(E) (3)
where ln is natural logarithm, so information is measured in nats, not in bits (1 nat ≈ 1.44 bits). It may be argued that using I(E) instead of P(E) provides a more natural scale, especially in case of rare events. For example, three values of P(E) = 0.02, 0.01, and 0.001 are almost equidistant, but the corresponding values of I(E) are 3.91, 4.61; 6.91 (all in nats) which are more realistic. We see that rarer events provide more information when observed. Note that (3) can be also associated with the minimal word length to convey the information content. So information is as fundamental a concept as probability. Now, let y1, y2,..., yi,... yn be independent observations of variable Y; n - the total sample size; p = P(E) = P(Y = 1) and 1 – p = P(NE) = P(Y = 0). In these notations, we have the following chain of equalities:
ln[p/(1-p)]=ln[p(E)/(1-p(E)]=ln[p(E)] - ln[(1-p(E)]= I(NE) - I(E) (4)
Thus, the logit is nothing but the discriminative information difference between event E and nonevent NE - (DID(E, NE)). From equations (3) and (4) we see that
(a) the less probable event E is, the more information it contains, i.e. more amount of nats is needed to store it, and vice versa;
(b) as the value of p goes from 0 to 1, the corresponding value of DID (log odds) goes symmetrically from minus infinity to plus infinity;
(c) if p = 0.5, then DID(E, NE) = 0, i.e. E and NE cannot be discriminated in terms of information content;
(d) in general the DID (log odds) value for probability p is minus the DID (log odds) value for probability 1-p;
If anybody is uncomfortable with possible negative values of DID (if p < 0.5), then the absolute value of DID can be used. Though in this case the information about sign of DID will be lost.
DID(E,NE) can serve as an information measure of the ability to discriminate between E and NE. In this regard, it is interesting to compare logistic regression with linear discriminant analysis. These techniques are defined by the same mathematical equations, but they work under different assumptions. According to Hosmer and Lemeshow (1989), pp 34-36, and Menard (2010), pp 319-320 logistic regression is more preferable when the basic assumption of LDA – multivariate normality of predictors is not satisfied. And yet in practice, logistic regression and LDA often give similar results. See, for example, Hastie, Tibshirani, and Friedman (2013), pp. 121-122:
“It is our experience that the models give very similar results, even when LDA is used inappropriately, such as with qualitative predictors…We have seen that linear discriminant analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways”
After estimating coefficients (b0 , b1, … bk) in (1) by maximum likelihood we have to take decision whether our current case is EVENT (Y = 1) or NONEVENT (Y = 0). For this purpose, we can choose some decision cut-off either in terms of predicted probability or in terms of log odds (DID). Actually, in a logistic model, units of log odds or DID, (i.e. nats), are converted back to units of probability by formula
p = exp(ln[p/(1-p)] /(1+exp(ln[p/(1-p)]) =
1 / (1 + exp(- ln[p/(1-p)])) = 1 / (1 + exp(-( b0 + b1X1 + b2X2 + … bkXk) (5)
where exp is the exponential function. Though the predicted probabilities may be used in decision, it seems more convenient and natural to do it in terms of log odds, i.e. information. According to Allison (2017), in many applications researches are not really interested in predicted probabilities. And then Allison in this article added emphatically “… there’s nothing sacred about probabilities. An odds is just as legitimate a measure of the chance that an event will occur as a probability. And with a little training and experience, I believe that most people can get comfortable with odds”. We can extend the statement above by saying: there is nothing sacred about both probability and odds. Log odds i.e., discriminative information difference (DID) is as legitimate as a probability and odds. Moreover, in terms of logit (and not probability) finding a decision threshold becomes a linear problem – to choose a cut-off, C, such that
Declare EVENT if ln[p/(1-p)] = b0 + b1X1 + b2X2 + … bkXk > C (6)
Declare NONEVENT if ln[p/(1-p)] = b0 + b1X1 + b2X2 + … bkXk <= C
Of course, a cut-off point depends on the relative consequences of a false positive and a false negative.
Thus, in this post we have showed that logit is easily interpreted in terms of information as discriminative information difference (DID). Since there exists a linear relationship between logit and coefficients (bi) it is natural to expect a similar interpretation for coefficients (bi). According to Hosmer and Lemeshow (1989), p.39: “Proper interpretation of the coefficient in a logistic regression model depends on being able to place meaning on the difference between two logits”. It can be easily showed when we have a sole independent variable of a continues or a binary type. First, consider the case of a sole binary predictor, say, X, with values ‘1’ for ‘female’ and ‘0’ for ‘male. Then, b can be interpreted as the difference in ability to discriminate (DID) between ‘female’ and ‘male’. In the same way, with a sole continues predictor, say, age in years, the coefficient b can be interpreted as a change in DID per year of age. In general case of multiple logistic regression this interpretation is not as straightforward, and the estimated coefficients must be interpreted with care.
In our next blog posts, it will be showed that the main estimation method for logistic regression - Maximum Likelihood (ML) - has a simple and natural interpretation in terms of information theory. As a result, R-Squared measures based on the log likelihood (see our post Shtatland (2018)) acquire an information interpretation. And so do some goodness-of-fit tests (for example, the deviance test or the likelihood ratio test).
References
Allison, P.D. (2012) “Logistic Regression Using SAS: Theory and Application”. Cary, NC: SAS Institute.
Allison, P.D. (2014) “Measures of fit for logistic regression”. https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf
Allison, P.D. (2017) “In Defense of Logit – Part 1”. https://statisticalhorizons.com/in-defense-of-logit-part-1
Hastie, T. ,Tibshirani, R. and Friedman, J. (2013) “The Elements of Statistical Learning: Data Mining, Inference, and Prediction”. Springer
Hosmer, D.W. and Lemeshow, S. (1989) “Applied Logistic Regression”. New York: Wiley.
Shtatland, E.S. (2018) “Do we really need more than one R-Squared in logistic regression?” http://statisticalmiscellany.blogspot.com/
Von Hippel, P. (2015) Linear vs. Logistic Probability Models: Which is Better, and When? http://statisticalhorizons.com/linear-vs-logistic