This title is a paraphrase of the titles of the two
publications: “What’s the best R-squared for logistic regression?” by Allison
(2013) in http://www.statisticalhorizons.com/r2logistic
and “One more time about R-Squared measures of fit in logistic regression” by Shtatland
et al. (2002) in http://www.lexjansen.com/nesug/nesug02/st/st004.pdf.
A comprehensive and almost exhaustive review of this topic can be found in Menard (2010), Chapter 3, pp.
48 – 62, and we strongly recommend this publication.
For simplicity, we will discuss only the case of binary logistic
regression with single-trial syntax, though a number of our results can be
generalized to the case of multinomial logistic regression. Following Menard
(2000), we introduce now minimal necessary notations:
Y is the dependent variable taking values coded “1” in the
case of EVENT or “0” in the case of NOT-EVENT;
n is the total
sample size, i. e. the number of observations of dependent variable y1,
y2,..., yi,... yn;
L(0) is the maximized likelihood of the model containing the
intercept only, so called “Null model”;
L(M) is the maximized likelihood of the current model M containing
all predictors available;
L(S) is the maximized likelihood of the saturated model that
contains as many predictors as observations;
ŷi is the predicted value of yi
obtained from the model; actually ŷi is a continuous probability
with a value between 0 and 1, unlike yi which takes values either
“1” or “0”;
ȳ is the arithmetic mean of the dependent variables, also a
continuous probability, ȳ is known also as base rate or prevalence or marginal
proportion.
There are many candidates for R-Squared measure in logistic
regression (see Menard (2010) for a review). We are interested, first and
foremost, only in two of them because they are more popular among researchers
and most often frequently reported in statistical software. They are R2CS
and R2MF, which we define and discuss later.
Nevertheless, we will start with another R-Squared, determined in our notations
above as follows:
R2O = 1 - ∑(y – ŷ)2 / ∑(y –
ȳ)2 (1)
since R2O is frequently used as a
standard for comparison between other R-Squared measures in logistic
regression. Note that R2O looks like ordinary least
squares R2OLS in linear regression except that ŷi
are calculated by maximizing likelihood function in logistic regression and not
minimizing sum of squares.
Now, let us introduce R2CS and R2MF,
using notations above:
R2CS = 1 – exp{2[lnL(M) – lnL(0)] /
n} (2)
R2MF = 1 - lnL(M) / lnL(0 ) (3)
R2CS is usually attributed to Cox and
Snell (1989), which explains notation “CS”. R2MF is usually
referred to McFadden (1974) – that is why notation “MF”.
Note that R2MF is a particular case of more general, entropy-based R-Squared (R2Ent) given by formula
R2Ent = [lnL(M) – lnL(0)] / [lnL(S) – lnL(0)] (4)
It is known that for binary logistic regression with
single-trial syntax lnL(S) = 0, and R2Ent is reduced to R2MF.
Note that R-Squared (4) can also be interpreted as deviance-based measure R2DEV
(see Hosmer and Lemeshow (1989, pp. 147 – 148) and Cameron and Windmeijer
(1997)). Thus, in our case we have R2MF = R2Ent
= R2DEV.
Desirable properties of R-Squared measures mentioned above include interpretation in terms of the information gain (IG) when using the model with available predictors in comparison with the null model (see Kent (1983), Cameron and Windmeijer (1997), and Shtatland et al. (2000)).
Desirable properties of R-Squared measures mentioned above include interpretation in terms of the information gain (IG) when using the model with available predictors in comparison with the null model (see Kent (1983), Cameron and Windmeijer (1997), and Shtatland et al. (2000)).
As far as we know, all publications on R-Squared measures in logistic regression that refer to R2CS and R2MF consider them as independent measures. But this is not the case. There exists a very interesting and very important functional relationship between them. It was shown in our SUGI (2000) and NESUG (2002) presentations that
R2CS = 1 – exp(-R2MF * T) (5)
where
T = - 2lnL(0) / n (6)
Since the maximized log-likelihood for the null model lnL(0)
can be written as
lnL(0) = n[ȳ*lnȳ + (1 – ȳ)*ln(1 – ȳ)]
formula (6) can be rewritten as follows
T = -2[ȳ*lnȳ + (1 – ȳ)*ln(1 – ȳ)] (7)
It means that quantity T is nothing but doubled entropy of Bernoulli distribution with probability ȳ. Note that in our previous publications we have used another notation instead of R2CS, namely R2SAS. It’s because in SAS at that time, there existed only this R-Squared measure (not to mention the unfortunate so-called Nagelkerke adjustment). Entropy properties predetermine properties of our key quantity T. They are discussed below.
T is defined in the interval [0, 1], and is a symmetrical function
of ȳ with respect to ȳ = 1 /2. It is equal to 0 at the ends ȳ = 0 and ȳ = 1,
increases from 0 to 2ln2 on [0, 1 /2], attains its maximum 2ln2, and then
decreases to 0. In Shtatland et al. (2002), it was shown that if 1 < T <=
2ln2, which corresponds to 0.2 < ȳ < 0.8, then R2CS
is slightly greater than R2MF. Otherwise, R2CS
is smaller than R2 MF and very substantially when ŷ is
close to either 0 or 1. It is interesting that exactly in this interval, linear
probability model presents a good approximation to logistic regression (see
Allison, 2017). The properties of T, R2MF and R2CS
mentioned above provide us with theoretical justification of empirical results
on the comparison between R2CS and R2MF
when ȳ takes values around 0 (or 1) vs. when ȳ is close to 0.5 (see Allison (2013)
and numerous examples in Menard (2000), Mittlbock and Schemper (1996) and
Shtatland et al. (2002). In addition, formula (5) explains directly why the
maximal value of R2CS is equal to 0.75 and not to 1 like
R2MF.
Entropy and information are related so much that they are sometimes called “two sides of the same coin”. That is why the “entropy loss” (see Efron, 1978) is nothing more than the equivalent “information gain” (see Kent 1983, Shtatland et al, 2002).
Note also that if ȳ = 0.5, then we are in the state of maximal uncertainty or maximal possible entropy. Adding some relevant predictors can only increase information, or decrease entropy. Thus, the notions of entropy and information are fundamental for understanding the model building process. Any step-by-step construction of better and better models is always accompanied by an entropy decrease / information gain process.
There exists a physical (thermo-dynamical) analogy of this process. The system in thermo-dynamical equilibrium is characterized by maximum entropy. To move the system from equilibrium, it is necessary to spend some energy, i.e. to do some work. In our context, it means that some additional predictors, i.e. explanatory variables, have to be introduced into the model.
Summarizing, we can conclude that R2MF is certainly much more preferable measure than R2CS at least in two respects. First of all, R2MF has a desirable intuitively reasonable, immediate interpretation in terms of information-gain or entropy loss. This interpretation for R2CS is not as direct as for R2MF. Second, according to Menard (2000) p. 24, R2MF stands out for its relative independence from the base rate, comparative to other R-Squared measures. At the same time, R2CS is highly correlated with ŷ (see Menard, 2000, p. 23). From formulas (5), (7), it can be seen that R2CS depends on the base rate ȳ basically through the quantity T. As a result, R2CS demonstrates a rather anti-intuitive and odd trait: it increases as the base rate (more exactly, ȳ or 1 – ȳ , whichever is smaller,) increases from 0 to 0.5. That is why we recommend to use R2MF and not R2CS. All of this perfectly well explains the fact that more and more researchers have joined the “McFadden camp” since 2000. Of course, it does not prevent them from using R2O as a supplemental measure. As to a new, so-called Tjur R2, we will consider this measure elsewhere in the future.
As some researchers are still in doubts about using R-Squared
measures or not, we would like to cite Menard (2010 p. 52): “If you want R2,
why not use R2?”
References:
Allison, P. D. “What’s the best R-squared for logistic
regression?” (2013) (http://www.statisticalhorizons.com/r2logistic).
Allison, P. D. “Measures of fit for logistic regression”
(2014) https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf.
Allison, P. D. “In Defense of Logit – Part 1” (2017) https://statisticalhorizons.com/in-defense-of-logit-part-1.
Cameron, C. A. and Windmeijer, F. A. G. (1997) "An
R-squared measure of goodness of fit for some common nonlinear regression models",
Journal of Econometrics, Vol. 77, No.2, pp. 329-342.
Cox, D. R. and Snell E. J. (1989) “Analysis of binary data”
Second Edition, Chapman & Hall, London.
Efron, B. (1978), “Regression and ANOVA with zero-one data:
Measures of residual variation”, Journal of the American Statistical
Association, 73, 113 – 121.
Hosmer, D. W. and Lemeshow, S. (1989), “Applied logistic regression”,
New York: Wiley.
Kent, J. T. (1983). “Information gain and a general measure
of correlation”, Biometrika, 70, 163 –
173.
Menard, S. (2000) “Coefficients of determination for
multiple logistic regression analysis”, The American Statistician, 54, 17 – 24.
Menard, S. (2002), “Applied logistic regression analysis”,
Sage University Paper Series (Second edition).
Menard, S. (2010), “Logistic regression: From introductory
to advanced concepts and applications”, Sage University Paper Series, Chapter
3, pp. 48 – 62.
McFadden, D. (1974), “Conditional logit analysis of
qualitative choice behavior”, pp. 105 -142 in Zarembka (ed.), Frontiers in
Econometrics. Academic Press.
Mittlebock, M. and Schemper, M. (1996), “Explained variation
in logistic regression”, Statistics in Medicine, 15, 1987 – 1997.
Shtatland, E. S., Moore, S. and Barton, M. B. (2000) “Why we
need R^2 measure of fit (and not only one) in PROC LOGISTIC and PROC GENMOD”,
SUGI 2000 Proceedings, Paper 256 - 25, Cary, SAS Institute Inc.
Shtatland, E. S., Kleinman, K. and Cain, E. M. (2002) “One
more time about R^2 measures of fit in logistic regression”, NESUG 2002
Proceedings.