Saturday, December 2, 2017

And yet, once again: What’s the best R-squared for logistic regression?

This title is a paraphrase of the titles of the two publications: “What’s the best R-squared for logistic regression?” by Allison (2013) in http://www.statisticalhorizons.com/r2logistic and “One more time about R-Squared measures of fit in logistic regression” by Shtatland et al. (2002) in http://www.lexjansen.com/nesug/nesug02/st/st004.pdf.  

A comprehensive and almost exhaustive review of this topic  can be found in Menard (2010), Chapter 3, pp. 48 – 62, and we strongly recommend this publication.
For simplicity, we will discuss only the case of binary logistic regression with single-trial syntax, though a number of our results can be generalized to the case of multinomial logistic regression. Following Menard (2000), we introduce now minimal necessary notations:    
Y is the dependent variable taking values coded “1” in the case of EVENT or “0” in the case of NOT-EVENT;
n is the total sample size, i. e. the number of observations of dependent variable y1, y2,..., yi,... yn;
L(0) is the maximized likelihood of the model containing the intercept only, so called “Null model”;                                                    
L(M) is the maximized likelihood of the current model M containing all predictors available;
L(S) is the maximized likelihood of the saturated model that contains as many predictors as observations;
ŷi is the predicted value of yi obtained from the model; actually ŷi is a continuous probability with a value between 0 and 1, unlike yi which takes values either “1” or “0”;
ȳ is the arithmetic mean of the dependent variables, also a continuous probability, ȳ is known also as base rate or prevalence or marginal proportion.
There are many candidates for R-Squared measure in logistic regression (see Menard (2010) for a review). We are interested, first and foremost, only in two of them because they are more popular among researchers and most often frequently reported in statistical software. They are R2CS and  R2MF, which we define and discuss later. Nevertheless, we will start with another R-Squared, determined in our notations above as follows:
R2O = 1 - ∑(y – ŷ)2 / ∑(y – ȳ)2                                (1)
since R2O is frequently used as a standard for comparison between other R-Squared measures in logistic regression. Note that R2O looks like ordinary least squares R2OLS in linear regression except that ŷi are calculated by maximizing likelihood function in logistic regression and not minimizing sum of squares.
Now, let us introduce R2CS and R2MF, using notations above:
R2CS = 1 – exp{2[lnL(M) – lnL(0)] / n}                  (2)
R2MF = 1 - lnL(M) / lnL(0 )                                     (3)
R2CS is usually attributed to Cox and Snell (1989), which explains notation “CS”. R2MF is usually referred to McFadden (1974) – that is why notation “MF”.

Note that R2MF is a particular case of more general, entropy-based R-Squared (R2Ent) given by formula
R2Ent = [lnL(M) – lnL(0)] / [lnL(S) – lnL(0)]          (4)
It is known that for binary logistic regression with single-trial syntax lnL(S) = 0, and R2Ent is reduced to R2MF. Note that R-Squared (4) can also be interpreted as deviance-based measure R2DEV (see Hosmer and Lemeshow (1989, pp. 147 – 148) and Cameron and Windmeijer (1997)). Thus, in our case we have R2MF = R2Ent = R2DEV.
Desirable properties of R-Squared measures mentioned above include interpretation in terms of the information gain (IG) when using the model with available predictors in comparison with the null model (see Kent (1983), Cameron and Windmeijer (1997), and Shtatland et al. (2000)).          


As far as we know, all publications on R-Squared measures in logistic regression that refer to R2CS and R2MF consider them as independent measures. But this is not the case. There exists a very interesting and very important functional relationship between them. It was shown in our SUGI (2000) and NESUG (2002) presentations that

R2CS = 1 – exp(-R2MF * T)                                      (5)

where

T = - 2lnL(0) / n                                                        (6)

Since the maximized log-likelihood for the null model lnL(0) can be written as

lnL(0) = n[ȳ*lnȳ + (1 – ȳ)*ln(1 – ȳ)]


formula (6) can be rewritten as follows

T = -2[ȳ*lnȳ + (1 – ȳ)*ln(1 – ȳ)]                              (7)


It means that quantity T is nothing but doubled entropy of Bernoulli distribution with probability ȳ. Note that in our previous publications we have used another notation instead of R2CS, namely R2SAS. It’s because in SAS at that time, there existed only this R-Squared measure (not to mention the unfortunate so-called Nagelkerke adjustment). Entropy properties predetermine properties of our key quantity T. They are discussed below.


T is defined in the interval [0, 1], and is a symmetrical function of ȳ with respect to ȳ = 1 /2. It is equal to 0 at the ends ȳ = 0 and ȳ = 1, increases from 0 to 2ln2 on [0, 1 /2], attains its maximum 2ln2, and then decreases to 0. In Shtatland et al. (2002), it was shown that if 1 < T <= 2ln2, which corresponds to 0.2 < ȳ < 0.8, then R2CS is slightly greater than R2MF. Otherwise, R2CS is smaller than RMF and very substantially when ŷ is close to either 0 or 1. It is interesting that exactly in this interval, linear probability model presents a good approximation to logistic regression (see Allison, 2017). The properties of T, R2MF and R2CS mentioned above provide us with theoretical justification of empirical results on the comparison between R2CS and R2MF when ȳ takes values around 0 (or 1) vs. when ȳ is close to 0.5 (see Allison (2013) and numerous examples in Menard (2000), Mittlbock and Schemper (1996) and Shtatland et al. (2002). In addition, formula (5) explains directly why the maximal value of R2CS is equal to 0.75 and not to 1 like R2MF.


Entropy and information are related so much that they are sometimes called “two sides of the same coin”. That is why the “entropy loss” (see Efron, 1978) is nothing more than the equivalent “information gain” (see Kent 1983, Shtatland et al, 2002).

Note also that if ȳ = 0.5, then we are in the state of maximal uncertainty or maximal possible entropy. Adding some relevant predictors can only increase information, or decrease entropy. Thus, the notions of entropy and information are fundamental for understanding the model building process. Any step-by-step construction of better and better models is always accompanied by an entropy decrease / information gain process.


There exists a physical (thermo-dynamical) analogy of this process. The system in thermo-dynamical equilibrium is characterized by maximum entropy. To move the system from equilibrium, it is necessary to spend some energy, i.e. to do some work. In our context, it means that some additional predictors, i.e. explanatory variables, have to be introduced into the model.

Summarizing, we can conclude that R2MF is certainly much more preferable measure than R2CS at least in two respects. First of all, R2MF has a desirable intuitively reasonable, immediate interpretation in terms of information-gain or entropy loss. This interpretation for R2CS is not as direct as for R2MF. Second, according to Menard (2000) p. 24, R2MF stands out for its relative independence from the base rate, comparative to other R-Squared measures. At the same time, R2CS is highly correlated with ŷ (see Menard, 2000, p. 23). From formulas (5), (7), it can be seen that R2CS depends on the base rate ȳ basically through the quantity T. As a result, R2CS demonstrates a rather anti-intuitive and odd trait: it increases as the base rate (more exactly, ȳ or 1 – ȳ , whichever is smaller,) increases from 0 to 0.5. That is why we recommend to use R2MF and not R2CS. All of this perfectly well explains the fact that more and more researchers have joined the “McFadden camp” since 2000. Of course, it does not prevent them from using R2O as a supplemental measure. As to a new, so-called Tjur R2, we will consider this measure elsewhere in the future.


As some researchers are still in doubts about using R-Squared measures or not, we would like to cite Menard (2010 p. 52): “If you want R2, why not use R2?”

References:

Allison, P. D. “What’s the best R-squared for logistic regression?” (2013) (http://www.statisticalhorizons.com/r2logistic).
Allison, P. D. “Measures of fit for logistic regression” (2014) https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf.
Allison, P. D. “In Defense of Logit – Part 1” (2017) https://statisticalhorizons.com/in-defense-of-logit-part-1.
Cameron, C. A. and Windmeijer, F. A. G. (1997) "An R-squared measure of goodness of fit for some common nonlinear regression models", Journal of Econometrics, Vol. 77, No.2, pp. 329-342.
Cox, D. R. and Snell E. J. (1989) “Analysis of binary data” Second Edition, Chapman & Hall, London.
Efron, B. (1978), “Regression and ANOVA with zero-one data: Measures of residual variation”, Journal of the American Statistical Association, 73, 113 – 121.
Hosmer, D. W. and Lemeshow, S. (1989), “Applied logistic regression”, New York: Wiley.
Kent, J. T. (1983). “Information gain and a general measure of correlation”,  Biometrika, 70, 163 – 173.
Menard, S. (2000) “Coefficients of determination for multiple logistic regression analysis”, The American Statistician, 54, 17 – 24.
Menard, S. (2002), “Applied logistic regression analysis”, Sage University Paper Series (Second edition).
Menard, S. (2010), “Logistic regression: From introductory to advanced concepts and applications”, Sage University Paper Series, Chapter 3, pp. 48 – 62.
McFadden, D. (1974), “Conditional logit analysis of qualitative choice behavior”, pp. 105 -142 in Zarembka (ed.), Frontiers in Econometrics. Academic Press.
Mittlebock, M. and Schemper, M. (1996), “Explained variation in logistic regression”, Statistics in Medicine, 15, 1987 – 1997.
Shtatland, E. S., Moore, S. and Barton, M. B. (2000) “Why we need R^2 measure of fit (and not only one) in PROC LOGISTIC and PROC GENMOD”, SUGI 2000 Proceedings, Paper 256 - 25, Cary, SAS Institute Inc.

Shtatland, E. S., Kleinman, K. and Cain, E. M. (2002) “One more time about R^2 measures of fit in logistic regression”, NESUG 2002 Proceedings.