Statistical Miscellany: 2017

This title is a paraphrase of the titles of the two publications: “What’s the best R-squared for logistic regression?” by Allison (2013) in http://www.statisticalhorizons.com/r2logistic and “One more time about R-Squared measures of fit in logistic regression” by Shtatland et al. (2002) in http://www.lexjansen.com/nesug/nesug02/st/st004.pdf.

A comprehensive and almost exhaustive review of this topic can be found in Menard (2010), Chapter 3, pp. 48 – 62, and we strongly recommend this publication.

For simplicity, we will discuss only the case of binary logistic regression with single-trial syntax, though a number of our results can be generalized to the case of multinomial logistic regression. Following Menard (2000), we introduce now minimal necessary notations:

Y is the dependent variable taking values coded “1” in the case of EVENT or “0” in the case of NOT-EVENT;

n is the total sample size, i. e. the number of observations of dependent variable y₁, y₂,..., y_i,... y_n;

L(0) is the maximized likelihood of the model containing the intercept only, so called “Null model”;

L(M) is the maximized likelihood of the current model M containing all predictors available;

L(S) is the maximized likelihood of the saturated model that contains as many predictors as observations;

ŷ_i is the predicted value of y_i obtained from the model; actually ŷ_i is a continuous probability with a value between 0 and 1, unlike y_i which takes values either “1” or “0”;

ȳ is the arithmetic mean of the dependent variables, also a continuous probability, ȳ is known also as base rate or prevalence or marginal proportion.

There are many candidates for R-Squared measure in logistic regression (see Menard (2010) for a review). We are interested, first and foremost, only in two of them because they are more popular among researchers and most often frequently reported in statistical software. They are R²_CS and R²_MF, which we define and discuss later. Nevertheless, we will start with another R-Squared, determined in our notations above as follows:

R²_O = 1 - ∑(y – ŷ)² / ∑(y – ȳ)² (1)

since R²_O is frequently used as a standard for comparison between other R-Squared measures in logistic regression. Note that R²_O looks like ordinary least squares R²_OLS in linear regression except that ŷ_i are calculated by maximizing likelihood function in logistic regression and not minimizing sum of squares.

Now, let us introduce R²_CS and R²_MF, using notations above:

R²_CS = 1 – exp{2[lnL(M) – lnL(0)] / n} (2)

R²_MF = 1 - lnL(M) / lnL(0 ) (3)

R²_CS is usually attributed to Cox and Snell (1989), which explains notation “CS”. R²_MF is usually referred to McFadden (1974) – that is why notation “MF”.

Note that R²_MF is a particular case of more general, entropy-based R-Squared (R²_Ent) given by formula

R²_Ent = [lnL(M) – lnL(0)] / [lnL(S) – lnL(0)] (4)

It is known that for binary logistic regression with single-trial syntax lnL(S) = 0, and R²_Ent is reduced to R²_MF. Note that R-Squared (4) can also be interpreted as deviance-based measure R²_DEV (see Hosmer and Lemeshow (1989, pp. 147 – 148) and Cameron and Windmeijer (1997)). Thus, in our case we have R²_MF = R²_Ent = R²_DEV.
Desirable properties of R-Squared measures mentioned above include interpretation in terms of the information gain (IG) when using the model with available predictors in comparison with the null model (see Kent (1983), Cameron and Windmeijer (1997), and Shtatland et al. (2000)).

As far as we know, all publications on R-Squared measures in logistic regression that refer to R²_CS and R²_MF consider them as independent measures. But this is not the case. There exists a very interesting and very important functional relationship between them. It was shown in our SUGI (2000) and NESUG (2002) presentations that

R²_CS = 1 – exp(-R²_MF * T) (5)

where

T = - 2lnL(0) / n (6)

Since the maximized log-likelihood for the null model lnL(0) can be written as

lnL(0) = n[ȳ*lnȳ + (1 – ȳ)*ln(1 – ȳ)]

formula (6) can be rewritten as follows

T = -2[ȳ*lnȳ + (1 – ȳ)*ln(1 – ȳ)] (7)

It means that quantity T is nothing but doubled entropy of Bernoulli distribution with probability ȳ. Note that in our previous publications we have used another notation instead of R²_CS, namely R²_SAS. It’s because in SAS at that time, there existed only this R-Squared measure (not to mention the unfortunate so-called Nagelkerke adjustment). Entropy properties predetermine properties of our key quantity T. They are discussed below.

T is defined in the interval [0, 1], and is a symmetrical function of ȳ with respect to ȳ = 1 /2. It is equal to 0 at the ends ȳ = 0 and ȳ = 1, increases from 0 to 2ln2 on [0, 1 /2], attains its maximum 2ln2, and then decreases to 0. In Shtatland et al. (2002), it was shown that if 1 < T <= 2ln2, which corresponds to 0.2 < ȳ < 0.8, then R²_CS is slightly greater than R²_MF. Otherwise, R²_CS is smaller than R²_MF and very substantially when ŷ is close to either 0 or 1. It is interesting that exactly in this interval, linear probability model presents a good approximation to logistic regression (see Allison, 2017). The properties of T, R²_MF and R²_CS mentioned above provide us with theoretical justification of empirical results on the comparison between R²_CS and R²_MF when ȳ takes values around 0 (or 1) vs. when ȳ is close to 0.5 (see Allison (2013) and numerous examples in Menard (2000), Mittlbock and Schemper (1996) and Shtatland et al. (2002). In addition, formula (5) explains directly why the maximal value of R²_CS is equal to 0.75 and not to 1 like R²_MF.

Entropy and information are related so much that they are sometimes called “two sides of the same coin”. That is why the “entropy loss” (see Efron, 1978) is nothing more than the equivalent “information gain” (see Kent 1983, Shtatland et al, 2002).

Note also that if ȳ = 0.5, then we are in the state of maximal uncertainty or maximal possible entropy. Adding some relevant predictors can only increase information, or decrease entropy. Thus, the notions of entropy and information are fundamental for understanding the model building process. Any step-by-step construction of better and better models is always accompanied by an entropy decrease / information gain process.

There exists a physical (thermo-dynamical) analogy of this process. The system in thermo-dynamical equilibrium is characterized by maximum entropy. To move the system from equilibrium, it is necessary to spend some energy, i.e. to do some work. In our context, it means that some additional predictors, i.e. explanatory variables, have to be introduced into the model.

Summarizing, we can conclude that R²_MF is certainly much more preferable measure than R²_CS at least in two respects. First of all, R²_MF has a desirable intuitively reasonable, immediate interpretation in terms of information-gain or entropy loss. This interpretation for R²_CS is not as direct as for R²_MF. Second, according to Menard (2000) p. 24, R²_MF stands out for its relative independence from the base rate, comparative to other R-Squared measures. At the same time, R²_CS is highly correlated with ŷ (see Menard, 2000, p. 23). From formulas (5), (7), it can be seen that R²_CS depends on the base rate ȳ basically through the quantity T. As a result, R²_CS demonstrates a rather anti-intuitive and odd trait: it increases as the base rate (more exactly, ȳ or 1 – ȳ , whichever is smaller,) increases from 0 to 0.5. That is why we recommend to use R²_MF and not R²_CS. All of this perfectly well explains the fact that more and more researchers have joined the “McFadden camp” since 2000. Of course, it does not prevent them from using R²_O as a supplemental measure. As to a new, so-called Tjur R², we will consider this measure elsewhere in the future.

As some researchers are still in doubts about using R-Squared measures or not, we would like to cite Menard (2010 p. 52): “If you want R², why not use R²?”

References:

Allison, P. D. “What’s the best R-squared for logistic regression?” (2013) (http://www.statisticalhorizons.com/r2logistic).

Allison, P. D. “Measures of fit for logistic regression” (2014) https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf.

Allison, P. D. “In Defense of Logit – Part 1” (2017) https://statisticalhorizons.com/in-defense-of-logit-part-1.

Cameron, C. A. and Windmeijer, F. A. G. (1997) "An R-squared measure of goodness of fit for some common nonlinear regression models", Journal of Econometrics, Vol. 77, No.2, pp. 329-342.

Cox, D. R. and Snell E. J. (1989) “Analysis of binary data” Second Edition, Chapman & Hall, London.

Efron, B. (1978), “Regression and ANOVA with zero-one data: Measures of residual variation”, Journal of the American Statistical Association, 73, 113 – 121.

Hosmer, D. W. and Lemeshow, S. (1989), “Applied logistic regression”, New York: Wiley.

Kent, J. T. (1983). “Information gain and a general measure of correlation”, Biometrika, 70, 163 – 173.

Menard, S. (2000) “Coefficients of determination for multiple logistic regression analysis”, The American Statistician, 54, 17 – 24.

Menard, S. (2002), “Applied logistic regression analysis”, Sage University Paper Series (Second edition).

Menard, S. (2010), “Logistic regression: From introductory to advanced concepts and applications”, Sage University Paper Series, Chapter 3, pp. 48 – 62.

McFadden, D. (1974), “Conditional logit analysis of qualitative choice behavior”, pp. 105 -142 in Zarembka (ed.), Frontiers in Econometrics. Academic Press.

Mittlebock, M. and Schemper, M. (1996), “Explained variation in logistic regression”, Statistics in Medicine, 15, 1987 – 1997.

Shtatland, E. S., Moore, S. and Barton, M. B. (2000) “Why we need R^2 measure of fit (and not only one) in PROC LOGISTIC and PROC GENMOD”, SUGI 2000 Proceedings, Paper 256 - 25, Cary, SAS Institute Inc.

Shtatland, E. S., Kleinman, K. and Cain, E. M. (2002) “One more time about R^2 measures of fit in logistic regression”, NESUG 2002 Proceedings.

Statistical Miscellany

Saturday, December 2, 2017

And yet, once again: What’s the best R-squared for logistic regression?

Report Abuse