Saturday, April 21, 2018

Do we really need more than one R-Squared in logistic regression?


In the previous post we advocated McFadden R-Square as our preferred choice for binary logistic regression, in other words we joined the so-called “McFadden Camp” (see Paul Allison’s post “What’s the best R-Squared for logistic regression” (2013)). But not all researchers think this way. And as it is said in DeMaris (1992), p. 56: “… It may not be prudent to rely on only one measure for assessing predictive efficacy - particularly in view of the lack of consensus on which measure is most appropriate. Perhaps the best strategy is to report more than one measure for any given analysis. If the model has predictive power, this should be reflective in some degree by all of the measures discussed above”.

Many different R-Squared statistics have been proposed in the past four and a half decades (see, e.g., review publications: Windmeijer (1995), Cameron and Windmeijer (1997), Mittlbock and Schemper (1996), Menard (2000), (2010), Smith and McKenna (2013), Walker and Smith (2016)). These statistics can be divided into three main categories:
(1) R-Squared measures based on likelihoods;
(2) R-Squared measures based on the sum of squares;
(3) Measures based on squared correlations of observed output and estimated probabilities. 
Standing alone is the so-called Tjur R-Squared.


R-Squared measures based on model likelihoods
 
This category is most populated. It includes familiar to us from the previous post R^2(MF) (McFadden R^2) and R^2(CS) (Cox - Snell R^2). Here and further we use notations partially from our previous post, Menard (2000) and Smith – McKenna (2013)):

(1)  R^2(MF) = [lnL(0) – lnL(M)] / [lnL(S)] – lnL(0)] = 
      1 – lnL(M) / lnL(0)

(2)  R^2(CS) = 1 – [L(0) / L(M)]^(2/n)

(3)  R^2(N) = R^2(CS) /max R^2(CS) = R^2(CS) / (1 – L(0)^(2/n))

Also included are:

(4)  Aldrich – Nelson R-Squared, defined by          
R^2(AN) = G(M) / (G(M) + n) = 
2(lnL(M) - lnL(0)) /[2(lnL(M) - lnL(0)) + n]          
(see Smith – McKenna (2013))

(5)  Veall – Zimmermann R-Squared, given by formula 
R^2(VZ) = 
2(lnL(M) - lnL(0)) /[ 2(lnL(M) - lnL(0)) + n]*[(-2lnL(0) + n) / -2lnL(0)] =  R^2(AN) / max R^2(AN) 
(see Walker and Smith (2016)). R^2(VZ) was introduced as a correction to R^2(AN) which cannot reach the value of one. It’s easy to see that max R^2(AN) is equal to 2ln2/(2ln2 + 1) ≈0.581 (compare to max R^2(CS) = 0.75)

(6) Estrella R-Squared, defined by formula 
R^2(Est) = 1 – [lnL(M) / lnL(0)]^ (-2lnL(0)/n) 
(see Walker and Smith (2016)) 

We don’t consider here corrections for the number of predictors of the R-Squared measures mentioned above. From this abundance of R-Squared indexes we choose only two: R^2(MF) and R^2(CS). As it is shown in Shtatland et al. (2000) and (2002), there exists an important and a very simple functional relationship between R^2(MF) and R^2(CS) (see also our previous post, Menard (2010), p. 50), and our comment to Allison (2013):

R^2(CS) = 1 – exp(-R^2(MF)*T)                          (1)                          

where the quantity T is equal to -2lnL(0) / n. Since the maximized log-likelihood for the null model can be written as

lnL(0) = n[ȳ*lnȳ + (1 – ȳ)*ln(1 – ȳ)]

the formula for T can be rewritten as follows

T = -2[ȳ*lnȳ + (1 – ȳ)*ln(1 – ȳ)]                                       (2)

Thus, T is nothing but doubled entropy of Bernoulli distribution with probability ȳ (the base rate).

Desirable properties of an R-squared include interpretation in terms of the information content of the data (see Cameron and Windmeijer (1997)). It is shown in Shtatland et al. (2000) and (2002) (see also our previous post) that R^2(MF) has an intuitively reasonable, immediate interpretation in terms of information theory: it can be considered as ratio of information gain (IG) when moving from the null model with intercept only to the current one by adding available predictors (2(lnL(M) - lnL(0)) to the quantity of all potentially recoverable information (-2lnL(0)). Note that R^2(CS) also allows some information interpretation, though not as direct as for R^2(MF). Indeed, let us transform the equation

R^2(CS) = 1 – [L(0) / L(M)]^(2/n)

which defines R^2(CS) to

-ln (1 - R^2(CS)) = 2(lnL(M) – lnL(0))/n = IG /n = IGR

Here information gain rate, IGR, is nothing but information gain per observation. The logarithmic function -ln(1 – x) is a well-known measure from information theory (Theil and Chung (1988)). Using -ln(1 - R^2(CS)) rather than R^2(CS) provides a more natural scale for interpretation. Taking the first derivative of both sides of this equation with respect to R^2(CS), we have

dIGR/dR^2(CS) = 1 / (1 - R^2(CS))

From this formula we arrive at the following important conclusions:

If R^2(CS) = 0, then dIGR/dR^2(CS) = 1, i. e. the increase in R-Square is equivalent to the increase in IGR;
If R^2(CS) = 0.5, then dIGR/dR^2(CS) = 2, i. e. the increase in IGR is twice as large as the increase in R-Square;
If R^2(CS) = 0.75 (maximal value possible), then dIGR/dR^2(CS) = 4, i. e. the increase in IGR is four times larger than the increase in R-Square.

Note that equations (1) and (2)) can be also found in Sharma and McGee (2008), though without informational - entropic interpretation, which is one of the most desirable properties of an R-squared (see, for example, Cameron and Windmeijer (1997)).

Concluding our comparison R^2(CS) and R^2(MF), note that these measures are related to the base rate ȳ quite differently. According to Menard (2000), R^2(MF) stands out for its relative independence from the base rate, comparative to R^2(CS) and other R-Squared measures. At the same time, R^2(CS) is highly correlated with ŷ (see Menard, 2000, p. 23). From formulas (1), (2), it can be seen that R^2(CS) depends on the base rate ȳ basically through the quantity T. As a result, R^2(CS) demonstrates a rather anti-intuitive and odd trait: it increases as the base rate (more exactly, ȳ or 1 – ȳ, whichever is smaller,) increases from 0 to 0.5, absurdly implying that ȳ itself can be used as some R^2 measure.

It is easy to see that each candidate for R^2 in logistic regression expressed as a function of statistic [lnL(M) – lnL(0)] can be also rewritten through R^2(MF) and T. For example, R^2(AN), R^2(VZ) and R^2(EST) can be shown as:

R^2(AN) = R^2(MF) / (R^2(MF)+1/T)                            (3)

R^2(VZ) = R^2(MF) * (1 + T) / (1 + T* R^2(MF))         (4)

R^2(Est) = 1 – exp[T*ln(1 - R^2(MF))]                            (5)

Thus, it can be concluded from the formulas above that R^2(MF) plays a central role in the category of R-Squared statistics based on the model likelihoods - if we know R^2(MF) and T, then all other members of this category can be calculated by hand.

Furthermore, as it is shown above, R^2(MF) (and partly R^2(CS)) has a clear and direct interpretation in terms of the information content of the data, potentially recoverable information, information gain due to predictors adding, etc. (see Shtatland (2002), Menard (2010) p .48, and our previous post). This interpretation is no less natural and important than the one used in the OLS approach. At the same time R^2(AN), R^2(VZ) and R^2(EST) have lost this interpretation and look as ad hoc measures.  Summarizing we can conclude that R^2(MF) is obviously more preferable measure than any other R-Squared in this category.

R-Squared measures based on the sum of squares – OLS analogs

We consider here only the most popular measures. First one, based on sums of squared residuals, is used under various names: R^2(O), R^(SS), R^(Efron), R^2(res), etc., and is given by familiar to us from our previous post formula:

R^2(O) = 1 - ∑(yi –  ŷi) ^2 / ∑(yi – ȳ ) ^2 󠄝                         (6)
Second one, the so-called model R-Squared, R^2(mod) defined by formula:
R^2(mod) = ∑( ŷi – ȳ) ^2 /∑(yi – ȳ ) ^2                             (7)
Let us remind that yi is the observed outcome (1 or 0), ŷi is the predicted value of yi (actually, the predicted probability), and ȳ is the arithmetic mean of yi. R^2(O) looks exactly like the ordinary least square R^2(OLS) for linear regression except that ŷi are calculated by maximizing likelihood function and not minimizing sum of squares. It is frequently used as a standard for comparison between other R-Squared measures. For some researchers (see, for example, Mittlbock and Schemper (1996)), R^2(O) is a measure of choice. Also, it is concluded in Menard (2010), p. 56 that there are certain benefits in using R^2(O) not instead R^2(MF), but as a supplemental measure. Though, R^2(O) has two serious disadvantages: it does not automatically increase when the model is extended by an additional predictor, and can be negative in some rare, rather degenerate cases.
 
Measures based on squared correlations

There are plenty of them. In Mittlebock and Schemer (1996) six measures are discussed. We will mention only the most popular measure in this category – based on squared Pearson correlation:
R^2(cor) = 
{[∑(yi – ȳ )( ŷi – ȳ) / [∑(yi – ȳ )^2]^0.5] [∑(ŷi – ȳ )^2]^0.5]}^2
It is known that R^2(cor) is very close to R^2(O), moreover they are almost identical numerically (Liao and McGee (2003). Therefore, it is reasonable to use only one of them, and we prefer R^2(O).

Tjur R-Squared

Finally, we consider Tjur R-Squared (R^2(Tjur). Since it is not so easy to define R^2(Tjur) by formula, we refer for a verbal definition to Tjur (2009) and Allison (2013), (2014). Actually, this measure was introduced earlier in Cramer (1999), so it would be fair to use the double name: Cramer - Tjur. According to Allison (2013), the measure has a lot of intuitive appeal, it is easy to calculate, and also is closely related to R-Squared measures based on sum of squares:

R^2(Tjur) = (R^2(O) + R^2(mod)) /2                                (8)

R^2(Tjur) = (R^2(mod) * R^2(cor))^0.5                           (9)
Following Allison (2013), we propose to use R^2(Tjur) as an attractive alternative and as a good supplement to R^2(MF) (the best R-Squared based on likelihoods) and R^2(O) (our preferred choice in the category of OLS analogs).

Conclusion

Let us return to DeMaris quotation: “… It may not be prudent to rely on only one measure for assessing predictive efficacy - particularly in view of the lack of consensus on which measure is most appropriate. Perhaps the best strategy is to report more than one measure for any given analysis”. Following this advice, we propose to report three measures: R^2(MF), R^2(0) and T^2(Tjur) altogether, hoping that they make up a good ”three pillars of R-Squared measures” and a good pathway from “the jungle of R-Squareds” for binary logistic regression.
        
References:

Allison, P. D. “What’s the best R-squared for logistic regression?” (2013) (http://www.statisticalhorizons.com/r2logistic).
Allison, P. D. “Measures of fit for logistic regression” 2014) https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf.
Cameron, C. A. and Windmeijer, F. A. G. (1997) "An R-squared measure of goodness of fit for some common nonlinear regression models", Journal of Econometrics, Vol. 77, No.2, pp. 329-342.
Cox, D. R. and Snell E. J. (1989) “Analysis of binary data” Second Edition, Chapman & Hall, London.
Cramer, J. S. (1999) “Predictive performance of the binary logit model in unbalanced samples”, The Statistician, 48, 85 – 94.
DeMaris, A. (1992) “Logit modeling” Sage University Paper Series.
Efron, B. (1978), “Regression and ANOVA with zero-one data: Measures of residual variation”, Journal of the American Statistical Association, 73, 113 – 121.
Hosmer, D. W. and Lemeshow, S. (1989), “Applied logistic regression”, New York: Wiley.
Kent, J. T. (1983). “Information gain and a general measure of correlation”,  Biometrika, 70, 163 – 173.
Menard, S. (2000) “Coefficients of determination for multiple logistic regression analysis”, The American Statistician, 54, 17 – 24.
Menard, S. (2002), “Applied logistic regression analysis”, Sage University Paper Series (Second edition).
Menard, S. (2010), “Logistic regression: From introductory to advanced concepts and applications”, Sage University Paper Series, Chapter 3, pp. 48 – 62.
McFadden, D. (1974), “Conditional logit analysis of qualitative choice behavior”, pp. 105 -142 in Zarembka (ed.), Frontiers in Econometrics. Academic Press.
Mittlebock, M. and Schemper, M. (1996), “Explained variation in logistic regression”, Statistics in Medicine, 15, 1987 – 1997.
Sharma D. and McGee, D. (2008), “Estimating proportion of explained variation for an underlying linear model using logistic regression analysis”, Journal of Statistical Research,42, 59 – 69.
Shtatland, E. S., Moore, S. and Barton, M. B. (2000) “Why we need R^2 measure of fit (and not only one) in PROC LOGISTIC and PROC GENMOD”, SUGI 2000 Proceedings, Paper 256 - 25, Cary, SAS Institute Inc.
Shtatland, E. S., Kleinman, K. and Cain, E. M. (2002) “One more time about R^2 measures of fit in logistic regression”, NESUG 2002 Proceedings.
Smith, T. J. and McKenna, C. M. (2013) “A comparison of logistic regression pseudo R^2 indices”, Multiple Linear Regression Viewpoints, 39, 17 - 26. 
Theil, H. and Chung, C.F. (1988) “Information-theoretic measures of fit for univariate and multivariate linear regressions”, The American Statistician, 42, 249 - 252.
Tjur, T. (2009) “Coefficients of determination in logistic regression models – a new proposal: the coefficient of discrimination”, The American Statistician, 63, 366 – 372.
Walker, D. A. and Smith, T. J. (2016) “Nine pseudo R^2 indices for binary logistic regression models”, Journal of Modern Applied Statistical Methods, 15, 848 – 854.
Windmeijer, F. A. G. (1995) “Goodness of fit measures in binary choice models”, Econometric Reviews, 14, 101 – 116.