In the previous post we advocated McFadden R-Square as our
preferred choice for binary logistic regression, in other words we joined the so-called
“McFadden Camp” (see Paul Allison’s post “What’s the best R-Squared for
logistic regression” (2013)). But not all researchers think this way. And as it is said in DeMaris (1992), p. 56: “… It may not be
prudent to rely on only one measure for assessing predictive efficacy -
particularly in view of the lack of consensus on which measure is most
appropriate. Perhaps the best strategy is to report more than one measure for
any given analysis. If the model has predictive power, this should be
reflective in some degree by all of the measures discussed above”.
Many different R-Squared statistics have been proposed in
the past four and a half decades (see, e.g., review publications: Windmeijer
(1995), Cameron and Windmeijer (1997), Mittlbock and Schemper (1996), Menard
(2000), (2010), Smith and McKenna (2013), Walker and Smith (2016)). These
statistics can be divided into three main categories:
(1) R-Squared measures based on likelihoods;
(2) R-Squared measures based on the sum of squares;
(3) Measures based on squared correlations of observed output
and estimated probabilities.
Standing alone is the so-called Tjur R-Squared.
R-Squared measures based on model
likelihoods
This category is most populated. It includes familiar to us
from the previous post R^2(MF) (McFadden R^2) and R^2(CS) (Cox - Snell R^2). Here
and further we use notations partially from our previous post, Menard (2000)
and Smith – McKenna (2013)):
(1) R^2(MF) = [lnL(0) – lnL(M)] / [lnL(S)] –
lnL(0)] =
1 – lnL(M) / lnL(0)
(2) R^2(CS) = 1 – [L(0) / L(M)]^(2/n)
(3) R^2(N) = R^2(CS) /max R^2(CS) = R^2(CS) / (1 –
L(0)^(2/n))
Also included are:
(4) Aldrich – Nelson R-Squared, defined by
R^2(AN) = G(M) / (G(M) + n) =
2(lnL(M) - lnL(0)) /[2(lnL(M)
- lnL(0)) + n]
(see Smith – McKenna (2013))
(5) Veall – Zimmermann R-Squared, given by formula
R^2(VZ) =
R^2(VZ) =
2(lnL(M) - lnL(0)) /[ 2(lnL(M) - lnL(0)) + n]*[(-2lnL(0) + n) /
-2lnL(0)] = R^2(AN) / max R^2(AN)
(see Walker and Smith (2016)). R^2(VZ)
was introduced as a correction to R^2(AN) which cannot reach the value of one.
It’s easy to see that max R^2(AN) is equal to 2ln2/(2ln2 + 1) ≈0.581 (compare
to max R^2(CS) = 0.75)
(6) Estrella R-Squared, defined by
formula
R^2(Est) = 1 – [lnL(M) / lnL(0)]^ (-2lnL(0)/n)
(see Walker and Smith (2016))
(see Walker and Smith (2016))
We don’t consider here
corrections for the number of predictors of the R-Squared measures mentioned
above. From this abundance of R-Squared indexes we choose only two: R^2(MF) and R^2(CS). As it is shown
in Shtatland et al. (2000) and (2002), there exists an important and a very
simple functional relationship between R^2(MF) and R^2(CS) (see also our previous post,
Menard (2010), p. 50), and our comment to Allison (2013):
R^2(CS) = 1 – exp(-R^2(MF)*T) (1)
where the quantity T is equal to
-2lnL(0) / n. Since the maximized log-likelihood for the null model can be
written as
lnL(0) = n[ȳ*lnȳ + (1 – ȳ)*ln(1 – ȳ)]
the formula for T can be rewritten as follows
T = -2[ȳ*lnȳ + (1 – ȳ)*ln(1 – ȳ)] (2)
Thus, T is nothing but doubled entropy of Bernoulli
distribution with probability ȳ (the base rate).
Desirable properties of an
R-squared include interpretation in terms of the information content of the
data (see Cameron and Windmeijer (1997)). It is shown in Shtatland et al. (2000)
and (2002) (see also our previous post) that R^2(MF) has an intuitively
reasonable, immediate interpretation in terms of information theory: it can be
considered as ratio of information gain (IG) when moving from the null model with
intercept only to the current one by adding available predictors (2(lnL(M) - lnL(0)) to the quantity
of all potentially recoverable information (-2lnL(0)). Note that R^2(CS) also allows some
information interpretation, though not as direct as for R^2(MF). Indeed, let us transform the equation
R^2(CS) = 1 – [L(0) /
L(M)]^(2/n)
which defines R^2(CS) to
-ln (1 - R^2(CS)) = 2(lnL(M) – lnL(0))/n
= IG /n = IGR
Here information gain rate, IGR, is nothing but information
gain per observation. The logarithmic function -ln(1 – x) is a well-known
measure from information theory (Theil and Chung (1988)). Using -ln(1 - R^2(CS)) rather than R^2(CS)
provides a more natural scale for interpretation. Taking the first derivative
of both sides of this equation with respect to R^2(CS), we have
dIGR/dR^2(CS) = 1 / (1 - R^2(CS))
From this formula we arrive at the following important
conclusions:
If R^2(CS) = 0, then dIGR/dR^2(CS) = 1, i. e. the increase
in R-Square is equivalent to the increase in IGR;
If R^2(CS) = 0.5, then dIGR/dR^2(CS) = 2, i. e. the increase
in IGR is twice as large as the increase in R-Square;
If R^2(CS) = 0.75 (maximal value possible), then
dIGR/dR^2(CS) = 4, i. e. the increase in IGR is four times larger than the
increase in R-Square.
Note that equations (1) and (2))
can be also found in Sharma and McGee (2008), though without informational -
entropic interpretation, which is one of the most desirable properties of an
R-squared (see, for example, Cameron and Windmeijer (1997)).
Concluding our comparison R^2(CS) and R^2(MF), note that
these measures are related to the base rate ȳ quite differently. According to Menard (2000), R^2(MF) stands out for its relative independence
from the base rate, comparative to R^2(CS) and other
R-Squared measures. At the same time, R^2(CS) is
highly correlated with ŷ (see Menard, 2000, p. 23). From formulas (1), (2), it
can be seen that R^2(CS) depends on the base rate ȳ basically
through the quantity T. As a result, R^2(CS) demonstrates
a rather anti-intuitive and odd trait: it increases as the base rate (more
exactly, ȳ or 1 – ȳ, whichever is smaller,) increases from 0 to 0.5, absurdly
implying that ȳ itself can be used as some R^2 measure.
It is easy to see that each
candidate for R^2 in logistic regression expressed as a function of statistic [lnL(M) – lnL(0)]
can be also rewritten through R^2(MF) and T. For example, R^2(AN), R^2(VZ) and R^2(EST) can be shown as:
R^2(AN) = R^2(MF) / (R^2(MF)+1/T) (3)
R^2(VZ) = R^2(MF) * (1 + T) / (1 + T* R^2(MF)) (4)
R^2(Est) = 1 – exp[T*ln(1 - R^2(MF))] (5)
Thus, it can be concluded from the
formulas above that R^2(MF) plays a central role in the category of R-Squared
statistics based on the model likelihoods - if we know R^2(MF) and T, then all other
members of this category can be calculated by hand.
Furthermore, as it is shown above,
R^2(MF) (and partly R^2(CS)) has
a clear and direct interpretation in terms of the information content of the data, potentially
recoverable information, information gain due to predictors adding, etc. (see
Shtatland (2002), Menard
(2010) p .48, and our previous post). This interpretation is no less natural
and important than the one used in the OLS approach. At the same time R^2(AN), R^2(VZ) and R^2(EST) have lost this
interpretation and look as ad hoc measures.
Summarizing
we can conclude that R^2(MF) is obviously more preferable measure than any
other R-Squared in this category.
R-Squared measures based on the
sum of squares – OLS analogs
We consider here only the most popular measures. First one, based
on sums of squared residuals, is used under various names: R^2(O), R^(SS),
R^(Efron), R^2(res), etc., and is given by familiar to us from our previous
post formula:
R^2(O) = 1 - ∑(yi – ŷi) ^2 / ∑(yi – ȳ
) ^2 󠄝 (6)
Second one, the so-called model R-Squared, R^2(mod) defined
by formula:
R^2(mod) = ∑( ŷi – ȳ) ^2 /∑(yi – ȳ
) ^2 (7)
Let us remind that yi is the observed outcome (1 or 0), ŷi is the predicted value of yi (actually, the
predicted probability), and ȳ is the arithmetic mean
of yi. R^2(O) looks exactly like the ordinary least square R^2(OLS) for
linear regression except that ŷi are
calculated by maximizing likelihood function and not minimizing sum of squares.
It is frequently used as a standard for comparison between other R-Squared
measures. For some researchers (see, for example, Mittlbock and Schemper (1996)),
R^2(O) is a measure of choice. Also, it is concluded in Menard (2010), p. 56
that there are certain benefits in using R^2(O) not instead R^2(MF), but as a
supplemental measure. Though, R^2(O) has two serious disadvantages: it does not
automatically increase when the model is extended by an additional predictor,
and can be negative in some rare, rather degenerate cases.
Measures
based on squared correlations
There
are plenty of them. In Mittlebock and Schemer (1996) six measures are discussed.
We will mention only the most popular measure in this category – based on squared
Pearson correlation:
R^2(cor)
=
{[∑(yi – ȳ
)( ŷi – ȳ) / [∑(yi – ȳ
)^2]^0.5] [∑(ŷi – ȳ )^2]^0.5]}^2
It
is known that R^2(cor) is very close to R^2(O), moreover they are almost identical numerically (Liao
and McGee (2003). Therefore, it is reasonable to use only one of them, and we
prefer R^2(O).
Tjur R-Squared
Finally, we
consider Tjur R-Squared (R^2(Tjur). Since it is not so easy to define R^2(Tjur)
by formula, we refer for a verbal definition to Tjur (2009) and Allison (2013),
(2014). Actually, this measure was introduced earlier in Cramer (1999), so it
would be fair to use the double name: Cramer - Tjur. According to Allison
(2013), the measure has a lot of intuitive appeal, it is easy to calculate, and
also is closely related to R-Squared measures based on sum of squares:
R^2(Tjur) =
(R^2(O) + R^2(mod)) /2 (8)
R^2(Tjur) = (R^2(mod) * R^2(cor))^0.5 (9)
Following Allison (2013), we propose to
use R^2(Tjur) as an
attractive alternative and as a good supplement to R^2(MF)
(the best R-Squared
based on likelihoods) and R^2(O) (our preferred choice in the category of OLS
analogs).
Conclusion
Let us return to DeMaris quotation: “… It may
not be prudent to rely on only one measure for assessing predictive efficacy -
particularly in view of the lack of consensus on which measure is most
appropriate. Perhaps the best strategy is to report more than one measure for
any given analysis”. Following this advice, we propose to report three
measures: R^2(MF), R^2(0) and T^2(Tjur) altogether, hoping that they make up a
good ”three pillars of R-Squared measures” and a good pathway from “the jungle
of R-Squareds” for binary logistic regression.
References:
Allison, P. D.
“What’s the best R-squared for logistic regression?” (2013) (http://www.statisticalhorizons.com/r2logistic).
Allison, P. D.
“Measures of fit for logistic regression” 2014) https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf.
Cameron, C. A.
and Windmeijer, F. A. G. (1997) "An R-squared measure of goodness of fit
for some common nonlinear regression models", Journal of Econometrics,
Vol. 77, No.2, pp. 329-342.
Cox, D. R. and
Snell E. J. (1989) “Analysis of binary data” Second Edition, Chapman &
Hall, London.
Cramer, J. S.
(1999) “Predictive performance of the binary logit model in unbalanced
samples”, The Statistician, 48, 85 – 94.
DeMaris, A. (1992)
“Logit modeling” Sage University Paper Series.
Efron, B.
(1978), “Regression and ANOVA with zero-one data: Measures of residual
variation”, Journal of the American Statistical Association, 73, 113 – 121.
Hosmer, D. W.
and Lemeshow, S. (1989), “Applied logistic regression”, New York: Wiley.
Kent, J. T.
(1983). “Information gain and a general measure of
correlation”, Biometrika, 70, 163 – 173.
Menard, S.
(2000) “Coefficients of determination for multiple logistic regression
analysis”, The American Statistician, 54, 17 – 24.
Menard, S.
(2002), “Applied logistic regression analysis”, Sage University Paper Series
(Second edition).
Menard, S.
(2010), “Logistic regression: From introductory to advanced concepts and
applications”, Sage University Paper Series, Chapter 3, pp. 48 – 62.
McFadden, D.
(1974), “Conditional logit analysis of qualitative choice behavior”, pp. 105
-142 in Zarembka (ed.), Frontiers in Econometrics. Academic Press.
Mittlebock, M.
and Schemper, M. (1996), “Explained variation in logistic regression”,
Statistics in Medicine, 15, 1987 – 1997.
Sharma D. and
McGee, D. (2008), “Estimating proportion of explained variation for an
underlying linear model using logistic regression analysis”, Journal of
Statistical Research,42, 59 – 69.
Shtatland, E.
S., Moore, S. and Barton, M. B. (2000) “Why we need R^2 measure of fit (and not
only one) in PROC LOGISTIC and PROC GENMOD”, SUGI 2000 Proceedings, Paper 256 -
25, Cary, SAS Institute Inc.
Shtatland, E.
S., Kleinman, K. and Cain, E. M. (2002) “One more time about R^2 measures of
fit in logistic regression”, NESUG 2002 Proceedings.
Smith, T. J.
and McKenna, C. M. (2013) “A comparison of logistic regression pseudo R^2
indices”, Multiple Linear Regression Viewpoints, 39, 17 - 26.
Theil, H. and
Chung, C.F. (1988) “Information-theoretic measures of fit for univariate and
multivariate linear regressions”, The American Statistician, 42, 249 - 252.
Tjur, T. (2009)
“Coefficients of determination in logistic regression models – a new proposal:
the coefficient of discrimination”, The American Statistician, 63, 366 – 372.
Walker, D. A.
and Smith, T. J. (2016) “Nine pseudo R^2 indices for binary logistic regression
models”, Journal of Modern Applied Statistical Methods, 15, 848 – 854.
Windmeijer, F.
A. G. (1995) “Goodness of fit measures in binary choice models”, Econometric
Reviews, 14, 101 – 116.