Statistical Miscellany: Logistic Regression and Information Theory: Part 4 -

How Good Is Our Model?

After estimating logistic regression coefficients – usually by using maximum likelihood approach – we inevitably come to the question: How good is our model? According to Allison (2014):

“There are two very different approaches to answering this question. One is to get a statistic that measures how well you can predict the dependent variable based on the independent variables. I’ll refer to these kinds of statistics as measures of predictive power . . . Predictive power statistics available in PROC LOGISTIC include R-square, the area under the ROC curve, and several rank-order correlations . . . The other approach to evaluating model fit is to compute a goodness-of-fit statistic. With PROC LOGISTIC, you can get the deviance, the Pearson chi-square, or the Hosmer-Lemeshow tests. These are formal tests of the null hypothesis that the fitted model is correct . . . What many researchers fail to realize is that measures of predictive power and goodness-of-fit statistics are testing very different things. It’s not at all uncommon for models with very high R-squares to produce unacceptable goodness-of-fit statistics. And conversely, models with very low R-squares, can fit the data very well according to goodness-of-fit tests.”

Thus, to answer the question "How good is our model?", we have to understand clearly - for what purpose. It is known that any regression model (including logistic regression) has two primary goals:

(1) to predict dependent variable (aka response or output variable) based on corresponding independent variables (aka predictors), this is a practical goal, which can be achieved by statistician or statistical programmer;
(2) to explain the relationship between response variable and independent or explanatory variables, this is rather a theoretical goal, requiring subject-matter knowledge.

In this blogpost, we limit ourselves to the measures of predictive power mentioned above, more specifically to R² based on likelihood.

Basic notations

Given a sample of observations or subjects (y_i, X_i), i=1,…, n, let:

(1) y_i denote the dependent class variable with values ‘1’ (EVENT) or ‘0’ (NONEVENT) where EVENT can be meant as Disease, Buy, Vote, Out-of-Order, etc. and correspondingly NONEVENT – No-Disease, No-Buy. No-Vote, In-Order, etc.;

(2) X_i denotes a vector of independent variables (x₁, x₂,…, x_k), or covariates, or predictors, which can be understood as the attributes of i-th subject, such as age, education, test results and so on, depending on context;

(3) Class variable y_i and vector of attributes X_i are related to each other by equation

Ln[(P(y_i = 1) / (1 – P (y_i = 1))] = b₀ + b₁x₁ + … + b_kx_k (1)

or equivalently

P (y_i = 1) = exp(b₀ + b₁x₁ + … b_kx_k) / (1 + exp(b₀ + b₁x₁ + … b_kx_k)) (1’)

where coefficients b₀, b₁, b₂, … b_k have to be estimated (usually by using maximum likelihood method);

(4) Estimates b^*₀, b^*₁, …, b^*_k put into (1’), provide us with an estimate for P(y_i = 1) and based on this estimate we have to predict the real value of y_i, which is still unknown at this point;

(5) The likelihood of the logistic regression model (1) is given by equation

L(y₁, y₂,..., y_i ... y_n; X₁, X₂, … X_n) = Π n_i=1 P(y_i | X_i) (2)

Here, we have dropped the vectors of parameters b = (b₀ , b₁, b₂, … b_k) for notational brevity.

(6) The logarithm of maximized likelihood is as follows

lnL(y₁, y₂,..., y_i ... y_n; X₁, X₂, … X_n) = Σⁿ_i=1 ln[P( y_i | X_i) = (3)

Σⁿ_i=1[y_i [ln(P(y_i | X_i)] + (1 - y_i) (1 - [ln(1 - P(y_i | X_i)

R² measures based on likelihood

There are several R-Squared measures based on likelihood (see Shtatland (2018)). The most popular are R²_MF (McFadden R²) and R²_CS (Cox - Snell R²) defined as follows

R²_MF = 1 – lnL(M) / lnL(0) = (lnL(0) – lnL(M)) / lnL(0) = (lnL(M) – lnL(0))/(-lnL(0)) (4)

R²_CS = 1 – [L(0) / L(M)](2/n) (5)

Here, L(M) and L(0) are maximized likelihoods of the model M with all available

predictors (x₁, x₂,…, x_k), and the null model ”0” without predictors, respectively.

It is interesting that R²_CS can be expressed as a simple function of R²_MF:

R²_CS = 1 – exp(-2H* R²_MF) (6)

where

H(ŷ)= -[ŷln ŷ + (1 – ŷ)ln(1 – ŷ)] and ŷ = (Σⁿ_i=1 y_i)/n (7)

Note that quantity H is nothing but the entropy of Bernoulli distribution with parameter ŷ. Equations (6) and (7) were originally derived in Shtatland et al (2000) and (2002). It is easy to get similar formulas in terms of R²_MF and H for other R² measures based on likelihood (see, for example, Sharma (2006), Sharna and McGee (2008), Sharma et all (2011), and Shtatland (2018)), but they are rather bulky and difficult to interpret. Besides, those R² measures are not as popular as R²_MF and R²_CS.

R²_MF and R²_CS - information interpretation

Note also that information theoretic interpretation of R² measures in logistic regression is one of the most desirable properties on a level with such standard properties as: (1) 0 ≤ R² ≤ 1; and (2) R² is nondecreasing as predictors are added (see, for example, Cameron and Windmeijer (1997) and references herein). We are in this, “ R² information interpretation camp” since 2000: Shtatland et al (2000), Shtatland et al (2002), Shtatland (2018). It has been shown in these publications that both R²_MF and R²_CS have an explicit and direct interpretation in terms of the information content of the data, potentially recoverable information and information gain due to added predictors. This interpretation is based on the fact that negative loglikelihood can be understood as the information contained in the observed data. As a result we can interpret quantity (- lnL(0)) for the model without predictors as potentially recoverable information in data (y_i, i=1,…, n) and quantity (-ln(M)) for the model with predictors as the information content of enlarged data (y_i,X_i, i=1,…, n). Thus, it is natural to interpret quantity ln(M) – lnL(0) as information gain (IG) due to switching from the null model with intercept only to the current model with available predictors. Note that R²_MF as the ratio of information gain, IG, to all potentially recoverable information shows not only how good model M is compared to the null model, but also how bad it is compared to the saturated model.

R²_CS and R²_MF scales vs information scale

After getting information interpretation for R²_CS and R²_MF, it is interesting to compare their scales with information scale. As to R²_MF, the situation is simple. According to formula (4), which can be re-written as

R²_MF = (lnL(M) – lnL(0))/(-lnL(0)) =
= [(1/n) (lnL(M) – lnL(0))]/[- (1/n)lnL(0)] = IGR/H (8)

where (-lnL(0)) does not depend on (X_i, i = 1, 2, … , n) and can be treated as constant; information gain rate, IGR, is information gain per observation, and at the same time is the estimate of the entropy in the available data. R²_MF scale is the same as information scale (up to this constant). That is information capacity per R²_MF unit is the same along the interval 0 ≤ R²_MF ≤ 1. As to R²_CS, the situation is more sophisticated and interesting. Indeed, let us transform the equation (5)

R²_CS = 1 – [L(0) / L(M)]^(2/n)

which defines R²_CS to

–ln (1 - R²_CS) = 2(lnL(M) – lnL(0))/n =2 IG /n = 2IGR (9)

Note that there is an alternative interpretation of (IG/n) as n → ∞. Indeed, this limit can be interpreted as entropy loss associated with using model M with predictors instead of the null model 0 without predictors. The logarithmic function -ln(1 – x) is a well-known measure from information theory (Theil and Chung (1988)). Using -ln(1 - R²_CS) rather than R²_CS provides a more natural scale for interpretation. Taking the first derivative of both sides of equation (9) with respect to R²_CS, we have

dIGR/ dR²_CS = 1 / (1 - R²_CS) (10)

From this formula we arrive at the following important conclusions:

If R²_CS = 0, then dIGR/dR²_CS = 1, i. e. the increase in R-Square is equivalent to the increase in IGR;

If R²_CS = 0.5, then dIGR/dR²_CS = 2, i. e. the increase in IGR is twice as large as the increase in R-Square;

If R²_CS = 0.75 (maximal value possible), then dIGR/dR²_CS = 4, i. e. the increase in IGR is four times larger than the increase in R-Square.

Asymptotic behavior of R²_MF and R²_CS

Note that all characteristics of logistic regression discussed above: likelihoods, maximized likelihoods, significance tests, and R² measures, including R²_MF and R²_CS are statistics thus random quantities. Some of these statistics depend on (y_i i=1,…, n) only, others depend on (y_i, X_i, i=1,…, n). It is very interesting and important to study their asymptotic behavior when n → ∞. The driving force in our analysis is the Law of Large Numbers (LLN). Indeed,

(1) ŷ = (Σⁿ_i=1 y_i)/n → π (unknown real population proportion)

as n → ∞ (here and below symbol → means convergence in probability);

(2) - lnL(0)/n = H(ŷ) = -[ŷln ŷ + (1 – ŷ)ln(1 – ŷ)] → -[π ln π + (1 – π)ln(1 – π)] = H(π) as n → ∞; where H(π) is entropy of Bernoulli distribution with parameter π and H(ŷ) is the estimate of H(π);

(3) It can be showed that as n → ∞, the quantity (1/n) (lnL(M) – lnL(0)) converges to some nonrandom finite limit (see Hu, Shao and Palta (2006)).

It is reasonable to interpret this limiting value of (1/n) (lnL(M) – lnL(0)) as an averaged information gain or entropy loss due to switching from the null model with intercept only to the current model with available predictors. A natural notation for this limit is

lim(1/n) (lnL(M) – lnL(0)) → H(M) – H(0)

where H(M) = - lim(1/n) lnL(M) is an averaged entropy associated with model M

and H(0) - an averaged entropy associated with the null model. As we have seen

above H(0) = H(π) = H(Y). Thus, we have

R²_MF → 1 - H(M) / H(0) (11)

R²_CS → 1 - exp2(H(M) - H(0)) (12)

Formula (12) was derived in Hu, Shao and Palta (2006) using somewhat different approach and notations.

The quantities R²_MF and R²_CS should be treated as estimators of their limiting values in assessing the predictive model strength for large data sets.

References

Allison, P.D. (2014) “Measures of fit for logistic regression”. https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf

Cameron, C. A. and Windmeijer, F. A. G. (1997) "An R-squared measure of goodness of fit for some common nonlinear regression models", Journal of Econometrics, Vol. 77, No.2, pp. 329-342.

Cox, D. R. and Snell E. J. (1989) “Analysis of binary data” 2nd Edition, Chapman & Hall, London.

Hosmer, D.W. and Lemeshow, S. (1989) “Applied Logistic Regression”. New York: Wiley.

Hu B., Shao J. and Palta M. (2006) “Pseudo-R² in Logistic Regression Model” Statistica Sinica, 16, 847- 860.

McFadden, D. (1974), “Conditional logit analysis of qualitative choice behavior”, pp. 105 -142 in Zarembka (ed.), Frontiers in Econometrics. Academic Press.

Sharma, D.R. (2006) “Logistic Regression, Measures of Explained Variation and the Base Rate Problem”, Ph.D. Thesis, Florida State University, USA.

Sharma, D. and McGee, D. (2008) “Estimating proportion of explained variation for an underlying linear model using logistic regression analysis” J. Stat. Res., 42, No. 1, pp. 59-69.

Sharma, D., McGee, D., and Golam Kibria, B.M. (2011) “Measures of Explained Variation and the Base-Rate Problem for Logistic Regression”, American Journal of Biostatistics, 2 (1): 11-19

Shtatland, E. S., Moore, S. and Barton, M. B. (2000) “Why we need R^2 measure of fit (and not only one) in PROC LOGISTIC and PROC GENMOD”, SUGI 2000 Proceedings, Paper 256 - 25, Cary, SAS Institute Inc.

Shtatland, E. S., Kleinman, K. and Cain, E. M. (2002) “One more time about R^2 measures of fit in logistic regression”, NESUG 2002 Proceedings.

Shtatland, E.S. (2018) “Do we really need more than one R-Squared in logistic regression?” http://statisticalmiscellany.blogspot.com/.

Theil, H. and Chung, C. F. (1988) “Information-theoretic measures of fit for univariate and multivariate linear regressions”, The American Statistician, 42, No 4, 249 – 252.

Windmeijer, F.A.G. (1995), “Goodness-of-fit measures in binary choice models”, Econometric Reviews, 14, 101-116

Statistical Miscellany

Sunday, August 9, 2020

Logistic Regression and Information Theory: Part 4 -

How Good Is Our Model?

Basic notations

R² measures based on likelihood

R²_MF and R²_CS - information interpretation

R²_CS and R²_MF scales vs information scale

Asymptotic behavior of R²_MF and R²_CS

References

No comments:

Post a Comment

Report Abuse

Sunday, August 9, 2020

Logistic Regression and Information Theory: Part 4 -

How Good Is Our Model?

Basic notations

R2 measures based on likelihood

R2MF and R2CS - information interpretation

R2CS and R2MF scales vs information scale

Asymptotic behavior of R2MF and R2CS

References

No comments:

Post a Comment

R² measures based on likelihood

R²_MF and R²_CS - information interpretation

R²_CS and R²_MF scales vs information scale

Asymptotic behavior of R²_MF and R²_CS