Sunday, August 9, 2020

Logistic Regression and Information Theory: Part 4 -

How Good Is Our Model?

After estimating logistic regression coefficients – usually by using maximum likelihood approach – we inevitably come to the question: How good is our model? According to Allison (2014):

“There are two very different approaches to answering this question. One is to get a statistic that measures how well you can predict the dependent variable based on the independent variables. I’ll refer to these kinds of statistics as measures of predictive power . . . Predictive power statistics available in PROC LOGISTIC include R-square, the area under the ROC curve, and several rank-order correlations . . . The other approach to evaluating model fit is to compute a goodness-of-fit statistic. With PROC LOGISTIC, you can get the deviance, the Pearson chi-square, or the Hosmer-Lemeshow tests. These are formal tests of the null hypothesis that the fitted model is correct . . . What many researchers fail to realize is that measures of predictive power and goodness-of-fit statistics are testing very different things. It’s not at all uncommon for models with very high R-squares to produce unacceptable goodness-of-fit statistics. And conversely, models with very low R-squares, can fit the data very well according to goodness-of-fit tests.”

Thus, to answer the question "How good is our model?", we have to understand clearly - for what purpose. It is known that any regression model (including logistic regression) has two primary goals:

(1) to predict dependent variable (aka response or output variable) based on corresponding independent variables (aka predictors), this is a practical goal, which can be achieved by statistician or statistical programmer;
(2) to explain the relationship between response variable and independent or explanatory variables, this is rather a theoretical goal, requiring subject-matter knowledge.

In this blogpost, we limit ourselves to the measures of predictive power mentioned above, more specifically to R2 based on likelihood.

Basic notations

Given a sample of observations or subjects (yi, Xi), i=1,…, n, let:

(1) yi denote the dependent class variable with values ‘1’ (EVENT) or ‘0’ (NONEVENT) where EVENT can be meant as Disease, Buy, Vote, Out-of-Order, etc. and correspondingly NONEVENT – No-Disease, No-Buy. No-Vote, In-Order, etc.;

(2) Xi denotes a vector of independent variables (x1, x2,…, xk), or covariates, or predictors, which can be understood as the attributes of i-th subject, such as age, education, test results and so on, depending on context;

(3) Class variable yi and vector of attributes Xi are related to each other by equation

Ln[(P(yi = 1) / (1 – P (yi = 1))] = b0 + b1x1 + … + bkxk       (1)

or equivalently

P (yi = 1) = exp(b0 + b1x1 + … bkxk) / (1 + exp(b0 + b1x1 + … bkxk))       (1’)

where coefficients b0, b1, b2, … bk have to be estimated (usually by using maximum likelihood method);

(4) Estimates b*0, b*1, …, b*k put into (1’), provide us with an estimate for P(yi = 1) and based on this estimate we have to predict the real value of yi, which is still unknown at this point;

(5) The likelihood of the logistic regression model (1) is given by equation

L(y1, y2,..., yi ... yn; X1, X2, … Xn) = Π ni=1 P(yi | Xi)       (2)

Here, we have dropped the vectors of parameters b = (b0 , b1, b2, … bk) for notational brevity.

(6) The logarithm of maximized likelihood is as follows

lnL(y1, y2,..., yi ... yn; X1, X2, … Xn) = Σni=1 ln[P( yi | Xi) =       (3)

Σni=1[yi [ln(P(yi | Xi)] + (1 - yi) (1 - [ln(1 - P(yi | Xi)

R2 measures based on likelihood

There are several R-Squared measures based on likelihood (see Shtatland (2018)). The most popular are R2MF (McFadden R2) and R2CS (Cox - Snell R2) defined as follows

R2MF = 1 – lnL(M) / lnL(0) = (lnL(0) – lnL(M)) / lnL(0) = (lnL(M) – lnL(0))/(-lnL(0))       (4)

R2CS = 1 – [L(0) / L(M)](2/n)       (5)

Here, L(M) and L(0) are maximized likelihoods of the model M with all available

predictors (x1, x2,…, xk), and the null model ”0” without predictors, respectively.

It is interesting that R2CS can be expressed as a simple function of R2MF:

R2CS = 1 – exp(-2H* R2MF)       (6)


H(ŷ)= -[ŷln ŷ + (1 – ŷ)ln(1 – ŷ)] and ŷ = (Σni=1 yi)/n       (7)

Note that quantity H is nothing but the entropy of Bernoulli distribution with parameter ŷ. Equations (6) and (7) were originally derived in Shtatland et al (2000) and (2002). It is easy to get similar formulas in terms of R2MF and H for other R2 measures based on likelihood (see, for example, Sharma (2006), Sharna and McGee (2008), Sharma et all (2011), and Shtatland (2018)), but they are rather bulky and difficult to interpret. Besides, those R2 measures are not as popular as R2MF and R2CS.

R2MF and R2CS - information interpretation

Note also that information theoretic interpretation of R2 measures in logistic regression is one of the most desirable properties on a level with such standard properties as: (1) 0 ≤ R2 ≤ 1; and (2) R2 is nondecreasing as predictors are added (see, for example, Cameron and Windmeijer (1997) and references herein). We are in this, “ R2 information interpretation camp” since 2000: Shtatland et al (2000), Shtatland et al (2002), Shtatland (2018). It has been shown in these publications that both R2MF and R2CS have an explicit and direct interpretation in terms of the information content of the data, potentially recoverable information and information gain due to added predictors. This interpretation is based on the fact that negative loglikelihood can be understood as the information contained in the observed data. As a result we can interpret quantity (- lnL(0)) for the model without predictors as potentially recoverable information in data (yi, i=1,…, n) and quantity (-ln(M)) for the model with predictors as the information content of enlarged data (yi,Xi, i=1,…, n). Thus, it is natural to interpret quantity ln(M) – lnL(0) as information gain (IG) due to switching from the null model with intercept only to the current model with available predictors. Note that R2MF as the ratio of information gain, IG, to all potentially recoverable information shows not only how good model M is compared to the null model, but also how bad it is compared to the saturated model.

R2CS and R2MF scales vs information scale

After getting information interpretation for R2CS and R2MF, it is interesting to compare their scales with information scale. As to R2MF, the situation is simple. According to formula (4), which can be re-written as

R2MF = (lnL(M) – lnL(0))/(-lnL(0)) =
= [(1/n) (lnL(M) – lnL(0))]/[- (1/n)lnL(0)] = IGR/H       (8)

where (-lnL(0)) does not depend on (Xi, i = 1, 2, … , n) and can be treated as constant; information gain rate, IGR, is information gain per observation, and at the same time is the estimate of the entropy in the available data. R2MF scale is the same as information scale (up to this constant). That is information capacity per R2MF unit is the same along the interval 0 ≤ R2MF ≤ 1. As to R2CS, the situation is more sophisticated and interesting. Indeed, let us transform the equation (5)

R2CS = 1 – [L(0) / L(M)]^(2/n)

which defines R2CS to

–ln (1 - R2CS) = 2(lnL(M) – lnL(0))/n =2 IG /n = 2IGR       (9)

Note that there is an alternative interpretation of (IG/n) as n → ∞. Indeed, this limit can be interpreted as entropy loss associated with using model M with predictors instead of the null model 0 without predictors. The logarithmic function -ln(1 – x) is a well-known measure from information theory (Theil and Chung (1988)). Using -ln(1 - R2CS) rather than R2CS provides a more natural scale for interpretation. Taking the first derivative of both sides of equation (9) with respect to R2CS, we have

dIGR/ dR2CS = 1 / (1 - R2CS)       (10)

From this formula we arrive at the following important conclusions:

If R2CS = 0, then dIGR/dR2CS = 1, i. e. the increase in R-Square is equivalent to the increase in IGR;

If R2CS = 0.5, then dIGR/dR2CS = 2, i. e. the increase in IGR is twice as large as the increase in R-Square;

If R2CS = 0.75 (maximal value possible), then dIGR/dR2CS = 4, i. e. the increase in IGR is four times larger than the increase in R-Square.

Asymptotic behavior of R2MF and R2CS

Note that all characteristics of logistic regression discussed above: likelihoods, maximized likelihoods, significance tests, and R2 measures, including R2MF and R2CS are statistics thus random quantities. Some of these statistics depend on (yi i=1,…, n) only, others depend on (yi, Xi, i=1,…, n). It is very interesting and important to study their asymptotic behavior when n → ∞. The driving force in our analysis is the Law of Large Numbers (LLN). Indeed,

(1) ŷ = (Σni=1 yi)/n → π (unknown real population proportion)

as n → ∞ (here and below symbol means convergence in probability);

(2) - lnL(0)/n = H(ŷ) = -[ŷln ŷ + (1 – ŷ)ln(1 – ŷ)] → -[π ln π + (1 – π)ln(1 – π)] = H(π) as n → ∞; where H(π) is entropy of Bernoulli distribution with parameter π and H(ŷ) is the estimate of H(π);

(3) It can be showed that as n → ∞, the quantity (1/n) (lnL(M) – lnL(0)) converges to some nonrandom finite limit (see Hu, Shao and Palta (2006)).

It is reasonable to interpret this limiting value of (1/n) (lnL(M) – lnL(0)) as an averaged information gain or entropy loss due to switching from the null model with intercept only to the current model with available predictors. A natural notation for this limit is

lim(1/n) (lnL(M) – lnL(0)) → H(M) – H(0)

where H(M) = - lim(1/n) lnL(M) is an averaged entropy associated with model M

and H(0) - an averaged entropy associated with the null model. As we have seen

above H(0) = H(π) = H(Y). Thus, we have

R2MF → 1 - H(M) / H(0)       (11)

R2CS → 1 - exp2(H(M) - H(0))       (12)

Formula (12) was derived in Hu, Shao and Palta (2006) using somewhat different approach and notations.

The quantities R2MF and R2CS should be treated as estimators of their limiting values in assessing the predictive model strength for large data sets.


