Statistical Miscellany: August 2020

How Good Is Our Model?

After estimating logistic regression coefficients – usually by using maximum likelihood approach – we inevitably come to the question: How good is our model? According to Allison (2014):

“There are two very different approaches to answering this question. One is to get a statistic that measures how well you can predict the dependent variable based on the independent variables. I’ll refer to these kinds of statistics as measures of predictive power . . . Predictive power statistics available in PROC LOGISTIC include R-square, the area under the ROC curve, and several rank-order correlations . . . The other approach to evaluating model fit is to compute a goodness-of-fit statistic. With PROC LOGISTIC, you can get the deviance, the Pearson chi-square, or the Hosmer-Lemeshow tests. These are formal tests of the null hypothesis that the fitted model is correct . . . What many researchers fail to realize is that measures of predictive power and goodness-of-fit statistics are testing very different things. It’s not at all uncommon for models with very high R-squares to produce unacceptable goodness-of-fit statistics. And conversely, models with very low R-squares, can fit the data very well according to goodness-of-fit tests.”

Thus, to answer the question "How good is our model?", we have to understand clearly - for what purpose. It is known that any regression model (including logistic regression) has two primary goals:

(1) to predict dependent variable (aka response or output variable) based on corresponding independent variables (aka predictors), this is a practical goal, which can be achieved by statistician or statistical programmer;
(2) to explain the relationship between response variable and independent or explanatory variables, this is rather a theoretical goal, requiring subject-matter knowledge.

In this blogpost, we limit ourselves to the measures of predictive power mentioned above, more specifically to R² based on likelihood.

Basic notations

Given a sample of observations or subjects (y_i, X_i), i=1,…, n, let:

(1) y_i denote the dependent class variable with values ‘1’ (EVENT) or ‘0’ (NONEVENT) where EVENT can be meant as Disease, Buy, Vote, Out-of-Order, etc. and correspondingly NONEVENT – No-Disease, No-Buy. No-Vote, In-Order, etc.;

(2) X_i denotes a vector of independent variables (x₁, x₂,…, x_k), or covariates, or predictors, which can be understood as the attributes of i-th subject, such as age, education, test results and so on, depending on context;

(3) Class variable y_i and vector of attributes X_i are related to each other by equation

Ln[(P(y_i = 1) / (1 – P (y_i = 1))] = b₀ + b₁x₁ + … + b_kx_k (1)

or equivalently

P (y_i = 1) = exp(b₀ + b₁x₁ + … b_kx_k) / (1 + exp(b₀ + b₁x₁ + … b_kx_k)) (1’)

where coefficients b₀, b₁, b₂, … b_k have to be estimated (usually by using maximum likelihood method);

(4) Estimates b^*₀, b^*₁, …, b^*_k put into (1’), provide us with an estimate for P(y_i = 1) and based on this estimate we have to predict the real value of y_i, which is still unknown at this point;

(5) The likelihood of the logistic regression model (1) is given by equation

L(y₁, y₂,..., y_i ... y_n; X₁, X₂, … X_n) = Π n_i=1 P(y_i | X_i) (2)

Here, we have dropped the vectors of parameters b = (b₀ , b₁, b₂, … b_k) for notational brevity.

(6) The logarithm of maximized likelihood is as follows

lnL(y₁, y₂,..., y_i ... y_n; X₁, X₂, … X_n) = Σⁿ_i=1 ln[P( y_i | X_i) = (3)

Σⁿ_i=1[y_i [ln(P(y_i | X_i)] + (1 - y_i) (1 - [ln(1 - P(y_i | X_i)

R² measures based on likelihood

There are several R-Squared measures based on likelihood (see Shtatland (2018)). The most popular are R²_MF (McFadden R²) and R²_CS (Cox - Snell R²) defined as follows

R²_MF = 1 – lnL(M) / lnL(0) = (lnL(0) – lnL(M)) / lnL(0) = (lnL(M) – lnL(0))/(-lnL(0)) (4)

R²_CS = 1 – [L(0) / L(M)](2/n) (5)

Here, L(M) and L(0) are maximized likelihoods of the model M with all available

predictors (x₁, x₂,…, x_k), and the null model ”0” without predictors, respectively.

It is interesting that R²_CS can be expressed as a simple function of R²_MF:

R²_CS = 1 – exp(-2H* R²_MF) (6)

where

H(ŷ)= -[ŷln ŷ + (1 – ŷ)ln(1 – ŷ)] and ŷ = (Σⁿ_i=1 y_i)/n (7)

Note that quantity H is nothing but the entropy of Bernoulli distribution with parameter ŷ. Equations (6) and (7) were originally derived in Shtatland et al (2000) and (2002). It is easy to get similar formulas in terms of R²_MF and H for other R² measures based on likelihood (see, for example, Sharma (2006), Sharna and McGee (2008), Sharma et all (2011), and Shtatland (2018)), but they are rather bulky and difficult to interpret. Besides, those R² measures are not as popular as R²_MF and R²_CS.

R²_MF and R²_CS - information interpretation

Note also that information theoretic interpretation of R² measures in logistic regression is one of the most desirable properties on a level with such standard properties as: (1) 0 ≤ R² ≤ 1; and (2) R² is nondecreasing as predictors are added (see, for example, Cameron and Windmeijer (1997) and references herein). We are in this, “ R² information interpretation camp” since 2000: Shtatland et al (2000), Shtatland et al (2002), Shtatland (2018). It has been shown in these publications that both R²_MF and R²_CS have an explicit and direct interpretation in terms of the information content of the data, potentially recoverable information and information gain due to added predictors. This interpretation is based on the fact that negative loglikelihood can be understood as the information contained in the observed data. As a result we can interpret quantity (- lnL(0)) for the model without predictors as potentially recoverable information in data (y_i, i=1,…, n) and quantity (-ln(M)) for the model with predictors as the information content of enlarged data (y_i,X_i, i=1,…, n). Thus, it is natural to interpret quantity ln(M) – lnL(0) as information gain (IG) due to switching from the null model with intercept only to the current model with available predictors. Note that R²_MF as the ratio of information gain, IG, to all potentially recoverable information shows not only how good model M is compared to the null model, but also how bad it is compared to the saturated model.

R²_CS and R²_MF scales vs information scale

After getting information interpretation for R²_CS and R²_MF, it is interesting to compare their scales with information scale. As to R²_MF, the situation is simple. According to formula (4), which can be re-written as

R²_MF = (lnL(M) – lnL(0))/(-lnL(0)) =
= [(1/n) (lnL(M) – lnL(0))]/[- (1/n)lnL(0)] = IGR/H (8)

where (-lnL(0)) does not depend on (X_i, i = 1, 2, … , n) and can be treated as constant; information gain rate, IGR, is information gain per observation, and at the same time is the estimate of the entropy in the available data. R²_MF scale is the same as information scale (up to this constant). That is information capacity per R²_MF unit is the same along the interval 0 ≤ R²_MF ≤ 1. As to R²_CS, the situation is more sophisticated and interesting. Indeed, let us transform the equation (5)

R²_CS = 1 – [L(0) / L(M)]^(2/n)

which defines R²_CS to

–ln (1 - R²_CS) = 2(lnL(M) – lnL(0))/n =2 IG /n = 2IGR (9)

Note that there is an alternative interpretation of (IG/n) as n → ∞. Indeed, this limit can be interpreted as entropy loss associated with using model M with predictors instead of the null model 0 without predictors. The logarithmic function -ln(1 – x) is a well-known measure from information theory (Theil and Chung (1988)). Using -ln(1 - R²_CS) rather than R²_CS provides a more natural scale for interpretation. Taking the first derivative of both sides of equation (9) with respect to R²_CS, we have

dIGR/ dR²_CS = 1 / (1 - R²_CS) (10)

From this formula we arrive at the following important conclusions:

If R²_CS = 0, then dIGR/dR²_CS = 1, i. e. the increase in R-Square is equivalent to the increase in IGR;

If R²_CS = 0.5, then dIGR/dR²_CS = 2, i. e. the increase in IGR is twice as large as the increase in R-Square;

If R²_CS = 0.75 (maximal value possible), then dIGR/dR²_CS = 4, i. e. the increase in IGR is four times larger than the increase in R-Square.

Asymptotic behavior of R²_MF and R²_CS

Note that all characteristics of logistic regression discussed above: likelihoods, maximized likelihoods, significance tests, and R² measures, including R²_MF and R²_CS are statistics thus random quantities. Some of these statistics depend on (y_i i=1,…, n) only, others depend on (y_i, X_i, i=1,…, n). It is very interesting and important to study their asymptotic behavior when n → ∞. The driving force in our analysis is the Law of Large Numbers (LLN). Indeed,

(1) ŷ = (Σⁿ_i=1 y_i)/n → π (unknown real population proportion)

as n → ∞ (here and below symbol → means convergence in probability);

(2) - lnL(0)/n = H(ŷ) = -[ŷln ŷ + (1 – ŷ)ln(1 – ŷ)] → -[π ln π + (1 – π)ln(1 – π)] = H(π) as n → ∞; where H(π) is entropy of Bernoulli distribution with parameter π and H(ŷ) is the estimate of H(π);

(3) It can be showed that as n → ∞, the quantity (1/n) (lnL(M) – lnL(0)) converges to some nonrandom finite limit (see Hu, Shao and Palta (2006)).

It is reasonable to interpret this limiting value of (1/n) (lnL(M) – lnL(0)) as an averaged information gain or entropy loss due to switching from the null model with intercept only to the current model with available predictors. A natural notation for this limit is

lim(1/n) (lnL(M) – lnL(0)) → H(M) – H(0)

where H(M) = - lim(1/n) lnL(M) is an averaged entropy associated with model M

and H(0) - an averaged entropy associated with the null model. As we have seen

above H(0) = H(π) = H(Y). Thus, we have

R²_MF → 1 - H(M) / H(0) (11)

R²_CS → 1 - exp2(H(M) - H(0)) (12)

Formula (12) was derived in Hu, Shao and Palta (2006) using somewhat different approach and notations.

The quantities R²_MF and R²_CS should be treated as estimators of their limiting values in assessing the predictive model strength for large data sets.

References

Allison, P.D. (2014) “Measures of fit for logistic regression”. https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf

Cameron, C. A. and Windmeijer, F. A. G. (1997) "An R-squared measure of goodness of fit for some common nonlinear regression models", Journal of Econometrics, Vol. 77, No.2, pp. 329-342.

Cox, D. R. and Snell E. J. (1989) “Analysis of binary data” 2nd Edition, Chapman & Hall, London.

Hosmer, D.W. and Lemeshow, S. (1989) “Applied Logistic Regression”. New York: Wiley.

Hu B., Shao J. and Palta M. (2006) “Pseudo-R² in Logistic Regression Model” Statistica Sinica, 16, 847- 860.

McFadden, D. (1974), “Conditional logit analysis of qualitative choice behavior”, pp. 105 -142 in Zarembka (ed.), Frontiers in Econometrics. Academic Press.

Sharma, D.R. (2006) “Logistic Regression, Measures of Explained Variation and the Base Rate Problem”, Ph.D. Thesis, Florida State University, USA.

Sharma, D. and McGee, D. (2008) “Estimating proportion of explained variation for an underlying linear model using logistic regression analysis” J. Stat. Res., 42, No. 1, pp. 59-69.

Sharma, D., McGee, D., and Golam Kibria, B.M. (2011) “Measures of Explained Variation and the Base-Rate Problem for Logistic Regression”, American Journal of Biostatistics, 2 (1): 11-19

Shtatland, E. S., Moore, S. and Barton, M. B. (2000) “Why we need R^2 measure of fit (and not only one) in PROC LOGISTIC and PROC GENMOD”, SUGI 2000 Proceedings, Paper 256 - 25, Cary, SAS Institute Inc.

Shtatland, E. S., Kleinman, K. and Cain, E. M. (2002) “One more time about R^2 measures of fit in logistic regression”, NESUG 2002 Proceedings.

Shtatland, E.S. (2018) “Do we really need more than one R-Squared in logistic regression?” http://statisticalmiscellany.blogspot.com/.

Theil, H. and Chung, C. F. (1988) “Information-theoretic measures of fit for univariate and multivariate linear regressions”, The American Statistician, 42, No 4, 249 – 252.

Windmeijer, F.A.G. (1995), “Goodness-of-fit measures in binary choice models”, Econometric Reviews, 14, 101-116

Binary logistic regression model with dichotomic outcome Y is defined by equation

ln[P(Y = 1)/P(Y = 0) = ln[p/(1-p)] = b₀ + b₁X₁ + b₂X₂ + … b_kX_k (1)

where p is the predicted probability of EVENT (Y = 1) by model (1) and 1 – p is the predicted probability of NONEVENT (Y = 0); X = ( X₁, X₂, … X_k) is a vector of predictors or explanatory variables; b = ( b₀ , b₁, b₂, … b_k) are the related coefficients. The actual or real probability π is unknown and estimated by p. It is well known (see, for example, Hosmer and Lemeshow (1989) or Menard (2010)) that maximum likelihood is a basic method for estimating the parameters b = ( b₀ , b₁, b₂, … b_k) in (1). It means that with independent observations (class labels) y₁, y₂, ..., y_i, ... y_n of output variable Y and the corresponding vectors of predictors X₁, X₂, … X_n, we have to find values of coefficients b^* = (b^*₀, b^*₁, b^*₂, … b^*_k) which maximize likelihood

L( y₁, y₂,..., y_i ... y_n; X₁, X₂, … X_n ; b₀ , b₁, b₂, … b_k) = Πⁿ_i=1 P(Y = y_i |Xi,; b₀ , b₁, b₂, … b_k) (2)

But maximizing this product by using derivatives is much more inconvenient and cumbersome than maximizing the sum. Also, if sample size n is large enough, we will operate with very small numbers and will get inevitably into an arithmetic underflow effect. That is why the logarithm of likelihood, lnL, is used:

lnL( y₁, y₂,..., y_i ... y_n; X₁, X₂, … X_n) = Σⁿ_i=1 ln[P(Y = y_i |Xi)] (3)

Here, we have dropped the vectors of parameters b = ( b₀ , b₁, b₂, … b_k) for notational brevity, but keeping in mind that probabilities and likelihoods do depend on these variables.

Due to the relationship between the probability of event E and the information associated with this event (see formula (3) in Shtatland (2019)), the right side of (3) can be rewritten in terms of information as follows:

Σⁿ_i=1 ln[P(Y = y_i | Xi)] = - Σⁿ_i=1 I(Y = y_i | Xi) (4)

Thus, maximizing likelihood L (or log likelihood lnL ) is equivalent to minimizing the negative log likelihood or information content of the event that we observe the sequence y₁, y₂,..., y_i ... y_n. So, in addition to technical advantages mentioned above, taking logarithm of likelihood allows us to acquire an information interpretation of negative log likelihood. Further, we will come to the same conclusion from a different angle.

In the case of binary logistic regression with y_i equals either to 1 or 0 (we consider only this case in our post) the equation for negative log likelihood can be rewritten as:

-Σⁿ_i=1 ln[P(Y = y_i |Xi,) = -Σⁿ_i=1[ y_i [ln(P(Y = y_i |Xi,)] + (1 - y_i) (1 - [ln(1 - P(Y = y_i|Xi) (5)

(see equation (1.4) in Hosmer and Lemeshow (1989)). It is interesting that the right side of (5) can get an alternative meaning of the so-called cross-entropy, a well-known measure in machine learning (Hastie, Tibshirani, Friedman (2009)).

Generally, in information theory, the cross-entropy between two probability distributions p(ω) and q(ω) on the same underlying finite set of outputs Ω = {ω} is defined as:

H(p;q) = - Σ_ω p(ω)lnq(ω) (6)

It is a central concept in information theory. There are two other fundamental concepts closely related to cross-entropy: entropy itself, H(p), and Kullback-Leibler divergence, D_KL (see Cover and Thomas (2006)). Entropy of a probability distribution p(ω) is defined as follows:

H(p) = - Σ_ω p(ω)lnp(ω), (6')

i.e., entropy is just the averaged information associated with the distribution p(ω).

Kullback-Leibler divergence is defined by formula:

D_KL(pǁq) = - Σ_ω p(ω)ln[p(ω)/(q(ω)].]

It is easy to see that:

(a) H(p) ≥ 0; and H(p) = 0 if and only if p is a degenerate distribution;

(b) H(p;q) ≥ H(p) ≥ 0; and H(p;q) = H(p) if and only if distributions p and q are identical (p≡q);

(c) D_KL(pǁq) = Σ_ω p(ω)ln[p(ω) - Σ_ω p(ω)ln(q(ω) = H(p;q) - H(p) ≥ 0 (7)

and D_KL(pǁq) = 0 if and only if distributions are identical (p≡q).

Kullback-Leibler divergence is known under a variety of names, including relative entropy, Kullback–Leibler distance, information divergence, and information for discrimination. In spite of the name “Kullback–Leibler distance”, D_KL(pǁq) is not a true distance between distributions since it is not symmetric and does not satisfy the triangle inequality. But it does satisfy the following inequality:

½[Σ_ω|p(ω) - (q(ω)|]2 ≤ D_KL(pǁq) (8)

where Σ_ω|p(ω) - (q(ω)| is the so-called total variation distance. This inequality is known as Pinsker’s inequality. It is useful, particularly in proving convergence results, and giving bounds of one measure of dissimilarity in terms of another. One of the implications of inequality (8) is that convergence in relative entropy implies convergence in total variation. In applications, {p(ω)} typically represents the true distribution of data, which is unknown. Observations y₁, y₂,..., y_i ... y_n are drawn from this distribution, while {q(ω)} represents a model distribution that approximates {p(ω)}. In order to find the distribution that is closest to {p(ω)}, we have to build a model with predictive distribution q that minimizes D_KL(pǁq) or cross-entropy H(p;q), which is the same due to (7) since H(p) does not depend on the parameters of the model.

In the case of binary logistic regression, we have Ω = {1, 0}, i.e., the set of label values associated with EVENT and NONEVENT correspondingly. If the class label y_i is interpreted as the probability of being class 1, then logistic regression provides an estimate of the probability that the data point is in class 1. Probability distribution p in this case is a degenerate observed label distribution { y_i , 1 - y_i } with values (1,0) or (0,1) depending on y_i = 1 or y_i = 0 respectively. Probability distribution q is estimated from logistic regression model with q(ω =1) = P( Y_i = 1| X_i) and q(ω = 0) = P( Y_i = 0| X_i). Thus, the term − [ y_i ln(P(Y = y_i| X_i) + (1 - y_i ) (1 - ln(1 - P(Y = y_i)| X_i)] in (5) is nothing but the binary cross-entropy corresponding to the i-th observation ( y_i, X_i )) between label distribution { y_i, 1- y_i} and {P(Y = y_i | X_i,), (1 - P(Y = y_i | X_i,))}. At the same time, this term is the contribution to the negative log likelihood for the pair ( y_i , X_i ). As a result, (5) is just the total cross-entropy measure and simultaneously the total negative log likelihood. Hence, negative log likelihood and binary cross-entropy are identical measures, but negative log likelihood is the basic measure in statistics (including logistic regression) and cross-entropy and D_KL are the favorite measures in machine learning. So, maximizing the log likelihood is equivalent to minimizing the binary cross-entropy or Kullback - Leibler divergence, which are frequently used in problems of model selection, testing for goodness of fit, etc.

References

Cover, T. M. and Thomas, J. A. (2006) “Elements of Information Theory”, 2nd Edition. John Wiley, New Jersey.
Hastie, T., Tibshirani, R., and Friedman, J. (2009) “The Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Springer Series in Statistics, 2nd Edition, John Wiley, New Jersey.
Hosmer, D.W. and Lemeshow, S. (1989) “Applied Logistic Regression”. New York: Wiley.
Menard, S.W. (2010) “Logistic Regression: From Introductory to Advanced Concepts and Applications”, Thousand Oaks, CA: Sage.
Shtatland, E.S. (2019) “Logistic Regression and Information Theory: Part 1 - Do log odds have any intuitive meaning?” https://statisticalmiscellany.blogspot.com/2019/09/logistic-regression-and-information.html

Statistical Miscellany

Sunday, August 9, 2020

Logistic Regression and Information Theory: Part 4 -

How Good Is Our Model?

Basic notations

R² measures based on likelihood

R²_MF and R²_CS - information interpretation

R²_CS and R²_MF scales vs information scale

Asymptotic behavior of R²_MF and R²_CS

References

Sunday, August 2, 2020

Logistic Regression and Information Theory: Part 3 - Maximum Likelihood Is Equivalent To Minimum Cross Entropy and Kullback-Leibler Divergence

Report Abuse

Sunday, August 9, 2020

Logistic Regression and Information Theory: Part 4 -

How Good Is Our Model?

Basic notations

R2 measures based on likelihood

R2MF and R2CS - information interpretation

R2CS and R2MF scales vs information scale

Asymptotic behavior of R2MF and R2CS

References

Sunday, August 2, 2020

Logistic Regression and Information Theory: Part 3 - Maximum Likelihood Is Equivalent To Minimum Cross Entropy and Kullback-Leibler Divergence

R² measures based on likelihood

R²_MF and R²_CS - information interpretation

R²_CS and R²_MF scales vs information scale

Asymptotic behavior of R²_MF and R²_CS