Sunday, August 9, 2020

Logistic Regression and Information Theory: Part 4 -

How Good Is Our Model?


After estimating logistic regression coefficients – usually by using maximum likelihood approach – we inevitably come to the question: How good is our model? According to Allison (2014):

“There are two very different approaches to answering this question. One is to get a statistic that measures how well you can predict the dependent variable based on the independent variables. I’ll refer to these kinds of statistics as measures of predictive power . . . Predictive power statistics available in PROC LOGISTIC include R-square, the area under the ROC curve, and several rank-order correlations . . . The other approach to evaluating model fit is to compute a goodness-of-fit statistic. With PROC LOGISTIC, you can get the deviance, the Pearson chi-square, or the Hosmer-Lemeshow tests. These are formal tests of the null hypothesis that the fitted model is correct . . . What many researchers fail to realize is that measures of predictive power and goodness-of-fit statistics are testing very different things. It’s not at all uncommon for models with very high R-squares to produce unacceptable goodness-of-fit statistics. And conversely, models with very low R-squares, can fit the data very well according to goodness-of-fit tests.”

Thus, to answer the question "How good is our model?", we have to understand clearly - for what purpose. It is known that any regression model (including logistic regression) has two primary goals:

(1) to predict dependent variable (aka response or output variable) based on corresponding independent variables (aka predictors), this is a practical goal, which can be achieved by statistician or statistical programmer;
(2) to explain the relationship between response variable and independent or explanatory variables, this is rather a theoretical goal, requiring subject-matter knowledge.

In this blogpost, we limit ourselves to the measures of predictive power mentioned above, more specifically to R2 based on likelihood.

Basic notations


Given a sample of observations or subjects (yi, Xi), i=1,…, n, let:

(1) yi denote the dependent class variable with values ‘1’ (EVENT) or ‘0’ (NONEVENT) where EVENT can be meant as Disease, Buy, Vote, Out-of-Order, etc. and correspondingly NONEVENT – No-Disease, No-Buy. No-Vote, In-Order, etc.;

(2) Xi denotes a vector of independent variables (x1, x2,…, xk), or covariates, or predictors, which can be understood as the attributes of i-th subject, such as age, education, test results and so on, depending on context;

(3) Class variable yi and vector of attributes Xi are related to each other by equation

Ln[(P(yi = 1) / (1 – P (yi = 1))] = b0 + b1x1 + … + bkxk       (1)

or equivalently

P (yi = 1) = exp(b0 + b1x1 + … bkxk) / (1 + exp(b0 + b1x1 + … bkxk))       (1’)

where coefficients b0, b1, b2, … bk have to be estimated (usually by using maximum likelihood method);

(4) Estimates b*0, b*1, …, b*k put into (1’), provide us with an estimate for P(yi = 1) and based on this estimate we have to predict the real value of yi, which is still unknown at this point;

(5) The likelihood of the logistic regression model (1) is given by equation

L(y1, y2,..., yi ... yn; X1, X2, … Xn) = Π ni=1 P(yi | Xi)       (2)

Here, we have dropped the vectors of parameters b = (b0 , b1, b2, … bk) for notational brevity.

(6) The logarithm of maximized likelihood is as follows

lnL(y1, y2,..., yi ... yn; X1, X2, … Xn) = Σni=1 ln[P( yi | Xi) =       (3)

Σni=1[yi [ln(P(yi | Xi)] + (1 - yi) (1 - [ln(1 - P(yi | Xi)

R2 measures based on likelihood


There are several R-Squared measures based on likelihood (see Shtatland (2018)). The most popular are R2MF (McFadden R2) and R2CS (Cox - Snell R2) defined as follows

R2MF = 1 – lnL(M) / lnL(0) = (lnL(0) – lnL(M)) / lnL(0) = (lnL(M) – lnL(0))/(-lnL(0))       (4)

R2CS = 1 – [L(0) / L(M)](2/n)       (5)

Here, L(M) and L(0) are maximized likelihoods of the model M with all available

predictors (x1, x2,…, xk), and the null model ”0” without predictors, respectively.

It is interesting that R2CS can be expressed as a simple function of R2MF:

R2CS = 1 – exp(-2H* R2MF)       (6)

where

H(ŷ)= -[ŷln ŷ + (1 – ŷ)ln(1 – ŷ)] and ŷ = (Σni=1 yi)/n       (7)

Note that quantity H is nothing but the entropy of Bernoulli distribution with parameter ŷ. Equations (6) and (7) were originally derived in Shtatland et al (2000) and (2002). It is easy to get similar formulas in terms of R2MF and H for other R2 measures based on likelihood (see, for example, Sharma (2006), Sharna and McGee (2008), Sharma et all (2011), and Shtatland (2018)), but they are rather bulky and difficult to interpret. Besides, those R2 measures are not as popular as R2MF and R2CS.

R2MF and R2CS - information interpretation


Note also that information theoretic interpretation of R2 measures in logistic regression is one of the most desirable properties on a level with such standard properties as: (1) 0 ≤ R2 ≤ 1; and (2) R2 is nondecreasing as predictors are added (see, for example, Cameron and Windmeijer (1997) and references herein). We are in this, “ R2 information interpretation camp” since 2000: Shtatland et al (2000), Shtatland et al (2002), Shtatland (2018). It has been shown in these publications that both R2MF and R2CS have an explicit and direct interpretation in terms of the information content of the data, potentially recoverable information and information gain due to added predictors. This interpretation is based on the fact that negative loglikelihood can be understood as the information contained in the observed data. As a result we can interpret quantity (- lnL(0)) for the model without predictors as potentially recoverable information in data (yi, i=1,…, n) and quantity (-ln(M)) for the model with predictors as the information content of enlarged data (yi,Xi, i=1,…, n). Thus, it is natural to interpret quantity ln(M) – lnL(0) as information gain (IG) due to switching from the null model with intercept only to the current model with available predictors. Note that R2MF as the ratio of information gain, IG, to all potentially recoverable information shows not only how good model M is compared to the null model, but also how bad it is compared to the saturated model.

R2CS and R2MF scales vs information scale


After getting information interpretation for R2CS and R2MF, it is interesting to compare their scales with information scale. As to R2MF, the situation is simple. According to formula (4), which can be re-written as

R2MF = (lnL(M) – lnL(0))/(-lnL(0)) =
= [(1/n) (lnL(M) – lnL(0))]/[- (1/n)lnL(0)] = IGR/H       (8)

where (-lnL(0)) does not depend on (Xi, i = 1, 2, … , n) and can be treated as constant; information gain rate, IGR, is information gain per observation, and at the same time is the estimate of the entropy in the available data. R2MF scale is the same as information scale (up to this constant). That is information capacity per R2MF unit is the same along the interval 0 ≤ R2MF ≤ 1. As to R2CS, the situation is more sophisticated and interesting. Indeed, let us transform the equation (5)

R2CS = 1 – [L(0) / L(M)]^(2/n)

which defines R2CS to

–ln (1 - R2CS) = 2(lnL(M) – lnL(0))/n =2 IG /n = 2IGR       (9)

Note that there is an alternative interpretation of (IG/n) as n → ∞. Indeed, this limit can be interpreted as entropy loss associated with using model M with predictors instead of the null model 0 without predictors. The logarithmic function -ln(1 – x) is a well-known measure from information theory (Theil and Chung (1988)). Using -ln(1 - R2CS) rather than R2CS provides a more natural scale for interpretation. Taking the first derivative of both sides of equation (9) with respect to R2CS, we have

dIGR/ dR2CS = 1 / (1 - R2CS)       (10)

From this formula we arrive at the following important conclusions:

If R2CS = 0, then dIGR/dR2CS = 1, i. e. the increase in R-Square is equivalent to the increase in IGR;

If R2CS = 0.5, then dIGR/dR2CS = 2, i. e. the increase in IGR is twice as large as the increase in R-Square;

If R2CS = 0.75 (maximal value possible), then dIGR/dR2CS = 4, i. e. the increase in IGR is four times larger than the increase in R-Square.

Asymptotic behavior of R2MF and R2CS


Note that all characteristics of logistic regression discussed above: likelihoods, maximized likelihoods, significance tests, and R2 measures, including R2MF and R2CS are statistics thus random quantities. Some of these statistics depend on (yi i=1,…, n) only, others depend on (yi, Xi, i=1,…, n). It is very interesting and important to study their asymptotic behavior when n → ∞. The driving force in our analysis is the Law of Large Numbers (LLN). Indeed,

(1) ŷ = (Σni=1 yi)/n → π (unknown real population proportion)

as n → ∞ (here and below symbol means convergence in probability);

(2) - lnL(0)/n = H(ŷ) = -[ŷln ŷ + (1 – ŷ)ln(1 – ŷ)] → -[π ln π + (1 – π)ln(1 – π)] = H(π) as n → ∞; where H(π) is entropy of Bernoulli distribution with parameter π and H(ŷ) is the estimate of H(π);

(3) It can be showed that as n → ∞, the quantity (1/n) (lnL(M) – lnL(0)) converges to some nonrandom finite limit (see Hu, Shao and Palta (2006)).

It is reasonable to interpret this limiting value of (1/n) (lnL(M) – lnL(0)) as an averaged information gain or entropy loss due to switching from the null model with intercept only to the current model with available predictors. A natural notation for this limit is

lim(1/n) (lnL(M) – lnL(0)) → H(M) – H(0)

where H(M) = - lim(1/n) lnL(M) is an averaged entropy associated with model M

and H(0) - an averaged entropy associated with the null model. As we have seen

above H(0) = H(π) = H(Y). Thus, we have

R2MF → 1 - H(M) / H(0)       (11)

R2CS → 1 - exp2(H(M) - H(0))       (12)

Formula (12) was derived in Hu, Shao and Palta (2006) using somewhat different approach and notations.

The quantities R2MF and R2CS should be treated as estimators of their limiting values in assessing the predictive model strength for large data sets.

References


Allison, P.D. (2014) “Measures of fit for logistic regression”. https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf

Cameron, C. A. and Windmeijer, F. A. G. (1997) "An R-squared measure of goodness of fit for some common nonlinear regression models", Journal of Econometrics, Vol. 77, No.2, pp. 329-342.

Cox, D. R. and Snell E. J. (1989) “Analysis of binary data” 2nd Edition, Chapman & Hall, London.

Hosmer, D.W. and Lemeshow, S. (1989) “Applied Logistic Regression”. New York: Wiley.

Hu B., Shao J. and Palta M. (2006) “Pseudo-R2 in Logistic Regression Model” Statistica Sinica, 16, 847- 860.

McFadden, D. (1974), “Conditional logit analysis of qualitative choice behavior”, pp. 105 -142 in Zarembka (ed.), Frontiers in Econometrics. Academic Press.

Sharma, D.R. (2006) “Logistic Regression, Measures of Explained Variation and the Base Rate Problem”, Ph.D. Thesis, Florida State University, USA.

Sharma, D. and McGee, D. (2008) “Estimating proportion of explained variation for an underlying linear model using logistic regression analysis” J. Stat. Res., 42, No. 1, pp. 59-69.

Sharma, D., McGee, D., and Golam Kibria, B.M. (2011) “Measures of Explained Variation and the Base-Rate Problem for Logistic Regression”, American Journal of Biostatistics, 2 (1): 11-19

Shtatland, E. S., Moore, S. and Barton, M. B. (2000) “Why we need R^2 measure of fit (and not only one) in PROC LOGISTIC and PROC GENMOD”, SUGI 2000 Proceedings, Paper 256 - 25, Cary, SAS Institute Inc.

Shtatland, E. S., Kleinman, K. and Cain, E. M. (2002) “One more time about R^2 measures of fit in logistic regression”, NESUG 2002 Proceedings.

Shtatland, E.S. (2018) “Do we really need more than one R-Squared in logistic regression?” http://statisticalmiscellany.blogspot.com/.

Theil, H. and Chung, C. F. (1988) “Information-theoretic measures of fit for univariate and multivariate linear regressions”, The American Statistician, 42, No 4, 249 – 252.

Windmeijer, F.A.G. (1995), “Goodness-of-fit measures in binary choice models”, Econometric Reviews, 14, 101-116

Sunday, August 2, 2020

Logistic Regression and Information Theory: Part 3 - Maximum Likelihood Is Equivalent To Minimum Cross Entropy and Kullback-Leibler Divergence

Binary logistic regression model with dichotomic outcome Y is defined by equation

ln[P(Y = 1)/P(Y = 0) = ln[p/(1-p)] = b0 + b1X1 + b2X2 + … bkXk       (1)

where p is the predicted probability of EVENT (Y = 1) by model (1) and 1 – p is the predicted probability of NONEVENT (Y = 0); X = ( X1, X2, … Xk) is a vector of predictors or explanatory variables; b = ( b0 , b1, b2, … bk) are the related coefficients. The actual or real probability π is unknown and estimated by p. It is well known (see, for example, Hosmer and Lemeshow (1989) or Menard (2010)) that maximum likelihood is a basic method for estimating the parameters b = ( b0 , b1, b2, … bk) in (1). It means that with independent observations (class labels) y1, y2, ..., yi, ... yn of output variable Y and the corresponding vectors of predictors X1, X2, … Xn, we have to find values of coefficients b* = (b*0, b*1, b*2, … b*k) which maximize likelihood

L( y1, y2,..., yi ... yn; X1, X2, … Xn ; b0 , b1, b2, … bk) = Πni=1 P(Y = yi |Xi,; b0 , b1, b2, … bk) (2)

But maximizing this product by using derivatives is much more inconvenient and cumbersome than maximizing the sum. Also, if sample size n is large enough, we will operate with very small numbers and will get inevitably into an arithmetic underflow effect. That is why the logarithm of likelihood, lnL, is used:

lnL( y1, y2,..., yi ... yn; X1, X2, … Xn) = Σni=1 ln[P(Y = yi |Xi)]       (3)

Here, we have dropped the vectors of parameters b = ( b0 , b1, b2, … bk) for notational brevity, but keeping in mind that probabilities and likelihoods do depend on these variables.

Due to the relationship between the probability of event E and the information associated with this event (see formula (3) in Shtatland (2019)), the right side of (3) can be rewritten in terms of information as follows:

Σni=1 ln[P(Y = yi | Xi)] = - Σni=1 I(Y = yi | Xi)       (4)

Thus, maximizing likelihood L (or log likelihood lnL ) is equivalent to minimizing the negative log likelihood or information content of the event that we observe the sequence y1, y2,..., yi ... yn. So, in addition to technical advantages mentioned above, taking logarithm of likelihood allows us to acquire an information interpretation of negative log likelihood. Further, we will come to the same conclusion from a different angle.

In the case of binary logistic regression with yi equals either to 1 or 0 (we consider only this case in our post) the equation for negative log likelihood can be rewritten as:

ni=1 ln[P(Y = yi |Xi,) = -Σni=1[ yi [ln(P(Y = yi |Xi,)] + (1 - yi) (1 - [ln(1 - P(Y = yi|Xi)       (5)

(see equation (1.4) in Hosmer and Lemeshow (1989)). It is interesting that the right side of (5) can get an alternative meaning of the so-called cross-entropy, a well-known measure in machine learning (Hastie, Tibshirani, Friedman (2009)).

Generally, in information theory, the cross-entropy between two probability distributions p(ω) and q(ω) on the same underlying finite set of outputs Ω = {ω} is defined as:

H(p;q) = - Σω p(ω)lnq(ω)       (6)

It is a central concept in information theory. There are two other fundamental concepts closely related to cross-entropy: entropy itself, H(p), and Kullback-Leibler divergence, DKL (see Cover and Thomas (2006)). Entropy of a probability distribution p(ω) is defined as follows:

H(p) = - Σω p(ω)lnp(ω),       (6')

i.e., entropy is just the averaged information associated with the distribution p(ω).

Kullback-Leibler divergence is defined by formula:

DKL(pǁq) = - Σω p(ω)ln[p(ω)/(q(ω)].]

It is easy to see that:

(a) H(p) ≥ 0; and H(p) = 0 if and only if p is a degenerate distribution;

(b) H(p;q) ≥ H(p) ≥ 0; and H(p;q) = H(p) if and only if distributions p and q are identical (p≡q);

(c) DKL(pǁq) = Σω p(ω)ln[p(ω) - Σω p(ω)ln(q(ω) = H(p;q) - H(p) ≥ 0       (7)

and DKL(pǁq) = 0 if and only if distributions are identical (p≡q).

Kullback-Leibler divergence is known under a variety of names, including relative entropy, Kullback–Leibler distance, information divergence, and information for discrimination. In spite of the name “Kullback–Leibler distance”, DKL(pǁq) is not a true distance between distributions since it is not symmetric and does not satisfy the triangle inequality. But it does satisfy the following inequality:

½[Σω|p(ω) - (q(ω)|]2 ≤ DKL(pǁq)       (8)

where Σω|p(ω) - (q(ω)| is the so-called total variation distance. This inequality is known as Pinsker’s inequality. It is useful, particularly in proving convergence results, and giving bounds of one measure of dissimilarity in terms of another. One of the implications of inequality (8) is that convergence in relative entropy implies convergence in total variation. In applications, {p(ω)} typically represents the true distribution of data, which is unknown. Observations y1, y2,..., yi ... yn are drawn from this distribution, while {q(ω)} represents a model distribution that approximates {p(ω)}. In order to find the distribution that is closest to {p(ω)}, we have to build a model with predictive distribution q that minimizes DKL(pǁq) or cross-entropy H(p;q), which is the same due to (7) since H(p) does not depend on the parameters of the model.

In the case of binary logistic regression, we have Ω = {1, 0}, i.e., the set of label values associated with EVENT and NONEVENT correspondingly. If the class label yi is interpreted as the probability of being class 1, then logistic regression provides an estimate of the probability that the data point is in class 1. Probability distribution p in this case is a degenerate observed label distribution { yi , 1 - yi } with values (1,0) or (0,1) depending on yi = 1 or yi = 0 respectively. Probability distribution q is estimated from logistic regression model with q(ω =1) = P( Yi = 1| Xi) and q(ω = 0) = P( Yi = 0| Xi). Thus, the term − [ yi ln(P(Y = yi| Xi) + (1 - yi ) (1 - ln(1 - P(Y = yi)| Xi)] in (5) is nothing but the binary cross-entropy corresponding to the i-th observation ( yi, Xi )) between label distribution { yi, 1- yi} and {P(Y = yi | Xi,), (1 - P(Y = yi | Xi,))}. At the same time, this term is the contribution to the negative log likelihood for the pair ( yi , Xi ). As a result, (5) is just the total cross-entropy measure and simultaneously the total negative log likelihood. Hence, negative log likelihood and binary cross-entropy are identical measures, but negative log likelihood is the basic measure in statistics (including logistic regression) and cross-entropy and DKL are the favorite measures in machine learning. So, maximizing the log likelihood is equivalent to minimizing the binary cross-entropy or Kullback - Leibler divergence, which are frequently used in problems of model selection, testing for goodness of fit, etc.

References

Cover, T. M. and Thomas, J. A. (2006) “Elements of Information Theory”, 2nd Edition. John Wiley, New Jersey.
Hastie, T., Tibshirani, R., and Friedman, J. (2009) “The Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Springer Series in Statistics, 2nd Edition, John Wiley, New Jersey.
Hosmer, D.W. and Lemeshow, S. (1989) “Applied Logistic Regression”. New York: Wiley.
Menard, S.W. (2010) “Logistic Regression: From Introductory to Advanced Concepts and Applications”, Thousand Oaks, CA: Sage.
Shtatland, E.S. (2019) “Logistic Regression and Information Theory: Part 1 - Do log odds have any intuitive meaning?” https://statisticalmiscellany.blogspot.com/2019/09/logistic-regression-and-information.html