Sunday, August 2, 2020

Logistic Regression and Information Theory: Part 3 - Maximum Likelihood Is Equivalent To Minimum Cross Entropy and Kullback-Leibler Divergence

Binary logistic regression model with dichotomic outcome Y is defined by equation

ln[P(Y = 1)/P(Y = 0) = ln[p/(1-p)] = b0 + b1X1 + b2X2 + … bkXk       (1)

where p is the predicted probability of EVENT (Y = 1) by model (1) and 1 – p is the predicted probability of NONEVENT (Y = 0); X = ( X1, X2, … Xk) is a vector of predictors or explanatory variables; b = ( b0 , b1, b2, … bk) are the related coefficients. The actual or real probability π is unknown and estimated by p. It is well known (see, for example, Hosmer and Lemeshow (1989) or Menard (2010)) that maximum likelihood is a basic method for estimating the parameters b = ( b0 , b1, b2, … bk) in (1). It means that with independent observations (class labels) y1, y2, ..., yi, ... yn of output variable Y and the corresponding vectors of predictors X1, X2, … Xn, we have to find values of coefficients b* = (b*0, b*1, b*2, … b*k) which maximize likelihood

L( y1, y2,..., yi ... yn; X1, X2, … Xn ; b0 , b1, b2, … bk) = Πni=1 P(Y = yi |Xi,; b0 , b1, b2, … bk) (2)

But maximizing this product by using derivatives is much more inconvenient and cumbersome than maximizing the sum. Also, if sample size n is large enough, we will operate with very small numbers and will get inevitably into an arithmetic underflow effect. That is why the logarithm of likelihood, lnL, is used:

lnL( y1, y2,..., yi ... yn; X1, X2, … Xn) = Σni=1 ln[P(Y = yi |Xi)]       (3)

Here, we have dropped the vectors of parameters b = ( b0 , b1, b2, … bk) for notational brevity, but keeping in mind that probabilities and likelihoods do depend on these variables.

Due to the relationship between the probability of event E and the information associated with this event (see formula (3) in Shtatland (2019)), the right side of (3) can be rewritten in terms of information as follows:

Σni=1 ln[P(Y = yi | Xi)] = - Σni=1 I(Y = yi | Xi)       (4)

Thus, maximizing likelihood L (or log likelihood lnL ) is equivalent to minimizing the negative log likelihood or information content of the event that we observe the sequence y1, y2,..., yi ... yn. So, in addition to technical advantages mentioned above, taking logarithm of likelihood allows us to acquire an information interpretation of negative log likelihood. Further, we will come to the same conclusion from a different angle.

In the case of binary logistic regression with yi equals either to 1 or 0 (we consider only this case in our post) the equation for negative log likelihood can be rewritten as:

ni=1 ln[P(Y = yi |Xi,) = -Σni=1[ yi [ln(P(Y = yi |Xi,)] + (1 - yi) (1 - [ln(1 - P(Y = yi|Xi)       (5)

(see equation (1.4) in Hosmer and Lemeshow (1989)). It is interesting that the right side of (5) can get an alternative meaning of the so-called cross-entropy, a well-known measure in machine learning (Hastie, Tibshirani, Friedman (2009)).

Generally, in information theory, the cross-entropy between two probability distributions p(ω) and q(ω) on the same underlying finite set of outputs Ω = {ω} is defined as:

H(p;q) = - Σω p(ω)lnq(ω)       (6)

It is a central concept in information theory. There are two other fundamental concepts closely related to cross-entropy: entropy itself, H(p), and Kullback-Leibler divergence, DKL (see Cover and Thomas (2006)). Entropy of a probability distribution p(ω) is defined as follows:

H(p) = - Σω p(ω)lnp(ω),       (6')

i.e., entropy is just the averaged information associated with the distribution p(ω).

Kullback-Leibler divergence is defined by formula:

DKL(pǁq) = - Σω p(ω)ln[p(ω)/(q(ω)].]

It is easy to see that:

(a) H(p) ≥ 0; and H(p) = 0 if and only if p is a degenerate distribution;

(b) H(p;q) ≥ H(p) ≥ 0; and H(p;q) = H(p) if and only if distributions p and q are identical (p≡q);

(c) DKL(pǁq) = Σω p(ω)ln[p(ω) - Σω p(ω)ln(q(ω) = H(p;q) - H(p) ≥ 0       (7)

and DKL(pǁq) = 0 if and only if distributions are identical (p≡q).

Kullback-Leibler divergence is known under a variety of names, including relative entropy, Kullback–Leibler distance, information divergence, and information for discrimination. In spite of the name “Kullback–Leibler distance”, DKL(pǁq) is not a true distance between distributions since it is not symmetric and does not satisfy the triangle inequality. But it does satisfy the following inequality:

½[Σω|p(ω) - (q(ω)|]2 ≤ DKL(pǁq)       (8)

where Σω|p(ω) - (q(ω)| is the so-called total variation distance. This inequality is known as Pinsker’s inequality. It is useful, particularly in proving convergence results, and giving bounds of one measure of dissimilarity in terms of another. One of the implications of inequality (8) is that convergence in relative entropy implies convergence in total variation. In applications, {p(ω)} typically represents the true distribution of data, which is unknown. Observations y1, y2,..., yi ... yn are drawn from this distribution, while {q(ω)} represents a model distribution that approximates {p(ω)}. In order to find the distribution that is closest to {p(ω)}, we have to build a model with predictive distribution q that minimizes DKL(pǁq) or cross-entropy H(p;q), which is the same due to (7) since H(p) does not depend on the parameters of the model.

In the case of binary logistic regression, we have Ω = {1, 0}, i.e., the set of label values associated with EVENT and NONEVENT correspondingly. If the class label yi is interpreted as the probability of being class 1, then logistic regression provides an estimate of the probability that the data point is in class 1. Probability distribution p in this case is a degenerate observed label distribution { yi , 1 - yi } with values (1,0) or (0,1) depending on yi = 1 or yi = 0 respectively. Probability distribution q is estimated from logistic regression model with q(ω =1) = P( Yi = 1| Xi) and q(ω = 0) = P( Yi = 0| Xi). Thus, the term − [ yi ln(P(Y = yi| Xi) + (1 - yi ) (1 - ln(1 - P(Y = yi)| Xi)] in (5) is nothing but the binary cross-entropy corresponding to the i-th observation ( yi, Xi )) between label distribution { yi, 1- yi} and {P(Y = yi | Xi,), (1 - P(Y = yi | Xi,))}. At the same time, this term is the contribution to the negative log likelihood for the pair ( yi , Xi ). As a result, (5) is just the total cross-entropy measure and simultaneously the total negative log likelihood. Hence, negative log likelihood and binary cross-entropy are identical measures, but negative log likelihood is the basic measure in statistics (including logistic regression) and cross-entropy and DKL are the favorite measures in machine learning. So, maximizing the log likelihood is equivalent to minimizing the binary cross-entropy or Kullback - Leibler divergence, which are frequently used in problems of model selection, testing for goodness of fit, etc.

References

Cover, T. M. and Thomas, J. A. (2006) “Elements of Information Theory”, 2nd Edition. John Wiley, New Jersey.
Hastie, T., Tibshirani, R., and Friedman, J. (2009) “The Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Springer Series in Statistics, 2nd Edition, John Wiley, New Jersey.
Hosmer, D.W. and Lemeshow, S. (1989) “Applied Logistic Regression”. New York: Wiley.
Menard, S.W. (2010) “Logistic Regression: From Introductory to Advanced Concepts and Applications”, Thousand Oaks, CA: Sage.
Shtatland, E.S. (2019) “Logistic Regression and Information Theory: Part 1 - Do log odds have any intuitive meaning?” https://statisticalmiscellany.blogspot.com/2019/09/logistic-regression-and-information.html

No comments:

Post a Comment