Logistic Regression
Logistic Regression?
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log odds for the event be a linear combination of one or more independent variables. (Wikipedia)
위와 같이 Y = label (discrete values) & X1, X2 = features 인경우, X1, X2와 Y (label)간의 관계를 어떻게 구할 수 있을까?
먼저 Classification (분류)에 대해서 들여다보자.
[Classification]
Y (label) is discrete values, for instance, if it is the binary classification, there can be two values (0 and 1).
Real-world tasks are Classification problems even if we learn regression from statistics a lot.
[True conditional probability function = Bayes classifier]
In a similar method of linear regression, Logistic Regression also has different functions.
The true conditional probability function (P(x) = E(Y|X=x) = E(Y=1|X=x) in classification can be similar to the true regression function (f(x) = E(Y|X=x))in regression, which is the ideal classifier to get for predicting Y using X variables.
The true conditional probability function is the method in which the largest probability of the Y label will be chosen as the Y label.
linear regression?
2023.02.24 - [머신러닝] - Regression
Regression
Regression? 통계학에서 회귀 분석은 관찰된 연속형 변수들에 대해 두 변수 사이의 모형을 구한 뒤 적합도를 측정해 내는 분석 방법이다. 회귀분석은 시간에 따라 변화하는 데이터나 어떤 영향, 가
fin-engin-story.tistory.com
[Conditional probability estimate]
The conditional probability estimate (hat{P(x)} = 1/ 1+e^-z, where z = hat{f(x)} = beta0 + sum of (beta i * Xi) in classification can be similar to the linear regression estimate ( hat {f(x)} = beta0 + sum of (beta i * Xi) ) in regression.
In the case of linear regression estimate, it gains the best linear regression approximate, fL(x), by Gradient Descent that minimizes MSE or RSS. Once we find the best parameters (betas), we can put x variables into the best linear regression approximation and can get predicted y values.
Likewise, for classification, we can gain the best conditional probability estimate from Gradient Descent that minimizes the Loss function, -1/n * Sum of ( y * ln(hat{y}) + (1-y) * in(1-hat{y}) ). Once we find the best parameters (betas), we can calculate the label from the best conditional probability estimate, with the probability the same or above 0.5, it can be labeled 1, otherwise, it can be labeled 0. (Depending on the threshold; 임계값에 따라)
[Why do we use conditional probability estimate as an estimator of probability,
not using linear regression estimate?]
Conditional probability estimate is for estimating the probability of y =1 in X=x.
If linear regression estimate is used as an estimator of the probability, it can somehow show the probability well (ok classifier, when there is only one variable, x1), but there is a possibility of negative probability which is impossible to exist.
[How to induce conditional probability estimate, hat{P(x)}]
Let’s assume there is only one input variable, x1 (single linear regression model).
From the linear regression estimate, Y = beta0 + beta1*x1 + Errors,
we can replace the Y with log odds, Y -> ln ( P(Y=1|X=x1) / 1 – P(Y=1|X=x1) ) = beta0 +beta1 *x1 + Errors.
From that we can induce the hat{P(Y=1|X=x1)} = 1/ 1+e^z, where z = beta0 + beta1*x1)
[How to do gradient descent]
hat{y} = 1/ 1+e^z, where z = hat{beta0} + sum of (hat {beta i} * xi)
In case of linear regression, the loss function is MSE = 1/n * sum of (y – hat{y})^2,
whereas, in the case of logistic regression, the loss function is represented down below.
L(hat{y}, y) = –ln(hat{y}) if y = 1
-ln(1-hat{y}) if y=0
-> -1/n * Sum of ( y * ln(hat{y}) + (1-y) * in(1-hat{y}) )
[Different Plots]
1. scatterplot -> Y = 0 or 1 plotting/ can not find any trends
2. histogram -> a little bit better/ can interpret but can not apply linear regression function
3. conditional density plot -> probability plot (P(Y=1|X=x)),
A. somehow applies linear regression, but has problems
i. linear regression -> negative value / but probability -> only positive
ii. linear regression -> poor job of estimating p(x) (conditional probability of Y=1)
iii. linear regression -> still, it is an okay classifier
B. Logistic regression (sigmoid function) -> better to find the probability
Logistic regression -> better job of estimating p(x) (conditional probability of Y=1)
[Summary]
Linear Regression: Y = real values
Logistic Regression: Y = real values
Linear Regression : true regression function, f(x) = E(Y|X=x)
Logistic Regression :true conditional probability function P(x) = E(Y=1|X=x) -> largest value = y label
- this is called the ‘Bayes classifier’
Linear Regression: linear regression estimate, hat{f(x)} = beta0 + sum of (beta i * Xi)
-> hat{y} = hat{f(x)} + Errors
Logistic Regression : conditional probability estimate, hat{P(x)} = Sigmoid( hat{f(x)} ) = 1/ 1+e^hat{f(x)}
Linear Regression: Loss Function, L(hat{y}, y) = MSE = 1/n * sum of (hat{y} – y)^2
Logistic Regression: Loss Function, L(hat{y}, y) = -1/n * Sum of ( y * ln(hat{y}) + (1-y) * in(1-hat{y}) )
Gradient Descent -> same
Linear Regression: best linear regression approximate, fL(x) -> predicted y values.
Logistic Regression: best conditional probability estimate -> above or less than 0.5 -> predicted y label
Linear Regression: performance evaluation -> Adj R-squared, p-value of each coefficient, MSE
Logistic Regression: performance evaluation -> error rate = 1/n * sum of ( I (yi =! hat{yi}) ),
where I( ) means Indicator function (if the condition is met, it returns 1, if the condition is not met, it returns 0)
higher value = more ‘1’ = higher error rate