We started learning logistic regression in class today. A statistical method for examining datasets having one or more independent factors in the classification that affect a result is called logistic regression. It is typically used for binary classification tasks, where the goal is to forecast certain outcomes, such as if an email is spam or not, if a client will buy a product, or if a student will pass or fail an exam.
In classification and predictive analytics, this statistical model—also referred to as the logit model—is frequently employed.
The purpose of logistic regression is to calculate the likelihood of an event, like voting or not, given a collection of independent factors. Because the result is a probability, the dependent variable has a range from 0-1.
The chances in logistic regression are transformed using a logit function, which expresses the likelihood of success divided by the likelihood of failure. Mathematically stated by the following formulae, this transformation is also known as the log odds or the natural logarithm of odds.
(Logit(pi) – exp(-pi)) = 1 /
Beta_0 + Beta_1 * X_1 +… + B_k * K_k = ln(pi / (1 – pi))
The dependent variable, or response variable, in this logistic regression equation is logit(pi), while the independent variable is X. Maximum likelihood estimation (MLE) is commonly used to estimate the beta parameters, or coefficients, in this model. Using several rounds, this estimate approach tests different values of beta in order to maximise the fit of the log odds.
I performed the equation by –
- Importing the necessary libraries: We import the pandas library for data manipulation and the scikit-learn library for logistic regression and model evaluation.
- Creating a DataFrame: We create a DataFrame from the provided dataset, which contains information about police shootings.
- Creating a binary target variable: We create a binary target variable, where 0 represents “not shot” and 1 represents “shot”, based on the “manner_of_death” column.
- Selecting independent variables: We select the independent variables for logistic regression, which are “age” and “signs_of_mental_illness”.
- Splitting the dataset: We split the dataset into training and testing sets using the train_test_split function from scikit-learn.
- Fitting a logistic regression model: We fit a logistic regression model to the training data using the LogisticRegression class from scikit-learn.
- Making predictions: We use the trained model to make predictions on the test set.
- Calculating accuracy and confusion matrix: We calculate the accuracy of the model and generate a confusion matrix using the accuracy_score and confusion_matrix functions from scikit-learn.
- Printing the results: We print the accuracy and confusion matrix.
The output —
Accuracy: 0.9567341242149338 Confusion Matrix: [[ 0 62] [ 0 1371]]
The log-likelihood function is generated by these iterations, and the goal of logistic regression is to maximise this function in order to find the optimal parameter values.
Conditional probabilities for every observation may be computed, logged, and added together to provide a forecast probability when the ideal coefficient—or coefficients, if there are several independent variables—has been identified. A probability less than 0.5 in binary classification predicts 0 whereas a probability greater than 0.5 predicts 1.
It is crucial to assess the model’s goodness of fit—the degree to which it accurately predicts the dependent variable—after it has been constructed. One popular technique for evaluating the model’s fit is the Hosmer-Lemeshow test.






