October 2023 – Mousami Biswas

October 20, 2023November 7, 2023

Oct 20

We started learning logistic regression in class today. A statistical method for examining datasets having one or more independent factors in the classification that affect a result is called logistic regression. It is typically used for binary classification tasks, where the goal is to forecast certain outcomes, such as if an email is spam or not, if a client will buy a product, or if a student will pass or fail an exam.

In classification and predictive analytics, this statistical model—also referred to as the logit model—is frequently employed.

The purpose of logistic regression is to calculate the likelihood of an event, like voting or not, given a collection of independent factors. Because the result is a probability, the dependent variable has a range from 0-1.

The chances in logistic regression are transformed using a logit function, which expresses the likelihood of success divided by the likelihood of failure. Mathematically stated by the following formulae, this transformation is also known as the log odds or the natural logarithm of odds.

(Logit(pi) – exp(-pi)) = 1 /

Beta_0 + Beta_1 * X_1 +… + B_k * K_k = ln(pi / (1 – pi))

The dependent variable, or response variable, in this logistic regression equation is logit(pi), while the independent variable is X. Maximum likelihood estimation (MLE) is commonly used to estimate the beta parameters, or coefficients, in this model. Using several rounds, this estimate approach tests different values of beta in order to maximise the fit of the log odds.

I performed the equation by –

Importing the necessary libraries: We import the pandas library for data manipulation and the scikit-learn library for logistic regression and model evaluation.
Creating a DataFrame: We create a DataFrame from the provided dataset, which contains information about police shootings.
Creating a binary target variable: We create a binary target variable, where 0 represents “not shot” and 1 represents “shot”, based on the “manner_of_death” column.
Selecting independent variables: We select the independent variables for logistic regression, which are “age” and “signs_of_mental_illness”.
Splitting the dataset: We split the dataset into training and testing sets using the train_test_split function from scikit-learn.
Fitting a logistic regression model: We fit a logistic regression model to the training data using the LogisticRegression class from scikit-learn.
Making predictions: We use the trained model to make predictions on the test set.
Calculating accuracy and confusion matrix: We calculate the accuracy of the model and generate a confusion matrix using the accuracy_score and confusion_matrix functions from scikit-learn.
Printing the results: We print the accuracy and confusion matrix.

The output —

Accuracy: 0.9567341242149338
Confusion Matrix:
[[   0   62]
 [   0 1371]]

The log-likelihood function is generated by these iterations, and the goal of logistic regression is to maximise this function in order to find the optimal parameter values.

Conditional probabilities for every observation may be computed, logged, and added together to provide a forecast probability when the ideal coefficient—or coefficients, if there are several independent variables—has been identified. A probability less than 0.5 in binary classification predicts 0 whereas a probability greater than 0.5 predicts 1.

It is crucial to assess the model’s goodness of fit—the degree to which it accurately predicts the dependent variable—after it has been constructed. One popular technique for evaluating the model’s fit is the Hosmer-Lemeshow test.

October 18, 2023November 2, 2023

Oct 18

After discussing my questions in the doubt class today, I performed three machine learning classification tasks performed on the fatal police shootings dataset, these ML classifications helped me gain a better understanding pf predicting the manner of death, its perceived threat level and prediction of whether the incident was rrecorded on the Bodycam.

The insights I gained are as follows –

In any activity, age is a crucial aspect that is continuously present. This implies that a number of factors related to police contacts are significantly influenced by the age of the person involved.
The degree of perceived threat has an impact on both determining if a body camera was in use and forecasting the way of death. This emphasizes how crucial the perception of threat is in interactions with the police.
Being armed has a significant impact on all jobs, highlighting the importance of guns in these circumstances.
While not the most important factor, characteristics like race nonetheless have a significant impact on some activities. This can suggest that there are structural or social issues at work.
The perceived threat appears to be one of the elements that influences the deployment of a body camera.

October 16, 2023November 7, 2023

Oct 16

I used the gender feature to show the dataset in order to better understand its demographics. With almost 2,000 male victims compared to fewer than 250 female victims victims, the visualization showed a considerable gap in fatalities. I started my investigation by figuring out which state had the most instances. According to my research, California accounted for 344 of the total instances.

According to my reseaarch, California had the highest number of instances (344), followed by Texas (192 incidents) and Florida (126 incidents).

I also tried to do a temporal analysis on the dataset, and at first it seemed that the most shootings occurred in 2017.

Racial Breakdown: In this section, I employ the sns.countplot() function from the Seaborn library to generate a countplot. This plot offers a visual representation of the distribution of fatal shootings across different years, and I utilize the ‘race’ parameter as the hue. The resulting visualization provides valuable insights into the racial breakdown of these incidents.

Presence of Body Cameras: Similar to the preceding section, in this segment, I use a countplot to depict the presence of body cameras over the years. By setting the hue parameter to ‘body_camera,’ I can assess the changes in the utilization of body cameras during fatal shootings.

Signs of Mental Illness: This section maintains the same structure as the previous two sections, but it centers its focus on the signs of mental illness. The countplot displays the distribution of fatal shootings across different years, with the ‘signs_of_mental_illness’ parameter as the hue.

I found a difference between my Python code and the data that needs more research. In this instance, even though the dataset has data for the year 2023, it records no gunshots for that year.

I also looked into whih agencies were engaged in the most incidents, and I found that agency ID 38 had the most incidents overall. I used the ‘armed’ feature to visualize the dataset in order to have a deeper understanding of how many victims were armed and what kinds of weapons they carried. The graphic showed that more than 1,200 victims had firearms, almost 400 had knives, and about 200 had no weapons at all.

In order to gain understanding of incident location patterns, I also carried out an experiment to examine the relationship between latitude and longitude. Plotting this association allowed for the creation of a scatter plot that clearly showed the locations of the incidents and provided insight into their relationship. I intend to investigate more features as I go along in order to have a deeper comprehension of the dataset. A snapshot of this analysis is provided below:

October 14, 2023November 7, 2023

Oct 13

Today during the Friday doubt class, we discussed that in our initial exploration, did we notice any correlations or interesting trends between variables, such as the presence or absence of weapons and the outcome of these incidents.

During my initial exploration of the dataset, I did observe some correlations and trends related to the presence or absence of weapons and the outcomes of these incidents. It appeared that there were variations in outcomes based on the presence of weapons. However, to provide more specific and statistically significant insights, we would need to perform a more in-depth analysis, possibly using techniques such as logistic regression to model the relationship between the presence of weapons and the likelihood of specific outcomes.

We also talked about what visualization techniques other than histograms are we considering to further analyze and interpret the dataset.

In addition to histograms, I’m considering a variety of visualization techniques to gain a more comprehensive understanding of the dataset. These techniques include scatter plots, which will allow us to explore relationships between two numerical variables, such as age and the type of threat. Additionally, we are contemplating the use of bar charts to effectively visualize categorical data and compare counts or proportions, particularly in the context of understanding the distribution of races among the victims. Time series plots will help us uncover trends and patterns in fatal police shootings over time, while box plots will provide insights into the distribution of numerical data, aiding in the identification of outliers and variations across different variables.

I also discussed with Gary about heatmaps offering a means to discover correlations between various attributes, shedding light on the interplay of factors in these incidents. However, we also talked that we could use geospatial maps that can be employed if the dataset contains location data, allowing us to map incidents on a geographic scale and reveal geographic patterns and hotspots.

I’m trying to plot pie charts that can illustrate proportions of specific categorical variables, such as the proportion of different kinds of threats. The selection of the most appropriate visualization technique will depend on the specific questions and insights that I aim to derive from the dataset.

October 12, 2023October 30, 2023

Oct 11- Project 2- Understanding the Data

We began by analysing the data that the Washington Post had gathered on fatal police shootings in the US.

We spoke about the questions in class that would help us come up with a starting point for the analysis.

How many fatal police shootings occurred between 2015 and 2023?
How much of a factor does mental illness have in shootings by police?
Are these deadly police shootings biased towards any particular gender, age, or race?
Is there a weapon that fugitives use that sees more deadly shots than others?
Which agency, if any, are involved in more deadly shootings?

We discovered when reviewing the data that we are missing information about the race of the police officers that were engaged in the shooting. Apart from that, we don’t have information on the reason behind the fugitive’s altercation with the police or if the shot was warranted. The absence of this data might impose some restrictions on our study.

In order to validate the observed skewness, I plan to draw histograms for age, latitude, and longitude later in the project to visually portray the data distributions.

October 8, 2023October 10, 2023

Oct 8 – Submission of CDC Data Analysis Report

Predicting Diabetes Prevalence from Obesity and Inactivity (1)

October 3, 2023

Oct 2

As I was working on my report today, it became evident that conducting a t-test on the independent variables within the urban-rural dataset is imperative. Specifically, I focused on the obesity data in the urban-rural context and uncovered a T-statistic of -9.73523394627364, along with a remarkably low p-value of 5.8473952385946553e-40.

These statistical results strongly suggest a substantial distinction between these groups, with the first group exhibiting a significantly lower mean compared to the other group.

The exceedingly low p-value implies that the observed difference is unlikely to be due to random fluctuations but instead represents a meaningful and real divergence between the groups.

Also, when examining the inactivity data, I noted a T-statistic of -9.923344884720628, accompanied by an associated p-value of 9.236304752700284e-14.

Up to this point, I have conducted a variety of analyses on the dataset, including linear regression, multiple regression an t-tests. My report thoroughly documents the outcomes of these analyses as well as any notable challenges encountered along the way.