Project 2 – Fata Police Shootings – Revised Submission
Dec 1
In today’s analysis of economic indicators, I tried to visually explore the time series of crucial variables in our dataset. I used the seaborn library, by plotting a line plot that showcases the evolution of key factors over time. The variables i used include logan_passengers, total_jobs, unemp_rate, med_housing_price, and housing_sales_vol. The resultt is a time-traveling journey across dates, where each line on the plot explains these aspects of variables. It’s like creating a historical timeline of the given dataset that allowed me to grasp the patterns, trends, and potential connections between these variables.
After that, for insights into the overall direction and behavior of ‘total_jobs’ over the forecasted period, I plotted a linear trend that signifies a consistent, straight-line pattern in the variable’s values.
For that, I performed time series forecasting using the ARIMA (AutoRegressive Integrated Moving Average) model for the ‘total_jobs’ variable in the dataset. I began by importing the functions from the statsmodels.tsa
module,; seasonal_decompose
for decomposing the time series and ARIMA
for building the forecasting model. The time series data for ‘total_jobs’ was extracted for the period from January 2013 to December 2019, which served as the training data for the model.
I then iinitiated an ARIMA model with a specified order of (1, 1, 1) . The model was then fitted to the training data using the fit()
method. After fitting, I forecasted future values for the ‘total_jobs’ variable, specifying the number of steps to forecast into the future (in this case, 12 steps or months).
The visual output of the code is a graph that showcases the actual time series values for ‘total_jobs’ as a blue line and the forecasted values as a red line. The x-axis represents time (date), and the y-axis represents the values of the ‘total_jobs’ variable. The linear progressive graph indicates a trend in the ‘total_jobs’ data, suggesting a systematic increase or decrease over time. In this context, .
Finally, I’ve crafted a web app that acts as a control panel for delving into the depths of our dataset. Picture the dataset as a treasure chest full of various gems, each gem representing different aspects like jobs, housing prices, and more. The app zooms in on two particular gems: total jobs and median housing prices. The dropdown menus act like magical selectors (X and Y) that allow me to choose which gems to compare. With the first dropdown (X), I can pick what I want on the left side of the comparison (let’s call it the x-axis), say the total number of jobs. The second dropdown (Y) lets me choose what I want on the right side (the y-axis), perhaps the median housing prices.
Now, imagine the scatter plot in the middle as a mystical map that shows where these selected gems interact. Initially, the map shows something (we’re not sure what) just to fill the space. As I change my choices with the dropdowns, the map updates, unveiling how total jobs relate to median housing prices. It’s like having a window into the dataset, allowing us to witness patterns and connections between these two key aspects.
This setup essentially turns the exploration of data into an interactive adventure. It’s a dashboard where I pick two things, and it throws a visual representation to help me understand the correlation bw those two things.
Nov 27
Today, i studied about how a 3D scatter plot can help us understand more about Economic Indicators. After my study, I understood that it proves the the intricate relationships among three variables—total jobs, unemployment rate, and labor force participation rate—by visualizing their distribution and patterns in a three-dimensional space. In this context, the 3D scatter plot offers numerous advantages:
Nov 23
Correlation Matrix Heatmap:
In today’s analysis, I’ve delved into the analysis of our dataset by calculating a correlation matrix for three pivotal economic indicators: total jobs, unemployment rate, and labor force participation rate. This correlation matrix serves as a quantitative measure of the relationships between these variables, presenting correlation coefficients for all pairs. To bring these numerical insights to life, I’ve generated a heatmap using Plotly Express (px.imshow), where each cell’s color vividly communicates the strength and direction of the correlation between the respective variables. The title, “Correlation Matrix Heatmap,” succinctly captures the essence of the visualization, elucidating our primary objective of illustrating the interdependencies among the selected economic indicators.
Pairplot of Selected Variables:
Shifting focus to the Pairplot of Selected Variables, I’ve harnessed the capabilities of the seaborn library (sns.pairplot) to craft a visual narrative for total jobs, unemployment rate, labor force participation rate, median housing price, and housing sales volume. This pair plot offers a simultaneous and comprehensive view of relationships across all variable pairs. The scatter plots reveal nuanced interactions between numeric variables, while histograms along the diagonal provide insights into individual variable distributions. The title, “Pairplot of Selected Variables,” reinforces the plot’s purpose, underscoring its role in providing a visual panorama of relationships within the chosen economic and housing-related indicators. This plot proves instrumental in uncovering potential patterns, trends, and outliers, contributing to a more holistic understanding of how these variables interact.
In summary, these visualizations are invaluable tools for unraveling the intricate relationships within our dataset. The correlation matrix heatmap distills complex correlations into a visually accessible form, while the pair plot extends our exploration to a broader set of variables, illuminating potential patterns and trends and offering a nuanced perspective on the data’s interplay.
November 17
Nov 13 – Project 3 – Economic Indicators
To start with project 3, we are given the “Employee Earning” dataset from Analyse Boston, which is made available by the City of Boston. It includes the names, job descriptions, and earnings information of all city employees from January 2011 to December 2022, including base pay, overtime, and total compensation. Employee earnings are collected and stored in a CSV file for each year. Definitions for the variables used in the analysis of employee earnings are included in another file.
I started by loading the dataset into the DataFrame = df. I began by conducting Principal Component Analysis (PCA) on a dataset using the scikit-learn library. The initial step involved selecting a subset of features from the original dataset, which included columns such as ‘logan_intl_flights,’ ‘hotel_occup_rate,’ ‘hotel_avg_daily_rate,’ and ‘total_jobs.’ These features were chosen as potential candidates for dimensionality reduction.
To ensure a consistent scale across features, I applied standardization using the StandardScaler
from scikit-learn. Standardization transforms the data to have zero mean and unit variance, a crucial step for PCA, which is sensitive to the scale of the features.
Next, I applied PCA to the standardized features. PCA is a technique that transforms the original features into a set of linearly uncorrelated variables called principal components. These components capture the maximum variance in the data. I extracted the explained variance ratio for each principal component, providing insights into the proportion of the total variance that each component explains.
To aid in determining the number of principal components to retain, I plotted the cumulative explained variance against the number of components. This visualization is helpful in understanding how much variance is preserved as we increase the number of components. In this example, the plot facilitated the decision to retain components that collectively explain at least 95% of the variance.
With the optimal number of components identified, I reapplied PCA, creating a new DataFrame containing the principal components. I then concatenated this principal component DataFrame with the original dataset, resulting in a new DataFrame that includes both the original features and the principal components.
The final step involved displaying the first few rows of the DataFrame with principal components, offering a glimpse into how the dataset has been transformed. This process allows for a more concise representation of the data while retaining the essential information captured by the principal components. Adjustments can be made to this methodology based on specific dataset characteristics and analysis goals.
Project 2 – Comprehensive Analysis of Fatal Police Encounters: Unveiling Demographic Disparities, Trends and Analytical Methodologies
[Project 1] Re-submission of Final Report
Oct 23
Today, I worked up further on Logistic Regression.
Binary logistic regression, ordinal logistic regression, and multinomial logistic regression are the three primary subtypes of logistic regression.
When the dependent variable is binary, Binary Logistic Regression—the most popular of the three forms of logistic regression—is employed. It can only make two assumptions.
Accuracy Score
The accuracy score is a commonly used metric to measure the performance of a classification model. It calculates the percentage of correct predictions made by the model. The accuracy score ranges from 0 to 1, where 1 represents a perfect prediction.
Classification Report
The classification report provides a comprehensive summary of the model’s performance for each class in a classification problem. It includes metrics such as precision, recall, and F1-score, which help evaluate the model’s ability to correctly classify instances of each class.
The code provided consists of two main parts: importing the necessary modules and evaluating the logistic regression model’s performance.
First, we import the accuracy_score
and classification_report
functions from the sklearn.metrics
module using the from
keyword.
Next, we have the report
variable, which stores the classification report generated by the classification_report
function. This function takes two arguments: y_test
and y_pred
. y_test
represents the true labels of the test data, while y_pred
represents the predicted labels generated by the logistic regression model.
Finally, I printed the accuracy score and the classification report using the print
function.
Output –
Logistic Regression Model Accuracy: 0.9567341242149338 Classification Report: precision recall f1-score support 0 0.00 0.00 0.00 62 1 0.96 1.00 0.98 1371 accuracy 0.96 1433 macro avg 0.48 0.50 0.49 1433 weighted avg 0.92 0.96 0.94 1433
Oct 20
We started learning logistic regression in class today. A statistical method for examining datasets having one or more independent factors in the classification that affect a result is called logistic regression. It is typically used for binary classification tasks, where the goal is to forecast certain outcomes, such as if an email is spam or not, if a client will buy a product, or if a student will pass or fail an exam.
In classification and predictive analytics, this statistical model—also referred to as the logit model—is frequently employed.
The purpose of logistic regression is to calculate the likelihood of an event, like voting or not, given a collection of independent factors. Because the result is a probability, the dependent variable has a range from 0-1.
The chances in logistic regression are transformed using a logit function, which expresses the likelihood of success divided by the likelihood of failure. Mathematically stated by the following formulae, this transformation is also known as the log odds or the natural logarithm of odds.
(Logit(pi) – exp(-pi)) = 1 /
Beta_0 + Beta_1 * X_1 +… + B_k * K_k = ln(pi / (1 – pi))
The dependent variable, or response variable, in this logistic regression equation is logit(pi), while the independent variable is X. Maximum likelihood estimation (MLE) is commonly used to estimate the beta parameters, or coefficients, in this model. Using several rounds, this estimate approach tests different values of beta in order to maximise the fit of the log odds.
I performed the equation by –
- Importing the necessary libraries: We import the pandas library for data manipulation and the scikit-learn library for logistic regression and model evaluation.
- Creating a DataFrame: We create a DataFrame from the provided dataset, which contains information about police shootings.
- Creating a binary target variable: We create a binary target variable, where 0 represents “not shot” and 1 represents “shot”, based on the “manner_of_death” column.
- Selecting independent variables: We select the independent variables for logistic regression, which are “age” and “signs_of_mental_illness”.
- Splitting the dataset: We split the dataset into training and testing sets using the train_test_split function from scikit-learn.
- Fitting a logistic regression model: We fit a logistic regression model to the training data using the LogisticRegression class from scikit-learn.
- Making predictions: We use the trained model to make predictions on the test set.
- Calculating accuracy and confusion matrix: We calculate the accuracy of the model and generate a confusion matrix using the accuracy_score and confusion_matrix functions from scikit-learn.
- Printing the results: We print the accuracy and confusion matrix.
The output —
Accuracy: 0.9567341242149338 Confusion Matrix: [[ 0 62] [ 0 1371]]
The log-likelihood function is generated by these iterations, and the goal of logistic regression is to maximise this function in order to find the optimal parameter values.
Conditional probabilities for every observation may be computed, logged, and added together to provide a forecast probability when the ideal coefficient—or coefficients, if there are several independent variables—has been identified. A probability less than 0.5 in binary classification predicts 0 whereas a probability greater than 0.5 predicts 1.
It is crucial to assess the model’s goodness of fit—the degree to which it accurately predicts the dependent variable—after it has been constructed. One popular technique for evaluating the model’s fit is the Hosmer-Lemeshow test.
Oct 18
After discussing my questions in the doubt class today, I performed three machine learning classification tasks performed on the fatal police shootings dataset, these ML classifications helped me gain a better understanding pf predicting the manner of death, its perceived threat level and prediction of whether the incident was rrecorded on the Bodycam.
The insights I gained are as follows –
- In any activity, age is a crucial aspect that is continuously present. This implies that a number of factors related to police contacts are significantly influenced by the age of the person involved.
- The degree of perceived threat has an impact on both determining if a body camera was in use and forecasting the way of death. This emphasizes how crucial the perception of threat is in interactions with the police.
- Being armed has a significant impact on all jobs, highlighting the importance of guns in these circumstances.
- While not the most important factor, characteristics like race nonetheless have a significant impact on some activities. This can suggest that there are structural or social issues at work.
- The perceived threat appears to be one of the elements that influences the deployment of a body camera.
Oct 16
I used the gender feature to show the dataset in order to better understand its demographics. With almost 2,000 male victims compared to fewer than 250 female victims victims, the visualization showed a considerable gap in fatalities. I started my investigation by figuring out which state had the most instances. According to my research, California accounted for 344 of the total instances.
According to my reseaarch, California had the highest number of instances (344), followed by Texas (192 incidents) and Florida (126 incidents).
I also tried to do a temporal analysis on the dataset, and at first it seemed that the most shootings occurred in 2017.
Racial Breakdown: In this section, I employ the sns.countplot()
function from the Seaborn library to generate a countplot. This plot offers a visual representation of the distribution of fatal shootings across different years, and I utilize the ‘race’ parameter as the hue. The resulting visualization provides valuable insights into the racial breakdown of these incidents.
Presence of Body Cameras: Similar to the preceding section, in this segment, I use a countplot to depict the presence of body cameras over the years. By setting the hue parameter to ‘body_camera,’ I can assess the changes in the utilization of body cameras during fatal shootings.
Signs of Mental Illness: This section maintains the same structure as the previous two sections, but it centers its focus on the signs of mental illness. The countplot displays the distribution of fatal shootings across different years, with the ‘signs_of_mental_illness’ parameter as the hue.
I found a difference between my Python code and the data that needs more research. In this instance, even though the dataset has data for the year 2023, it records no gunshots for that year.
I also looked into whih agencies were engaged in the most incidents, and I found that agency ID 38 had the most incidents overall. I used the ‘armed’ feature to visualize the dataset in order to have a deeper understanding of how many victims were armed and what kinds of weapons they carried. The graphic showed that more than 1,200 victims had firearms, almost 400 had knives, and about 200 had no weapons at all.
In order to gain understanding of incident location patterns, I also carried out an experiment to examine the relationship between latitude and longitude. Plotting this association allowed for the creation of a scatter plot that clearly showed the locations of the incidents and provided insight into their relationship. I intend to investigate more features as I go along in order to have a deeper comprehension of the dataset. A snapshot of this analysis is provided below:
Oct 13
Today during the Friday doubt class, we discussed that in our initial exploration, did we notice any correlations or interesting trends between variables, such as the presence or absence of weapons and the outcome of these incidents.
During my initial exploration of the dataset, I did observe some correlations and trends related to the presence or absence of weapons and the outcomes of these incidents. It appeared that there were variations in outcomes based on the presence of weapons. However, to provide more specific and statistically significant insights, we would need to perform a more in-depth analysis, possibly using techniques such as logistic regression to model the relationship between the presence of weapons and the likelihood of specific outcomes.
We also talked about what visualization techniques other than histograms are we considering to further analyze and interpret the dataset.
In addition to histograms, I’m considering a variety of visualization techniques to gain a more comprehensive understanding of the dataset. These techniques include scatter plots, which will allow us to explore relationships between two numerical variables, such as age and the type of threat. Additionally, we are contemplating the use of bar charts to effectively visualize categorical data and compare counts or proportions, particularly in the context of understanding the distribution of races among the victims. Time series plots will help us uncover trends and patterns in fatal police shootings over time, while box plots will provide insights into the distribution of numerical data, aiding in the identification of outliers and variations across different variables.
I also discussed with Gary about heatmaps offering a means to discover correlations between various attributes, shedding light on the interplay of factors in these incidents. However, we also talked that we could use geospatial maps that can be employed if the dataset contains location data, allowing us to map incidents on a geographic scale and reveal geographic patterns and hotspots.
I’m trying to plot pie charts that can illustrate proportions of specific categorical variables, such as the proportion of different kinds of threats. The selection of the most appropriate visualization technique will depend on the specific questions and insights that I aim to derive from the dataset.
Oct 11- Project 2- Understanding the Data
We began by analysing the data that the Washington Post had gathered on fatal police shootings in the US.
We spoke about the questions in class that would help us come up with a starting point for the analysis.
- How many fatal police shootings occurred between 2015 and 2023?
- How much of a factor does mental illness have in shootings by police?
- Are these deadly police shootings biased towards any particular gender, age, or race?
- Is there a weapon that fugitives use that sees more deadly shots than others?
- Which agency, if any, are involved in more deadly shootings?
We discovered when reviewing the data that we are missing information about the race of the police officers that were engaged in the shooting. Apart from that, we don’t have information on the reason behind the fugitive’s altercation with the police or if the shot was warranted. The absence of this data might impose some restrictions on our study.
In order to validate the observed skewness, I plan to draw histograms for age, latitude, and longitude later in the project to visually portray the data distributions.
Oct 8 – Submission of CDC Data Analysis Report
Oct 2
As I was working on my report today, it became evident that conducting a t-test on the independent variables within the urban-rural dataset is imperative. Specifically, I focused on the obesity data in the urban-rural context and uncovered a T-statistic of -9.73523394627364, along with a remarkably low p-value of 5.8473952385946553e-40.
These statistical results strongly suggest a substantial distinction between these groups, with the first group exhibiting a significantly lower mean compared to the other group.
The exceedingly low p-value implies that the observed difference is unlikely to be due to random fluctuations but instead represents a meaningful and real divergence between the groups.
Also, when examining the inactivity data, I noted a T-statistic of -9.923344884720628, accompanied by an associated p-value of 9.236304752700284e-14.
Up to this point, I have conducted a variety of analyses on the dataset, including linear regression, multiple regression an t-tests. My report thoroughly documents the outcomes of these analyses as well as any notable challenges encountered along the way.
Sept 27
Today, I explored of the Cross Validation and Bootstrap methodologies, seeking to grasp their intricacies on a deeper level. Cross Validation, as I learned, necessitates the initial division of the dataset into two vital segments: the training dataset and the testing dataset. A pivotal choice in this process is the determination of ‘k,’ representing the number of subsets or folds into which the entire dataset will be split. The training dataset comprises ‘k-1’ folds, reserving one fold for the testing dataset. I discovered that the careful selection of a suitable performance metric is of utmost importance to effectively gauge the model’s performance. The Cross Validation process involves repeating the ‘k-1’ iterations, each time calculating the performance metric. The mean of these performance metrics offers a robust estimate of the model’s overall performance, with the ultimate goal being a precise estimate of the test error.
On the other hand, Bootstrap, I found, employs a resampling technique that allows for replacement, setting it apart from Cross Validation. Unlike Cross Validation, which conducts multiple testing iterations using various training datasets and a fixed test dataset, Bootstrap employs a different strategy. Bootstrap proves particularly valuable when dealing with datasets of limited size. It repeatedly samples data points from the observed dataset with replacement, enabling the estimation of the underlying data distribution.
Reflecting on our ongoing project, I’ve come to realize that Bootstrap appears to be the more suitable approach for our modeling endeavors, given that our CDC dataset comprises approximately 300 data points. Bootstrap’s ability to work effectively with smaller datasets makes it the preferred methodology in this specific scenario. My journey into these methodologies has equipped me with a newfound appreciation for the nuances and versatility they bring to the world of data analysis and modeling.
Sept 25
While working on the project during class today, I encountered the following points :
1. Data Going Haywire: Sometimes, the given data can get all wonky because it wasn’t taken care of properly. You know, it gets corrupted and messy.
2. Manual Oopsies: Then there’s the human factor. Data might vanish because someone forgot to jot it down or made a good old-fashioned mistake.
Now, on the given CDC Diabetes 2018 dataset, we had around 3,000 rows of diabetes info, a bit over 300 rows for obesity, and a little more than 1,000 rows for inactivity. But when I mashed all three sets together, we ended up with just a smidge over 300 rows of data. But guess what? There were still 9 empty spots in the “% inactive” column.
Here’s the kicker: If we had a boatload of data, I might’ve considered tossing those rows with the gaps. But with only 300 rows, I wasn’t too thrilled about shrinking our dataset even further.
So, after doing some digging, I stumbled upon a nifty solution known as “Imputing the Missing Values.” There are actually four ways to do this, you know.
First off, there’s the “Arbitrary Value” method. You just toss in some random number like -3, 0, or 7. But here’s the deal: it’s not the best because it can mess with the data and make it harder to spot patterns. However, I am still considering if I want to use this approach as compared to the bootstrap method.
Sept 22
So after yesterdays class, i started learning resampling tricks. I decided to dive into this whole resampling thing, especially this Cross Validation stuff.
- Resampling? It’s like when you take a test and then you keep redoing the same test over and over again. Or maybe you make up new tests based on the one you already did.
- Why Bother with Resampling? Well, imagine you’re trying to predict stuff with your fancy model, but you don’t have new data to test it on. That’s when resampling comes to the rescue. It helps you make up new data to see how your model does.
- Cross Validation’s Mission: Cross Validation is like your model’s bodyguard. It watches out for those sneaky mistakes caused by your model being too obsessed with the training data.
- Test Error vs. Training Error: Test Error is like the average oopsies your model makes when it meets new data. Training Error is more like the oopsies it makes when it’s practicing on its old pals.
- What’s Overfitting? Picture this: your model is trying to draw a line that’s so snug with certain dots that it forgets about the other dots. That’s overfitting, and it’s not great for making predictions.
- Cross-Validation 101: You split your data into two teams—the training team and the validation team. The training team is like your model’s personal trainer, getting it in shape. Then, the model tries to predict stuff for the validation team to see how well it learned.
Sept 20
Today we discussed about a linear model fit to data where both variables are non-normally distributed, skewed with high variance and high kurtosis with the help of a CRAB example.
For eg. Pre-molt describes the shell’s size prior to molting, while Post-molt refers to the dimensions of a crab’s shell after molting.
Today’s data model was proposed to be attempted to predict pre molt size from post molt size.
We also tried to understand if the difference between both the states means statistical significance or not and came to the conclusion that by standard statistical inference across alot of cases, the p value in this case was less than 0.05 with makes us reject the null hypothesis that there is no real difference.
Post that, we also talked about the t-test analysis. The t-test is like a detective tool for numbers. It helps us figure out if two groups of numbers are different because of luck or if there’s something real going on. For eg. Imagine you have two groups, like group A and group B. You want to know if their averages (or means) are different. The t-test does this job. It uses a special math thing called “Student’s t-distribution” to make its decision. This math thing is like the rulebook it follows. Now, here’s the tricky part. The t-test wishes it knew the size of a special secret ingredient (lets call it “scaling term”) in the math, but it doesn’t. So, it has to guess it from the numbers. If it guesses right (under some specific conditions), it can use its rulebook to say if the groups are different or not.
So, in a nutshell, the t-test helps us see if two groups’ averages are different for real, not just by chance. It’s like the Sherlock Holmes of statistics for comparing numbers.
Sept 18
Today, I had an enlightening experience exploring data visualization with the Python library Seaborn. To kick things off, I imported Seaborn into my Python environment and loaded a dataset for analysis.
What immediately caught my attention was Seaborn’s ability to produce visually appealing and informative plots. I began by creating scatter plots to visualize relationships between variables like diabetes and obesity, as well as diabetes and inactivity. Seaborn’s rich color palette was particularly noteworthy, enhancing the clarity and interpretability of the data.
As I delved deeper into data visualization, I decided to experiment with pair plots. These plots provided a holistic view of the entire dataset, revealing correlations between different variables in one comprehensive display. It was akin to having a bird’s-eye view of the data landscape, offering invaluable insights.
The pinnacle of my exploration was the utilization of heat maps. These dynamic visualizations offered an unparalleled means of uncovering hidden patterns and dependencies within the data. The colors on the heat map vividly conveyed the strength and direction of correlations between variables, making complex relationships readily accessible.
In summary, my journey into Seaborn was not just an introduction to a powerful library; it was an enlightening experience in the art of data visualization. Seaborn’s color palette, pair plots, and heat maps equipped me with essential tools to explore datasets in-depth, enriching my data analysis capabilities.
Sept 15
In today’s class, we encountered the concept of Multiple Correlation, which pertains to measuring the degree of association among three quantitative variables. I learned that typically, we assess the correlation between two variables, but in the context of the current obesity, inactivity, and diabetes data, a three-variable correlation becomes essential.
- Multiple Correlation: This method allows us to gauge how three variables, denoted as x, y, and Z, relate to one another.
- Coefficient Definition: The multiple correlation coefficient comes into play, where x and y function as independent variables, while Z takes on the role of the dependent variable.
- Variable Elimination: When assessing the correlation between two variables, it’s possible to eliminate one of them for simplification.
In my project analysis, I followed a structured approach:
- Data Consolidation: I initially merged data from three different sheets to facilitate interpretation. The primary key for this merging process was identified as “FIPDS” or “FIPS.”
- Relationship Exploration: Post data merging, I endeavored to establish a connection between inactivity and diabetes. This involved plotting a graph, with diabetes as the independent variable on the x-axis and inactivity as the dependent variable on the y-axis.
- Descriptive Statistics: I further conducted statistical analysis by calculating the mean, median, mode, variance, and standard deviation for the data.
- Three-Variable Relationship: My next step entails exploring the relationship among all three variables and subsequently plotting and analyzing the corresponding graph.
In summary, the utilization of Multiple Correlation and a systematic approach for data analysis has been implemented in my project to understand the intricate relationships between obesity, inactivity, and diabetes.
Sept 13
Before I began with the data analysis, I did some essential data preparation steps, such as cleaning up the data and merging the datasets. This merge brought together information from the three datasets we had, and the result was a dataset containing only around 360 data points.
It seemed to me like a significant reduction in data volume, but it think its enough observations to make it suitable for applying the central limit theorem.
After this, I created some fundamental scatterplots. These scatterplots involved pairing up two variables at a time. I’m trying to understand the dataset’s characteristics and to spot any visible trends or patterns that could emerge.
Considering the absence of informative data specifically related to diabetes, I’m leaning towards the use of a linear regression model. The idea is to build a predictive model for a variable as Y, with a focus on diabetes. I think that there isn’t direct diabetes data available so, I’m using the inactivity” variable as a predictor in relation to obesity.
I’m still in the process of understanding how to perform linear regression analysis practically on this dataset. Once i’m done learning, i will write about that on my next post.
Sept 11
Moving forward with the course lecture, we delved into the course structure and our first project topic: Linear regression. Personally, I find it essential to revisit the fundamentals before diving deep into any subject, so I went back to the course material. Upon reading, I realized the significance of grasping the data thoroughly when employing statistical methods for analysis. Connecting with the data is vital to gain insights into its inherent nature. Since our data originates from a real source, it’s imperative that our predictions reflect a realistic approach rather than blindly fitting it into an overly simplified model.
It’s crucial to acknowledge that real-world data carries inherent errors, and these errors should be accorded due consideration to preserve the authenticity of the data. I came across Karl Gauss’s Linear Least Squares model, which calculates the absolute error values and minimizes them to approximate the data points within a linear model. Nevertheless, this model exhibits instability and unreliability. My plan is to begin by plotting individual models based on available data points, then progress to establishing a correlation between obesity and inactivity to predict diabetes percentages accurately.
Sept 9
After going through the provided dataset and its concept, I think that the data analysis could potentially be performed using linear or multiple regression methods. However, I am not sure about the data’s consistency and adequacy. I am uncertain about how to effectively apply the regression method to this dataset. For instance, the “Obesity” sheet contains only 364 values, whereas the “Diabetes” sheet contains 3143 values. This indifference raises concerns about the reliability and comprehensiveness of the dataset for regression analysis.
Hello world!
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!