Sept 27

Today, I explored of the Cross Validation and Bootstrap methodologies, seeking to grasp their intricacies on a deeper level.  Cross Validation, as I learned, necessitates the initial division of the dataset into two vital segments: the training dataset and the testing dataset. A pivotal choice in this process is the determination of ‘k,’ representing the number of subsets or folds into which the entire dataset will be split. The training dataset comprises ‘k-1’ folds, reserving one fold for the testing dataset. I discovered that the careful selection of a suitable performance metric is of utmost importance to effectively gauge the model’s performance. The Cross Validation process involves repeating the ‘k-1’ iterations, each time calculating the performance metric. The mean of these performance metrics offers a robust estimate of the model’s overall performance, with the ultimate goal being a precise estimate of the test error.

On the other hand, Bootstrap, I found, employs a resampling technique that allows for replacement, setting it apart from Cross Validation. Unlike Cross Validation, which conducts multiple testing iterations using various training datasets and a fixed test dataset, Bootstrap employs a different strategy. Bootstrap proves particularly valuable when dealing with datasets of limited size. It repeatedly samples data points from the observed dataset with replacement, enabling the estimation of the underlying data distribution.

Reflecting on our ongoing project, I’ve come to realize that Bootstrap appears to be the more suitable approach for our modeling endeavors, given that our CDC dataset comprises approximately 300 data points. Bootstrap’s ability to work effectively with smaller datasets makes it the preferred methodology in this specific scenario. My journey into these methodologies has equipped me with a newfound appreciation for the nuances and versatility they bring to the world of data analysis and modeling.

Sept 25

While working on the project during class today, I encountered the following points :

1. Data Going Haywire: Sometimes, the given data can get all wonky because it wasn’t taken care of properly. You know, it gets corrupted and messy.

2. Manual Oopsies: Then there’s the human factor. Data might vanish because someone forgot to jot it down or made a good old-fashioned mistake.

Now, on the given CDC Diabetes 2018 dataset, we had around 3,000 rows of diabetes info, a bit over 300 rows for obesity, and a little more than 1,000 rows for inactivity. But when I mashed all three sets together, we ended up with just a smidge over 300 rows of data. But guess what? There were still 9 empty spots in the “% inactive” column.

Here’s the kicker: If we had a boatload of data, I might’ve considered tossing those rows with the gaps. But with only 300 rows, I wasn’t too thrilled about shrinking our dataset even further.

So, after doing some digging, I stumbled upon a nifty solution known as “Imputing the Missing Values.” There are actually four ways to do this, you know.

First off, there’s the “Arbitrary Value” method. You just toss in some random number like -3, 0, or 7. But here’s the deal: it’s not the best because it can mess with the data and make it harder to spot patterns. However, I am still considering if I want to use this approach as compared to the bootstrap method.

Sept 22

So after yesterdays class, i started learning resampling tricks. I decided to dive into this whole resampling thing, especially this Cross Validation stuff.

  • Resampling? It’s like when you take a test and then you keep redoing the same test over and over again. Or maybe you make up new tests based on the one you already did.
  • Why Bother with Resampling? Well, imagine you’re trying to predict stuff with your fancy model, but you don’t have new data to test it on. That’s when resampling comes to the rescue. It helps you make up new data to see how your model does.
  • Cross Validation’s Mission: Cross Validation is like your model’s bodyguard. It watches out for those sneaky mistakes caused by your model being too obsessed with the training data.
  • Test Error vs. Training Error: Test Error is like the average oopsies your model makes when it meets new data. Training Error is more like the oopsies it makes when it’s practicing on its old pals.
  • What’s Overfitting? Picture this: your model is trying to draw a line that’s so snug with certain dots that it forgets about the other dots. That’s overfitting, and it’s not great for making predictions.
  • Cross-Validation 101: You split your data into two teams—the training team and the validation team. The training team is like your model’s personal trainer, getting it in shape. Then, the model tries to predict stuff for the validation team to see how well it learned.

Sept 20

Today we discussed about a linear model fit to data where both variables are non-normally distributed, skewed with high variance and high kurtosis with the help of a CRAB example. 

For eg.  Pre-molt describes the shell’s size prior to molting, while Post-molt refers to the dimensions of a crab’s shell after molting.

Today’s data model was proposed to be attempted to predict pre molt size from post molt size.

We also tried to understand if the difference between both the states means statistical significance or not and came to the conclusion that by standard statistical inference across alot of cases, the p value in this case was less than 0.05 with makes us reject the null hypothesis that there is no real difference.

Post that, we also talked about the t-test analysis. The t-test is like a detective tool for numbers. It helps us figure out if two groups of numbers are different because of luck or if there’s something real going on. For eg. Imagine you have two groups, like group A and group B. You want to know if their averages (or means) are different. The t-test does this job. It uses a special math thing called “Student’s t-distribution” to make its decision. This math thing is like the rulebook it follows. Now, here’s the tricky part. The t-test wishes it knew the size of a special secret ingredient (lets call it “scaling term”) in the math, but it doesn’t. So, it has to guess it from the numbers. If it guesses right (under some specific conditions), it can use its rulebook to say if the groups are different or not.

So, in a nutshell, the t-test helps us see if two groups’ averages are different for real, not just by chance. It’s like the Sherlock Holmes of statistics for comparing numbers.

Sept 18

Today, I had an enlightening experience exploring data visualization with the Python library Seaborn. To kick things off, I imported Seaborn into my Python environment and loaded a dataset for analysis.

What immediately caught my attention was Seaborn’s ability to produce visually appealing and informative plots. I began by creating scatter plots to visualize relationships between variables like diabetes and obesity, as well as diabetes and inactivity. Seaborn’s rich color palette was particularly noteworthy, enhancing the clarity and interpretability of the data.

As I delved deeper into data visualization, I decided to experiment with pair plots. These plots provided a holistic view of the entire dataset, revealing correlations between different variables in one comprehensive display. It was akin to having a bird’s-eye view of the data landscape, offering invaluable insights.

The pinnacle of my exploration was the utilization of heat maps. These dynamic visualizations offered an unparalleled means of uncovering hidden patterns and dependencies within the data. The colors on the heat map vividly conveyed the strength and direction of correlations between variables, making complex relationships readily accessible.

In summary, my journey into Seaborn was not just an introduction to a powerful library; it was an enlightening experience in the art of data visualization. Seaborn’s color palette, pair plots, and heat maps equipped me with essential tools to explore datasets in-depth, enriching my data analysis capabilities.

Sept 15

In today’s class, we encountered the concept of Multiple Correlation, which pertains to measuring the degree of association among three quantitative variables. I learned that typically, we assess the correlation between two variables, but in the context of the current obesity, inactivity, and diabetes data, a three-variable correlation becomes essential.

  1. Multiple Correlation: This method allows us to gauge how three variables, denoted as x, y, and Z, relate to one another.
  2. Coefficient Definition: The multiple correlation coefficient comes into play, where x and y function as independent variables, while Z takes on the role of the dependent variable.
  3. Variable Elimination: When assessing the correlation between two variables, it’s possible to eliminate one of them for simplification.

In my project analysis, I followed a structured approach:

  1. Data Consolidation: I initially merged data from three different sheets to facilitate interpretation. The primary key for this merging process was identified as “FIPDS” or “FIPS.”
  2. Relationship Exploration: Post data merging, I endeavored to establish a connection between inactivity and diabetes. This involved plotting a graph, with diabetes as the independent variable on the x-axis and inactivity as the dependent variable on the y-axis.
  3. Descriptive Statistics: I further conducted statistical analysis by calculating the mean, median, mode, variance, and standard deviation for the data.
  4. Three-Variable Relationship: My next step entails exploring the relationship among all three variables and subsequently plotting and analyzing the corresponding graph.

In summary, the utilization of Multiple Correlation and a systematic approach for data analysis has been implemented in my project to understand the intricate relationships between obesity, inactivity, and diabetes.

Sept 13

Before I began with the data analysis, I did some essential data preparation steps, such as cleaning up the data and merging the datasets. This merge brought together information from the three datasets we had, and the result was a dataset containing only around 360 data points.

It seemed to me like a significant reduction in data volume, but it think its enough observations to make it suitable for applying the central limit theorem.

After this, I  created some fundamental scatterplots. These scatterplots involved pairing up two variables at a time. I’m trying to understand the dataset’s characteristics and to spot any visible trends or patterns that could emerge.

Considering the absence of informative data specifically related to diabetes, I’m leaning towards the use of a linear regression model. The idea is to build a predictive model for a variable as Y, with a focus on diabetes. I think that there isn’t direct diabetes data available so, I’m using the inactivity” variable as a predictor in relation to obesity.

I’m still in the process of understanding how to perform linear regression analysis practically on this dataset. Once i’m done learning, i will write about that on my next post.

Sept 11

Moving forward with the course lecture, we delved into the course structure and our first project topic: Linear regression. Personally, I find it essential to revisit the fundamentals before diving deep into any subject, so I went back to the course material. Upon reading, I realized the significance of grasping the data thoroughly when employing statistical methods for analysis. Connecting with the data is vital to gain insights into its inherent nature. Since our data originates from a real source, it’s imperative that our predictions reflect a realistic approach rather than blindly fitting it into an overly simplified model.

It’s crucial to acknowledge that real-world data carries inherent errors, and these errors should be accorded due consideration to preserve the authenticity of the data. I came across Karl Gauss’s Linear Least Squares model, which calculates the absolute error values and minimizes them to approximate the data points within a linear model. Nevertheless, this model exhibits instability and unreliability. My plan is to begin by plotting individual models based on available data points, then progress to establishing a correlation between obesity and inactivity to predict diabetes percentages accurately.

Sept 9

After going through the provided dataset and its concept, I think that the data analysis could potentially be performed using linear or multiple regression methods. However, I am not sure about the data’s consistency and adequacy. I am uncertain about how to effectively apply the regression method to this dataset. For instance, the “Obesity” sheet contains only 364 values, whereas the “Diabetes” sheet contains 3143 values. This indifference raises concerns about the reliability and comprehensiveness of the dataset for regression analysis.