Before I began with the data analysis, I did some essential data preparation steps, such as cleaning up the data and merging the datasets. This merge brought together information from the three datasets we had, and the result was a dataset containing only around 360 data points.
It seemed to me like a significant reduction in data volume, but it think its enough observations to make it suitable for applying the central limit theorem.
After this, I created some fundamental scatterplots. These scatterplots involved pairing up two variables at a time. I’m trying to understand the dataset’s characteristics and to spot any visible trends or patterns that could emerge.
Considering the absence of informative data specifically related to diabetes, I’m leaning towards the use of a linear regression model. The idea is to build a predictive model for a variable as Y, with a focus on diabetes. I think that there isn’t direct diabetes data available so, I’m using the inactivity” variable as a predictor in relation to obesity.
I’m still in the process of understanding how to perform linear regression analysis practically on this dataset. Once i’m done learning, i will write about that on my next post.