Today, i studied about how a 3D scatter plot can help us understand more about Economic Indicators. After my study, I understood that it proves the the intricate relationships among three variables—total jobs, unemployment rate, and labor force participation rate—by visualizing their distribution and patterns in a three-dimensional space. In this context, the 3D scatter plot offers numerous advantages:
Nov 23
Correlation Matrix Heatmap:
In today’s analysis, I’ve delved into the analysis of our dataset by calculating a correlation matrix for three pivotal economic indicators: total jobs, unemployment rate, and labor force participation rate. This correlation matrix serves as a quantitative measure of the relationships between these variables, presenting correlation coefficients for all pairs. To bring these numerical insights to life, I’ve generated a heatmap using Plotly Express (px.imshow), where each cell’s color vividly communicates the strength and direction of the correlation between the respective variables. The title, “Correlation Matrix Heatmap,” succinctly captures the essence of the visualization, elucidating our primary objective of illustrating the interdependencies among the selected economic indicators.
Pairplot of Selected Variables:
Shifting focus to the Pairplot of Selected Variables, I’ve harnessed the capabilities of the seaborn library (sns.pairplot) to craft a visual narrative for total jobs, unemployment rate, labor force participation rate, median housing price, and housing sales volume. This pair plot offers a simultaneous and comprehensive view of relationships across all variable pairs. The scatter plots reveal nuanced interactions between numeric variables, while histograms along the diagonal provide insights into individual variable distributions. The title, “Pairplot of Selected Variables,” reinforces the plot’s purpose, underscoring its role in providing a visual panorama of relationships within the chosen economic and housing-related indicators. This plot proves instrumental in uncovering potential patterns, trends, and outliers, contributing to a more holistic understanding of how these variables interact.
In summary, these visualizations are invaluable tools for unraveling the intricate relationships within our dataset. The correlation matrix heatmap distills complex correlations into a visually accessible form, while the pair plot extends our exploration to a broader set of variables, illuminating potential patterns and trends and offering a nuanced perspective on the data’s interplay.
November 17
Nov 13 – Project 3 – Economic Indicators
To start with project 3, we are given the “Employee Earning” dataset from Analyse Boston, which is made available by the City of Boston. It includes the names, job descriptions, and earnings information of all city employees from January 2011 to December 2022, including base pay, overtime, and total compensation. Employee earnings are collected and stored in a CSV file for each year. Definitions for the variables used in the analysis of employee earnings are included in another file.
I started by loading the dataset into the DataFrame = df. I began by conducting Principal Component Analysis (PCA) on a dataset using the scikit-learn library. The initial step involved selecting a subset of features from the original dataset, which included columns such as ‘logan_intl_flights,’ ‘hotel_occup_rate,’ ‘hotel_avg_daily_rate,’ and ‘total_jobs.’ These features were chosen as potential candidates for dimensionality reduction.
![]()
To ensure a consistent scale across features, I applied standardization using the StandardScaler from scikit-learn. Standardization transforms the data to have zero mean and unit variance, a crucial step for PCA, which is sensitive to the scale of the features.

Next, I applied PCA to the standardized features. PCA is a technique that transforms the original features into a set of linearly uncorrelated variables called principal components. These components capture the maximum variance in the data. I extracted the explained variance ratio for each principal component, providing insights into the proportion of the total variance that each component explains.
To aid in determining the number of principal components to retain, I plotted the cumulative explained variance against the number of components. This visualization is helpful in understanding how much variance is preserved as we increase the number of components. In this example, the plot facilitated the decision to retain components that collectively explain at least 95% of the variance.

With the optimal number of components identified, I reapplied PCA, creating a new DataFrame containing the principal components. I then concatenated this principal component DataFrame with the original dataset, resulting in a new DataFrame that includes both the original features and the principal components.

The final step involved displaying the first few rows of the DataFrame with principal components, offering a glimpse into how the dataset has been transformed. This process allows for a more concise representation of the data while retaining the essential information captured by the principal components. Adjustments can be made to this methodology based on specific dataset characteristics and analysis goals.
Project 2 – Comprehensive Analysis of Fatal Police Encounters: Unveiling Demographic Disparities, Trends and Analytical Methodologies
[Project 1] Re-submission of Final Report
Oct 23
Today, I worked up further on Logistic Regression.
Binary logistic regression, ordinal logistic regression, and multinomial logistic regression are the three primary subtypes of logistic regression.
When the dependent variable is binary, Binary Logistic Regression—the most popular of the three forms of logistic regression—is employed. It can only make two assumptions.
Accuracy Score
The accuracy score is a commonly used metric to measure the performance of a classification model. It calculates the percentage of correct predictions made by the model. The accuracy score ranges from 0 to 1, where 1 represents a perfect prediction.
Classification Report
The classification report provides a comprehensive summary of the model’s performance for each class in a classification problem. It includes metrics such as precision, recall, and F1-score, which help evaluate the model’s ability to correctly classify instances of each class.
The code provided consists of two main parts: importing the necessary modules and evaluating the logistic regression model’s performance.
First, we import the accuracy_score and classification_report functions from the sklearn.metrics module using the from keyword.

Next, we have the report variable, which stores the classification report generated by the classification_report function. This function takes two arguments: y_test and y_pred. y_test represents the true labels of the test data, while y_pred represents the predicted labels generated by the logistic regression model.
Finally, I printed the accuracy score and the classification report using the print function.
Output –
Logistic Regression Model Accuracy: 0.9567341242149338
Classification Report:
precision recall f1-score support
0 0.00 0.00 0.00 62
1 0.96 1.00 0.98 1371
accuracy 0.96 1433
macro avg 0.48 0.50 0.49 1433
weighted avg 0.92 0.96 0.94 1433







