Nov 27

Today, i studied about how a 3D scatter plot can help us understand more about Economic Indicators. After my study, I understood that it proves the the intricate relationships among three variables—total jobs, unemployment rate, and labor force participation rate—by visualizing their distribution and patterns in a three-dimensional space. In this context, the 3D scatter plot offers numerous advantages:

  1. Visualizing Multivariate Relationships:
    • Given the presence of three variables, the 3D scatter plot enables me to concurrently visualize the joint distribution of total jobs, unemployment rate, and labor force participation rate. I found this valuable when navigating through complex relationships where multiple factors may exert influence on one another.
  2. Identifying Clusters and Patterns:
    • The three-dimensional space becomes a canvas for identifying clusters, patterns, or trends in the data that might elude detection in traditional two-dimensional plots. Clusters within the 3D scatter plot have the potential to unveil subgroups or nuanced patterns within the dataset.
  3. Examining Correlations:
    • The visualization allows me to visually assess the correlations between the three variables. For instance, if total jobs and labor force participation rate exhibit simultaneous increases while the unemployment rate decreases, this correlation pattern may come to light within the 3D scatter plot.
  4. Outlier Detection:
    • Outliers, representing noteworthy or unusual observations, are more easily discerned within the 3D space. The identification of outliers, whether in one variable or across multiple variables, proves crucial for comprehending the unique characteristics of the dataset.
  5. Enhanced Data Exploration:
    • The 3D scatter plot offers a more immersive and interactive approach to exploring the data. I can manipulate the plot, rotating it and viewing it from different angles, thereby uncovering hidden patterns or relationships that may remain concealed in static, 2D visualizations.

Nov 23

Correlation Matrix Heatmap:

In today’s analysis, I’ve delved into the analysis of our dataset by calculating a correlation matrix for three pivotal economic indicators: total jobs, unemployment rate, and labor force participation rate. This correlation matrix serves as a quantitative measure of the relationships between these variables, presenting correlation coefficients for all pairs. To bring these numerical insights to life, I’ve generated a heatmap using Plotly Express (px.imshow), where each cell’s color vividly communicates the strength and direction of the correlation between the respective variables. The title, “Correlation Matrix Heatmap,” succinctly captures the essence of the visualization, elucidating our primary objective of illustrating the interdependencies among the selected economic indicators.

Pairplot of Selected Variables:

Shifting focus to the Pairplot of Selected Variables, I’ve harnessed the capabilities of the seaborn library (sns.pairplot) to craft a visual narrative for total jobs, unemployment rate, labor force participation rate, median housing price, and housing sales volume. This pair plot offers a simultaneous and comprehensive view of relationships across all variable pairs. The scatter plots reveal nuanced interactions between numeric variables, while histograms along the diagonal provide insights into individual variable distributions. The title, “Pairplot of Selected Variables,” reinforces the plot’s purpose, underscoring its role in providing a visual panorama of relationships within the chosen economic and housing-related indicators. This plot proves instrumental in uncovering potential patterns, trends, and outliers, contributing to a more holistic understanding of how these variables interact.

In summary, these visualizations are invaluable tools for unraveling the intricate relationships within our dataset. The correlation matrix heatmap distills complex correlations into a visually accessible form, while the pair plot extends our exploration to a broader set of variables, illuminating potential patterns and trends and offering a nuanced perspective on the data’s interplay.

November 17

I initiated the analysis by importing economic indicators data and performing data preprocessing, which included converting the ‘Year’ and ‘Month’ columns to a date-time format. This step was crucial for temporal analysis as it allowed for a more nuanced exploration of trends over time.

Also, I employed Matplotlib to create a line plot, offering a visual representation of the passenger counts at Logan Airport across the temporal dimension. The resulting plot highlighted a discernible trend in passenger counts and notable fluctuations suggested potential variations in travel patterns or external factors influencing airport activity.

 

To delve deeper into the economic context, I computed a correlation matrix for three key indicators: total jobs, unemployment rate, and labor force participation rate. The matrix gave insightful relationships, including a robust negative correlation (-0.87) between total jobs and the unemployment rate, signifying an inverse relationship between employment levels and unemployment. Simultaneously, a robust positive correlation (0.86) emerged between total jobs and the labor force participation rate, suggesting that as job opportunities increased, so did the participation in the labor force. Additionally, a moderate negative correlation (-0.57) between the unemployment rate and labor force participation rate hinted at the complex interplay between these two indicators.

To enhance the analytical depth, I turned to Plotly Express for dynamic and interactive visualizations. The initial line chart provided a comprehensive view of the temporal evolution of total jobs, unemployment rate, and labor force participation rate. The subsequent dynamic bubble chart introduced additional dimensions, with the size of bubbles representing the unemployment rate, the color denoting the year, and an animated sequence unfolding over months. The third visualization, a scatter plot with animation, lets us in into the relationship between total jobs and labor force participation rate over the years, employing size and color variations to incorporate the unemployment rate as a contextual element.

Further exploring    the analysis, I created a heatmap to visually represent the correlation matrix, providing an easily interpretable snapshot of the interrelationships among the economic indicators. Additionally, a 3D scatter plot was generated to explore the intricate relationships among total jobs, unemployment rate, and labor force participation rate in a three-dimensional space, offering a nuanced perspective on their co-variation.

In summary, the analysis encompassed a comprehensive exploration of spatiotemporal trends in Logan Airport passenger counts. It further delved into the intricate correlations among key economic indicators over time, leveraging a diverse set of visualization techniques to provide a nuanced understanding of the underlying dynamics.

Nov 13 – Project 3 – Economic Indicators

To start with project 3, we are given the “Employee Earning” dataset from Analyse Boston, which is made available by the City of Boston. It includes the names, job descriptions, and earnings information of all city employees from January 2011 to December 2022, including base pay, overtime, and total compensation. Employee earnings are collected and stored in a CSV file for each year. Definitions for the variables used in the analysis of employee earnings are included in another file.

I started by loading the dataset into the DataFrame = df. I began by conducting Principal Component Analysis (PCA) on a dataset using the scikit-learn library. The initial step involved selecting a subset of features from the original dataset, which included columns such as ‘logan_intl_flights,’ ‘hotel_occup_rate,’ ‘hotel_avg_daily_rate,’ and ‘total_jobs.’ These features were chosen as potential candidates for dimensionality reduction.

To ensure a consistent scale across features, I applied standardization using the StandardScaler from scikit-learn. Standardization transforms the data to have zero mean and unit variance, a crucial step for PCA, which is sensitive to the scale of the features.

Next, I applied PCA to the standardized features. PCA is a technique that transforms the original features into a set of linearly uncorrelated variables called principal components. These components capture the maximum variance in the data. I extracted the explained variance ratio for each principal component, providing insights into the proportion of the total variance that each component explains.

To aid in determining the number of principal components to retain, I plotted the cumulative explained variance against the number of components. This visualization is helpful in understanding how much variance is preserved as we increase the number of components. In this example, the plot facilitated the decision to retain components that collectively explain at least 95% of the variance.

With the optimal number of components identified, I reapplied PCA, creating a new DataFrame containing the principal components. I then concatenated this principal component DataFrame with the original dataset, resulting in a new DataFrame that includes both the original features and the principal components.

The final step involved displaying the first few rows of the DataFrame with principal components, offering a glimpse into how the dataset has been transformed. This process allows for a more concise representation of the data while retaining the essential information captured by the principal components. Adjustments can be made to this methodology based on specific dataset characteristics and analysis goals.

Oct 23

Today, I worked up further on Logistic Regression.

Binary logistic regression, ordinal logistic regression, and multinomial logistic regression are the three primary subtypes of logistic regression.

When the dependent variable is binary, Binary Logistic Regression—the most popular of the three forms of logistic regression—is employed. It can only make two assumptions.

Accuracy Score

The accuracy score is a commonly used metric to measure the performance of a classification model. It calculates the percentage of correct predictions made by the model. The accuracy score ranges from 0 to 1, where 1 represents a perfect prediction.

Classification Report

The classification report provides a comprehensive summary of the model’s performance for each class in a classification problem. It includes metrics such as precision, recall, and F1-score, which help evaluate the model’s ability to correctly classify instances of each class.

The code provided consists of two main parts: importing the necessary modules and evaluating the logistic regression model’s performance.

First, we import the accuracy_score and classification_report functions from the sklearn.metrics module using the from keyword.

Next, we have the report variable, which stores the classification report generated by the classification_report function. This function takes two arguments: y_test and y_predy_test represents the true labels of the test data, while y_pred represents the predicted labels generated by the logistic regression model.

Finally, I printed the accuracy score and the classification report using the print function.

Output –

Logistic Regression Model Accuracy: 0.9567341242149338
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        62
           1       0.96      1.00      0.98      1371

    accuracy                           0.96      1433
   macro avg       0.48      0.50      0.49      1433
weighted avg       0.92      0.96      0.94      1433