Take-home Exercise 3

This is my Take-home Exercise on exploring the financial health of Ohio in USA, espically focusing on the prosperity of the businesses there.

Joyce Tseng https://www.linkedin.com/in/joyce-tseng-a7115a1aa/ (School of Computing and Information Systems (SMU))https://scis.smu.edu.sg/master-it-business
2022-05-14

1. Overview

In this take-home exercise, we reveal Ohio’s economy and look more in details on its business prosperity by creating interactive data visualizations with ggplot2 and its extension packages in R. The data is provided and can be downloaded from VAST Challenge 2022 and the topic is about the first point in challenge 3.

2. Data Preparation

2.1 Installing Packages

The following packages and libraries are installed for this exercise. Important and relative packages for plotting an interactive graph in R are:

packages = c('tidyverse', 'readxl', 'knitr', 'dplyr', 'ggplot2', 'ggiraph',
             'plotly', 'DT', 'patchwork','crosstalk', 'ggridges')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

2.2 Importing the Dataset

To reveal Ohio’s financial health, we would only focus on their main commercial businesses, but not including catering industry like restaurants and Pubs. Therefore, the only dataset would be used in this exercise is regard to Ohio residents’ Job information. There are 6 variables in this dataset, including their working hours, working days, hourly rate and education requirement and so on.

Jobs <- read_csv("Datasets/Attributes/Jobs.csv")

2.3 Data Wrangling and Exploration

DERIVING A NEW COLUMN

Here we would like to create a new column to view how many employees are there in each company. By doing so, we use the mutate() of dplyr to generate the frequency by employer_id.

# add frequency column
final_data <- final_data %>%
  group_by(employerId)%>%
  dplyr::mutate(count=n())

RENAMING THE COLUMN

The default name of using count to derive the above new column would call count, which is not intuitive to understand. We would like to change the column name and call it No.Employees.

# rename the column
final_data <- final_data %>%
  rename(c('No.Employees' = 'count'))

REMOVING USELESS COLUMNS

There are some columns that would not be used in our visualization later. To keep the data more clean, we remove some useless columns by subset.

# Remove columns
final_data <- subset(final_data, select = -c(daysToWork, buildingType, type))

CONVERTING DATA TYPE

Since we would use the deriving columns that we created previously as the category to classify the raw data, we need to convert the data type from numeric to categorical.

# Convert numeric column into categorical
final_data$No.Employees <- as.factor(final_data$No.Employees)

3. Data Visualizations and Insights

3.1 Distribution of Numbers of Companies on Numbers of Employees

Generally, numbers of employees can be regarded as a company’s productivity. Thus, to identify which business is more popular and appears to be more prosperous in Ohio, we would first take a look at their employees number. Since there are over 500 of businesses in our dataset, we use numbers of employees as our category to be the x-axis and display which enterprises have hired the most and the least workers.

To create an interactive graph, it is not enough to only use ggplot2 to make it. The bar chart of numbers of companies on numbers of employees is plotted as follows:

Through the above function, when selecting each whole bar on the graph, the below table would automatically show more details for the corresponding row data.

dd <- highlight_key(final_data)

graph1 <- ggplot(dd, aes(x = No.Employees)) +
  geom_bar(fill = '#468499') +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'grey')) +
  ggtitle("The Distribution of Numbers of Companies on Numbers of Employees") +
  xlab("Numbers of Employees") +
  ylab("Count")

gg <- highlight(ggplotly(graph1),
                "plotly_selected")

crosstalk::bscols(gg, 
                  widths = c(12,12),
                  DT::datatable(dd,
                                rownames = FALSE,
                                colnames = c('Education Requirement' = 'educationRequirement',
                                             'Employer Id' = 'employerId',
                                             'Hourly Rate' = 'hourlyRate',
                                             'Numbers of Employees' = 'No.Employees'), 
                                filter = 'top',
                                class = 'display'))

From the plot, it can be observed:

3.2 Distribution of Hourly Rate in Prosperous Businesses

It is not enough to just look at the numbers of labours in a company to display its prosperity. Here, we would like to see the distribution of hourly rate by numbers of employees and further observe whether larger scale of companies would pay more to their employees. The violin graph is plotted by ploy_ly(), which is a function in plotly and is more or less same as ggplotly() to construct an interacitve graph.

graph2 <- plot_ly(final_data, 
                 x = ~No.Employees, 
                 y = ~hourlyRate,
                 split = ~No.Employees,
                 type = 'violin',
                 box = list(visible = T),
                 meanline = list(visible = T)) %>%
  layout(title = "The Distribution of Hourly Rate in Prosperous Businesses",
         xaxis = list(title = 'Numbers of Employees'),
         yaxis = list(title = 'Hourly Rate'))

graph2

The violin plot with boxplot can easily show the distribution of hourly rate in each category and disclose some statistical information from the dataset. The graph indicates:

3.3 Distribution of Hourly Rate by Education Requirement

Ridge plot is a good visualization to display the distribution of continuous value based on a categorical variable. The possible factor to affect the hourly rate is education requirement. We would like to see whether higher education level would be granted with more wage.

ggplot(final_data, 
       aes(x = hourlyRate, y = educationRequirement, fill = factor(stat(quantile)))) + 
  stat_density_ridges(geom = "density_ridges_gradient",
                      calc_ecdf = TRUE,
                      quantiles = c(0.025, 0.975)) +
  scale_fill_manual(
    name = "Probability", values = c("#FF0000A0", "#A0A0A0A0", "#0000FFA0"),
    labels = c("(0, 0.025]", "(0.025, 0.975]", "(0.975, 1]")) +
  ggtitle("The Distribution of Hourly Rate by Education Requirement") +
  xlab("Hourly Rate") +
  theme(axis.title.y = element_blank()) 

It can be observed that people who are graduate surely earn more money than others. The density of hourly rate in graduate distribution is around 20 to 30 dollars; however, others are around or below than 15 dollars. In addition, its range in 95% of probability distribution is much wider than other, meaning that most people in this category have higher wage.

3.4 Distribution of Numbers of Employees by Education Requirement

CREATING STATISTICAL DATATABLE

After exploring the distribution of hourly rate by different education requirements, we would like to see more statistical data about hourly rate based on education requirement by companies’ size. Hence, we clean the needed columns and create a new data table for the following graph.

# Create statistical data information
graph1_sum <- final_data %>%
  group_by(No.Employees, educationRequirement) %>%
  dplyr::summarise(n=n(), mean=mean(hourlyRate), sd=sd(hourlyRate)) %>%
  mutate(se=sd/sqrt(n-1))

# Convert int column into category
graph1_sum$No.Employees <- as.factor(graph1_sum$No.Employees)

# Round off to two decimals
graph1_sum$mean <- round(graph1_sum$mean, digit = 2) 
graph1_sum$se <- round(graph1_sum$se, digit = 2) 
graph1_sum$sd <- round(graph1_sum$sd, digit = 2) 

PLOTTING THE GRAPH

The bar graph displays the percentage of different education requirement roles by companies’ size. Moreover, it is connected to the below data table and can select a bar to see the corresponding data.

d1 <- highlight_key(graph1_sum)

final_data$tooltip <- c(paste0(
  "No. Employees:", final_data$No.Employees,
  "Percentage: ", final_data$n,
  "Eduction: ", final_data$educationRequirement))

graph3 <- ggplot(d1, aes(x = No.Employees, y = n, fill = educationRequirement)) +
  geom_bar(position="fill", stat="identity") +
  theme(legend.position="top") +
  ggtitle("Numbers of Employess by Education Requirement") +
  xlab("Numbers of Employees") +
  ylab("Percentage\n(%)")


gg3 <- highlight(ggplotly(graph3, tooltip = final_data$tooltip),
                "plotly_selected")

crosstalk::bscols(gg3, 
                  widths = c(12,12),
                  DT::datatable(d1,
                                rownames = FALSE,
                                colnames = c('Numbers of Employees' = 'No.Employees',
                                             'Education Requirement' = 'educationRequirement',
                                             'N' = 'n',
                                             'Average Hourly Rate' = 'mean',
                                             'Hourly Rate Std' = 'sd', 
                                             'Hourly Rate SE' = 'se'),
                                class = 'display'))

There are some intereting findings in this interactive graph:

4. Learing Points

This take-home exercise helps us to hone our skills to design a good visualization for other people to use. Furthermore, it also makes me more understand the usage of different packages in R for plotting a graph, especially ggplot2 with ggirafe and ggplotly.

My key takeaways are: