This is my Take-home Exercise on exploring the financial health of Ohio in USA, espically focusing on the prosperity of the businesses there.
In this take-home exercise, we reveal Ohio’s economy and look more in
details on its business prosperity by creating interactive data
visualizations with ggplot2
and its extension packages in
R. The data is provided and can be downloaded from VAST Challenge 2022
and the topic is about the first point in challenge 3.
The following packages and libraries are installed for this exercise. Important and relative packages for plotting an interactive graph in R are:
tidyverse
designs for data science and is used
extensively for data preparation and wrangling.ggiraph
creates dynamic ggplot graphs with hover and
click actions.DT
provides an R interface to the JavaScript library DataTables that creates interactive
table on html page.patchwork
combines multiple ggplot objects into a
single figure.tooltip
a column of datasets are displayed when the
mouse is over responding elements.packages = c('tidyverse', 'readxl', 'knitr', 'dplyr', 'ggplot2', 'ggiraph',
'plotly', 'DT', 'patchwork','crosstalk', 'ggridges')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
To reveal Ohio’s financial health, we would only focus on their main commercial businesses, but not including catering industry like restaurants and Pubs. Therefore, the only dataset would be used in this exercise is regard to Ohio residents’ Job information. There are 6 variables in this dataset, including their working hours, working days, hourly rate and education requirement and so on.
Jobs <- read_csv("Datasets/Attributes/Jobs.csv")
DERIVING A NEW COLUMN
Here we would like to create a new column to view how many employees
are there in each company. By doing so, we use the mutate() of
dplyr
to generate the frequency by employer_id.
# add frequency column
final_data <- final_data %>%
group_by(employerId)%>%
dplyr::mutate(count=n())
RENAMING THE COLUMN
The default name of using count to derive the above new column would call count, which is not intuitive to understand. We would like to change the column name and call it No.Employees.
# rename the column
final_data <- final_data %>%
rename(c('No.Employees' = 'count'))
REMOVING USELESS COLUMNS
There are some columns that would not be used in our visualization later. To keep the data more clean, we remove some useless columns by subset.
CONVERTING DATA TYPE
Since we would use the deriving columns that we created previously as the category to classify the raw data, we need to convert the data type from numeric to categorical.
# Convert numeric column into categorical
final_data$No.Employees <- as.factor(final_data$No.Employees)
Generally, numbers of employees can be regarded as a company’s productivity. Thus, to identify which business is more popular and appears to be more prosperous in Ohio, we would first take a look at their employees number. Since there are over 500 of businesses in our dataset, we use numbers of employees as our category to be the x-axis and display which enterprises have hired the most and the least workers.
To create an interactive graph, it is not enough to only use
ggplot2
to make it. The bar chart of numbers of companies
on numbers of employees is plotted as follows:
plotly
package,
which can help us create an object of class
crosstalk::SharedData to connect and show corresponding outputs
of two graphs or a graph with data table.plotly
package. It can link to multiple plotly graphs and set various options
for brushing them.Through the above function, when selecting each whole bar on the graph, the below table would automatically show more details for the corresponding row data.
dd <- highlight_key(final_data)
graph1 <- ggplot(dd, aes(x = No.Employees)) +
geom_bar(fill = '#468499') +
theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
panel.background= element_blank(), axis.line= element_line(color= 'grey')) +
ggtitle("The Distribution of Numbers of Companies on Numbers of Employees") +
xlab("Numbers of Employees") +
ylab("Count")
gg <- highlight(ggplotly(graph1),
"plotly_selected")
crosstalk::bscols(gg,
widths = c(12,12),
DT::datatable(dd,
rownames = FALSE,
colnames = c('Education Requirement' = 'educationRequirement',
'Employer Id' = 'employerId',
'Hourly Rate' = 'hourlyRate',
'Numbers of Employees' = 'No.Employees'),
filter = 'top',
class = 'display'))
From the plot, it can be observed:
It is not enough to just look at the numbers of labours in a company
to display its prosperity. Here, we would like to see the distribution
of hourly rate by numbers of employees and further observe whether
larger scale of companies would pay more to their employees. The violin
graph is plotted by ploy_ly(), which is a function in
plotly
and is more or less same as ggplotly() to
construct an interacitve graph.
graph2 <- plot_ly(final_data,
x = ~No.Employees,
y = ~hourlyRate,
split = ~No.Employees,
type = 'violin',
box = list(visible = T),
meanline = list(visible = T)) %>%
layout(title = "The Distribution of Hourly Rate in Prosperous Businesses",
xaxis = list(title = 'Numbers of Employees'),
yaxis = list(title = 'Hourly Rate'))
graph2
The violin plot with boxplot can easily show the distribution of hourly rate in each category and disclose some statistical information from the dataset. The graph indicates:
Ridge plot is a good visualization to display the distribution of continuous value based on a categorical variable. The possible factor to affect the hourly rate is education requirement. We would like to see whether higher education level would be granted with more wage.
ggplot(final_data,
aes(x = hourlyRate, y = educationRequirement, fill = factor(stat(quantile)))) +
stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = TRUE,
quantiles = c(0.025, 0.975)) +
scale_fill_manual(
name = "Probability", values = c("#FF0000A0", "#A0A0A0A0", "#0000FFA0"),
labels = c("(0, 0.025]", "(0.025, 0.975]", "(0.975, 1]")) +
ggtitle("The Distribution of Hourly Rate by Education Requirement") +
xlab("Hourly Rate") +
theme(axis.title.y = element_blank())
It can be observed that people who are graduate surely earn more money than others. The density of hourly rate in graduate distribution is around 20 to 30 dollars; however, others are around or below than 15 dollars. In addition, its range in 95% of probability distribution is much wider than other, meaning that most people in this category have higher wage.
CREATING STATISTICAL DATATABLE
After exploring the distribution of hourly rate by different education requirements, we would like to see more statistical data about hourly rate based on education requirement by companies’ size. Hence, we clean the needed columns and create a new data table for the following graph.
# Create statistical data information
graph1_sum <- final_data %>%
group_by(No.Employees, educationRequirement) %>%
dplyr::summarise(n=n(), mean=mean(hourlyRate), sd=sd(hourlyRate)) %>%
mutate(se=sd/sqrt(n-1))
# Convert int column into category
graph1_sum$No.Employees <- as.factor(graph1_sum$No.Employees)
# Round off to two decimals
graph1_sum$mean <- round(graph1_sum$mean, digit = 2)
graph1_sum$se <- round(graph1_sum$se, digit = 2)
graph1_sum$sd <- round(graph1_sum$sd, digit = 2)
PLOTTING THE GRAPH
The bar graph displays the percentage of different education requirement roles by companies’ size. Moreover, it is connected to the below data table and can select a bar to see the corresponding data.
d1 <- highlight_key(graph1_sum)
final_data$tooltip <- c(paste0(
"No. Employees:", final_data$No.Employees,
"Percentage: ", final_data$n,
"Eduction: ", final_data$educationRequirement))
graph3 <- ggplot(d1, aes(x = No.Employees, y = n, fill = educationRequirement)) +
geom_bar(position="fill", stat="identity") +
theme(legend.position="top") +
ggtitle("Numbers of Employess by Education Requirement") +
xlab("Numbers of Employees") +
ylab("Percentage\n(%)")
gg3 <- highlight(ggplotly(graph3, tooltip = final_data$tooltip),
"plotly_selected")
crosstalk::bscols(gg3,
widths = c(12,12),
DT::datatable(d1,
rownames = FALSE,
colnames = c('Numbers of Employees' = 'No.Employees',
'Education Requirement' = 'educationRequirement',
'N' = 'n',
'Average Hourly Rate' = 'mean',
'Hourly Rate Std' = 'sd',
'Hourly Rate SE' = 'se'),
class = 'display'))
There are some intereting findings in this interactive graph:
This take-home exercise helps us to hone our skills to design a good
visualization for other people to use. Furthermore, it also makes me
more understand the usage of different packages in R for plotting a
graph, especially ggplot2 with ggirafe
and
ggplotly
.
My key takeaways are:
Adding interactivity is like a auxiliary tool for us to tell more stories in a graph; however, it is not always a must. It is possible to be an additional function for people to use or even bother them to read from the visualization. Thus, we need to think about the purpose for our visualization design.
There are many packages can help us build an interactive plot in R, but some of them cannot use together and even would affect another execution. We should always plan first before starting to write down the code so that we can select the correct package to create what we expect.