This is my Take-home Exercise on exploring demographic of Ohio in USA from VAST Challenge 2022.
In this take-home exercise, we will explore and reveal the
demographics of Ohio in USA by creating data visualizations with
ggplot2
in R. The data is provided and can be downloaded
from VAST Challenge
2022. The data visualizations included in this exercise are:
ggplot2
is based on the alphabetical order of categories in
the column.dplyr
package
can help us summarize each group to fewer rows. Thus, we can derive the
average income of each participants from different timestamp.A list of packages, namely tidyverse
,
plotly
, readxl
, knitr
,
dplyr
, ggplot2
, grid
would be
used in this exercise. This code chunk installs the required packages
and loads them into RStudio environment.
packages = c('tidyverse', 'plotly', 'readxl', 'knitr', 'dplyr', 'ggplot2',
'grid')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
Data import was accomplished using read_csv() of readxl package, which is useful for reading csv files into a tibble.
participants <- read_csv("Datasets/Attributes/Participants.csv")
FinancialJournal <- read_csv("Datasets/Journals/FinancialJournal.csv")
# Inspecting the structure of every columns in the dataframe
glimpse(participants)
glimpse(FinancialJournal)
DERIVING NUMBERS OF KIDS AND ADULTS
To derive new columns from the existing columns, the
mutate() of dplyr
is used to generate the data
with condition function, if_els().
participants_mutated <- participants %>%
mutate(kids = if_else(haveKids == TRUE & age < 50, householdSize - 2, 0)) %>%
mutate(adults = if_else(haveKids == TRUE & age < 50, 0, householdSize))
MERGING THE TWO DATA FRAMES
Before combining the separate data frames, the join table is created first and has calculated the average wage of each participants by participant ID. The merge() of dplyr is used to add a new column, wage, on the original data frame.
PIVOTTING DATA
The column of kids and adults in original data frame are two separate
columns and shows the value of each. This is not a good data structure
for ggplot2
to produce the graph, so we use
gather() function to pivot them into a better structure.
PLOTTING THE GRAPH
To look at the demographics of a city, the population composition is the first thing that we would like to know and the pie chart is the suitable one to plot, which is mainly used for displaying the proportion in a variable.
A graph was plotted using ggplot2
as follows:
ggplot(Population, aes(x="", y=value, fill=group)) +
geom_bar(stat="identity", width=1, color="white") +
coord_polar("y", start=0) +
theme_void() +
geom_label(aes(label = value),
position = position_stack(vjust = 0.5),
show.legend = FALSE) +
geom_text(aes(label = paste(round(value / sum(value) * 100, 1), "%"), x = 1.3),
position = position_stack(vjust = 0.5)) +
scale_fill_brewer() +
ggtitle("Population Composition in Ohio USA") +
theme(plot.title = element_text(hjust = 0.5))
After having a quick look at the population, another important factor in demographics is people’s education level. The bar graph is to show the difference and the number of people in each category in a descending order.
A graph was plotted using ggplot2
as follows:
ggplot(participants_mutated, aes(x=reorder(educationLevel,
educationLevel,
function(x)-length(x)))) +
geom_bar(fill = "lightsteelblue1") +
ylim(0, 580) +
geom_text(stat="count",
aes(label=paste0(..count.., ", ",
round(..count../sum(..count..)*100, 1), "%")), vjust=-1) +
ylab("No. of\nParticipants") +
theme(axis.title.y=element_text(angle = 0)) +
ggtitle("Population Distribution on Education Level") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.title.x=element_blank())
We would like to see further that the distribution of people in different age in each education level, we also put the background data that sums up the number of people in different age.
A graph was plotted using ggplot2
as follows:
participants_mutated$educationLevel =
factor(participants_mutated$educationLevel,
levels=c('Low','HighSchoolOrCollege','Bachelors','Graduate'))
data <- participants_mutated
data_bg <- data[, -5]
ggplot(data, aes(x = age, fill = educationLevel)) +
geom_histogram(data = data_bg, fill = "grey", alpha = .5) +
geom_histogram(colour = "white") +
facet_wrap(.~ educationLevel) +
guides(fill = FALSE) +
theme_bw() +
theme(strip.background = element_rect(fill="beige")) +
xlab("Age") +
theme(axis.title.x = element_text(angle = 0)) +
theme(axis.title.y = element_blank()) +
ggtitle("Numbers of People in Different Age in Each Education Level") +
theme(plot.title = element_text(hjust = 0.5))
The boxplot is a way to compare multiple data distributions, since they can be placed side by side by a discrete variable. Here we would like use box plot to see the financial conditions of each participants by education level.
A graph was plotted using ggplot2
as follows:
ggplot(data, aes(x = educationLevel, y = wage)) +
geom_boxplot() +
geom_point(stat="summary",
fun.y="medium",
colour ="red",
size = 1.5) +
ylab("Wage\n(USD)") +
theme(axis.title.y=element_text(angle = 0)) +
theme(axis.title.x=element_blank()) +
ggtitle("The Medium of Income by Education Level") +
theme(plot.title = element_text(hjust = 0.5))
Since the age is also a factor to affect one’s income, we use the scatterplot to present, helping us easily observe the value of each point in two axes.
A graph was plotted using ggplot2
as follows:
ggplot(participants_mutated,
aes(x = age, y = wage, colour = factor(educationLevel))) +
geom_point(size = 1) +
facet_grid(.~ educationLevel, scales = "free") +
theme(strip.background = element_rect(fill="beige")) +
guides(colour = FALSE) +
xlab("Age") +
theme(axis.title.x = element_text(angle = 0)) +
ylab("Wage\n(USD)") +
theme(axis.title.y=element_text(angle = 0)) +
ggtitle("Numbers of People in Different Age in Each Education Level") +
theme(plot.title = element_text(hjust = 0.5))
From the plot, we can describe that the education level in Ohio is high. Almost 40% of people graduated from bachelor and even higher and over 50% of people go to work after studying in high school and college. In general, higher education level would result in higher income. The boxplot above has verified this and delivers the same information. The value of medium becomes higher from high school or college to graduate. Although the medium of low education level is higher than others, the percentage is too small (less than 10%) and the spread of wage is large. We would not say people would have higher income if only in low education level. However, the scatterplot tells us that the income for the majority of people are under 250 dollars no matter the education level. Therefore, we can obtain that those whose income are over 250 dollars in Ohio belongs to higher class.
This take-home exercise provides a great opportunity to get
familiarized with how to clean data and make the visualization by R
packages, especially using tidyverse
and
ggplot2
and their extensions.
My key takeaways are: