1. Overview

In this take-home exercise, we will explore and reveal the demographics of Ohio in USA by creating data visualizations with ggplot2 in R. The data is provided and can be downloaded from VAST Challenge 2022. The data visualizations included in this exercise are:

A population proportion of kids and adults.
Numbers of education levels in 4 categories: Low, High School or College, Bachelors and Graduate.
Numbers of education levels on different age.
A medium of wage on education levels.
A distribution of wage on age by education levels.

2. Data Preparation

2.1 Challenges Faced

The raw data only reveals a boolean value that whether the surveyed volunteers have kids or not without further detailed information such as their kids’ age for us to identify the population composition.
The data of participants’ income contains different timestamp record and is also inconsistent between each participants.
The default visualization of size order created by ggplot2 is based on the alphabetical order of categories in the column.

2.2 Ways to Overcome Challenges

The legal age of an adult is 18 years old generally in US; therefore, the definition of a kid is a person who is under 18. According to Forbes study, the average of first-time mother in USA is from 21 to 26, while for fathers, it is from 27 to 31. Given this, we can defer that their kids are already an adult for those participants aged over 50.
The function of summarize() in dplyr package can help us summarize each group to fewer rows. Thus, we can derive the average income of each participants from different timestamp.
The relevel() function in R can help us to resort the levels of a factor by sepcifying our expected order on the grouping variables.

3. Sketch

4. Step-by-step Description

4.1 Installing Packages

A list of packages, namely tidyverse, plotly, readxl, knitr, dplyr, ggplot2, grid would be used in this exercise. This code chunk installs the required packages and loads them into RStudio environment.

packages = c('tidyverse', 'plotly', 'readxl', 'knitr', 'dplyr', 'ggplot2', 
             'grid')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

4.2 Importing the Dataset

Data import was accomplished using read_csv() of readxl package, which is useful for reading csv files into a tibble.

participants <- read_csv("Datasets/Attributes/Participants.csv")
FinancialJournal <- read_csv("Datasets/Journals/FinancialJournal.csv")

# Inspecting the structure of every columns in the dataframe
glimpse(participants)
glimpse(FinancialJournal)

4.3 Data Wrangling

DERIVING NUMBERS OF KIDS AND ADULTS

To derive new columns from the existing columns, the mutate() of dplyr is used to generate the data with condition function, if_els().

participants_mutated <- participants %>%
  mutate(kids = if_else(haveKids == TRUE & age < 50, householdSize - 2, 0)) %>%
  mutate(adults = if_else(haveKids == TRUE & age < 50, 0, householdSize))

MERGING THE TWO DATA FRAMES

Before combining the separate data frames, the join table is created first and has calculated the average wage of each participants by participant ID. The merge() of dplyr is used to add a new column, wage, on the original data frame.

Wage <- FinancialJournal %>%
  filter(category == "Wage") %>%
  group_by(participantId) %>%
  select(participantId, amount) %>%
  summarise(wage = mean(amount))

participants_mutated <- merge(x = participants_mutated, 
                              y = Wage[ , c("participantId", "wage")], 
                              by = "participantId", all.x = TRUE)

4.4 Plot Population Proportion

PIVOTTING DATA

The column of kids and adults in original data frame are two separate columns and shows the value of each. This is not a good data structure for ggplot2 to produce the graph, so we use gather() function to pivot them into a better structure.

Population <- participants_mutated %>%
  select(kids, adults) %>%
  summarise(kids = sum(kids), adults = sum(adults)) 

Population <- gather(Population, kids, adults, key = group, value = value)

PLOTTING THE GRAPH

To look at the demographics of a city, the population composition is the first thing that we would like to know and the pie chart is the suitable one to plot, which is mainly used for displaying the proportion in a variable.