Quick Guide to Create Frequency Table from Wide Table Using R
library(dplyr)
library(reshape2)
library(ggplot2)
This post presents a way to display the distribution of multiple variables (with the same comparable unit) side by side and compare their distributions. Normally the data will come in a wide table form in the dataframe, making it impossible to directly use group labels to compare the distributions.
Let’s start by creating an example dataframe:
# generate a dataframe with 10 observations and 4 variables
user_id <- seq(1, 10)
var1_time <- runif(10, 1, 10)
var2_time <- sample(seq(30, 50), 10)
var3_time <- rnorm(10, 70, 9)
df <- data.frame(user_id, var1_time, var2_time, var3_time)
glimpse(df)
## Observations: 10
## Variables: 4
## $ user_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ var1_time <dbl> 2.054537, 8.347375, 7.089574, 3.198027, 1.708012, 7....
## $ var2_time <int> 38, 41, 33, 34, 32, 43, 30, 45, 47, 42
## $ var3_time <dbl> 88.80084, 68.84027, 70.66759, 75.27730, 77.33516, 73...
This is what we’ll usually have in our dataframes. The variables are represented horizontally as columns. However, for data visualization purposes, we will align the values of var1_time
, var2_time
, and var3_time
as if they are one variable with different group levels; only in this way can we generate a side by side distribution comparison. Here, we need to turn this df
into a long table format. In this long format, we will change the dataframe so that it only contains three variables: user_id
, var_label
, time_in_seconds
. The R package reshape2
is ideal for this task, as shown below:
melt(df,
# ID variables, variables not to split
id.vars = c('user_id'),
# Source columns to split
measure.vars = c('var1_time','var2_time','var3_time'),
# Name of the group label variable
variable.name = 'var_name',
# Name of the integrated value
value.name = 'time_in_seconds') -> df_long
glimpse(df_long)
## Observations: 30
## Variables: 3
## $ user_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, ...
## $ var_name <fctr> var1_time, var1_time, var1_time, var1_time, v...
## $ time_in_seconds <dbl> 2.054537, 8.347375, 7.089574, 3.198027, 1.7080...
We can see that df_long
follows what we desires: it kept the variables we don’t want to split (i.e. user_id
) but “melted” other variables into one unified measure (i.e. time_in_seconds
) with “group labels” (i.e. var_name
). Now We can create a visual to compare the distributions of these three variables side by side, as shown below:
g1 <- ggplot(df_long) +
geom_point(aes(x=var_name, y=time_in_seconds), alpha=0.3, color="tomato") +
geom_boxplot(aes(x=var_name, y=time_in_seconds), alpha=0) +
ggtitle("Distribution of Time (Seconds)") + xlab("Variable Names") +
ylab("Time (Seconds)") + theme_light()
g1
Leave a Comment