Hands-on Exercise 4C - Visualising Uncertainty

Author

Goh Si Hui

Published

February 1, 2024

Modified

February 2, 2024

1 About this Exercise

In this hands-on exercise, we will learn how to:

plot static error bars using ggplot2
plot interactive error bars using ggplot2, plotly and DT
create hypothetical outcome plots (HOPs) by using ungeviz package.

2 Getting Started

Before we start, let us ensure that the required R packages have been installed and import the relevant data for this hands-on exercise.

2.1 Installing and Loading the Packages

For this exercise, other than tidyverse, we will use the following packages:

tidyverse: a family of R packages for data science process,
plotly: to create interactive plot,
gganimate: to create animation plot,
DT: to display interactive html table,
crosstalk: to implement cross-widget interactions (currently, linked brushing and filtering), and
ggdist: to visualise distribution and uncertainty.

The code chunk below uses p_load() of pacman package to check if the abovementioned packages are installed in the computer. If they are, they will be launched in R. Otherwise, pacman will install the relevant packages before launching them.

Show the code

devtools::install_github("wilkelab/ungeviz") #you only need to do this step once

Show the code

pacman::p_load(ungeviz, plotly, crosstalk,
               DT, ggdist, ggridges,
               colorspace, gganimate, tidyverse)

2.2 Importing Data into R

We will use Exam_data.csv for this exercise.

Show the code

exam <- read_csv("data/Exam_data.csv")

3 About uncertainty and point estimates

A point estimate is a single number, such as a mean. Uncertainty, on the other hand, is expressed as standard error, confidence interval, or credible interval.

Important

Don’t confuse the uncertainty of a point estimate with the variation in the sample

4 Visualizing the uncertainty of point estimates: ggplot2 methods

To plot error bars of maths scores by race using the data provided, we will first derive the necessary summary statistics using the following code chunk.

Show the code

my_sum <- exam %>%
  group_by(RACE) %>%
  summarise(
    n = n(),
    mean = mean(MATHS),
    sd = sd(MATHS)
    ) %>%
  mutate(se = sd/sqrt(n-1))

Note

group_by() of dplyr package is used to group the observation by RACE, summarise() is used to compute the count of observations, mean, standard deviation mutate() is used to derive standard error of Maths by RACE, and the output is save as a tibble data table called my_sum.

We will then display my_sum tibble data frame in html table format using the following code chunk

Show the code

knitr::kable(head(my_sum), format = 'html')

RACE	n	mean	sd	se
Chinese	193	76.50777	15.69040	1.132357
Indian	12	60.66667	23.35237	7.041005
Malay	108	57.44444	21.13478	2.043177
Others	9	69.66667	10.72381	3.791438

4.1 Standard Error Bars of Point Estimates

We can visualise the standard error bars of mean maths score by race using the following code chunk.

Note that the error bars are computed by using the formula mean+/-se. :::{.callout-important} For geom_point(), it is important to indicate stat=“identity”. :::

Show the code

ggplot(my_sum) +
  geom_errorbar(
    aes(x=RACE, 
        ymin=mean-se, 
        ymax=mean+se), 
    width=0.2, 
    colour="black", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=RACE, 
            y=mean), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  ggtitle("Standard error of mean maths score by race")

4.2 Confidence Interval of Point Estimates

Instead of plotting the standard error bar of point estimates, we can also plot the confidence intervals of mean maths score by race.

Show the code

ggplot(my_sum) +
  geom_errorbar(
    aes(x=reorder(RACE, -mean), 
        ymin=mean-1.96*se, 
        ymax=mean+1.96*se), 
    width=0.2, 
    colour="black", 
    alpha=0.9, 
    size=0.5) +
  geom_point(aes
           (x=RACE, 
            y=mean), 
           stat="identity", 
           color="red",
           size = 1.5,
           alpha=1) +
  labs(x = "Maths score",
       title = "95% confidence interval of mean maths score by race")

Note

The confidence intervals are computed by using the formula mean+/-1.96*se. The error bars are sorted using the average maths scores. labs() argument of ggplot2 is used to change the x-axis label.

4.3 Interactive error bars

We can also plot interactive error bars for the 99% confidence interval of mean maths scores by race using the following code chunk.

Show the code

shared_df = SharedData$new(my_sum)

bscols(widths = c(4,8),
       ggplotly((ggplot(shared_df) +
                   geom_errorbar(aes(
                     x=reorder(RACE, -mean),
                     ymin=mean-2.58*se, 
                     ymax=mean+2.58*se), 
                     width=0.2, 
                     colour="black", 
                     alpha=0.9, 
                     size=0.5) +
                   geom_point(aes(
                     x=RACE, 
                     y=mean, 
                     text = paste("Race:", `RACE`, 
                                  "<br>N:", `n`,
                                  "<br>Avg. Scores:", round(mean, digits = 2),
                                  "<br>95% CI:[", 
                                  round((mean-2.58*se), digits = 2), ",",
                                  round((mean+2.58*se), digits = 2),"]")),
                     stat="identity", 
                     color="red", 
                     size = 1.5, 
                     alpha=1) + 
                   xlab("Race") + 
                   ylab("Average Scores") + 
                   theme_minimal() + 
                   theme(axis.text.x = element_text(
                     angle = 45, vjust = 0.5, hjust=1)) +
                   ggtitle("99% Confidence interval of average /<br>maths scores by race")), 
                tooltip = "text"), 
       DT::datatable(shared_df, 
                     rownames = FALSE, 
                     class="compact", 
                     width="100%", 
                     options = list(pageLength = 10,
                                    scrollX=T), 
                     colnames = c("No. of pupils", 
                                  "Avg Scores",
                                  "Std Dev",
                                  "Std Error")) %>%
         formatRound(columns=c('mean', 'sd', 'se'),
                     digits=2))

5 Visualizing the uncertainty of point estimates: ggdist methods

5.1 About ggdist

ggdist is an R package that provides a flexible set of ggplot2 geoms and stats designed especially for visualising distributions and uncertainty.

It is designed for both frequentist and Bayesian uncertainty visualization, taking the view that uncertainty visualization can be unified through the perspective of distribution visualization:

for frequentist models, one visualises confidence distributions or bootstrap distributions (see vignette(“freq-uncertainty-vis”));
for Bayesian models, one visualises probability distributions (see the tidybayes package, which builds on top of ggdist).

5.2 Confidence Intervals of Mean Maths Scores

We can use stat_pointinterval() or stat_gradientinterval() to build a visual for displaying distribution of maths scores by race

stat_pointinterval()
stat_gradientinterval()

exam %>%
  ggplot(aes(x = RACE, 
             y = MATHS)) +
  stat_pointinterval() +
  labs(
    title = "Visualising confidence intervals of mean math score",
    subtitle = "Mean Point + Multiple-interval plot")

exam %>%
  ggplot(aes(x = RACE, 
             y = MATHS)) +
  stat_gradientinterval(   
    fill = "skyblue",      
    show.legend = TRUE     
  ) +                        
  labs(
    title = "Visualising confidence intervals of mean math score",
    subtitle = "Gradient + interval plot")

Important

These two functions come with many arguments, please refer to the syntax reference for more detail.

6 Visualising Uncertainty with Hypothetical Outcome Plots (HOPs)

library(ungeviz)

ggplot(data = exam, 
       (aes(x = factor(RACE), y = MATHS))) +
  geom_point(position = position_jitter(
    height = 0.3, 
    width = 0.05), 
    size = 0.4, 
    color = "#0072B2", alpha = 1/2) +
  geom_hpline(data = sampler(25, group = RACE), height = 0.6, color = "#D55E00") +
  theme_bw() + 
    transition_states(.draw, 1, 3)

Note

.draw is a generated column indicating the sample draw

7 References

Kam, T. S. (2023). R for Visual Analytics [Web-book]. https://r4va.netlify.app/.