Hands-on Exercise 4: Fundamentals Statistics of Visual Analytics

Author

Wei Yanrui

Published

January 30, 2024

Modified

February 24, 2024

1. Roadmap for Studying

2. Visualize Distribution

2.1 Getting Started

2.1.1 Install and Load R packages

tidyverse for data science process
ggridges a ggplot2 extension specially designed for plotting ridgeline plots
ggdist for visualising distribution and uncertainty

pacman::p_load(ggdist, ggridges, tidyverse,
               ggthemes,colorspace)

2.1.2 Import data

Exam_data.csv will be used in this exercise.

exam <- read_csv("data/Exam_data.csv")

2.2 Visualize Distribution with Ridgeline Plot

Ridgeline plot(also called Joyplot): for revealing the distribution of a numeric value for several groups

is used when the number of group is large
is used when the number of group to represent is medium to high
use space more efficiently
is used when there is a clear pattern in the result, an obvious ranking in groups

2.2.1 ggridges package

geom_ridgeline(): creates plots that look like a series of mountain ridges. creates a line that represents the density of the distribution of your data.
geom_density_ridges(): In addition to what geom_ridgeline() does, geom_density_ridges() adds a shaded area under the lines, which represents the same density information in a more visually filled way

The ridgeline plot below is plotted by using geom_density_ridges()

ggplot(exam,
       aes(x=ENGLISH,
           y=CLASS))+
  geom_density_ridges(
    scale=3,
    rel_min_height=0.01,
    bandwidth=3.4,
    fill=lighten("#7097BB",.3),
    color="white"
  )+
  scale_x_continuous(
    name="English grades",
    expand=c(0,0)
  )+
  scale_y_discrete(name=NULL,
                   expand=expansion(add=c(0.2,2.6)))+
  theme_ridges()

observation: each peak point of each group, the spread of each group and the tendency of the line linking all peak points

Code Notes

scale: controls the vertical scaling of the density curves for each category, a larger value increases the vertical spacing between the curves, making them easier to distinguish from one another
rel_min_height: sets a minimum relative height for each density curve. It’s a way to ensure that curves representing very small density values don’t become invisible
bandwith: controls the degree of smoothness in the density estimation. A larger bandwidth results in smoother curves with less detail, while a smaller bandwidth leads to more jagged curves with more detail
fill: color of shade
color: color of outline
lighten(): gives a lighter color of the specific color
scale_x_continuous(): sets the scale for the x-axis as continuous

7.1 name="English grades": title of x axis. If you want to remove the label of axis, then use name=NULL

7.2 expand=c(0,0):

7.2.1 1st 0: the multiplier that determines how much space to add around both of the axis based on the range of the data, e.g.expand=c(0.1,0) will look at the range of data points in terms of x axis and add a space equal to 10% of that range on both sides of the axis. Same as y axis.

7.2.2 2nd 0: the fixed value adding blank space beyond the min and max of your actual data on y axis, e.g. y-axis goes from 20 to 80, your plot might show just that range, if with expand=c(0,10), y-axis starts from 10 and ends at 90, but the data points themselves don’t change at all.
scale_y_discrete(): sets the scale for the y-axis as discrete

2.2.2 Change fill colors

geom_ridgeline_gradient(): make the area under a ridgeline filled with colors that vary in some form along the x axis (do not allow for alpha transparency in the fill)
geom_density_ridges_gradient(): make the area under a ridgeline filled with colors that vary in some form along the x axis (do not allow for alpha transparency in the fill)

ggplot(exam,
       aes(x=ENGLISH,
           y=CLASS,
           fill=stat(x)))+
  geom_density_ridges_gradient(
    scale=3,
    rel_min_height=0.01)+
  scale_fill_viridis_c(name="Temp. [F]",
                       option="C")+
  scale_x_continuous(
    name="English grades",
    expand=c(0,0))+
  scale_y_discrete(name=NULL,
                   expand=expansion(add=c(0.2,2.6)))+
  theme_ridges()

Code Notes

fill=stat(x): fills the ridges based on the statistical transformation of the x, in the context of geom_density_ridges(), it fills the ridges based on the density of the English scores
scale_fill_viridis_c(): applies a color scale from the viridis palette to the fill aesthetic 2.1 name="Temp. [F]": sets the name for the color scale in the legend 2.2 option="C": allows for variations in the color mapping. The C option stands for one of the color maps available in the viridis package. Each option (A, B, C, D, E) provides a different color scheme. Note the color used representing quantile across all group so that we can observe the difference of distribution of all groups by color without checking the absolute numbers.

2.2.3 Map the probabilities onto color

stat_density_ridges(): provides a stat function

ggplot(exam,
       aes(x=ENGLISH,
           y=CLASS,
           fill=0.5-abs(0.5-stat(ecdf))))+
  stat_density_ridges(geom="density_ridges_gradient",
                      calc_ecdf=TRUE)+
  scale_fill_viridis_c(name="Tail probability",
                       direction = -1)+
  theme_ridges()

Code Notes

stat(ecdf): computes the empirical cumulative distribution function 1.1 0.5-abs(0.5-stat(ecdf)): This is a way to visualize how far each value is from the median. The closer the fill color is to the center of the color scale, the closer the ECDF at that point is to 0.5, indicating values near the median.
stat_density_ridges(): adds a statistical transformation layer that calculates the density ridgelines and also the empirical cumulative distribution function (when calc_ecdf=TRUE) 2.1 geom="density_ridges_gradient": indicates that a gradient fill will be used, which will reflect the fill aesthetic defined earlier 2.2 calc_ecdf=TRUE: indicates that the function should calculate the empirical cumulative distribution function for each group of data.
direction = -1: reverses the color scale, so that high values are colored with what would normally be low-end colors and vice versa. This is commonly done to match intuitive color associations (e.g., red for higher values, blue for lower values)

2.2.4 Add quantile lines to ridgeline plots

ggplot(exam,
       aes(x = ENGLISH, 
           y = CLASS, 
           fill = factor(stat(quantile))
           )) +
  stat_density_ridges(
    geom = "density_ridges_gradient",
    calc_ecdf = TRUE, 
    quantiles = 4,
    quantile_lines = TRUE) +
  scale_fill_viridis_d(name = "Quartiles") +
  theme_ridges()

Instead of using number to define the quantiles, we can also specify quantiles by cut points such as 2.5% and 97.5% tails to colour the ridgeline plot as shown in the figure below.

ggplot(exam,
       aes(x=ENGLISH,
           y=CLASS,
           fill=factor(stat(quantile))))+
  stat_density_ridges(
    geom="density_ridges_gradient",
    calc_ecdf=TRUE,
    quantiles=c(0.025,0.975))+
  scale_fill_manual(
    name="Probability",
    values=c("#FF0000A0", "#A0A0A0A0", "#0000FFA0"),
    labels=c("(0, 0.025]", "(0.025, 0.975]", "(0.975, 1]"))+
  theme_ridges()

Code Notes

factor(stat(quantile)): computes quantiles and be treated as discrete factors, so each quantile will get a different color in the plot.
quantiles=c(0.025,0.975): specifies which quantiles to calculate for the fill aesthetic.

2.3 Visualize Distribution with Raincloud Plot

Raincloud Plot produces a half-density to a distribution plot.

used to compare the distribution of a continuous variable in terms of different groups
highlight multiple modalities (an indicator that groups may exist)

2.3.1 Plot a Half Eye graph

stat_halfeye(): produces a Half Eye visualization, which is contains a half-density and a slab-interval.

ggplot(exam,
       aes(x=RACE,
           y=ENGLISH))+
  stat_halfeye(adjust=0.5,
               justification=-0.2,
               .width=0,
               point_colour=NA)

Code Notes

adjust=0.5: control the bandwidth of the kernel density estimate, which affects the smoothness of the density shape. A smaller value would make the shape more sensitive to small changes.
justification=-0.2: control the alignment or positioning of the “half-eye” in relation to the categorical axis. A negative value suggests it is shifted to the right of each category
.width=0: remove the slab interval (data distribution and confidence interval).
point_colour=NA: set the color of the individual points within the “half-eye”. Setting it to ‘NA’ means that the points will not be colored.

2.3.2 Add the boxplot with geom_boxplot()

ggplot(exam,
       aes(x=RACE,
           y=ENGLISH))+
  stat_halfeye(adjust=0.5,
               justification=-0.2,
               .width=0,
               point_colour=NA)+
  geom_boxplot(width=.20,
               outlier.shape=NA)

2.3.3 Add the dot plot with geom_dots()

stat_dots(): produces a half-dotplot, which is similar to a histogram that indicates the number of data points in each bin

ggplot(exam,
       aes(x=RACE,
           y=ENGLISH))+
  stat_halfeye(adjust=0.5,
               justification=-0.2,
               .width=0,
               point_colour=NA)+
  geom_boxplot(width=.20,
               outlier.shape=NA)+
  stat_dots(side="left",
            justification=1.2,
            binwidth=.5,
            dotsize=2)

Notes

Halfeye alone just shows the distribution, check the amount of dots by using stat_dots() can have a better understanding of each group, it shows the size of exact dataset.

2.3.4 Flip horizontally

ggplot(exam,
       aes(x=RACE,
           y=ENGLISH))+
  stat_halfeye(adjust=0.5,
               justification=-0.2,
               .width=0,
               point_colour=NA)+
  geom_boxplot(width=.20,
               outlier.shape=NA)+
  stat_dots(side="left",
            justification=1.2,
            binwidth=.5,
            dotsize=2)+
  coord_flip()+
  theme_economist()

3. Visual Statistical Analysis

3.1 Getting Started

3.1.1 Install and Launch R package

ggstatsplot: create graphics with details from statistical tests included in the information-rich plots themselves

Learn more about ggstatsplot on ggstatsplot

pacman::p_load(ggstatsplot, tidyverse)

3.1.2 Import Data

exam <- read.csv("Data/Exam_data.csv")

3.2 One-sample test: gghistostats() method

set.seed(1234)

gghistostats(
  data=exam,
  x=ENGLISH,
  type="bayes",
  test.value=60,
  xlab="English scores"
)

Code Notes

set.seed(1234): sets a seed for random number generation to ensures that if the code involves any randomness (like random sampling), it will produce the same results every time you run it. The number 1234 is just a starting point for generating random numbers.
type="bayes: the type of analysis this code wants to perform. Bayesian statistics is a different approach than traditional statistics, allowing you to update your beliefs about something based on new evidence.
test.value=60: hypothesis test related to English score: how the data supports or contradicts a value of 60

Analysis

log(BF01) = -31.45: The Bayes Factor comparing the null hypothesis to the alternative is given in log scale. A negative value suggests that the data are more likely under the alternative hypothesis.
Posterior difference = 7.16: After analyzing the data and taking into account any prior knowledge, the analysis estimates that the difference we are interested in (like the average score) is about 7.16 units. This is not a fixed number but an estimate that can change with more information.
ETI 95% [5.44, 8.75]: ETI stands for “Equal-Tailed Interval,” which is a range of values that the true difference is likely to fall within. The “95%” part means we can be 95% confident that the true value is somewhere between 5.44 and 8.75. It’s like saying, “We’re pretty sure the true value is in this range, but we don’t know exactly where.”
µMAP = 74.74: This is the maximum a posteriori estimate, which is the most credible value for the mean score given the data.
Conclusion: there is evidence to suggest that the mean English score is not 60.

3.3 Bayes Factor

A Bayes factor is the ratio (positive number) of the likelihood of one particular hypothesis to the likelihood of another. It can be interpreted as a measure of the strength of evidence in favor of one theory among two competing theories

Example:

You have two competing options or hypotheses. Let’s call them Option A and Option B.
You collect some data or evidence. The Bayes Factor helps you assess how well the data supports Option A versus Option B.
The Bayes Factor gives you a ratio. If the Bayes Factor is greater than 1, it means the data supports Option A more than Option B. If it’s less than 1, it means the data supports Option B more than Option A.
The bigger the Bayes Factor, the stronger the evidence in favor of one option. A Bayes Factor of 10 means the data strongly supports one option over the other.
The smaller the Bayes Factor, the weaker the evidence. A Bayes Factor of 0.1 means the data weakly supports one option over the other.

3.4 Two-sample mean test: ggbetweenstats() method

ggbetweenstats(
  data=exam,
  x=GENDER,
  y=MATHS,
  type="np",
  message=FALSE
)

Code Notes

type="np": refer to “non-parametric,” suggesting that you’re doing a type of analysis that doesn’t assume your data follows a specific mathematical distribution.
message=FALSE: is used to control whether or not you want informational messages to be displayed while creating the plot.

Analysis

Mann-Whitney U test: This is a statistical test comparing the two groups. The value W = 13011.00, p = 0.91 indicates the test statistic and the p-value. A p-value higher than 0.05 (like 0.91 here) typically means that there’s no statistical evidence of a difference in math scores between females and males.
Rank-biserial correlation: r = 7.04e-03 is very close to zero, suggesting a very small effect size, if any, means almost no correlation between gender and score.
Confidence Interval: CI95% [-0.12, 0.13]: indicates the 95% confidence interval for the rank-biserial correlation coefficient, which includes zero, suggesting no effect.
n_obs = 322: Number of Observations tells us the total number of observations (students) included in the analysis.
Conclusion: there does not appear to be a significant difference in math scores between female and male students in this particular sample.

3.5 Oneway ANOVA test: ggbetweenstats() method

ggbetweenstats(
  data=exam,
  x=RACE,
  y=ENGLISH,
  type="p",
  mean.ci=TRUE,
  pairwise.comparisons=TRUE,
  pairwise.display = "s",
  pairwise.method="fdr",
  messages=FALSE
)

Code Notes

type="p": refer to “parametric,” suggesting that you’re doing an analysis that assumes your data follows a specific mathematical distribution.
mean.ci=TRUE: indicates that you want to include confidence intervals around the mean values in your plot.
pairwise.comparisons=TRUE: means that you want to compare the groups pairwise, which allows you to see if there are statistically significant differences between any two racial groups.
pairwise.display = "s": specify how you want the pairwise comparisons to be displayed. “s” might stand for “summary,” indicating that you want a summarized view of the comparisons.

4.1 “ns” → only non-significant

4.2 “s” → only significant

4.3 “all” → everything
pairwise.method="fdr": specifying the method for adjusting for multiple comparisons. “fdr” might refer to the “False Discovery Rate,” which is a way to control for the risk of making false discoveries when you’re comparing multiple groups.

Analysis

F(3, 23.8) = 10.15, p = 1.71e-04: indicates that there are statistically significant differences between the groups’ means. The F statistic is 10.15, and the p-value is 0.000171, which is well below the common threshold of 0.05 for significance.
the omega squared effect size=0.50: suggests a moderate to large effect size. However, the confidence interval for this effect size includes 0, which can introduce some uncertainty about the effect size estimate.
P_FDR-adj: The pairwise test results are indicated by lines connecting the groups. The presence of lines (with p-value annotations) suggests that specific group comparisons were made. The pink dots on the lines indicate that the differences were statistically significant after adjusting for multiple comparisons using the False Discovery Rate (FDR) method.
Conclusion: Mean score of “Others” shows significant difference with mean score of “Chinese”,“Indian” and “Malay”.(as denoted by the pink dots and lines)

3.6 Significant test of Correlation: ggscatterstats()

ggscatterstats(
  data=exam,
  x=MATHS,
  y=ENGLISH,
  marginal=TRUE
)

Code Notes

marginal: extra plots or histograms along the sides of the scatter plot.

Analysis

CI 95% [0.79, 0.86]: A coefficient of 0.83 indicates a strong positive relationship between math and English scores.
t(320) = 26.72, p = 1.70e-83: indicates the results of a t-test that tests the hypothesis of no correlation. The t-statistic is 26.72 and the p-value is extremely small (1.70 x 10^-83), which provides strong evidence against the null hypothesis of no correlation.
log(BF01) = -183.55: provides the Bayes factor for testing the null hypothesis of no relationship (which is the alternative hypothesis here). The negative value is very large, suggesting strong evidence against the null hypothesis.
(eta squared) = 1.41: effect size, the value being greater than 1 is unusual for eta squared, which should be between 0 and 1. This might be a typographical error in the annotation.
Conclusion: the plot suggests a strong positive relationship between math and English scores, meaning students who score high in math also tend to score high in English, and this relationship is statistically significant.

3.7 Significant test of Association: ggbarstats()

data wrangling: the Maths scores is binned into a 4-class variable by using cut()

exam1 <- exam %>%
  mutate(MATHS_bins = 
           cut(MATHS, 
               breaks = c(0,60,75,85,100)))

Association test

ggbarstats(exam1,
           x=MATHS_bins,
           y=GENDER)

Analysis

Chi-Squared(3),p = 0.79,Cramér’s V = 0.00: suggests a test of independence was performed to see if there’s a relationship between gender and math score bins. The p-value of 0.79 is high, indicating there’s no statistical evidence of association between the two variables, which is further supported by a Cramér’s V value of 0, showing no effect size.
log_e(BF01) = 4.73: A positive value suggests evidence for the null hypothesis.
p = 0.62 for Females, p = 0.83 for Males: both for male and female, the distribution across the score bins appears to be random or as expected, not showing any particular trend or bias.
Conclusion: that there is no significant association between gender and math scores.

4. Visualize model diagnostic and model parameters

4.1 Getting Started

4.1.1 Install and load R packages

performance: for assessing and checking the quality of statistical models. It contains functions for computing various performance metrics, model diagnostics, and checks for regression models, among others.
parameters: for processing the parameters of statistical models. It provides functions to summarize, manipulate, and plot model parameters and supports a wide range of models.
see: to create plots and visualizations for statistical models and data structures. It works well with the performance and parameters packages, among others, to create attractive and informative plots.

pacman::p_load(readxl,performance,parameters,see)

4.1.2 Import Data

car_resale <- read_xls("Data/ToyotaCorolla.xls","data")
car_resale

# A tibble: 1,436 × 38
      Id Model    Price Age_08_04 Mfg_Month Mfg_Year     KM Quarterly_Tax Weight
   <dbl> <chr>    <dbl>     <dbl>     <dbl>    <dbl>  <dbl>         <dbl>  <dbl>
 1    81 TOYOTA … 18950        25         8     2002  20019           100   1180
 2     1 TOYOTA … 13500        23        10     2002  46986           210   1165
 3     2 TOYOTA … 13750        23        10     2002  72937           210   1165
 4     3  TOYOTA… 13950        24         9     2002  41711           210   1165
 5     4 TOYOTA … 14950        26         7     2002  48000           210   1165
 6     5 TOYOTA … 13750        30         3     2002  38500           210   1170
 7     6 TOYOTA … 12950        32         1     2002  61000           210   1170
 8     7  TOYOTA… 16900        27         6     2002  94612           210   1245
 9     8 TOYOTA … 18600        30         3     2002  75889           210   1245
10    44 TOYOTA … 16950        27         6     2002 110404           234   1255
# ℹ 1,426 more rows
# ℹ 29 more variables: Guarantee_Period <dbl>, HP_Bin <chr>, CC_bin <chr>,
#   Doors <dbl>, Gears <dbl>, Cylinders <dbl>, Fuel_Type <chr>, Color <chr>,
#   Met_Color <dbl>, Automatic <dbl>, Mfr_Guarantee <dbl>,
#   BOVAG_Guarantee <dbl>, ABS <dbl>, Airbag_1 <dbl>, Airbag_2 <dbl>,
#   Airco <dbl>, Automatic_airco <dbl>, Boardcomputer <dbl>, CD_Player <dbl>,
#   Central_Lock <dbl>, Powered_Windows <dbl>, Power_Steering <dbl>, …

4.2 Build Multiple Regression Model using lm()

model <- lm(Price ~ Age_08_04 + Mfg_Year + KM + Weight + Guarantee_Period, data = car_resale)
model


Call:
lm(formula = Price ~ Age_08_04 + Mfg_Year + KM + Weight + Guarantee_Period, 
    data = car_resale)

Coefficients:
     (Intercept)         Age_08_04          Mfg_Year                KM  
      -2.637e+06        -1.409e+01         1.315e+03        -2.323e-02  
          Weight  Guarantee_Period  
       1.903e+01         2.770e+01

The output equation is: Price = -14.09Age_08_04+1315Mfg_Year-232.3KM+19.03Weight+27.7Guarantee_Period

Code Notes

lm(): to create linear regression model, lm(y~x1+x2+x3+…, data=…).

4.3 Model Diagnostic: check multicolinearity

Multicolinearity: several independent varaiables in a model are correlated

check_collinearity(model)

# Check for Multicollinearity

Low Correlation

             Term  VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
               KM 1.46 [ 1.37,  1.57]         1.21      0.68     [0.64, 0.73]
           Weight 1.41 [ 1.32,  1.51]         1.19      0.71     [0.66, 0.76]
 Guarantee_Period 1.04 [ 1.01,  1.17]         1.02      0.97     [0.86, 0.99]

High Correlation

      Term   VIF     VIF 95% CI Increased SE Tolerance Tolerance 95% CI
 Age_08_04 31.07 [28.08, 34.38]         5.57      0.03     [0.03, 0.04]
  Mfg_Year 31.16 [28.16, 34.48]         5.58      0.03     [0.03, 0.04]

Analysis

Term: predictor variables.
VIF: Variance Inflation Factor. This measures how much the variance of the estimated regression coefficients is increased due to multicollinearity. A VIF value of 1 means there is no correlation among the kth predictor and the remaining predictor variables, and hence, no multicollinearity. As a rule of thumb, a VIF greater than 5 or 10 indicates a problematic amount of multicollinearity.
Tolerance: the inverse of VIF (1/VIF) and indicates the proportion of variance of the predictor that is not explained by the other predictors. A small tolerance (close to 0) indicates that the variable is highly correlated with other variables.
Conclusion: Age_08_04 and Mfg_Year are highly correlated, can consider removing one of the highly correlated variables or combining them into a single variable.

Plot the collinearity in terms of VIF

check_c <- check_collinearity(model)
plot(check_c)

Code Notes

why the function can be plotted directly?: some functions from statistics model can return an object of a specific type. check_collinearity() here returns a list of multiple components. One of them is dataframe, others might be metadata used for plotting or model summary.

4.4 Model Diagnostic: Check normality assumption

Normality assumption: when performing statistical tests that rely on the assumption of normally distributed residuals (like t-tests for coefficients), it is required to check if the data points follow a standard normal distribution in advance.

model1 <- lm(Price ~ Age_08_04+KM+Weight+Guarantee_Period,
             data=car_resale)
check_n <- check_normality(model1)
plot(check_n)

Analysis

Q-Q plot: quantile-quantile plot, to assess if a set of data plausibly came from some theoretical distribution such as a Normal distribution.
x-axis: represents the quantiles from a standard normal distribution, which is the theoretical distribution we are comparing the data against.
y-axis: shows the quantiles from the data being analyzed—in this case, the residuals from the linear regression model.
residuals: the differences between the observed values and the values predicted by the model. Each data point has one residual. On the Q-Q plot, each dot represents one of these residuals, not the actual data points. The purpose of this plot is to see how the residuals compare to a normal distribution.
reference line: x-axis is theoretical distribution quantiles (standard normal distribution quantiles), y-axis is sample quantile deviations. If sample data follows standard normal distribution, it should be a horizontal line at y=0, which is the reference line in this case.
Conclusion: the normality assumption of the linear regression model is not fully met.

Middle Range: In the middle range of the data (between about -2 and 2 on the x-axis), the residuals mostly follow the reference line. This suggests that in this range, the residuals are behaving as we would expect if they were normally distributed.
Tails: at both ends of the plot (the tails), the residuals deviate from the reference line. This is indicated by the points curving away from the line. In the left tail, the residuals are lower than expected for a normal distribution, and in the right tail, they are higher than expected. The residuals have heavier tails than a normal distribution, which means there are more extreme values (outliers) than we would expect if the residuals were truly normal.
It might still be okay to use the model for predictions or to understand relationships between variables, especially if it’s for exploratory purposes or the sample size is large. However, if you are performing statistical tests that rely on the assumption of normally distributed residuals (like t-tests for coefficients), the results of those tests might not be entirely reliable. You might need to consider transformations of the data, robust statistical methods, or other models that do not assume normality of residuals.

4.5 Model Diagnostic: Check homogeneity of variances

homogeneity of variances means “equal spread” or “equal variability” of different groups. It refers to the consistency of the spread or variance of the residuals across all levels of the independent variables.

when you are doing an experiment or a study, you often want to make sure that the groups you are comparing are similar in terms of how spread out their data is. If one group has data that is all over the place and another group’s data is very tightly packed, it could be a problem for certain types of statistical tests which assume that the variances are equal across all groups being compared.

Example:

Imagine you’re trying to predict the scores of students on a math test using the number of hours they studied. After everyone takes the test, you compare your predictions to the actual scores—the differences are called “residuals.”

If your predictions are equally reliable for students who studied a little and students who studied a lot, then the spread of those residuals will be pretty consistent. This is like saying whether a student studied for 2 hours or 10 hours, your prediction might be off by about 3 points on average either way. This consistency in prediction error is what we mean by “homogeneity of variance.”

If you notice that for students who studied a little, your predictions are really close, but for students who studied a lot, your predictions are way off—maybe you’re underestimating their scores. This means the spread of the residuals isn’t consistent. For those who studied a little, the residuals are small and clustered close together. For those who studied a lot, the residuals are large and spread out. This inconsistency is what we’d call “heteroscedasticity.”

“homogeneity of variance” means your prediction mistakes are about the same size no matter how much someone studied. If that’s not the case, and your mistakes vary a lot depending on how much they studied, then there’s a problem with “heteroscedasticity”.

Reasons for heteroscedasticity:

Non-Linear Relationship: The relationship between study time and scores isn’t straight-line (linear).
Missing Factors: There could be other factors that affect scores in addition to study time.
Study Time Measurement: Perhaps “study time” isn’t measured in the best way. For example, just counting hours might not take into account the quality of the study.
Extreme Values: There might be students with very high or very low study times that are affecting the prediction errors. These could be outliers, and they might need special attention in your analysis.
Incorrect Model: You might be using the wrong type of model to make your predictions. Maybe the relationship between study time and scores is more complex than the model you’re using can handle.

check_h <- check_heteroscedasticity(model1)
plot(check_h)

Analysis

x-axis (Fitted Values): the predicted values by your regression model.
y-axis (Standardized residuals): the residuals from your model that have been standardized. Standardization typically means subtracting the mean and dividing by the standard deviation, so these values are expressed in terms of how many standard deviations they are from the mean residual (which is typically 0).
Conclusion: the plot shows a pattern where the spread of residuals seems to increase with the fitted values — the residuals are quite close together on the left (for smaller fitted values) and spread out more on the right (for larger fitted values). This pattern is called heteroscedasticity, which means the variability of the errors is not consistent across the range of data. This can be problematic for regression models because they assume that the residuals are equally spread across all levels of the independent variables (homoscedasticity).

4.6 Model Diagnostic: Complete check

To create a composite checks on a statistical model.

check_model(model1)

Analysis

Linearity: The reference line should be flat and horizontal if the relationship between the independent variables and the dependent variable is linear. If the residuals are randomly dispersed around the horizontal line, it suggests linearity. A pattern or systematic structure in the residuals would suggest non-linearity.
Posterior Predictive Check: This plot compares the distribution of observed data against the distribution of data predicted by the model. Ideally, the model-predicted data should closely match the observed data, which would mean the model is accurately capturing the data’s distribution. In the plot, if the lines for observed and predicted data are close together, it indicates a good fit.
Influential Observations: This plot, often called a Cook’s distance plot, helps identify influential observations. Points that fall outside the dashed lines (contour lines of Cook’s distance) can have a disproportionate influence on the model’s parameters.

4.7 Visualize Regression Parameters: see methods

To display the point estimates and confidence intervals of the parameters (coefficients) from a statistical model.

plot(parameters(model1))

Analysis

dot-and-whisker plot: to display the point estimates and confidence intervals of the parameters (coefficients) from a statistical model.
x-axis: represents the value of the estimated coefficients for each variable. These coefficients indicate the expected change in the dependent variable for a one-unit change in the predictor, assuming all other variables are held constant.
Dots: Each dot represents the estimated coefficient for the corresponding variable. The position on the horizontal axis shows whether the effect is positive or negative and the size of the effect.(relationship between this independent variable and the dependent variable)
Whiskers (Horizontal Lines): lines extending from the dots represent the confidence intervals for these estimates, often at a 95% confidence level. If a line crosses the vertical zero line (the dashed line), it suggests that we cannot be confident that the variable has a statistically significant effect on the dependent variable at the chosen confidence level.(either negative or positive is ok, only when crossing the zero line means insignificant effect)
Conclusion:
- Age_08_04: negative and does not cross the zero line, meaning negative relationship (the dependent variable decreases as “Age_08_04” increases)
- Weight: positive relationship
- Guarantee Period: the coefficient is positive, but its confidence interval does cross the zero line, suggesting that the relationship between ‘Guarantee Period’ and the dependent variable is not statistically significant at the given level of confidence.
- KM: dot is at zero line, estimated effect is very small or negligible.

4.8 Visualize Regression Parameters: ggcoefstats() methods

pacman::p_load(ggstatsplot)
ggcoefstats(model1, output="plot")

Analysis

values next to the dots: 1.1 estimated β: estimated coefficient. These values are the “effect sizes,” showing how much the dependent variable is expected to change with a one-unit change in the predictor, holding other variables constant. 1.2 t-value: the coefficient divided by its standard error 1.3 degrees of freedom: suggests the sample size 1.4 p-value: indicates the probability of observing the data if the null hypothesis were true. A p-value less than 0.05 typically indicates statistical significance.
conclusion: 2.1 intercept: The intercept is significant with a large negative coefficient 2.2 Age_08_04: has the most substantial negative effect on the dependent variable, with a highly significant p-value (almost 0). 2.3 Weight: has a significant positive effect on the dependent variable, with a very low p-value, indicating strong evidence against the null hypothesis. 2.4 KM: has a small, but significant, negative effect. 2.5 Guarantee_Period: has a negative effect, though smaller in magnitude than “KM”. 2.6 AIC/BIC: measures of the model’s quality, with lower values generally indicating a better fit.

5. Visualize Uncertainty

5.1 Concept

Estimate: a best guess. In order to learn something about the whole population, we take a smaller group (sample), collect data, look at that, and then make an educated guess or estimate about the larger population based on what we see in the sample.

Point Estimate: the single best guess.
Interval Estimate: a range of numbers where the true answer might fall.

Uncertainty: about how confident we are in our estimates after multiple sampling tests.

Confidence Interval: indicate the uncertainty of estimated value, meaning whether the results will point to a certain range after many times of sampling test.
- Large confidence interval: the result of multiple tests is more discrete which is pointing to a large range, so the uncertainty of that estimate is large
- Small confidence interval: the result of multiple tests is more concentrated which is pointing to a small range, so the uncertainty of that estimate is small
- Note: uncertainty of estimated value is not equal to accuracy of estimated value to the actual value.

Accuracy: how close is the distance between estimated values or ranges and the actual values.

Confidence Level: indicate the accuracy of estimated value to the actual value, meaning how many times the estimated results of sampling tests include the actual values
- High confidence level: the actual values fall in almost every estimated ranges of each test
- Low confidence level: the actual values fall in few estimated ranges of tests

Confidence Level vs Confidence Interval:

The higher the confidence level, the wider the confidence interval
Confidence level of 99% doesn’t mean to be better than that of 95%

5.2 Getting Started

5.2.1 Install and load packages

The packages used for this exercise include:

tidyverse, a family of R packages for data science process
plotly for creating interactive plot
gganimate for creating animation plot
DT for displaying interactive html table
crosstalk for for implementing cross-widget interactions (currently, linked brushing and filtering)
ggdist for visualising distribution and uncertainty

devtools::install_github("wilkelab/ungeviz")

Code Notes

devtools: package to install another package from other sources, meaning the package is not available on the Comprehensive R Archive Network (CRAN).
install_github(): to install R packages that are hosted on GitHub
"wilkelab/ungeviz": specifies the GitHub repository where the package is located. The format is “username/repository”, so in this case, “wilkelab” is the username or account name on GitHub, and “ungeviz” is the name of the repository where the package code resides.

pacman::p_load(ungeviz,plotly,crosstalk,DT,
               ggdist,ggridges,colorspace,
               gganimate,tidyverse)

5.2.2 Import data

exam <- read_csv("data/Exam_data.csv")

5.3 Visualize uncertainty: ggplot2 methods

Uncertainty of point estimate is not equal to confidence interval. Uncertainty of point estimate is not equal to Variation in the sample.

Uncertainty of point estimate: quantifies how much the estimate might vary if different samples were taken from the same population.
Confidence interval: quantifies the uncertainty about the true parameter value by providing a range of plausible values.
Variation in the sample: is commonly measured by standard deviation.

5.3.1 Derive necessary summary statistics from data

Calculate mean, standard deviation and standard error of the mean from data.

my_sum <- exam %>%
  group_by(RACE) %>%
  summarise(
    n=n(),
    mean=mean(MATHS),
    sd=sd(MATHS)
  ) %>%
  mutate(se=sd/sqrt(n-1))

Code Notes

summarise(): is used to compute the count of observations, mean, standard deviation. It can be used to created new columns of specified summary statistics: n, mean, sd
n=n(): the number of observations (or rows) in each group.

Display tibble data frame in an html table format.

An HTML table format refers to the way tables are structured and displayed in HTML (Hypertext Markup Language), which is the standard markup language for creating web pages and web applications. Different from tables created in non-web-based environments, HTML tables can be styled extensively with CSS and manipulated with JavaScript, serves as interactive components in forms, data visualizations, and more.

knitr::kable(head(my_sum),format = 'html')

RACE	n	mean	sd	se
Chinese	193	76.50777	15.69040	1.132357
Indian	12	60.66667	23.35237	7.041005
Malay	108	57.44444	21.13478	2.043177
Others	9	69.66667	10.72381	3.791438

5.3.2 Plot standard error bars of point estimates

ggplot(my_sum)+
  geom_errorbar(
    aes(x=RACE,
        ymin=mean-se,
        ymax=mean+se),
    width=0.2,
    colour="black",
    alpha=0.9,
    size=0.5)+
  geom_point(aes(x=RACE,y=mean),
             stat="identity",
             color="red",
             size=1.5,
             alpha=1)+
  ggtitle("Standard error of mean maths score by race")

Code Notes

Graphs: standard error of mean can be considered as two graphs combined in one plot: graph of standard error and graph of mean. Defination of Y axis value of these two graphs is different, so we put aes() inside each geom_() instead of putting it in ggplot()
stat="identity": This argument within geom_point specifies that the values for x and y should be taken directly from the data without any additional statistical transformation. It is the default statistical transformation for geom_point(). If you want to change the default behavior with a different statistical transformation (e.g., counting occurrences of discrete values), you need to define it within geom_point() by using stat="count"

5.3.3 Plot confidence interval

ggplot(my_sum)+
  geom_errorbar(
    aes(x=reorder(RACE,-mean),
        ymin=mean-1.96*se,
        ymax=mean+1.96*se),
    width=0.2,
    colour="black",
    alpha=0.9,
    size=0.5)+
  geom_point(aes(x=RACE,y=mean),
             stat="identity",
             color="red",
             size=1.5,
             alpha=1)+
  labs(x="Math score",title="95% confidence interval of mean maths score by race")

Code Notes

reorder(RACE,-mean): is used to change the order of levels of a factor based on the values of another variable. In this case, it reorders the levels of RACE based on the values of mean in descending order.

5.3.4 Plot confidence interval interactive with table

shared_df = SharedData$new(my_sum)

bscols(widths = c(4,8),
       ggplotly((ggplot(shared_df) +
                   geom_errorbar(aes(
                     x=reorder(RACE, -mean),
                     ymin=mean-2.58*se, 
                     ymax=mean+2.58*se), 
                     width=0.2, 
                     colour="black", 
                     alpha=0.9, 
                     size=0.5) +
                   geom_point(aes(
                     x=RACE, 
                     y=mean, 
                     text = paste("Race:", `RACE`, 
                                  "<br>N:", `n`,
                                  "<br>Avg. Scores:", round(mean, digits = 2),
                                  "<br>95% CI:[", 
                                  round((mean-2.58*se), digits = 2), ",",
                                  round((mean+2.58*se), digits = 2),"]")),
                     stat="identity", 
                     color="red", 
                     size = 1.5, 
                     alpha=1) + 
                   xlab("Race") + 
                   ylab("Average Scores") + 
                   theme_minimal() + 
                   theme(axis.text.x = element_text(
                     angle = 45, vjust = 0.5, hjust=1)) +
                   ggtitle("99% Confidence interval of average /<br>maths scores by race")), 
                tooltip = "text"), 
       DT::datatable(shared_df, 
                     rownames = FALSE, 
                     class="compact", 
                     width="100%", 
                     options = list(pageLength = 10,
                                    scrollX=T), 
                     colnames = c("No. of pupils", 
                                  "Avg Scores",
                                  "Std Dev",
                                  "Std Error")) %>%
         formatRound(columns=c('mean', 'sd', 'se'),
                     digits=2))

Code Notes

shared_df = SharedData$new(my_sum): to wrap the ‘my_sum’ data frame into an object that can be shared between different interactive visualizations or data presentations, maintain synchronization between them or to manage data more efficiently.
bscols(): is used to create a column layout within a Shiny application. It allows you to specify the width proportions of the columns in a Bootstrap grid layout.
widths=c(4,8): the first column will occupy 4 units of width which is used for ggplotly, while the second column will occupy 8 units of width used for datatable in the Bootstrap grid system.
formatRound(columns=c('mean', 'sd', 'se'), digits=2): used within the DT package to format numerical columns in the DataTable with values rounded to 2 decimal places.

5.4 Visualize uncertainty: ggdist methods

ggdist is designed for both frequentist and Bayesian uncertainty visualization, taking the view that uncertainty visualization can be unified through the perspective of distribution visualization.

Frequentist Model: probabilities are based on the frequency of events. It relies on the idea that if you repeat an experiment infinitely many times, the relative frequency of an event will converge to its true probability. It relies solely on observed data for inference.
Bayesian Model: probabilities represent a degree of belief or uncertainty. It incorporates prior beliefs about the parameters being estimated along with the observed data to update and refine those beliefs. It incorporates both prior beliefs and observed data, updating beliefs as new data becomes available.

5.4.1 Visualize confidence interval of mean: ggdist methods (stat_pointinterval)

exam %>%
  ggplot(aes(x=RACE,
             y=MATHS))+
  stat_pointinterval()+
  labs(
    title="Visualizing confidence intervals of mean maths score",
    subtitle="Mean Point + Multiple-interval plot")

Code Notes

stat_pointinterval(): is used to build a visual for displaying distribution of mean with confidence intervals

Add arguments within stat_pointinterval()

exam %>%
  ggplot(aes(x=RACE,
             y=MATHS))+
  stat_pointinterval(.width=0.95,
                     .point=median,
                     .interval=qi)+
  labs(
    title="Visualizing confidence intervals of median maths score",
    subtitle="Median Point + Multiple-interval plot")

Code Notes

.interval=qi: specifies the method used to calculate the confidence intervals for the median math score. qi is a method for calculating quantile intervals, computing the confidence intervals by taking the 2.5th and 97.5th percentiles of the sample distribution. This corresponds to a 95% confidence interval, which is the default when no specific confidence level is specified.

5.4.2 Visualize confidence interval of mean in different confidence level: ggdist methods (stat_pointinterval)

exam %>%
  ggplot(aes(x = RACE, 
             y = MATHS)) +
  stat_pointinterval(
    conf.level = 0.90,
    show.legend = FALSE) +   
  labs(
    title = "Visualising confidence intervals of mean math score",
    subtitle = "Mean Point + Multiple-interval plot")

5.4.3 Visualize confidence interval of mean: ggdist methods (stat_gradientinterval)

exam %>%
  ggplot(aes(x = RACE, 
             y = MATHS)) +
  stat_gradientinterval(   
    fill = "skyblue",      
    show.legend = TRUE     
  ) +                        
  labs(
    title = "Visualising confidence intervals of mean math score",
    subtitle = "Gradient + interval plot")

5.5 Visualize uncertainty: Hypothetical Outcome Plots (HOPs)

Hypothetical Outcome Plots (HOPs) are a way of showing how different decisions or actions might lead to different outcomes. They’re often used in decision-making or scenario analysis. They’re useful for exploring scenarios, making decisions, and understanding the potential consequences of our choices.

library(ungeviz)

ggplot(data = exam, 
       (aes(x = factor(RACE), y = MATHS))) +
  geom_point(position = position_jitter(
    height = 0.3, width = 0.05), 
    size = 0.4, color = "#0072B2", alpha = 1/2) +
  geom_hpline(data = sampler(25, group = RACE), height = 0.6, color = "#D55E00") +
  theme_bw() + 
  # `.draw` is a generated column indicating the sample draw
  transition_states(.draw, 1, 3)

Code Notes

position_jitter(): is used to jitter the points slightly along the x-axis and y-axis to prevent overplotting, making it easier to see the density of points. Note that position_jitter() is different from normal point plot. Normal point plot adds points to the plot representing actual individual data points, while position_jitter() add points with randomly moving each point a small distance along the x-axis and/or y-axis. In other words, position_jitter() doesn’t represent the actual data points, instead, it represents the data points with slight “error”.
geom_hpline(): adds horizontal lines to the plot. These lines are generated from the sampler() function, which creates samples of the dataset exam. sampler(25, group = RACE) means that it will create a subset of the original dataset ‘exam’ with 25 observations from each group defined by the ’RACE’variable.
height = 0.6: sets the height of horizontal lines
transition_states(.draw, 1, 3): This sets up animation using gganimate. It specifies that the animation should transition between different states based on the column .draw in the dataset, going from draw 1 to draw 3. This means the animation will cycle through different subsets of the data, creating the effect of movement or change over time. In this case, it will go like 1,2,3,1,2,3,1,2,3…
draw: In statistical or simulation contexts, a “draw” or “iteration” usually refers to a single execution or realization of a process or simulation. Each draw represents one instance of the simulation or resampling procedure.

6. Build Funnel Plot for fair comparisons

Funnel plot is a specially designed data visualisation for conducting unbiased comparison between outlets, stores or business entities.

6.1 Getting Started

6.1.1 Install and load packages

The packages used in this exercise include:

readr for importing csv into R.
FunnelPlotR for creating funnel plot.
ggplot2 for creating funnel plot manually.
knitr for building static html table.
plotly for creating interactive funnel plot.

pacman::p_load(tidyverse,FunnelPlotR,plotly,knitr)

6.1.2 Import data

covid19 <- read_csv("data/COVID-19_DKI_Jakarta.csv") %>%
  mutate_if(is.character, as.factor)

For this hands-on exercise, we are going to compare the cumulative COVID-19 cases and death by sub-district (i.e. kelurahan) as at 31st July 2021, DKI Jakarta.

6.2 FunnelPlotR methods

FunnelPlotR package uses ggplot to generate funnel plots. It requires a numerator (events of interest), denominator (population to be considered) and group. The key arguments selected for customisation are:

limit: specify the plot limits, either at the 95% confidence level or the 99% confidence level. This determines the width of the funnel plot and the corresponding confidence intervals around the summary measure.
label_outliers: When set to true, this argument will label outliers on the funnel plot.
Poisson_limits: Setting this argument to true adds Poisson limits to the plot. Poisson limits are calculated based on the assumption that the events of interest follow a Poisson distribution. These limits help assess whether observed variation in event rates is within the range of random chance.
Poisson Distribution: describe the number of events that occur within a fixed interval of time or space, given a known average rate of occurrence and the assumption that the events happen independently of each other. When dealing with funnel plots, we often want to assess whether observed variations in event rates (like mortality rates, complication rates, etc.) are within the range of random chance or if they might indicate something significant. Poisson limits in the context of funnel plots help us with this assessment by setting boundaries to check whether the observed data points are within what we’d expect from random variation alone. If data points fall outside these Poisson limits, it suggests that there might be factors other than chance affecting the event rates and warrants further investigation.
OD_adjust: This argument adds overdispersed limits to the plot when set to true. Overdispersion occurs when there is more variability in the data than expected based on a Poisson distribution.
xrange and yrange: specify the range to display for the x-axis and y-axis, respectively. They act as a zoom function, allowing you to focus on a specific region of interest within the funnel plot.

6.2.1 FunnelPlotR methods: The basic plot

funnel_plot(
  numerator = covid19$Positive,
  denominator = covid19$Death,
  group = covid19$`Sub-district`
)

A funnel plot object with 267 points of which 0 are outliers. 
Plot is adjusted for overdispersion.

6.2.2 FunnelPlotR methods: Makeover 1

funnel_plot(
  numerator = covid19$Death,
  denominator = covid19$Positive,
  group = covid19$`Sub-district`,
  data_type = "PR",     #<<
  xrange = c(0, 6500),  #<<
  yrange = c(0, 0.05)   #<<
)

A funnel plot object with 267 points of which 7 are outliers. 
Plot is adjusted for overdispersion.

Code Notes

data_type = "PR": determines the type of data being plotted.
PR stands for “Proportion” or “Prevalence Rate”, while SR(by default) stands for “Standardized Rate” or “Standardized Ratio”.

Example:

City A:

Total population: 100,000; Number of people with diabetes: 10,000.

City B:

Total population: 200,000; Number of people with diabetes: 20,000.

Prevalence Rate (PR):

City A: (Number of people with diabetes in City A) / (Total population of City A) = 10,000 / 100,000 = 0.1 or 10%
City B: (Number of people with diabetes in City A) / (Total population of City A) = 10,000 / 100,000 = 0.1 or 10%

Standardized Rate (SR):

Standard Population: Total population of City A + Total population of City B = 100,000 + 200,000 = 300,000
City A: (Number of people with diabetes in City A) / (Total standard population) * 100,000 = (10,000 / 300,000) * 100,000 = 3,333.33 cases per 100,000 population
City B: (Number of people with diabetes in City B) / (Total standard population) * 100,000 = (20,000 / 300,000) * 100,000 = 6,666.67 cases per 100,000 population

6.2.3 FunnelPlotR methods: Makeover 2

funnel_plot(
  numerator = covid19$Death,
  denominator = covid19$Positive,
  group = covid19$`Sub-district`,
  data_type = "PR",   
  xrange = c(0, 6500),  
  yrange = c(0, 0.05),
  label = NA,
  title = "Cumulative COVID-19 Fatality Rate by Cumulative Total Number of COVID-19 Positive Cases", #<<           
  x_label = "Cumulative COVID-19 Positive Cases", #<<
  y_label = "Cumulative Fatality Rate"  #<<
)

A funnel plot object with 267 points of which 7 are outliers. 
Plot is adjusted for overdispersion.

6.3 Funnel Plot for Fair Visual Comparison: ggplot2 methods

6.3.1 Compute the basic derived fields

Derive cumulative death rate and standard error of cumulative death rate.

df <- covid19 %>%
  mutate(rate = Death / Positive) %>%
  mutate(rate.se = sqrt((rate*(1-rate)) / (Positive))) %>%
  filter(rate > 0)

Derive weighted mean of death rate.

fit.mean <- weighted.mean(df$rate, 1/df$rate.se^2)

Code Notes

weighted.mean(df$rate, 1/df$rate.se^2): computes the weighted mean of the rates in the “rate” column, where the weights are given by the inverses of the squared standard errors. This means that rates with smaller standard errors contribute more to the weighted mean, reflecting higher confidence in those estimates, while rates with larger standard errors contribute less.

6.3.2 Calculate lower and upper limits for 95% and 99.9% CI

number.seq <- seq(1, max(df$Positive), 1)
number.ll95 <- fit.mean - 1.96 * sqrt((fit.mean*(1-fit.mean)) / (number.seq)) 
number.ul95 <- fit.mean + 1.96 * sqrt((fit.mean*(1-fit.mean)) / (number.seq)) 
number.ll999 <- fit.mean - 3.29 * sqrt((fit.mean*(1-fit.mean)) / (number.seq)) 
number.ul999 <- fit.mean + 3.29 * sqrt((fit.mean*(1-fit.mean)) / (number.seq)) 
dfCI <- data.frame(number.ll95, number.ul95, number.ll999, 
                   number.ul999, number.seq, fit.mean)

Code Notes

seq(1, max(df$Positive), 1): creates a sequence of numbers from 1 to the maximum value of the “Positive” column in the data frame ‘df’, with a step size of 1. It seems to be generating a sequence of numbers representing potential counts or sample sizes.

6.3.3 Plot a static funnel plot

p <- ggplot(df, aes(x = Positive, y = rate)) +
  geom_point(aes(label=`Sub-district`), 
             alpha=0.4) +
  geom_line(data = dfCI, 
            aes(x = number.seq, 
                y = number.ll95), 
            size = 0.4, 
            colour = "grey40", 
            linetype = "dashed") +
  geom_line(data = dfCI, 
            aes(x = number.seq, 
                y = number.ul95), 
            size = 0.4, 
            colour = "grey40", 
            linetype = "dashed") +
  geom_line(data = dfCI, 
            aes(x = number.seq, 
                y = number.ll999), 
            size = 0.4, 
            colour = "grey40") +
  geom_line(data = dfCI, 
            aes(x = number.seq, 
                y = number.ul999), 
            size = 0.4, 
            colour = "grey40") +
  geom_hline(data = dfCI, 
             aes(yintercept = fit.mean), 
             size = 0.4, 
             colour = "grey40") +
  coord_cartesian(ylim=c(0,0.05)) +
  annotate("text", x = 1, y = -0.13, label = "95%", size = 3, colour = "grey40") + 
  annotate("text", x = 4.5, y = -0.18, label = "99%", size = 3, colour = "grey40") + 
  ggtitle("Cumulative Fatality Rate by Cumulative Number of COVID-19 Cases") +
  xlab("Cumulative Number of COVID-19 Cases") + 
  ylab("Cumulative Fatality Rate") +
  theme_light() +
  theme(plot.title = element_text(size=12),
        legend.position = c(0.91,0.85), 
        legend.title = element_text(size=7),
        legend.text = element_text(size=7),
        legend.background = element_rect(colour = "grey60", linetype = "dotted"),
        legend.key.height = unit(0.3, "cm"))
p

6.3.4 Interactive Funnel Plot: plotly + ggplot2

fp_ggplotly <- ggplotly(p,
  tooltip = c("label", 
              "x", 
              "y"))
fp_ggplotly