An introduction to statistical inference and practice. Statistical models and parameters, estimators of parameters and their statistical properties, methods of estimation, confidence intervals, hypothesis testing, likelihood function, the linear model. Use of statistical computation for data analysis and simulation.
Course material provided by Professor Karen Huynh Wong and modifications were made by myself.
In this R Lab, you will be guided by your TA to familiarize yourself with the R environment, using and formatting in R Markdown, and producing well-organized outputs.
?command_name
to
find the documentation of various commandsSome additional resources you may find helpful as you learn is the R Markdown Cheatsheet and the LaTeX Cheatsheet. As you work more in LaTeX, the notation will get easier and become second nature!
R packages are a collection of functions, code, and data sets that are developed by others and can be used to supplement the basic tools in R. R comes with some basic packages, but working with them can be sometimes clunky. You’ll begin by learning how to install packages and load packages, and the difference between these two actions!
Begin by installing and loading the tidyverse
package,
which is a collection of R packages. We will use this extensively in our
course, as all packages within share similar structure and design,
making it a very intuitive and versatile tool to master!
Note: Never include a package install in your R chunks in an R Markdown file. R will not be able to knit your document! Instead, run it strictly within your console.
# Students: What do you think the 'message = F' in the R chunk option does?
# Try knitting your document with and without the option
# Can also try `warning = F` to hide warnings
# echo = F` to run the chunk but hide the code
# install.packages("tidyverse")
library(tidyverse)
# Use Ctrl + Alt + I shortcut for creating code chunks
Let’s examine some datasets that are already available in R (see
library(help = "datasets")
for a list of datasets in base
R), starting with the ToothGrowth dataset.
# ?ToothGrowth
# Variable Assignment
dat <- ToothGrowth
# Mostly useful for smaller datasets, but for larger ones we prefer to use glimpse and summary
# View(dat)
glimpse(dat)
## Rows: 60
## Columns: 3
## $ len <dbl> 4.2, 11.5, 7.3, 5.8, 6.4, 10.0, 11.2, 11.2, 5.2, 7.0, 16.5, 16.5, 15.2, 17.3, 22.5, 17.3, 13.6, 14.5, 18.8, 15.5, 23.6, 18.5, 33.9, 25.5, 26.4, 32.5, 26.7, 21.5…
## $ supp <fct> VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, OJ, OJ, OJ, OJ, OJ, OJ, OJ, OJ, OJ, OJ, …
## $ dose <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 0.5, 0.5, …
summary(dat)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
It looks like there are two groups of data in the
ToothGrowth
data set. What are they?
What if you only want to study the growth based on one source of vitamin C? How do you go about extracting this information?
# After loading and examining your data, and learning about selecting columns or elements
# from your data set, consider the following two commands below.
# Briefly comment below what these two commands do:
# What does
which(dat$supp == 'VC') #do?
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
# Extracts which indices are TRUE
# What does
dat[dat$supp=='VC', 1] #do?
## [1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0 16.5 16.5 15.2 17.3 22.5 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5 23.3 29.5
# extracts the `len` variable/column, where `supp` is 'VC'
# Alternative 1
# Pipe operator shortcut: Ctrl + Shift + M
dat %>%
filter(supp == "VC") %>%
select(len)
## len
## 1 4.2
## 2 11.5
## 3 7.3
## 4 5.8
## 5 6.4
## 6 10.0
## 7 11.2
## 8 11.2
## 9 5.2
## 10 7.0
## 11 16.5
## 12 16.5
## 13 15.2
## 14 17.3
## 15 22.5
## 16 17.3
## 17 13.6
## 18 14.5
## 19 18.8
## 20 15.5
## 21 23.6
## 22 18.5
## 23 33.9
## 24 25.5
## 25 26.4
## 26 32.5
## 27 26.7
## 28 21.5
## 29 23.3
## 30 29.5
# Alternative 2
# This is basically Alternative 1 without the pipe operator
select(filter(dat, supp == "VC"), len)
## len
## 1 4.2
## 2 11.5
## 3 7.3
## 4 5.8
## 5 6.4
## 6 10.0
## 7 11.2
## 8 11.2
## 9 5.2
## 10 7.0
## 11 16.5
## 12 16.5
## 13 15.2
## 14 17.3
## 15 22.5
## 16 17.3
## 17 13.6
## 18 14.5
## 19 18.8
## 20 15.5
## 21 23.6
## 22 18.5
## 23 33.9
## 24 25.5
## 25 26.4
## 26 32.5
## 27 26.7
## 28 21.5
## 29 23.3
## 30 29.5
# Below we can also extract the other 2 columns
dat[dat$supp=='VC', 2]
## [1] VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC
## Levels: OJ VC
dat[dat$supp=='VC', 3]
## [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
After examining the data set, what are some study questions we can ask? Make some suggestions in the Teams chat!
Below are some questions suggested by students in the chat: * Does dosage and length have a positive correlation? * Which vitamin contributes to the most growth?
Find the (sample) mean and standard deviation of tooth length in the Vitamin C group and again in the group that only received Orange Juice. What can be said based on the numbers computed about the differences between the two groups?
Do this using the formula, instead of the functions
mean()
or sd()
to practice using R as a
calculator!
# Task for students: try to recreate the mean and variance values through
# Calculations in R, and compare them with the values you should get through
# mean() and sd() commands in R
# Create tooth length variable for the vitamin C
vc <- dat[dat$supp=='VC', 1]
# take the sum of the tooth length, then divide by the number of values
n <- length(vc)
mean.vc <- sum(vc)/n
mean.vc
## [1] 16.96333
# Can also compute using a for loop
mean.vc2 <- 0
for (i in 1:n){
mean.vc2 <- mean.vc2 + vc[i]/n
}
mean.vc2
## [1] 16.96333
# Compare to mean() function
mean(vc)
## [1] 16.96333
# Standard deviation
var.vc <- sum((vc - mean.vc)^2)/(n-1)
sd.vc <- sqrt(var.vc)
sd.vc
## [1] 8.266029
# Compare to sd() function
sd(vc)
## [1] 8.266029
Below is a repeat of the above but for OJ
(orange
juice):
# Orange Juice
oj <- dat[dat$supp=='OJ', 1]
# Mean
n <- length(oj)
mean.oj <- sum(oj)/n
mean.oj
## [1] 20.66333
# Can also compute using a for loop
mean.oj2 <- 0
for (i in 1:n){
mean.oj2 <- mean.oj2 + oj[i]/n
}
mean.oj2
## [1] 20.66333
# Compare to mean()
mean(oj)
## [1] 20.66333
# Standard deviation
var.oj <- sum((oj - mean.oj)^2)/(n-1)
sd.oj <- sqrt(var.oj)
sd.oj
## [1] 6.605561
# Compare to sd()
sd(oj)
## [1] 6.605561
Trying to type the variance formula in plain-text is not visually
appealing, nor helpful in communicating (it’s very difficult to read!).
The beauty of LaTeX is the format is fairly intuitive, such as
$\frac{numerator}{denominator}$
will render as \(\frac{numerator}{denominator}\) (the dollar
signs indicate the start and end of the math setting, double-dollar sign
will have the math rendered centre to the page). For example, the sample
variance can be written as:
\[s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}\] Try to write the sample mean in LaTeX and quote the values computed in the R chunk above!
For the mean we can use
\bar{x}_n = \frac{\sum_{i=1}^n x_i}{n}
: \[\bar{x}_n = \frac{\sum_{i=1}^n
x_i}{n}\]
Numerical summaries can be powerful to provide concrete, numerical
evidence to support our claims. Using graphical displays can also
communicate clearly in a visual manner the story that our numbers tell,
and can sometimes be a more powerful communication tool than numbers
because we can see the scale of differences. Let’s use
the ggplot
package in tidyverse
to produce
side-by-side box plots of how tooth length growth differs between the
two sources of vitamin C:
# Focus of this segment: understanding the layering language of ggplot
# Note also the R chunk options that control the alignment and size of
# The output figures
dat %>%
ggplot(aes(x = supp, y = len)) +
geom_boxplot() + # passing aes(x,y) into here works too!
labs(title = "Guinea Pig Tooth Length Growth by Vitamin C Source",
subtitle = "ToothGrowth Dataset",
x = "Vitamin C Source",
y = "Tooth Length Growth") +
scale_x_discrete(labels = c("Orange Juice", "Ascorbic Acid")) +
theme_bw() # there are many different themes from which to choose!
What can we notice about tooth length growth between the two supplement groups?
Practice using visual and numeric features to:
Here are things that we notice:
# Two-sided t-test
t.test(oj,vc)
##
## Welch Two Sample t-test
##
## data: oj and vc
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean of x mean of y
## 20.66333 16.96333
# One-sided t-test
t.test(oj,vc,"greater")
##
## Welch Two Sample t-test
##
## data: oj and vc
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.4682687 Inf
## sample estimates:
## mean of x mean of y
## 20.66333 16.96333
In this R Lab, you will be guided by the TA to demonstrate the steps involved in performing an exploratory data analysis.
tidyverse
packagelubridate
package for manipulating datessummarize
function# Load Required Packages
library(tidyverse)
library(lubridate)
# Set work directory
setwd("~/GitHub/dang-kevin.github.io/STA238")
# Read in Data
weather <- read.csv("weatherstats_vancouver_daily.csv")
# Using file.choose() allows you to manually select the file you want
# weather <- read.csv(file.choose())
# Quick overview of the data
glimpse(weather)
## Rows: 1,000
## Columns: 71
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, …
## $ date <chr> "2022-01-16", "2022-01-15", "2022-01-14", "2022-01-13", "2022-01-12", "2022-01-11", "2022-01-10", "2022-01-09", "2022-01-08", "2022-01-…
## $ max_temperature <dbl> 6.3, 7.0, 9.8, 11.1, 11.5, 10.0, 8.5, 6.7, 3.4, 4.6, 3.9, 0.8, 4.9, 5.4, 5.0, 1.4, -2.5, -0.7, -4.1, -4.4, -7.8, -5.7, 0.6, 3.6, 7.0, 8…
## $ avg_hourly_temperature <dbl> 5.29, 5.66, 6.94, 7.56, 9.87, 8.97, 4.78, 2.55, 1.52, 2.64, 0.80, -1.56, 2.12, 3.35, 3.31, -2.01, -6.64, -4.03, -7.09, -8.05, -10.80, -…
## $ avg_temperature <dbl> 5.55, 5.15, 7.45, 7.50, 10.19, 8.35, 3.55, 2.60, 0.64, 2.95, 0.19, -1.29, 2.55, 3.50, 2.75, -4.20, -6.70, -4.39, -7.90, -8.60, -11.55, …
## $ min_temperature <dbl> 4.8, 3.3, 5.1, 3.9, 8.9, 6.7, -1.4, -1.5, -2.1, 1.3, -3.5, -3.4, 0.2, 1.6, 0.5, -9.8, -10.9, -8.1, -11.7, -12.8, -15.3, -10.8, -5.8, 0.…
## $ max_humidex <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ min_windchill <int> NA, NA, NA, NA, NA, NA, NA, -2, -5, NA, -7, -10, NA, NA, NA, -13, -17, -14, -18, -20, -23, -16, -13, NA, NA, NA, -10, -7, -5, NA, -6, -…
## $ max_relative_humidity <int> 100, 100, 99, 99, 99, 99, 99, 99, 96, 99, 99, 98, 99, 96, 96, 82, 93, 94, 92, 91, 85, 90, 95, 98, 96, 98, 99, 93, 99, 98, 97, 95, 91, 9…
## $ avg_hourly_relative_humidity <dbl> 99.0, 99.2, 95.7, 94.8, 96.8, 96.3, 93.4, 93.7, 87.3, 77.5, 98.0, 90.9, 90.2, 87.3, 84.6, 71.9, 74.8, 85.0, 69.2, 80.9, 67.9, 72.5, 87.…
## $ avg_relative_humidity <dbl> NA, 99.0, 93.0, 93.0, 94.5, 95.5, 91.5, 91.5, 79.0, 78.5, 98.0, NA, 85.0, 84.5, 82.0, 70.5, 72.5, 80.0, 70.0, 79.0, 71.0, 72.5, 84.0, 9…
## $ min_relative_humidity <int> 97, 98, 87, 87, 90, 92, 84, 84, 62, 58, 97, 82, 71, 73, 68, 59, 52, 66, 48, 67, 57, 55, 73, 91, 72, 83, 68, 61, 72, 66, 81, 70, 62, 83,…
## $ max_dew_point <dbl> 6.1, 6.5, 7.4, 8.5, 10.2, 9.4, 6.9, 3.6, 2.4, 3.8, 3.9, 0.2, 1.6, 3.1, 2.8, -3.4, -7.3, -4.3, -6.2, -8.9, -10.6, -8.2, -0.4, 2.0, 4.0, …
## $ avg_hourly_dew_point <dbl> 5.1, 5.6, 6.3, 6.7, 9.4, 8.4, 3.8, 1.6, -0.5, -1.1, 0.5, -2.9, 0.6, 1.4, 0.9, -6.5, -10.7, -6.3, -12.1, -10.8, -15.7, -11.9, -4.7, 0.7,…
## $ avg_dew_point <dbl> 5.3, 4.8, 6.3, 6.2, 9.2, 8.1, 3.4, 1.6, -1.0, -0.6, 0.5, -2.2, 0.5, 1.2, 0.4, -6.2, -10.1, -7.1, -11.1, -11.6, -15.0, -11.9, -4.5, 1.0,…
## $ min_dew_point <dbl> 4.6, 3.1, 5.3, 4.0, 8.2, 6.8, -0.1, -0.5, -4.5, -5.1, -2.9, -4.6, -0.6, -0.6, -1.9, -9.1, -12.8, -9.9, -16.0, -14.3, -19.4, -15.6, -8.5…
## $ max_wind_speed <int> 11, 13, 14, 16, 23, 28, 28, 15, 24, 61, 26, 23, 31, 36, 44, 36, 18, 25, 23, 17, 21, 19, 32, 26, 31, 32, 20, 15, 24, 35, 27, 36, 26, 22,…
## $ avg_hourly_wind_speed <dbl> 4.96, 6.21, 7.67, 8.58, 15.88, 20.58, 15.42, 7.08, 14.21, 34.17, 16.83, 13.12, 16.38, 22.42, 30.17, 19.33, 9.96, 12.38, 14.17, 9.12, 15…
## $ avg_wind_speed <dbl> 6.0, 8.0, 8.0, 9.0, 14.5, 21.0, 15.0, 8.0, 14.5, 34.0, 15.0, 15.0, 16.5, 23.0, 32.5, 24.0, 9.5, 15.0, 14.0, 10.0, 12.5, 15.5, 22.0, 15.…
## $ min_wind_speed <int> 1, 3, 2, 2, 6, 14, 2, 1, 5, 7, 4, 7, 2, 10, 21, 12, 1, 5, 5, 3, 4, 12, 12, 4, 8, 9, 1, 1, 1, 4, 6, 1, 2, 6, 4, 7, 7, 1, 3, 2, 3, 1, 3, …
## $ max_wind_gust <int> NA, NA, NA, NA, 38, 41, 38, NA, 32, 83, 38, NA, 46, 62, 68, 49, NA, 32, 32, NA, NA, NA, 42, 39, 46, 46, NA, NA, 31, 46, 31, 47, 47, 38,…
## $ wind_gust_dir_10s <int> NA, NA, NA, NA, 9, 10, 7, NA, 7, 29, 10, NA, 20, 20, 14, 9, NA, 7, 6, NA, NA, NA, 9, 10, 26, 14, NA, NA, 31, 20, 9, 29, 7, 9, 8, 16, 16…
## $ max_pressure_sea <dbl> 102.76, 102.90, 103.29, 103.06, 101.97, 102.24, 102.69, 102.79, 102.71, 101.45, 101.05, 102.52, 101.75, 100.67, 101.16, 102.51, 102.41,…
## $ avg_hourly_pressure_sea <dbl> 102.34, 102.82, 103.10, 102.30, 101.77, 101.94, 102.49, 102.65, 102.25, 99.95, 100.12, 102.08, 101.31, 99.77, 100.39, 102.10, 101.90, 1…
## $ avg_pressure_sea <dbl> 102.35, 102.81, 103.06, 102.37, 101.81, 101.95, 102.38, 102.67, 102.12, 100.22, 100.04, 101.87, 101.25, 99.99, 100.27, 101.81, 101.78, …
## $ min_pressure_sea <dbl> 101.94, 102.71, 102.83, 101.68, 101.64, 101.66, 102.06, 102.54, 101.53, 99.00, 99.03, 101.22, 100.75, 99.32, 99.37, 101.11, 101.14, 100…
## $ max_pressure_station <dbl> 102.71, 102.85, 103.24, 103.01, 101.92, 102.19, 102.64, 102.74, 102.66, 101.40, 101.00, 102.47, 101.70, 100.62, 101.11, 102.46, 102.36,…
## $ avg_hourly_pressure_station <dbl> 102.29, 102.77, 103.05, 102.25, 101.72, 101.89, 102.44, 102.60, 102.20, 99.90, 100.07, 102.03, 101.26, 99.72, 100.34, 102.05, 101.85, 1…
## $ avg_pressure_station <dbl> 102.30, 102.76, 103.01, 102.32, 101.75, 101.90, 102.33, 102.61, 102.07, 100.17, 99.99, 101.82, 101.20, 99.94, 100.22, 101.76, 101.72, 1…
## $ min_pressure_station <dbl> 101.89, 102.66, 102.78, 101.63, 101.59, 101.61, 102.01, 102.49, 101.48, 98.95, 98.98, 101.17, 100.70, 99.27, 99.32, 101.06, 101.09, 100…
## $ max_visibility <int> 16100, 24100, 48300, 48300, 32200, 32200, 32200, 40200, 48300, 48300, 24100, 32200, 40200, 32200, 48300, 48300, 48300, 40200, 40200, 40…
## $ avg_hourly_visibility <dbl> 6950.0, 10870.8, 31570.8, 34525.0, 17116.7, 24008.3, 24812.5, 30837.5, 23845.8, 36287.5, 10545.8, 19091.7, 19437.5, 29579.2, 26158.3, 3…
## $ avg_visibility <int> 8350, 12150, 33800, 36200, 19300, 18500, 22550, 20600, 26150, 28200, 12350, 16400, 22500, 20950, 26550, 36200, 36200, 20500, 20300, 321…
## $ min_visibility <int> 600, 200, 19300, 24100, 6400, 4800, 12900, 1000, 4000, 8100, 600, 600, 4800, 9700, 4800, 24100, 24100, 800, 400, 24100, 4800, 2400, 120…
## $ max_health_index <dbl> 2.0, 2.2, 1.8, 2.3, 2.4, 2.3, 2.8, 3.0, 2.9, 2.8, 2.5, 3.2, 2.9, 2.5, 2.4, 3.2, 4.1, 3.6, 3.4, 3.9, 2.8, 2.4, 2.3, 2.3, 2.9, 2.4, 3.8, …
## $ avg_hourly_health_index <dbl> 1.3, 1.5, 1.3, 1.7, 1.7, 1.8, 2.1, 2.0, 2.2, 2.2, 2.1, 2.2, 2.2, 2.2, 2.0, 2.3, 2.6, 2.5, 2.6, 2.6, 2.3, 2.0, 1.9, 1.9, 2.2, 1.9, 2.5, …
## $ avg_health_index <dbl> 1.5, 1.6, 1.4, 1.6, 1.8, 1.8, 2.2, 2.2, 2.2, 2.2, 2.1, 2.5, 2.4, 2.2, 2.0, 2.1, 3.1, 2.7, 2.7, 3.0, 2.4, 2.1, 2.0, 2.0, 2.3, 1.8, 2.8, …
## $ min_health_index <dbl> 1.0, 1.0, 1.0, 1.0, 1.2, 1.3, 1.7, 1.5, 1.6, 1.6, 1.8, 1.7, 1.8, 1.9, 1.7, 1.0, 2.1, 1.8, 2.0, 2.0, 2.0, 1.8, 1.7, 1.6, 1.7, 1.2, 1.7, …
## $ heatdegdays <dbl> 12.4, 12.8, 10.6, 10.5, 7.8, 9.7, 14.4, 15.4, 17.4, 15.1, 17.8, 19.3, 15.4, 14.5, 15.2, 22.2, 24.7, 22.4, 25.9, 26.6, 29.6, 26.2, 20.6,…
## $ cooldegdays <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ growdegdays_5 <dbl> 0.6, 0.2, 2.5, 2.5, 5.2, 3.4, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ growdegdays_7 <dbl> 0.0, 0.0, 0.4, 0.5, 3.2, 1.4, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ growdegdays_10 <dbl> 0.0, 0.0, 0.0, 0.0, 0.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
## $ precipitation <dbl> 0.0, 0.2, 0.0, 0.2, 29.3, 17.7, 4.6, 0.0, 3.0, 7.6, 16.4, 0.9, 7.0, 7.6, 8.6, 0.0, 0.0, 4.7, 0.0, 0.0, 0.8, 3.0, 1.6, 8.7, 2.4, 3.4, 6.…
## $ rain <dbl> NA, 0.2, 0.0, 0.2, 29.3, 17.7, 4.6, 0.0, 3.0, 7.6, 6.8, NA, 5.6, 7.6, 4.9, 0.0, 0.0, NA, 0.0, 0.0, 0.0, 0.0, 0.0, 5.1, 2.4, 3.4, 6.5, 0…
## $ snow <dbl> NA, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.0, NA, 0.4, 0.0, 3.7, 0.0, 0.0, 2.0, 0.2, 0.0, 1.0, 4.0, 2.2, 6.9, 0.0, 0.0, 0.0, 0…
## $ snow_on_ground <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 12, 0, 0, 2, 7, NA, NA, 16, 3, 1, 1, 2, 0, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ sunrise <chr> "08:01:00", "08:02:00", "08:03:00", "08:04:00", "08:04:00", "08:05:00", "08:05:00", "08:06:00", "08:06:00", "08:07:00", "08:07:00", "08…
## $ sunset <chr> "16:43:00", "16:42:00", "16:40:00", "16:39:00", "16:38:00", "16:36:00", "16:35:00", "16:34:00", "16:32:00", "16:31:00", "16:30:00", "16…
## $ daylight <dbl> 8.70, 8.67, 8.62, 8.58, 8.57, 8.52, 8.50, 8.47, 8.43, 8.40, 8.38, 8.37, 8.35, 8.32, 8.30, 8.28, 8.27, 8.25, 8.23, 8.23, 8.22, 8.22, 8.2…
## $ sunrise_f <dbl> 8.02, 8.03, 8.05, 8.07, 8.07, 8.08, 8.08, 8.10, 8.10, 8.12, 8.12, 8.12, 8.12, 8.13, 8.13, 8.13, 8.13, 8.13, 8.13, 8.12, 8.12, 8.12, 8.1…
## $ sunset_f <dbl> 16.72, 16.70, 16.67, 16.65, 16.63, 16.60, 16.58, 16.57, 16.53, 16.52, 16.50, 16.48, 16.47, 16.45, 16.43, 16.42, 16.40, 16.38, 16.37, 16…
## $ min_uv_forecast <int> 1, NA, 1, 1, NA, NA, NA, 1, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ max_uv_forecast <int> 1, NA, 1, 1, NA, NA, NA, 1, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ min_high_temperature_forecast <int> 7, 7, 9, 9, 10, 10, 8, 6, 2, 3, 3, 2, 3, 2, 5, -1, -3, -1, -6, -5, -9, -4, 0, 3, 4, 8, 1, 1, 5, 6, 2, 5, 6, 3, 3, 4, 6, 6, 5, 7, 6, 3, …
## $ max_high_temperature_forecast <int> 7, 8, 9, 10, 10, 10, 8, 6, 3, 3, 3, 2, 4, 4, 5, -1, -2, -1, -5, -4, -8, -4, 1, 3, 5, 8, 1, 1, 5, 6, 3, 6, 6, 5, 5, 5, 6, 6, 5, 7, 6, 3,…
## $ min_low_temperature_forecast <int> 4, 5, 6, 6, 9, 6, 7, 0, 2, -1, 4, -2, -2, 1, 3, 0, -9, -8, -8, -10, -9, -11, -6, -1, 0, 2, 2, -5, -3, 3, 1, -2, 2, 2, 1, 2, 1, 6, 1, 1,…
## $ max_low_temperature_forecast <int> 4, 5, 6, 6, 9, 9, 8, 0, 4, -1, 4, 0, -2, 1, 3, 0, -7, -7, -4, -10, -7, -10, -6, 0, 1, 3, 2, -4, -3, 3, 1, 1, 3, 2, 1, 2, 1, 6, 1, 2, 6,…
## $ solar_radiation <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ max_cloud_cover_4 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ avg_hourly_cloud_cover_4 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ avg_cloud_cover_4 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ min_cloud_cover_4 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ max_cloud_cover_8 <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, …
## $ avg_hourly_cloud_cover_8 <dbl> 6.7, 6.3, 8.0, 6.2, 7.9, 8.0, 6.9, 2.6, 7.6, 6.5, 7.9, 7.3, 7.5, 7.8, 8.0, 7.9, 3.5, 4.6, 4.6, 5.3, 6.1, 6.7, 7.1, 7.5, 7.5, 7.9, 5.7, …
## $ avg_cloud_cover_8 <dbl> 5.0, 4.5, 8.0, 4.5, 7.5, 8.0, 4.0, 4.0, 7.0, 4.5, 6.5, 5.0, 5.5, 7.0, 8.0, 7.0, 4.5, 4.5, 4.0, 4.0, 4.0, 4.5, 5.0, 4.5, 6.0, 7.5, 4.0, …
## $ min_cloud_cover_8 <int> 2, 1, 8, 1, 7, 8, 0, 0, 6, 1, 5, 2, 3, 6, 8, 6, 1, 1, 0, 0, 0, 1, 2, 1, 4, 7, 0, 0, 0, 2, 2, 1, 7, 1, 7, 5, 2, 3, 2, 2, 8, 5, 0, 0, 0, …
## $ max_cloud_cover_10 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ avg_hourly_cloud_cover_10 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ avg_cloud_cover_10 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ min_cloud_cover_10 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Note that since this data is from a csv file, functions like
?weather
will not provide us with information about the
data set. For the purposes of this lab, we assume that the variable
names are sufficiently informative to intuit their contents. In
practice, however, information about the data and its variables can be
found in a documentation file that is usually available from the source
of the data.
Our goal during exploratory data analysis (EDA) is to develop an understanding of our data. Real data is complicated, often containing missing data and irregular formatting.
In this lab, we will demonstrate this process with the question: “What is the monthly total rainfall in Vancouver over 2021?”
rain
variableThe variable rain
contains the daily rainfall measured
in millimeters. Our first task is to check for any missing data. We
could do this using the base R commands from Lab 0, but the code gets
messy for complicated manipulations. Instead, we will use the pipe
operator %>%
from the tidyverse
package
along with some data processing functions. The pipe operator
%>%
allows us to take the output from one function and
“pipe it in” to the argument of the next function. You will find that
data manipulations using the pipe operator are much more readable than
the equivalents in base R!
# Examples of pipe operator
#1. Pulling the first 5 rows of data containing date, rain, and average dew point
# Tidyverse way
weather %>%
select(date, rain, avg_dew_point) %>%
head(5)
## date rain avg_dew_point
## 1 2022-01-16 NA 5.3
## 2 2022-01-15 0.2 4.8
## 3 2022-01-14 0.0 6.3
## 4 2022-01-13 0.2 6.2
## 5 2022-01-12 29.3 9.2
# base R way
head(weather[c("date", "rain","avg_dew_point")], 5)
## date rain avg_dew_point
## 1 2022-01-16 NA 5.3
## 2 2022-01-15 0.2 4.8
## 3 2022-01-14 0.0 6.3
## 4 2022-01-13 0.2 6.2
## 5 2022-01-12 29.3 9.2
# head(weather[, c("date", "rain","avg_dew_point")], 5)
#2. Adding new columns of data to the existing weather data:
# Tidyverse way
weather <- weather %>%
mutate(new_variable_1 = 1.5 * avg_hourly_temperature,
new_variable_2 = rain + 10)
# base R way
weather$new_variable_1 <- 1.5 * weather$avg_hourly_temperature
weather$new_variable_2 <- weather$rain + 10
First we will check which dates have missing rainfall data:
# Base R way
weather$date[is.na(weather$rain)]
## [1] "2022-01-16" "2022-01-05" "2021-12-30" "2020-12-08" "2020-12-07" "2020-12-06" "2020-12-05" "2020-12-04" "2020-12-03" "2020-12-02" "2020-12-01"
# Tidyverse way
weather %>%
filter(is.na(rain)) %>%
# can also use select(date)
pull(date)
## [1] "2022-01-16" "2022-01-05" "2021-12-30" "2020-12-08" "2020-12-07" "2020-12-06" "2020-12-05" "2020-12-04" "2020-12-03" "2020-12-02" "2020-12-01"
# To show the days that DO NOT have missing rain values, add '!' in front of is.na()
# e.g. weather$date[!is.na(weather$rain)]
We see that there is one missing value of rainfall in the year 2021. This might be a cause for concern if this was a day with particularly high rainfall. Unfortunately, we do not know if this was the case. The best we can do in these situations is to tell the reader that what is missing from our data so that they are aware of the limitations of our analysis.
date
variableSince we are concerned with total monthly rainfall, our next task is
to convert the date
variable into a form that is amenable
to analysis. First we examine the format date
:
# str() compactly displays the structure of an object
# Good alternative to glimpse()
str(weather$date)
## chr [1:1000] "2022-01-16" "2022-01-15" "2022-01-14" "2022-01-13" "2022-01-12" "2022-01-11" "2022-01-10" "2022-01-09" "2022-01-08" "2022-01-07" "2022-01-06" "2022-01-05" ...
We find that each entry of date
is a character vector of
the form “YYYY-MM-DD”. To prepare this data for use in
lubridate
, we convert date
from a character
vector to a datetime object.
# First we convert date to a datetime object
# To explain format, look at `?strptime`
# The base R way
weather$date <- as.Date(weather$date, format = "%Y-%m-%d")
# The tidyverse way using mutate()
# Make sure to reassign the dataset to ensure that the mutation step is saved
weather <- weather %>%
mutate(date = as.Date(date, format = "%Y-%m-%d"))
Now that we have a datetime object, we can use functions from
lubridate
to easily convert date
into many
useful quantities. We use the functions year
,
month
, and day
to define the corresponding
variables in our data
# Demonstrate some lubridate functions
# Extract weekday as character
weekdays(weather$date) %>% str()
## chr [1:1000] "Sunday" "Saturday" "Friday" "Thursday" "Wednesday" "Tuesday" "Monday" "Sunday" "Saturday" "Friday" "Thursday" "Wednesday" "Tuesday" "Monday" "Sunday" ...
# Extract month as integer
lubridate::month(weather$date) %>% str()
## num [1:1000] 1 1 1 1 1 1 1 1 1 1 ...
# Extract month as ordered factor
month(weather$date, label = TRUE) %>% str()
## Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 1 1 1 1 1 1 1 1 1 ...
# Extract the year as integer
year(weather$date) %>% str()
## num [1:1000] 2022 2022 2022 2022 2022 ...
# summarizing the pieces above
# Create the year, month, and day variables
weather <- weather %>%
mutate(year = year(date),
month = month(date, label = TRUE),
day = day(date))
Now that we have examined and processed our data, we can start to
answer our original question: “What is the monthly total rainfall in
Vancouver over 2021?” To begin, we might use filter
and
select
to inspect our data
weather %>%
filter(year == 2021) %>%
select(year, month, day, rain) %>%
head(10)
## year month day rain
## 1 2021 Dec 31 0.0
## 2 2021 Dec 30 NA
## 3 2021 Dec 29 0.0
## 4 2021 Dec 28 0.0
## 5 2021 Dec 27 0.0
## 6 2021 Dec 26 0.0
## 7 2021 Dec 25 0.0
## 8 2021 Dec 24 5.1
## 9 2021 Dec 23 2.4
## 10 2021 Dec 22 3.4
This output, however, is not as helpful as a numerical summary of the
data. The function group_by
and summarize
provide a useful method for summarizing the data. First we choose a
variable(s) to group by and then we define the numerical summaries we
wish to compute. For example, we might be interested in the mean and the
maximum rainfall each month.
# Assign to new dataframe to use for plotting
df <- weather %>%
filter(year == 2021) %>%
group_by(month) %>%
summarise(rain_tot = sum(rain, na.rm = T),
rain_max = max(rain, na.rm = T))
df
## # A tibble: 12 × 3
## month rain_tot rain_max
## <ord> <dbl> <dbl>
## 1 Jan 176. 29.5
## 2 Feb 75 19.2
## 3 Mar 34 6
## 4 Apr 38 13.5
## 5 May 30.2 8.2
## 6 Jun 37.2 12.5
## 7 Jul 0 0
## 8 Aug 36.7 22.1
## 9 Sep 155 50.9
## 10 Oct 148. 43.3
## 11 Nov 312. 52.5
## 12 Dec 93.5 31.3
These numerical summaries may be useful in some contexts, but a graph is generally better for examining these summaries. Thankfully, the results of the summary can easily be piped into ggplot to make such graphs!
df %>%
ggplot(aes(x = month, y = rain_tot)) +
geom_col(fill = "skyblue") +
labs(title = "Total monthly rainfall, Vancouver 2021",
x = "Month",
y = "Total rainfall (mm)") +
# Alternative labelling method instead of labs()
# ggtitle("Total monthly rainfall, Vancouver 2021") +
# xlab("Month") +
# ylab("Total rainfall (mm)") +
theme_bw()
Some quick observations:
In this R Lab, you will be guided by the TA to demonstrate the steps involved in estimator analysis.
apply
functions for loopinggridExtra
and the function
grid.arrange
to combine plots, and latex2exp
to include relevant details in plot labelsIn our work we often find ourselves simulating data from a
probability distributions. R provides built-in functions to simulate
from the most common distributions (e.g. rnorm
,
rbinom
, etc. ), but we sometimes need to sample from a
distribution that is not included in base R.
Fortunately, the inverse CDF method provides an all-purpose method for simulating from continuous probability densities, and in the next section we will demonstrate how it can be used to sample from an arbitrary continuous density provided we know the CDF and its inverse.
Suppose \(X\) is a continuous random variable with distribution function \(F\). A realization of \(X\) can be simulated as follows:
To demonstrate this method, consider the following unusual density function, \[\begin{equation*} f(x; \alpha) = \begin{cases} \frac{\alpha}{\alpha - 1} \frac{1}{x^2}, & 1 < x \leq \alpha,\\ 0 & \text{otherwise} \end{cases} \end{equation*}\] This density is definitely not included in R, so its a perfect opportunity to use the inverse CDF method!
The first step will be to find the distribution function \(F\). By integration we find that \[\begin{equation*} F(x) = \begin{cases} 0 & x \leq 1 \\ \frac{\alpha}{\alpha - 1}\left(1 - \frac{1}{x}\right) & 1 < x \leq \alpha \\ 1 & \alpha < x. \end{cases} \end{equation*}\] Next we find the inverse CDF by setting \(u = F(x)\), \((u \in [0,1])\), and solving for \(x\): \[\begin{equation*} F^{-1}(u) = \left[1 - \frac{\alpha - 1}{\alpha}u\right]^{-1}, \quad (u \in [0,1]). \end{equation*}\]
With these expressions we can now program a function
rdens
to sample from the linear density \(f\) using the inverse CDF method.
# Define the inverse CDF
F_inv <- function(u, a){
(1-((a-1)/a)*u)^(-1)
}
# Define a function that performs the sampling
rdens <- function(n,a){
u <- runif(n) # u <- runif(n = n, min = 0, max = 1)
x <- F_inv(u, a)
return(x)
}
It is important to verify that our function rdens
behaves as expected. To check our function, we’ll plot a few samples of
size \(n = 5000\) from the function to
make sure the density looks correct.
library(tidyverse)
library(latex2exp)
# Plot the result
data.frame(x = rdens(n = 5000, a = 10)) %>%
ggplot(aes(x = x, y = ..density..)) +
geom_histogram(binwidth = 0.2, center = 0.1) +
ggtitle(label = TeX(r'(A sample from $f(x)$)'),
subtitle = TeX(r'($n = 5000, \alpha = 10$)')) +
theme_bw()
Now that we can sample from our distribution, we might be interested
in the behaviour of certain estimators. Suppose we wish to estimate the
parameter \(\alpha\) using a sample
from the distribution. Since \(\alpha\)
controls the width of the distribution, we might try estimating it using
the maximum value of our sample. By the expression for the density, we
define our estimator to be \[
\hat \alpha = \max(X_1,...,X_n)
\] It is easy to compute this estimate for each sample we
generate using rdens
. All that is needed is to evaluate
max
on each sample. For generalizability, we will write a
wrapper function get_ahat
.
# Write wrapper function
get_ahat <- function(x){max(x)}
# Evaluate the estimator on a sample
ahat <- get_ahat(rdens(n = 100, a = 10))
cat(ahat)
## 9.0586
By re-running the above chunk we see that there is substantial variation in the results and the estimates never quite reach the true value of \(10\). Thus it appears that our estimator is biased downwards, however it is not easy to calculate this bias due to the form of the estimator. In these cases, we typically turn to simulation methods to examine the estimator.
In this section, we will construct a simulation study to estimate the bias of the estimator when the sample size is \(n = 100\) and the true value of the parameter is \(\alpha = 10\). To do this, we will simulate 1000 independent samples of size 100 from the density and compute the estimate on each of the samples. From this we will obtain a sample from the sampling distribution of the estimator, which we can then use to estimate the bias and other quantities.
# Simulate 1000 datasets of size 100
sim <- replicate(n = 1000, rdens(n = 100, a = 10))
str(sim)
## num [1:100, 1:1000] 6.73 4.05 8.97 1.18 1.22 ...
# For each sample (each column), compute the statistic
ahat <- apply(sim, MARGIN = 2, FUN = get_ahat)
str(ahat)
## num [1:1000] 8.97 9.85 9.94 9.58 9.81 ...
Now that we have a sample of size \(1000\) from the sampling distribution of \(\hat\alpha\), we plot it to learn more about the distribution
# Plot the sampling distribution of the estimator
data.frame(ahat) %>%
ggplot(aes(x = ahat, y = ..density..)) +
geom_histogram(binwidth = 0.2, center = 0.1) +
ggtitle(
label = TeX(r'(Sampling distribution for $\hat{\alpha}$)'),
subtitle = TeX(r'($n = 100, \alpha = 10$, (1000 replications))')) +
xlab(TeX(r'($\hat{\alpha}$)')) +
theme_bw()
From the above sampling distribution, we get a better idea of its sampling variation, bias, as well as any skewness of the distribution which is difficult to determine from numeric summaries alone. To get a more precise estimate of the bias, we simply compute the mean of the sampling distribution and subtract it from the true value \(\alpha = 10\).
# Estimated bias of the estimator
bias <- mean(ahat) - 10
cat(bias)
## -0.7583627
# 95% credible interval
ci <- quantile(ahat, probs = c(0.025, 0.975))
cat(ci)
## 7.556135 9.975918
A natural question once we have an estimator is whether or not it is consistent, that is, as the sample size increases does the estimator converge in probability to the true value. Once again, this is typically difficult to prove analytically, so we might satisfy ourselves by examining the behaviour of the estimator via a simulation study. To do this, we simulate the sampling distribution of the estimator for sample sizes of \(n = 50\), \(100\), \(500\), and \(1000\) and plot the distributions.
# Function for simulations
sim_study <- function(B, n, a){
# Simulate B samples of size n
sim <- replicate(n = B, rdens(n = n, a = a))
ahat <- apply(sim, MARGIN = 2, FUN = get_ahat)
# Compute bias
bias <- mean(ahat) - a
# Plot
fig <- data.frame(ahat) %>%
ggplot(aes(x = ahat, y = ..density..)) +
geom_histogram(binwidth = 0.2, center = 0.1) +
scale_x_continuous(limits = c(5,10)) +
ggtitle(paste0("n =", n)) +
xlab(TeX(r'($\hat{\alpha}$)')) +
theme_bw()
return(list(fig = fig, bias = bias))
}
# Set parameters of simulation study
a <- 10
B <- 1000
n <- c(50, 100, 500, 1000)
# Set seed for reproducibility
set.seed(324)
# Run simulation for each value in n
result <- lapply(n, function(n) sim_study(B = B, n = n, a = a))
names(result) <- paste0("n_", n)
# Plot all four plots on a grid using grid.arrange
library(gridExtra)
grid.arrange(result$n_50$fig,
result$n_100$fig,
result$n_500$fig,
result$n_1000$fig)
From the above plots, what can you say about the estimator as the sample size increases? Does it appear consistent? Is the bias shrinking? How else does the distribution change?
We can also see our numeric estimates of the bias as follows:
# Print the bias in a data.frame
bias <- lapply(result, function(x) x$bias) %>% as.data.frame()
bias
## n_50 n_100 n_500 n_1000
## 1 -1.311561 -0.7815966 -0.1703174 -0.09099329
In this R Lab, you will be guided by your TA through a second example of constructing empirical bootstrapped confidence intervals (like we did in class this week). You will also learn how to produce your own Normal QQ plots as a means of verifying normality of data.
In this lab we will be considering laboratory values associated with diagnosing Type 2 diabetes, specifically glycated hemoglobin (A1C). Diabetes mellitus is one of the leading causes of death in the United States and estimating population-level prevalence is therefore an area of interest in public health research. This data is a subset of The Nation’s Mobile Health Survey (NHANES) survey for 2017-2018.
In this presentation we will be considering the following research question:
“What is the average value of A1C in the U.S. population in the period 2017-2018?”
Investigating this question provides us with a crude index for the prevalence of diabetes in the U.S. population.
Since our inferences are based on a small sample from the whole population, we will want to quantify the uncertainty in our estimates due to sampling variation. To this end, we will construct confidence intervals for our estimates using a bootstrap approach.
The first task is to load the NHANES data into memory using the
function read.csv
.
library(tidyverse)
library(latex2exp)
# Load the Data:
# It may be helpful to manually set your working directory using `setwd`
setwd("~/GitHub/dang-kevin.github.io/STA238")
diabetes <- read.csv("diabetes.csv")
# Get a summary of the variables in the nhanes data
# str(diabetes)
glimpse(diabetes)
## Rows: 135
## Columns: 3
## $ seqn <int> 100934, 99455, 99224, 99184, 97031, 98571, 96236, 101113, 95690, 102726, 94994, 101064, 94804, 98358, 95054, 99477, 101612, 100377, 101096, 99024, 94516, 102922…
## $ gh <dbl> 4.8, 5.2, 5.4, 5.6, 5.2, 5.0, 8.3, 5.8, 7.4, 4.9, 4.9, 5.4, 5.4, 5.9, 4.9, 5.6, 5.7, 5.5, 5.8, 5.0, 5.4, NA, 5.1, 5.7, 6.0, NA, 5.1, 5.7, 5.3, 5.6, NA, 5.0, 9.5…
## $ fpg <int> 79, 108, 103, 103, 98, 88, 107, 98, 106, 106, 90, 103, 101, 104, 112, 123, 96, 97, 109, 94, 88, NA, 92, 102, 108, NA, 97, 139, 96, 93, NA, 104, 234, 97, 164, 87…
In the above printout, seqn
is a subject ID number in
the database, gh
is the percentage of glycated hemoglobin
(A1C), and fpg
is fasting plasma glucose in mm/dL. The
latter two variables are lab measurements used for diagnosing Type II
diabetes, though our focus in the presentation portion of the lab will
be on the A1C values in gh
.
Before we get started on the analysis proper, it is a good idea to account for any missing values in the data.
# Print a summary of `gh` to check for missing data
summary(diabetes)
## seqn gh fpg
## Min. : 93732 Min. : 4.600 Min. : 74.0
## 1st Qu.: 96355 1st Qu.: 5.275 1st Qu.: 95.0
## Median : 99289 Median : 5.500 Median :103.0
## Mean : 98779 Mean : 5.791 Mean :111.9
## 3rd Qu.:101137 3rd Qu.: 5.800 3rd Qu.:112.0
## Max. :102935 Max. :10.900 Max. :313.0
## NA's :11 NA's :11
We see from the summary printout that there are 11 missing values. For the purposes of this lab we simply note this fact and remove them from the data.
# For this analysis, we remove missing values
# You learned other methods of doing this in R Lab #1
diabetes <- diabetes[complete.cases(diabetes),]
# diabetes <- na.omit(diabetes)
Recall that our goal is to estimate the mean A1C of the U.S. population from 2017-2018.
The estimate based on our sample is:
# Compute the mean A1C in our sample
x_bar <- mean(diabetes$gh)
x_bar
## [1] 5.791129
We would, however, like to include a confidence interval for this estimate to reflect the uncertainty in our estimate due to sampling variability. In the next sections we will demonstrate how we will use a bootstrap approach to construct this confidence interval.
The first component of any bootstrap approach is deciding how the bootstrapped datasets will be simulated; will we use a parametric or empirical approach? To help us decide, we should first examine the distribution of our data to see if it resembles a parametric density or not.
# Histogram of `gh`
diabetes %>%
ggplot(aes(x = gh)) +
geom_histogram(aes(y=..density..),
binwidth = 0.2,
fill = 'thistle2',
colour = 'black') +
labs(title = "Sample distribution of A1C",
x = "A1C",
y = "Density")
On inspection, this distribution looks like it is positively-skewed, and that a parametric bootstrap based on a normal sampling distribution is inappropriate. To get a closer look, we can plot the quantiles of the sample against a normal distribution in a Q-Q plot.
# QQ-plot to assess normality of `gh`
diabetes %>%
ggplot(aes(sample = gh)) +
geom_qq() +
geom_qq_line() +
labs(title = "Normal Q-Q plot",
subtitle = "Data: A1C",
x = TeX(r'($N(0,1)$ Quantiles)'),
y = "A1C Quantiles")
The Q-Q plot confirms our suspicions that the data in non-normally distributed. Indeed, about 16% of A1C values reside in the heavy tails of the sample. (To see this, note that the points that leave the normal line fall above the \(N(0,1)\) quantile value of \(Z \approx 1\), which corresponds to \(P(Z \geq 1) \approx 0.16\).)
Because of this non-normality, we opt for an empirical bootstrap approach to construct confidence intervals for the mean A1C.
To construct the confidence intervals, we will used what is called studentized bootstrap confidence intervals. This involves computing the studentized mean \[ t^* = \frac{\bar x_n^* - \bar x_n}{s^*_n/\sqrt n} \] for each bootstrap sample, where \(\bar x_n^*\) and \(s^*_n\) are the bootstrap mean and standard deviation, respectively, and \(\bar x_n\) is the sample mean of the original data. The confidence interval for the population mean \(\mu\) is then \[ \left(\bar x_n - c_u^*\frac{s_n}{\sqrt{n}}, \bar x_n - c_l^*\frac{s_n}{\sqrt{n}}\right), \] where \(c_l^*\) and \(c_u^*\) are the lower and upper critical values of the sampling distribution of \(t^*\) that give the desired significance level. In this lab, we will construct a \(95\%\) confidence interval by taking \(c_l^*\) and \(c_u^*\) to be the \(2.5\%\) and \(97.5\%\) quantiles of the sampling distribution, respectively.
## Using For-Loop
# Define the required inputs:
mean.sample <- mean(diabetes$gh) # sample mean of original data
sd.sample <- sd(diabetes$gh) #sample sd of original data
B <- 2000
n.sample <- length(diabetes$gh)
sim.t <- c() #empty vector to store studentized means
## Bootstrapping Step ##
for (i in 1:B){
boot.sample <- sample(diabetes$gh, n.sample, replace = T)
boot.mean <- mean(boot.sample)
boot.sd <- sd(boot.sample)
# Compute studentized mean and store it into sim.t
sim.t[i] <- (boot.mean-x_bar)/(boot.sd/sqrt(n.sample))
}
## Find the Critical Values ##
(crit.t <- quantile(sim.t, probs = c(0.975, 0.025)))
## 97.5% 2.5%
## 1.676319 -2.351053
# Compute Bootstrapped CI
(ci.student <- mean.sample - crit.t*sd.sample/sqrt(n.sample))
## 97.5% 2.5%
## 5.639214 6.004191
Since we know that if we performed this analysis many times on new samples of people, the true population mean A1C would be contained in any interval calculated in the above method in \(95\%\) of replications, this allows us to interpret the calculated interval: We can say that based on our data, we are 95% confident that the average A1C in the population could be as low as 5.6392143 to as high as 6.0041909. (The interpretation of confidence intervals can be tricky!)
In this R Lab, you will be guided by your TA on how to adapt
bootstrapping and simulation to construct confidence intervals and
perform hypothesis tests on two-sample data when normality assumptions
fail. At this point, it is assumed that you will have decent familiarity
in creating histograms using ggplot
to assess distribution
shape.
filter
in dplyr
(a package
already contained in tidyverse
) to filter data in dataframe
according to some criteria.The data we will be working with today is from a study reported in A Longitudinal Study of the Development of Elementary School Children’s Private Speech. In this study, researchers recorded instances where children were engaged in private speech during a randomly chosen 10 second interval. Private speech refers to speech spoken aloud that is either addressed to oneself or to no particular listener. Many such intervals were observed for each child, and the percentage of intervals where private speech occurred for each child was recorded. One of the outcomes of interest for this study and the focus of our tutorial is the following research question:
Does the amount of private speech differ between boys and girls?
To put this question into a mathematical context, let \((X_1,...X_m)\) and \((Y_1,...,Y_n)\) denote the percentage of private speech observed for boys and girls, respectively. Furthermore, assume that each observation \(X_i\) comes from a population with mean \(\mu_1\) and \(Y_j\) from a population with mean \(\mu_2\). This problem can then be viewed as that of making inferences about differences in means \(\mu_1 - \mu_2\), or determining whether boys and girls have different engagements in private speech.
To begin exploring this question, let us first load in the study data and the relevant libraries for our analysis.
library(tidyverse)
library(latex2exp)
library(gridExtra)
# Load in the speech data
# Remember to set your working directory and ensure data files are stored
# in the same location
setwd("~/GitHub/dang-kevin.github.io/STA238")
speech <- read.csv("speech.csv")
glimpse(speech)
## Rows: 33
## Columns: 3
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33
## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "male", "male", "male", "male", "male", "male", "male", "male", "male", "male", "male", "female", …
## $ percentage <dbl> 4.9, 5.5, 6.5, 0.0, 0.0, 3.0, 2.8, 6.4, 1.0, 0.9, 0.0, 28.1, 8.7, 1.6, 5.1, 17.0, 4.7, 28.1, 0.0, 1.3, 2.2, 0.0, 1.3, 0.0, 0.0, 0.0, 0.0, 3.9, 0.0, 10.1, …
Next we will plot the data for boys and girls separately to get an idea of their distribution. For the purposes of this analysis, the labels boy and girl refer to a child being labeled as male or female in the data set.
###########################################
### Remove the eval=F from the R chunk! ###
###########################################
# Plot the samples for boys and girls using gridExtra
# Density histograms to compare the density histogram of the two samples
plot.boys <- speech %>%
filter(gender == "male") %>%
ggplot(aes(x = percentage, y =..density..)) + #density histogram
geom_histogram(binwidth = 0.5, center = 0.25, #bin aesthetics
color = "black",
fill = "thistle2") +
labs(title = "Private speech: Boys",
x = "Percentage",
y = "Density") +
xlim(c(0,30)) #common range
plot.girls <- speech %>%
filter(gender == "female") %>%
ggplot(aes(x = percentage, y =..density..)) +
geom_histogram(binwidth = 0.5, center = 0.25,
color = "black",
fill ="thistle2") +
labs(title = "Private speech: Girls",
x = "Percentage",
y = "Density") +
xlim(c(0,30))
grid.arrange(plot.boys, plot.girls)
From these plots it is clear that the data is not normal and with the limited data size, any confidence intervals or hypothesis tests based on the assumption of normality will be misleading. In such situations, we typically relax the assumption of normal data and turn to non-parametric approaches to form inferences about the quantities of interest.
In the following sections, we will demonstrate how confidence intervals can be obtained using a bootstrap approach and how the so-called permutation test can be used to perform hypothesis testing with minimal distributional assumptions.
In this first section we will demonstrate how to construct a confidence interval for the difference in means \(\mu_1 - \mu_2\) using a bootstrap approach. The general approach is very similar to that of the one-sample problem:
Draw a bootstrap sample \((X_1^*,...,X_m^*)\) with replacement from the first group \((X_1,...,X_m)\)
Draw a bootstrap sample \((Y_1^*,...,Y_m^*)\) with replacement from the second group \((Y_1,...,Y_m)\)
Compute the difference in means \(\bar X^* - \bar Y^*\) for the bootstrap samples
Repeat this process many times to obtain the bootstrapped sampling distribution
Construct a bootstrap confidence interval \((C_{lower}^*, C_{upper}^*)\) from the sampling distribution.
In this presentation, we will be using the percentile method to construct the \((1-\alpha)\)-level confidence intervals in the final step, where \(C_{lower}^*\) and \(C_{upper}^*\) are taken to be the \(\alpha/2\) and \(1 - \alpha/2\) sample quantiles of the bootstrap sampling distribution, respectively. This method is simple, and makes few assumptions about the shape of the bootstrap sampling distribution, though the resulting interval is typically biased. To see another approach similar to the studentized approach from the previous tutorial, see the discussion in section 10.6 of the textbook.
Let’s use this recipe to construct a bootstrapped confidence interval in R:
###########################################
### Remove the eval=F from the R chunk! ###
###########################################
# Create vectors containing the observed data for each group
# obs.boys <- speech$percentage[speech$gender == "male"]
# obs.girls <- speech$percentage[speech$gender == "female"]
# Index of data entries that are "male"
index.boys <- speech$gender == "male"
# Percentage entries for the same index that were identified as "male"
obs.boys <- speech$percentage[index.boys]
# Percentage entries for the indices that are NOT male
obs.girls <- speech$percentage[!index.boys]
# Define the bootstrap experiment
set.seed(431)
B <- 5000
boot.mean.diff <- c()
for(b in 1:B){
# Bootstrap sample for each gender
boot.boys <- sample(obs.boys, replace = T)
boot.girls <- sample(obs.girls, replace = T)
# Compute difference in bootstrap means
boot.mean.diff[b] <- mean(boot.boys) - mean(boot.girls)
}
The 95% bootstrap confidence interval for the difference in means is then
###########################################
### Remove the eval=F from the R chunk! ###
###########################################
# Compute the 95% percentile confidence interval
ci.mean.diff <- quantile(boot.mean.diff, probs = c(0.025, 0.975))
ci.mean.diff
## 2.5% 97.5%
## 1.289917 9.473417
The corresponding confidence interval based on the assumption of normal data can be calculated to be \((1.29, 9.47)\), which has a smaller lower limit compared to our bootstrapped interval. Given that our data is definitely not normal, we have more confidence in the validity of the bootstrap-based interval.
Another common approach to studying the difference in means between two groups is the hypothesis testing approach. Here, we consider the following null and alternative hypotheses:
Since our data is not normally distributed, we can no longer use the two-sample \(t-\)test approach, where the null distribution of the sample statistic is a known \(t-\)distribution.
If we are comfortable assuming that the only possible difference in the distribution of the two samples is in their means \(\mu_1, \mu_2\), then we can use what is called the permutation test to perform this hypothesis test.
Under the null hypothesis \(\mu_1 = \mu_2\), where both sampling distributions are the same, every observation could equally well belong to either group and so the group labels have no effect on the outcome. Since the group labels are exchangeable, we can randomly assign them to the observed values and we will obtain what looks like a new observation from the null distribution with the same probability as the original sample. By comparing our observed difference \(\bar X - \bar Y\) to the difference measured on these new permuted data, we can get a sense of how unlikely it is to observe such a difference under the null hypothesis.
Following this reasoning, the recipe for the permutation test is as follows:
Compute the test statistic \(T_{obs} = \bar X - \bar Y\) on the original data (this test statistic is labeled T, but does not imply that it has a T-distribution).
Create a permuted data set by randomly assigning the group labels to the observed values and compute the test statistic \(T^* = \bar X^* - \bar Y^*\).
Repeat this process for every permutation of the data (or a very large number of permutations)
Compute \(p-\)value of \(T_{obs}\) by computing the proportion of permuted test statistics \(T^*\) that are at least as “extreme” as \(T_{obs}\).
We implement the permutation test for our data as follows:
# Compute the observed difference between genders
obs.mean.boys <- mean(speech$percentage[index.boys])
obs.mean.girls <- mean(speech$percentage[!index.boys])
obs.mean.diff <- obs.mean.boys - obs.mean.girls
# obs.mean.diff <- mean(obs.boys) - mean(obs.girls)
# Permutation test
set.seed(134)
B <- 5000
perm.mean.diff <- c()
for(b in 1:B){
# Generate new permutation of data
perm <- sample(speech$percentage, replace = F)
# Compute the mean for each group for the permutation
perm.mean.boys <- mean(perm[index.boys])
perm.mean.girls <- mean(perm[!index.boys])
# Compute the difference in means for the permutation
perm.mean.diff[b] <- perm.mean.boys - perm.mean.girls
}
The \(p-\)value for the our particular value of (FILL THIS IN) is then
# Compute p-value
mean(abs(perm.mean.diff) >= abs(obs.mean.diff))
## [1] 0.0302
where we have defined “extreme” to mean \(|T|\geq |T_{obs}|\), which is a two-tailed test. Therefore we reject the null hypothesis that \(\mu_1 = \mu_2\) at the \(5\%\) level.
The assumption that both distributions are the same in every possible way except perhaps the mean is a very strong assumption that should typically be checked. In practice we usually require this be approximately true by checking that the sample variances are roughly equal. If we check this for our the speech data set, we find that this is unlikely satisfied and that we may want to use another approach for testing this hypothesis.