Hypothesis testing

The t-test

set.seed(20221011) # Seed for reproducibility
n <- 100 # Sample size

mu1 <- 0 # Mean of sample 1
mu2 <- 0.1 # Mean of sample 2

sd1 <- 1 # Standard deviation 1
sd2 <- 1 # Standard deviation 2

# Generate two samples of of size n of normally-distributed data.
# Sample x1 has a mean of mu1 and standard deviation sd1
# Sample x2 has a mean of mu2 and standard deviation sd2
x1 <- rnorm(n, mu1, sd1)
x2 <- rnorm(n, mu2, sd2)

There are many ways to look at our data. Two examples are: 1. Density plots 2. Box plots

# First look at your data
# We draw density plots to see if our data are different
# Overlay the two plots
plot(density(x1), xlim=c(-5,5), ylim=c(0,0.5), col = 1, lty = 2, main = "Comparing two population means")
abline(v=mu1, col=1, lty = 2)
lines(density(x2), col = 3, lty = 4)
abline(v=mu2, col=3, lty = 4)

# The boxplot() function requires a data frame
# We'll put our data into a data frame with one column representing the random data and the other representing the group label
df <- data.frame(c(x1,x2),
                    c(rep(1,length(x1)), rep(2,length(x2))))
colnames(df) <- c("value","group")
boxplot(value ~ group, data = df)

# Now run a t-test comparing the two means. 
# Do you reject your null hypothesis?
# How does it change if you change n or mu1 or mu2?

# The t.test() function has several parameters
# If we put in two vectors, we can compare them
t.test(x1, x2) 
## 
##  Welch Two Sample t-test
## 
## data:  x1 and x2
## t = -0.79626, df = 196, p-value = 0.4268
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3866333  0.1642225
## sample estimates:
##   mean of x   mean of y 
## -0.02866188  0.08254352
# We can get more degrees of freedom with an equal-variance t-test (though it doesn't really matter once n is around 30+)
# This is a logical variable that returns whether the standard deviations of our two samples are equal
# If they are, it returns true
# Try putting sd1==sd2 into the console and seeing what it returns!
ve <- sd1 == sd2

t.test(x1,x2,var.equal = ve)
## 
##  Two Sample t-test
## 
## data:  x1 and x2
## t = -0.79626, df = 198, p-value = 0.4268
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3866161  0.1642053
## sample estimates:
##   mean of x   mean of y 
## -0.02866188  0.08254352
# For a one-sided t-test of whether a value is greater than zero, just use one parameter:
t.test(x2)
## 
##  One Sample t-test
## 
## data:  x2
## t = 0.88153, df = 99, p-value = 0.3802
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.1032517  0.2683387
## sample estimates:
##  mean of x 
## 0.08254352
t.test(x2, alternative = c("greater")) # or specify "greater"
## 
##  One Sample t-test
## 
## data:  x2
## t = 0.88153, df = 99, p-value = 0.1901
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
##  -0.07292973         Inf
## sample estimates:
##  mean of x 
## 0.08254352

We can retrieve relevant values by storing them in an R object and using $ to extract them.

t_test_results <- t.test(x1, x2)

names(t_test_results) # This will tell you what is store in the t_test_results object
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"
t_test_results$statistic # For example, this extracts the t-statistic
##          t 
## -0.7962618

t-test exercises

As an exercise, extract the t-statistic and p-value you calculated above using $

#t_test_results$

Now, try the t-test we did above, but try making the sample size much larger (say, n=1000 or n=10000). Do you notice a difference in the result of a t-test?

#set.seed(20221011) # Seed for reproducibility
#n <-  # Sample size (CHANGE THIS)

#mu1 <- 0 # Mean of sample 1
#mu2 <- 0.1 # Mean of sample 2

#sd1 <- 1 # Standard deviation 1
#sd2 <- 1 # Standard deviation 2

#ve <- sd1 == sd2

# Generate two samples of of size n of normally-distributed data.
# Sample x1 has a mean of mu1 and standard deviation sd1
# Sample x2 has a mean of mu2 and standard deviation sd2
#x1 <- rnorm(n, mu1, sd1)
#x2 <- rnorm(n, mu2, sd2)

#t.test(x1,x2,var.equal = ve)

Try the t-test again. Make the sample size smaller (say, n=20) but change the means

#set.seed(20221011) # Seed for reproducibility
#n <- 20 # Sample size

#mu1 <-  # Mean of sample 1 (CHANGE THIS)
#mu2 <-  # Mean of sample 2 (CHANGE THIS)

#sd1 <- 1 # Standard deviation 1
#sd2 <- 1 # Standard deviation 2

#ve <- sd1 == sd2

# Generate two samples of of size n of normally-distributed data.
# Sample x1 has a mean of mu1 and standard deviation sd1
# Sample x2 has a mean of mu2 and standard deviation sd2
#x1 <- rnorm(n, mu1, sd1)
#x2 <- rnorm(n, mu2, sd2)

#t.test(x1, x2, var.equal = ve)

ANOVA and the F-test

What if the data have more than two groups? This leads naturally into Analysis of Variance (ANOVA).

We will create three “populations” with a measurement between them. These three groups will have different means to ensure that they give a statistically significant result.

set.seed(20221011)

# Our artificial data will have three different means
m1 <- rnorm(30, mean = 25, sd = 5)
m2 <- rnorm(30, mean = 30, sd = 6)
m3 <- rnorm(30, mean = 22, sd = 7)

ms <- c(m1, m2, m3) # Combine our three measures into one vector

# Our vector of group labels A, B, and C
ls <- as.factor(c(rep("A",30),rep("B",30),rep("C",30)))

# Make it into a data frame
df_anova <- data.frame(ms, ls)
colnames(df_anova) <- c("measure","group")

First, we should visualize our data. We can use a boxplot to compare all the populations against each other.

boxplot(measure ~ group, data = df_anova)

These groups certainly look different, but we should run an ANOVA just in case. We’ll use the aov command. The command takes the formula y ~ x as an input, where y is the variable we are measuring and x is the grouping.

# Run the ANOVA
results_aov <- aov(measure ~ group, data = df_anova)

# Call the results of the ANOVA
print("---Printing results of ANOVA---")
## [1] "---Printing results of ANOVA---"
results_aov
## Call:
##    aov(formula = measure ~ group, data = df_anova)
## 
## Terms:
##                    group Residuals
## Sum of Squares  1332.216  3485.576
## Deg. of Freedom        2        87
## 
## Residual standard error: 6.32962
## Estimated effects may be unbalanced
# Print the summary
print("---Printing summary---")
## [1] "---Printing summary---"
summary(results_aov)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group        2   1332   666.1   16.63 7.67e-07 ***
## Residuals   87   3486    40.1                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We now have a significant result. What else can we say? Let’s use $ to examine the ANOVA object:

results_aov$coefficients
## (Intercept)      groupB      groupC 
##   25.382285    4.521143   -4.900450

These coefficients show us what our results are relative to Group A.

We can also use plot to see what our residuals look like.

plot(results_aov)

(Intercept)      groupB      groupC 
  25.382285    4.521143   -4.900450 

ANOVA exercises

Load the base R data PlantGroup using the command data("PlantGrowth"). Visualize the groups using a boxplot and run an ANOVA on them. Extract the coefficients using $.

data("PlantGrowth") # Load the data (it may take a second)

Run the TukeyHSD() command on the ANOVA you just created. Where are the differences in the model?

# TukeyHSD(...)

For two- and three-way ANOVA models, there are good examples at: https://www.datanovia.com/en/lessons/anova-in-r/. They require the datarium package.