I usually include this code at the top of scripts so that R doesn’t automatically read text as factors.
The following data tables are available for analysis:
Table Descriptions
Table locations
Table name | Google sheet key | GitHub filename |
---|---|---|
sites | 1LJj8BAngXFjBTvpOJv_0WXV_OwTXtEHk00p_WKsMuJ8 | BIO46_2018_sample_sites.csv |
lichens | 11d3-k-3OaIOfm8EfneUiNGiumfPBdMBHMViaTbVtzwY | lichens.csv |
fungal_isolates | 16ku-VRUHMZHz5XgHtD2ESDhdem401sBnbbPEEVeYtrg | |
infection_prevalence | 1iJ1bM4qGkL8M068wJAr52BU801hHo6HldQClU75FCMc | infection_prevalence.csv |
interaction_experiment_byplate | 1ty7Wz0oiEIFJcfSCy8FFIVd3eyiwI0iMc0AyGVdFrpY | interaction_experiment_byplate.csv |
interaction_experiment_bycompetitor | 16YxUnW2o1WfYYab5IUNTzxFVcrKVvwerbR7pGLJzHZQ | interaction_experiment_bycompetitor.csv |
env_loggers_summary | 1ktSM1PYOnazM663Fg_bdTYRPA8yUlHH2VYFriHFleEo | env_loggers_summary.csv |
infection_experiment | 1RwIhDQNeAWcLI9ZlrlpekDWQ0cjPM_UkBgb5FFBDiM8 | infection_experiment.csv |
taxa | 1pbttwIAHZ8jijHdB9rvwkdxkh0rkLfKoegm07PxjB9Y | taxa.csv |
Three ways to load these data:
1) From the Google Drive sheets
# Load package
library(googlesheets)
# Authorize googlesheets to access you Google Drive
gs_auth()
# Register the lichens data table
lichens_gs <- gs_key("11d3-k-3OaIOfm8EfneUiNGiumfPBdMBHMViaTbVtzwY")
# Read the lichens_gs googlesheet into R
lichens <- gs_read(lichens_gs)
2) From the GitHub repository
# Define the URL of the data repository
github_url <- "https://raw.githubusercontent.com/jescoyle/BIO46/master/2018/Data/"
# Read in the csv file using the url and the specific file name
# The following reads in three different tables
interactions <- read.csv(file.path(github_url, "interaction_experiment_bycompetitor.csv"))
sites <- read.csv(file.path(github_url, "BIO46_2018_sample_sites.csv"))
infection <- read.csv(file.path(github_url, "infection_prevalence.csv"))
3) From a file on your computer
First download the files to your BIO46 folder.
# Set the working directory
setwd("C:/Users/your_username/Documents/BIO46")
# Read in a csv file that you downloaded to that directory
read.csv("lichens.csv")
Functions for manipulating dataframes are in the dplyr
and tidyr
packages. Be sure to load them once before using these functions.
# Load the packages
library(dplyr)
library(tidyr)
Use select
to choose specific columns and filter
to choose specific rows based on the values in the columns. In the code below, the resulting dataframe is saved to an object named shrub_inf
that can be used later.
# Show the site and lichen number of lichens on shrubs that have more than 0 ascomata (i.e. infected)
shrub_inf <- lichens %>%
filter(Ascomata > 0 & Substrate_type == "shrub") %>%
select(LichenID, Site, Ascomata)
shrub_inf
# A tibble: 5 x 3
LichenID Site Ascomata
<chr> <int> <int>
1 G2-2 106 4
2 G2-3 106 3
3 G3-6 108 2
4 G4-1 51 3
5 G4-2 51 6
Summarize values in columnes with the summarise
function. Use group_by
to specify the columns used to group data before calculating summary statistics. In the code examples below, the resulting dataframes are saved to an objects named substrate_summary
and mean_growth
that can be used later.
# For lichens on different substrates,
# Count the number of lichens found
# and calculate mean and maximum number of ascomata
substrate_summary <- lichens %>%
group_by(Substrate_type) %>%
summarise(Num_lichens = n(),
Ascomata = mean(Ascomata),
Ascomata_max = max(Ascomata)
)
substrate_summary
# A tibble: 2 x 4
Substrate_type Num_lichens Ascomata Ascomata_max
<chr> <int> <dbl> <dbl>
1 shrub 9 2.000000 2.000000
2 tree 23 2.043478 2.043478
# Calculate the mean radial growth of each isolate growing solo or in co-culture
mean_growth <- interactions %>%
group_by(IsolateID, Culture_type) %>%
summarise(Radius_mm_day7 = mean(Radius_mm_day7),
Radius_mm_day14 = mean(Radius_mm_day14)
)
mean_growth
# A tibble: 17 x 4
# Groups: IsolateID [?]
IsolateID Culture_type Radius_mm_day7 Radius_mm_day14
<chr> <chr> <dbl> <dbl>
1 S1-B coculture 7.333333 NA
2 S1-B solo 11.000000 13.50000
3 S2-B coculture 7.666667 NA
4 S2-B solo 5.350000 10.50000
5 T15-L1-F11 solo 20.000000 35.00000
6 T15-L1-F15 coculture 27.500000 NA
7 T15-L1-F15 solo 42.666667 56.33333
8 T15-L1-F18 solo 12.500000 20.00000
9 T15-L1-F22 coculture 18.750000 19.08333
10 T15-L1-F22 solo 55.000000 55.00000
11 T24-L1-F11 solo 11.500000 21.00000
12 T24-L1-F13 solo 10.000000 20.00000
13 T25-L2-F12 solo 7.000000 27.00000
14 T26-L3-F23 solo 14.000000 28.00000
15 T26-L3-F7 solo 9.000000 20.00000
16 T32-L3-F17 solo 14.375000 24.50000
17 T54-L3-F4 solo 16.500000 27.50000
To combine multiple columns into the same column, use the gather
function. key =
is the name of a new column that will identify which column the data originally came from. value =
is the name of a new column that will contain the values that were originally in separate columns.
# Put the radius measurements from day 7 and day 14 into a single column (named Radius_mm)
mean_growth_long <- mean_growth %>%
gather(key = Measurement_day, value = Radius_mm, Radius_mm_day7, Radius_mm_day14)
mean_growth_long
# A tibble: 34 x 4
# Groups: IsolateID [13]
IsolateID Culture_type Measurement_day Radius_mm
<chr> <chr> <chr> <dbl>
1 S1-B coculture Radius_mm_day7 7.333333
2 S1-B solo Radius_mm_day7 11.000000
3 S2-B coculture Radius_mm_day7 7.666667
4 S2-B solo Radius_mm_day7 5.350000
5 T15-L1-F11 solo Radius_mm_day7 20.000000
6 T15-L1-F15 coculture Radius_mm_day7 27.500000
7 T15-L1-F15 solo Radius_mm_day7 42.666667
8 T15-L1-F18 solo Radius_mm_day7 12.500000
9 T15-L1-F22 coculture Radius_mm_day7 18.750000
10 T15-L1-F22 solo Radius_mm_day7 55.000000
# ... with 24 more rows
Add new columns based on calculations with other columns using mutate
.
# Add a new column to lichens that indicates whether it is infected (ascomata > 0)
lichens <- lichens %>%
mutate(Infected = Ascomata > 0)
# Display the LichenID, Ascomata and Infected columns
lichens %>%
select(LichenID, Ascomata, Infected)
# A tibble: 32 x 3
LichenID Ascomata Infected
<chr> <int> <lgl>
1 G1-11 0 FALSE
2 G1-3 10 TRUE
3 G1-4 14 TRUE
4 G1-6 2 TRUE
5 G1-7 0 FALSE
6 G1-8 0 FALSE
7 G1-9 3 TRUE
8 G2-2 4 TRUE
9 G2-3 3 TRUE
10 G2-4 0 FALSE
# ... with 22 more rows
# Add columns to mean_growth_long that indicate the number of days
# and then calculate growth rate (mm per day)
mean_growth_long <- mean_growth_long %>%
mutate(Days = ifelse(Measurement_day == "Radius_mm_day7", 7, 14),
Growth_mm_day = Radius_mm / Days
)
mean_growth_long
# A tibble: 34 x 6
# Groups: IsolateID [13]
IsolateID Culture_type Measurement_day Radius_mm Days Growth_mm_day
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 S1-B coculture Radius_mm_day7 7.333333 7 1.0476190
2 S1-B solo Radius_mm_day7 11.000000 7 1.5714286
3 S2-B coculture Radius_mm_day7 7.666667 7 1.0952381
4 S2-B solo Radius_mm_day7 5.350000 7 0.7642857
5 T15-L1-F11 solo Radius_mm_day7 20.000000 7 2.8571429
6 T15-L1-F15 coculture Radius_mm_day7 27.500000 7 3.9285714
7 T15-L1-F15 solo Radius_mm_day7 42.666667 7 6.0952381
8 T15-L1-F18 solo Radius_mm_day7 12.500000 7 1.7857143
9 T15-L1-F22 coculture Radius_mm_day7 18.750000 7 2.6785714
10 T15-L1-F22 solo Radius_mm_day7 55.000000 7 7.8571429
# ... with 24 more rows
Merge the columns of two dataframes using a join
function. left_join
keeps all of the rows of the left dataframe. right_join
keeps all of the rows of the right dataframe. full_join
keeps all rows in both dataframes. The function will use the column names that are the same in both dataframes to match up the rows.
# Select some columns from the site data
sites_light <- sites %>%
select(Site, Canopy_openness_pct)
# Add infection prevalence data to the abbreviated site data
inf_sites <- left_join(sites_light, infection)
inf_sites
Site Canopy_openness_pct Date Num_Eprunastri Num_infected Num_healthy
1 5 62.41 2018-02-08 16 2 14
2 8 60.28 <NA> NA NA NA
3 9 71.68 2018-02-08 9 0 9
4 10 76.39 2018-02-08 12 0 12
5 10 76.39 2018-02-22 10 0 9
6 11 58.11 <NA> NA NA NA
7 13 41.91 2018-02-22 18 0 9
8 15 57.49 2018-02-08 16 0 9
9 18 47.64 2018-02-08 24 6 3
10 19 25.59 2018-02-08 0 0 0
11 28 48.71 2018-02-22 10 1 8
12 37 75.38 2018-02-08 25 1 8
13 37 75.38 2018-02-22 9 0 9
14 38 58.71 2018-02-22 5 0 5
15 45 42.81 2018-02-08 4 3 1
16 47 15.67 2018-02-08 16 0 9
17 49 43.74 2018-02-08 0 0 0
18 51 22.41 <NA> NA NA NA
19 52 32.40 <NA> NA NA NA
20 54 29.60 2018-02-22 0 0 0
21 58 36.06 2018-02-22 17 1 8
22 61 NA 2018-02-08 0 0 0
23 61 NA 2018-02-22 3 0 3
24 65 45.88 <NA> NA NA NA
25 67 29.61 2018-02-22 9 0 9
26 72 30.28 2018-02-22 8 0 8
27 74 28.04 2018-02-08 20 0 9
28 75 25.93 2018-02-08 7 1 6
29 75 25.93 2018-02-22 5 0 5
30 76 31.15 2018-02-22 0 0 0
31 79 72.15 2018-02-22 9 0 9
32 87 20.47 <NA> NA NA NA
33 92 66.55 2018-02-22 10 3 7
34 94 25.46 <NA> NA NA NA
35 95 51.15 <NA> NA NA NA
36 97 45.26 2018-02-22 9 0 9
37 98 40.20 2018-02-08 0 0 0
38 99 33.04 2018-02-08 1 0 1
39 103 28.58 2018-02-08 3 0 3
40 103 28.58 2018-02-22 3 0 3
41 104 36.58 <NA> NA NA NA
42 106 57.46 2018-02-08 0 0 0
43 108 53.13 2018-02-08 5 1 4
44 109 48.09 <NA> NA NA NA
45 110 52.65 2018-02-08 35 0 9
46 110 52.65 2018-02-22 13 1 8
47 111 32.07 <NA> NA NA NA
48 112 29.93 2018-02-08 12 6 6
49 114 63.93 2018-02-22 4 0 4
50 43 70.00 <NA> NA NA NA
We will use functions from the ggplot2
package for data visualization.
# Load the package
library(ggplot2)
A scatterplot shows how two variable relate to one another. Each point corresponds to a row in a dataframe. You can include a linear regression line using geom_smooth(method = "lm")
or a logistic regression line using geom_smooth(method = "glm")
.
# Plot the number of ascomata versus a lichen's initial mass
lichens %>%
ggplot(aes(x = Mass_mg_init, y = Ascomata)) +
geom_point() +
labs(x = "Initial mass (mg)",
y = "Number of ascomata"
)
# Create the same plot with a linear regression line.
lichens %>%
ggplot(aes(x = Mass_mg_init, y = Ascomata)) +
geom_point() +
geom_smooth(method = "lm",
se = TRUE,
level = 0.95) +
labs(x = "Initial mass (mg)",
y = "Number of ascomata"
)
# Create a plot of infectedion status versus initial mass
# including a logistic regression line
lichens %>%
mutate(Infected = as.numeric(Infected)) %>% # convert to 0 or 1
ggplot(aes(x = Mass_mg_init, y = Infected)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = TRUE,
level = 0.95) +
labs(x = "Initial mass (mg)",
y = "Infected"
)
A histogram shows the frequency of observed X values.
# Create a histogram that counts how many lichens have different numbers of ascomata
lichens %>%
ggplot(aes(x = Ascomata)) +
geom_histogram(binwidth = 5, boundary = 5) +
labs(x = "Num. ascomata",
y = "Number of lichens"
)
A boxplot shows the distribution (median and quartiles) of Y values in different X categories.
The following boxplot uses data from the mean_growth_long
dataframe created above in the Manipulating data -> Stacking columns section. Here are the first 10 rows…
IsolateID | Culture_type | Measurement_day | Radius_mm | Days | Growth_mm_day |
---|---|---|---|---|---|
S1-B | coculture | Radius_mm_day7 | 7.33 | 7 | 1.05 |
S1-B | solo | Radius_mm_day7 | 11.00 | 7 | 1.57 |
S2-B | coculture | Radius_mm_day7 | 7.67 | 7 | 1.10 |
S2-B | solo | Radius_mm_day7 | 5.35 | 7 | 0.76 |
T15-L1-F11 | solo | Radius_mm_day7 | 20.00 | 7 | 2.86 |
T15-L1-F15 | coculture | Radius_mm_day7 | 27.50 | 7 | 3.93 |
T15-L1-F15 | solo | Radius_mm_day7 | 42.67 | 7 | 6.10 |
T15-L1-F18 | solo | Radius_mm_day7 | 12.50 | 7 | 1.79 |
T15-L1-F22 | coculture | Radius_mm_day7 | 18.75 | 7 | 2.68 |
T15-L1-F22 | solo | Radius_mm_day7 | 55.00 | 7 | 7.86 |
# Boxplot of the growth rates of fungal isolates growing solo and in co-culture
mean_growth_long %>%
ggplot(aes(x = Culture_type, y = Growth_mm_day)) +
geom_boxplot() +
labs(x = "Culture type",
y = "Growth rate (mm / day)"
)
Barcharts can show the number of observations in different categories.
# Plot the number of infected vs. uninfected lichens on trees vs. shrubs
lichens %>%
ggplot(aes(x = Substrate_type)) +
geom_bar(aes(fill = Infected),
position = "dodge" # remove this if you want the bars stacked
) +
labs(x = "Host plant",
y = "Num. lichens"
)
Barcharts can also show the mean and standard devation of observations in different categories.
# Calculate mean and sd of growth on day 7 across taxa
rad7_summary <- interactions %>%
group_by(TaxonID, Culture_type) %>%
summarise(Radius_mm = mean(Radius_mm_day7),
Radius_sd = sd(Radius_mm_day7)
)
# Plot the mean and sd of different taxa
# growing solo versus in co-culture
rad7_summary %>%
ggplot(aes(x = TaxonID, y = Radius_mm, fill = Culture_type)) +
geom_col(position = "dodge") +
geom_errorbar(aes(ymin = Radius_mm - Radius_sd,
ymax = Radius_mm + Radius_sd),
position = position_dodge(width = 0.9),
width = 0.5
) +
labs(x = "",
y = "Colony radius day 7 (mm)",
fill = "Culture type"
) +
theme(axis.text.x = element_text(angle=90,
vjust = 0.5,
hjust = 1
)
)
To save a figure, click the “Export” button in the Plots window of RStudio.
Alternatively, save the figure directly from your code by running the following immediately after generating a figure. Change the extension to .pdf or .svg if you prefer a vector format image.
# Save the last plot as my_file_name.png in the current working directory.
ggsave("my_file_name.png")
We’ll use the lichens
dataframe downloaded above to demonstrate analyses for when Y is categorical. Here are the first 10 rows…
Team | LichenID | Site | Substrate_type | Substrate_status | Ascomata | Experiment | Mass_mg_init | Infected |
---|---|---|---|---|---|---|---|---|
1 | G1-11 | 9 | shrub | living | 0 | infection | NA | FALSE |
1 | G1-3 | 5 | tree | dead | 10 | growth | 118 | TRUE |
1 | G1-4 | 5 | tree | dead | 14 | growth | 121 | TRUE |
1 | G1-6 | 99 | tree | living | 2 | growth | 46 | TRUE |
1 | G1-7 | 112 | tree | living | 0 | infection | NA | FALSE |
1 | G1-8 | 112 | tree | living | 0 | growth | 171 | FALSE |
1 | G1-9 | 112 | tree | living | 3 | growth control | 66 | TRUE |
2 | G2-2 | 106 | shrub | living | 4 | growth control | 90 | TRUE |
2 | G2-3 | 106 | shrub | living | 3 | growth | 28 | TRUE |
2 | G2-4 | 47 | tree | living | 0 | infection | NA | FALSE |
To apply the following tests, first create a contingeny table of the data that tabulates how many observations are from each combination of X and Y categories. These tests evaluate whether the counts in the rows of the contingency table are different from one another.
# Count the number of infected vs. uninfected lichens on trees vs shrubs
lichen_count <- table(lichens$Substrate_type, lichens$Infected)
# Display table
lichen_count
FALSE TRUE
shrub 4 5
tree 13 10
Use this test when you have many observations and the expected number of observations in each cell of the table is greater than five. This test is not valid when you have few observations.
# Test whether infection frequency is different between trees and shrubs
chisq.test(lichen_count)
Pearson's Chi-squared test with Yates' continuity correction
data: lichen_count
X-squared = 0.049105, df = 1, p-value = 0.8246
Use this test when you have few observations and when you know ahead of time how many observations are in each category of X and each category of Y (e.g. the marginal sums are fixed). Statisticians recommend not using this test and instead using unconditional tests (see Lydersen et al. 2009).
# Fisher's exact test with a two-sided hypothesis
fisher.test(lichen_count)
Fisher's Exact Test for Count Data
data: lichen_count
p-value = 0.6989
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.09605767 3.78195930
sample estimates:
odds ratio
0.6249097
Use this test when you have few observations and when you do not necessarily know ahead of time how many observations are in each category of X and each category of Y (e.g. the marginal sums are random).
# Load the package containing the function
library(Exact)
# Barnard's test used when the number of observations in each category of X is known
# X categories should be in the rows of the table, otherwise, specify, cond.row = FALSE
# method = "CSM" specifies Barnard's method, but you can use other options
barnard_test <- exact.test(lichen_count,
model = "binomial",
method = "CSM",
cond.row = TRUE,
to.plot = FALSE)
# Print test results
barnard_test
csm
data: 4 out of 9 vs. 13 out of 23
test statistic = NA, first sample size = 9, second sample size = 23, p-value = 0.6508
alternative hypothesis: true difference in proportion is not equal to 0
sample estimates:
difference in proportion
-0.1207729
# Boschloo's test used when the number of observations in each category is unknown for X or Y
boschloo_test <- exact.test(lichen_count,
model = "multinomial",
method = "boschloo")
# Print test results
boschloo_test
boschloo
data: x11=4, x12=5, x21=13, x22=10
test statistic = 0.69886, total sample size = 32, p-value = 0.6139
alternative hypothesis: true difference in product of proportion is not equal to 0
sample estimates:
difference in product of proportion
-0.02441406
Use this form of logistic regression when the dependent response variable has two outcomes (e.g. False / True, Absent / Present).
# Fit a logistic regression model with:
# Y = whether a lichen is infected
# X = the initial mass of the lichen
mod_infVmass <- glm(Infected ~ Mass_mg_init,
data = lichens,
family = binomial(link = "logit")
)
# View the results
summary(mod_infVmass)
Call:
glm(formula = Infected ~ Mass_mg_init, family = binomial(link = "logit"),
data = lichens)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7169 -1.2013 0.6215 1.0695 1.2216
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.232116 0.657943 -0.353 0.724
Mass_mg_init 0.005853 0.004471 1.309 0.190
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 31.755 on 23 degrees of freedom
Residual deviance: 29.402 on 22 degrees of freedom
(8 observations deleted due to missingness)
AIC: 33.402
Number of Fisher Scoring iterations: 4
# Exponentiate the mode coeffifients (intercept and slope)
# to see the effect of mass on the odds of being infected
exp(coef(mod_infVmass))
(Intercept) Mass_mg_init
0.7928544 1.0058706
We’ll use the interactions
dataframe downloaded above to demonstrate ANOVA and t-tests. Here are the first 10 rows…
IsolateID | TaxonID | Radius_mm_day7 | Radius_mm_day14 | Competitor_isolate | Competitor_taxon | Culture_type | PlateID |
---|---|---|---|---|---|---|---|
S2-B | NA | 7.50 | 7.5 | T32-L3-F17 | Cladosporium_sp1 | coculture | CJ-P1 |
S2-B | NA | 6.00 | 8.5 | T24-L1-F13 | Nemania_serpens_gen2 | coculture | CJ-P10 |
S2-B | NA | 4.00 | 5.5 | T26-L3-F7 | Nemania_serpens_gen2 | coculture | CJ-P11 |
T24-L1-F13 | Nemania_serpens_gen2 | 10.00 | 20.0 | NA | NA | solo | CJ-P12 |
T26-L3-F7 | Nemania_serpens_gen2 | 9.00 | 20.0 | NA | NA | solo | CJ-P13 |
S2-B | NA | 6.50 | 9.0 | T24-L1-F11 | Cladosporium_sp1 | coculture | CJ-P2 |
S2-B | NA | 5.70 | 9.0 | NA | NA | solo | CJ-P3 |
T32-L3-F17 | Cladosporium_sp1 | 15.75 | 25.0 | NA | NA | solo | CJ-P4 |
T24-L1-F11 | Cladosporium_sp1 | 10.00 | 21.0 | NA | NA | solo | CJ-P5 |
S2-B | NA | 6.00 | 8.5 | T15-L1-F18 | Nemania_serpens_gen1 | coculture | CJ-P6 |
Use ANOVA when you want to test whether a normally-distributed dependent variable differs among categories. ANOVA assumes that the variance within groups is the same across groups.
# Replace unknown (NA) taxon names with their isolate ID (e.g. S1-A)
int_noNAtaxa <- interactions %>%
mutate(TaxonID = ifelse(is.na(TaxonID), IsolateID, TaxonID))
# ANOVA testing whether day 7 radius differs among taxa
# Fit model
mod_rad7Vtaxa <- lm(Radius_mm_day7 ~ factor(TaxonID), data = int_noNAtaxa)
# Perform ANOVA
anova(mod_rad7Vtaxa)
Analysis of Variance Table
Response: Radius_mm_day7
Df Sum Sq Mean Sq F value Pr(>F)
factor(TaxonID) 7 4443.4 634.77 8.0596 2.062e-06 ***
Residuals 47 3701.7 78.76
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Post-hoc t-test
ANOVA only tells you whether the categories are different from one another (variance between categories > variance within categories). If you want to know which categories differ from one another, use Tukey’s Honest Significant Difference test after performing an ANOVA to determine whether the pairs of categories have different means.
# First fit an AOV (analysis of variance) object
aov_rad7Vtaxa <- aov(mod_rad7Vtaxa)
# Then computed Tukey's HSD
TukeyHSD(aov_rad7Vtaxa)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = mod_rad7Vtaxa)
$`factor(TaxonID)`
diff lwr upr p adj
Nemania_serpens_gen1-Cladosporium_sp1 0.3125000 -24.0608489 24.685849 1.0000000
Nemania_serpens_gen2-Cladosporium_sp1 -4.2708333 -25.7661066 17.224440 0.9982478
Nigrospora_sp1-Cladosporium_sp1 7.0625000 -24.4033581 38.528358 0.9962309
Pezizomycetes_sp5-Cladosporium_sp1 10.9910714 -4.9650309 26.947174 0.3793044
S1-B-Cladosporium_sp1 -4.9375000 -21.3700085 11.495008 0.9787338
S2-B-Cladosporium_sp1 -5.6920455 -22.1245539 10.740463 0.9539829
Sordariomycetes_sp1_gen1-Cladosporium_sp1 17.1736111 0.2612206 34.086002 0.0442053
Nemania_serpens_gen2-Nemania_serpens_gen1 -4.5833333 -30.2750989 21.108432 0.9991229
Nigrospora_sp1-Nemania_serpens_gen1 6.7500000 -27.7191205 41.219121 0.9984024
Pezizomycetes_sp5-Nemania_serpens_gen1 10.6785714 -10.5962316 31.953375 0.7526636
S1-B-Nemania_serpens_gen1 -5.2500000 -26.8844027 16.384403 0.9939164
S2-B-Nemania_serpens_gen1 -6.0045455 -27.6389481 15.629857 0.9864880
Sordariomycetes_sp1_gen1-Nemania_serpens_gen1 16.8611111 -5.1400025 38.862225 0.2511248
Nigrospora_sp1-Nemania_serpens_gen2 11.3333333 -21.1644652 43.831132 0.9523310
Pezizomycetes_sp5-Nemania_serpens_gen2 15.2619048 -2.6435101 33.167320 0.1471494
S1-B-Nemania_serpens_gen2 -0.6666667 -18.9978973 17.664564 1.0000000
S2-B-Nemania_serpens_gen2 -1.4212121 -19.7524427 16.910018 0.9999968
Sordariomycetes_sp1_gen1-Nemania_serpens_gen2 21.4444444 2.6818317 40.207057 0.0150268
Pezizomycetes_sp5-Nigrospora_sp1 3.9285714 -25.2031524 33.060295 0.9998617
S1-B-Nigrospora_sp1 -12.0000000 -41.3953648 17.395365 0.8963273
S2-B-Nigrospora_sp1 -12.7545455 -42.1499102 16.640819 0.8633919
Sordariomycetes_sp1_gen1-Nigrospora_sp1 10.1111111 -19.5551844 39.777407 0.9577291
S1-B-Pezizomycetes_sp5 -15.9285714 -27.2680796 -4.589063 0.0012500
S2-B-Pezizomycetes_sp5 -16.6831169 -28.0226250 -5.343609 0.0006367
Sordariomycetes_sp1_gen1-Pezizomycetes_sp5 6.1825397 -5.8418549 18.206934 0.7297755
S2-B-S1-B -0.7545455 -12.7551529 11.246062 0.9999993
Sordariomycetes_sp1_gen1-S1-B 22.1111111 9.4613602 34.760862 0.0000343
Sordariomycetes_sp1_gen1-S2-B 22.8656566 10.2159057 35.515407 0.0000180
Use a t-test whether you want to test whether the mean of a normally-distributed dependent variable differs between two categories.
The arguments in the t.test
function can be used to specify whether the variances should be assumed equal among groups (Student’s t-test, var.equal = TRUE
) or unequal (Welch’s t-test, var.equal = FALSE
). The paired
arguments determines whether you want an independent sample or dependent sample test. Use a dependent sample test (paired = TRUE
) when you have measured the same individuals in both categories.
We’ll use the interactions
data to demonstrate and test whether the radius measured on day 7 differed from the radius measured on day 14.
# Student's t-test, two-sided hypothesis
t.test(interactions$Radius_mm_day14,
interactions$Radius_mm_day7,
var.equal = TRUE)
Two Sample t-test
data: interactions$Radius_mm_day14 and interactions$Radius_mm_day7
t = 1.8523, df = 102, p-value = 0.06688
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3311459 9.6784186
sample estimates:
mean of x mean of y
21.00000 16.32636
# Student's t-test, one-sided hypothesis
t.test(interactions$Radius_mm_day14,
interactions$Radius_mm_day7,
var.equal = TRUE,
alternative = "greater")
Two Sample t-test
data: interactions$Radius_mm_day14 and interactions$Radius_mm_day7
t = 1.8523, df = 102, p-value = 0.03344
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.4852784 Inf
sample estimates:
mean of x mean of y
21.00000 16.32636
# Welch's t-test (unequal variance), one-sided hypothesis
t.test(interactions$Radius_mm_day14,
interactions$Radius_mm_day7,
var.equal = FALSE,
alternative = "greater")
Welch Two Sample t-test
data: interactions$Radius_mm_day14 and interactions$Radius_mm_day7
t = 1.8425, df = 97.82, p-value = 0.03422
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.4614567 Inf
sample estimates:
mean of x mean of y
21.00000 16.32636
Since measurements are not independent- we measured the same plates on day 7 and day 14, a paired test is more appropriate. Note that the two columns must have “individuals” in the same order.
# Paired Welch's t-test treating plates as individuals
t.test(interactions$Radius_mm_day14,
interactions$Radius_mm_day7,
var.equal = FALSE,
alternative = "greater",
paired = TRUE)
Paired t-test
data: interactions$Radius_mm_day14 and interactions$Radius_mm_day7
t = 6.3926, df = 48, p-value = 3.142e-08
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
4.035136 Inf
sample estimates:
mean of the differences
5.470408
We’ll use the inf_sites
dataframe created above in the Manipulating Data section to demonstrate fitting linear regression models. Here are the first 20 rows…
Site | Canopy_openness_pct | Date | Num_Eprunastri | Num_infected | Num_healthy |
---|---|---|---|---|---|
5 | 62.41 | 2018-02-08 | 16 | 2 | 14 |
8 | 60.28 | NA | NA | NA | NA |
9 | 71.68 | 2018-02-08 | 9 | 0 | 9 |
10 | 76.39 | 2018-02-08 | 12 | 0 | 12 |
10 | 76.39 | 2018-02-22 | 10 | 0 | 9 |
11 | 58.11 | NA | NA | NA | NA |
13 | 41.91 | 2018-02-22 | 18 | 0 | 9 |
15 | 57.49 | 2018-02-08 | 16 | 0 | 9 |
18 | 47.64 | 2018-02-08 | 24 | 6 | 3 |
19 | 25.59 | 2018-02-08 | 0 | 0 | 0 |
28 | 48.71 | 2018-02-22 | 10 | 1 | 8 |
37 | 75.38 | 2018-02-08 | 25 | 1 | 8 |
37 | 75.38 | 2018-02-22 | 9 | 0 | 9 |
38 | 58.71 | 2018-02-22 | 5 | 0 | 5 |
45 | 42.81 | 2018-02-08 | 4 | 3 | 1 |
47 | 15.67 | 2018-02-08 | 16 | 0 | 9 |
49 | 43.74 | 2018-02-08 | 0 | 0 | 0 |
51 | 22.41 | NA | NA | NA | NA |
52 | 32.40 | NA | NA | NA | NA |
54 | 29.60 | 2018-02-22 | 0 | 0 | 0 |
Use normal linear regression when the dependent variable is continuous and (mostly) normally distributed.
# Linear regression of number of E prunastri vs. canopy openness
mod_eprunVcanopy <- lm(Num_Eprunastri ~ Canopy_openness_pct, data = inf_sites)
# View the results
summary(mod_eprunVcanopy)
Call:
lm(formula = Num_Eprunastri ~ Canopy_openness_pct, data = inf_sites)
Residuals:
Min 1Q Median 3Q Max
-10.692 -5.383 -2.009 4.355 24.815
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.63260 3.79711 1.220 0.231
Canopy_openness_pct 0.10545 0.07611 1.386 0.175
Residual standard error: 8.038 on 34 degrees of freedom
(14 observations deleted due to missingness)
Multiple R-squared: 0.05345, Adjusted R-squared: 0.02561
F-statistic: 1.92 on 1 and 34 DF, p-value: 0.1749
Poisson regression is useful when the dependent variable is a discrete count. The number of E. prunastri is a count, not actually continuous and since there are many zeros and low counts (see histogram below) we may want to use Poisson regression instead.
# Histogram to view the distribution of Num_Eprunastri
inf_sites %>%
ggplot(aes(x = Num_Eprunastri)) +
geom_histogram(binwidth = 5, boundary = 5)
# Fit a Poisson regression
mod_eprunVcanopy_pois <- glm(Num_Eprunastri ~ Canopy_openness_pct,
data = inf_sites,
family = "poisson"
)
# View the results
summary(mod_eprunVcanopy_pois)
Call:
glm(formula = Num_Eprunastri ~ Canopy_openness_pct, family = "poisson",
data = inf_sites)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.5922 -1.9392 -0.6726 1.3738 6.1363
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.732272 0.160744 10.777 < 2e-16 ***
Canopy_openness_pct 0.010847 0.003022 3.589 0.000332 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 261.88 on 35 degrees of freedom
Residual deviance: 249.06 on 34 degrees of freedom
(14 observations deleted due to missingness)
AIC: 375.2
Number of Fisher Scoring iterations: 5
# Create a scatterplot
# fitted Poisson regression line in blue
# fitted linear regression line in red
inf_sites %>%
ggplot(aes(x = Canopy_openness_pct, y = Num_Eprunastri)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "poisson"),
color = "blue") +
geom_smooth(method = "lm",
color = "red")