In this lesson you will be examining data on 18 fungi isolated from Evernia prunastri lichens during the Winter 2017 BIO46 class. In this week’s lab you will be responsible for designing and conducting a competition experiment in which you grow these isolates together and in isolation in order to determine which ones can inhibit each other. You will also have access to two isolates each of the S1 and S2 fungal isolates that we used in the infection experiment and which were isolated from the ascomata of the fungal infection we are studying. However, S1 and S2 are not in this table because we don’t yet know anything about them. As you work through the analysis of this data, think about what hypothesis you might want to test and which pairs of isolates you would need to compete against one another.
In this exercise you will learn:
Before we start, load the packages that we will be using.
library(tidyr)
library(dplyr)
library(ggplot2)
In this exercise you will be using a data file that you saved in your BIO46 folder. The file is named W4_isolates.csv and is in a format called csv (for comma separated values) which means that if you open the file a text editor, you will see that divisions between columns are denoted with commas. You should have saved this file in a folder named data inside your BIO46 folder. The file path to the csv file is therefore: “BIO46/data/W4_isolates.csv”. In order for R to find a file on you computer you need to specify this file path.
When you start RStudio, R starts out in a folder called the working directory (directory = folder). You can discover which folder that R thinks you are in by typing getwd()
into the Console. “wd” stands for working directory . To change the working directory you type setwd("/folder_name/sub_folder_name")
. When you are working on analyses for our class you should always set the working directory to be the file path to your BIO46 folder.
Here is an example of how to do this. (Note: running this code probably will not work on your computer because the file path is not the same on your computer.)
# Set the working directory on Windows
setwd("C:/Users/your_username/Documents/BIO46")
# Set the working directory on Mac
setwd("/Users/your_username/Documents/BIO46")
Problem 1. Change the code above to set the working directory to be the BIO46 folder on your computer. You will only need to use the line of code that is appropriate for your operating system (Mac/Windows).
Hint: Need help figuring out what the file path is? Try searching: “find a file path on mac”.
Now that R knows which folder to start in, we can read in the csv file to a dataframe named isolates
. Notice that we specify the path to the file from the working directory. Since we previously told R to change the working directory to the BIO46 folder, the following code tells R to look for a file named W4_isolates.csv that is inside a folder named data and to read the contents into a dataframe. The read.table()
function expects this file to be a table of data in which the columns are separated by commas (sep = ","
). The header = TRUE
argument tells the function that the first row of the table contains column names.
# Read in the csv file to a dataframe
isolates <- read.table("data/W4_isolates.csv", header = TRUE, sep = ",")
You can examine this table by going to the Environment window and clicking on isolates
. You should see that each row of the table corresponds to a single fungus that was isolated from an Evernia prunastri lichen in last year’s BIO46 class.
Column name | Description |
---|---|
IsolateID | unique identifier for each fungal isolate |
TaxonID | tentative taxonomic identification (e.g. which species it is) |
LichenID | which lichen individual the isolate came from |
Thallus_location | whether the fungus was isolated from a tip or from the base |
Growth_rate_mm_day | growth rate on MEA media at standard lab conditions (mm day-1) |
Num_isolates | number of other fungi isolated from the same lichen |
Num_taxa | number of other fungal taxa isolated from the same lichen |
Class, Order, Family, Species | taxonomic placement* |
* Notice that some isolates could only be identified to Class.
When we read the data into R, we used the assignment operator (<-
) to create a new object named isolates
. We could have chosen any name we want, with a few restrictions:
my dog
my_dog
or my.dog
or myDog
18fungi
fungi18
or fungi_18
my_data
flower_data
or flowers
c
or t
T
or F
data
min
, max
, mean
, sum
, diff
, and other obvious arithmetic functionsError: object not found
then you can use the name.Most of these rules also apply when you are naming columns in a dataframe.
Let’s use R to summarize the data. First, how many isolates do we have for each taxon (i.e. putative “species”)? Each row of isolates
corresponds to a single fungal isolate and the TaxonID
indicates which taxon each isolate belongs to. Therefore we need to group the rows of isolates
by the values in TaxonID
and then count the number of rows in each group. We can do this using the group_by()
and summarise()
functions in the dplyr
package (which we already loaded).
In the code below, the %>%
is called a pipe and the way it works is to send the result of a computation on the left side into the first argument of the function on the right side (or on the next line). For example, X %>% Y %>% Z
means “do X
and send the result to Y
and then send that result to Z. An alternative way to write this using function composition is Z(Y(X))
.
Here’s what each line of the following code does, without the code present:
isolates
.TaxonID
.Available_isolates
.
n()
is a function that counts the number of rows.summarise()
is a function that creates a new dataframe using the new column names and calculation that we specify.## How many isolates are there for each taxon?
# Use the isolates data
isolates %>%
# Group the rows of the data by TaxonID
group_by(TaxonID) %>%
# Count the rows and save as a new column named Available_isolates.
summarise(Available_isolates = n())
## # A tibble: 9 x 2
## TaxonID Available_isolates
## <fctr> <int>
## 1 Cladosporium_sp1 3
## 2 Nemania_serpens_gen1 3
## 3 Nemania_serpens_gen2 3
## 4 Nigrospora_sp1 1
## 5 Peziza_arvernensis 1
## 6 Pezizomycetes_sp2 3
## 7 Pezizomycetes_sp5 1
## 8 Sordariomycetes_sp1_gen1 2
## 9 Sordariomycetes_sp1_gen4 1
So, there are four taxa for which we have 3 independent isolates and four taxa for which we only have one isolate. What do you think the result will be if we remove the second line that groups the rows according to the TaxonID?
# Use the isolates data
isolates %>%
# Count the rows and save as a new column named Available_isolates.
summarise(Available_isolates = n())
Let’s see how growth rates on MEA media vary among taxa. To do this we can modify the code above to also calculate the mean of the column Growth_rate_mm_day
within each group.
# Summarise the data within taxa
isolates %>%
group_by(TaxonID) %>%
summarise(Available_isolates = n(),
Growth_rate_mm_day = mean(Growth_rate_mm_day)
)
## # A tibble: 9 x 3
## TaxonID Available_isolates Growth_rate_mm_day
## <fctr> <int> <dbl>
## 1 Cladosporium_sp1 3 1.289667
## 2 Nemania_serpens_gen1 3 2.743000
## 3 Nemania_serpens_gen2 3 2.223952
## 4 Nigrospora_sp1 1 8.000000
## 5 Peziza_arvernensis 1 1.714000
## 6 Pezizomycetes_sp2 3 2.561905
## 7 Pezizomycetes_sp5 1 11.750000
## 8 Sordariomycetes_sp1_gen1 2 3.143000
## 9 Sordariomycetes_sp1_gen4 1 1.429000
Which taxon had the highest growth rate? Why?
Let’s use a different column to group the rows. What if we want to know whether growth rates differ between fungi isolated from the tips versus bases of the lichens? Try this yourself:
Problem 2. Modify the code above so that it counts the number of isolates and calculates the mean growth rate of isolates from the tips versus the bases of Evernia prunastri.
Hint: What is the name of the column that indicates whether an isolate came from the tip or base?
We can see that growth rates differ between fungi isolated from tips and bases, but is this difference statistically significant? To test this we should conduct an ANOVA and plot the results. (Why is ANOVA the approriate analysis?)
To conduct and ANOVA in R we use the lm()
function to define a linear model and then use the anova()
function on the result to calculate the F statistic and its p-value. Remember that the format is lm(Y ~ X, data = dataframe_name)
.
# Fit a linear model with growth rate as the response and thallus location as the predictor.
mod_growth_loc <- lm(Growth_rate_mm_day ~ Thallus_location, data = isolates)
# Calculate the F statistic and p-value
anova(mod_growth_loc)
## Analysis of Variance Table
##
## Response: Growth_rate_mm_day
## Df Sum Sq Mean Sq F value Pr(>F)
## Thallus_location 1 0.518 0.5181 0.0609 0.8082
## Residuals 16 136.076 8.5048
The low F value (0.061) means that the variance within groups is high relative to the variance between groups, which indicates that there is little difference in growth rates (the Y variable) between fungi from tips versus bases (the X grouping variable). The high p-value (0.81) substantiates this.
You should always plot the data! One reason is to check for outliers that may be skewing the results. Another reason is because statistical tests sometimes detect effects that aren’t actually very large or meaningful.
Boxplots are an easy way to show the variability of data within groups (e.g. when Y is continuous and X is categorical). To make a boxplot with ggplot
we first specify the the data to use (isolates
) and then specify the x
and y
variable columns within the aes()
function. Then we add (+
) the boxplot with geom_boxplot()
. I also added (+
) informative labels with the labs()
function.
# Make a boxplot of growth rates of isolates from different thallus locations
ggplot(isolates, aes(x = Thallus_location, y = Growth_rate_mm_day)) +
geom_boxplot() +
labs(x = "Thallus location",
y = expression("Growth rate"~(mm~day^-1))
)
Suppose we want to know whether growth rates differ between isolates that occured alone in the lichen versus those that co-occured with at least one other species. The column Num_taxa
tells us how many other taxa occured within the same lichen, but what we need is a column that differentiates between isolates occuring with 1 species (themselves) and those occuring with more than 1 species.
The mutate()
function will add a column to a dataframe based on a calculation that you specify. The following code makes a new column named Solo
that is TRUE
if the isolate occured alone (with Num_taxa == 1
) and is FALSE
otherwise. Notice that the double ==
tests whether the left and right sides are equal to one another, whereas the single =
assigns the calculation Num_taxa == 1
to the column named Solo
.
# Add a column that indicates whether an isolate occurred alone in the lichen.
isolates %>%
mutate(Solo = Num_taxa == 1)
## IsolateID TaxonID LichenID Thallus_location
## 1 T15-L1-F18 Nemania_serpens_gen1 T15-L1 base
## 2 T24-L1-F11 Cladosporium_sp1 T24-L1 tip
## 3 T24-L1-F13 Nemania_serpens_gen2 T24-L1 base
## 4 T25-L2-F12 Nemania_serpens_gen2 T25-L2 base
## 5 T26-L3-F21 Cladosporium_sp1 T26-L3 base
## 6 T26-L3-F23 Nemania_serpens_gen1 T26-L3 base
## 7 T26-L3-F7 Nemania_serpens_gen2 T26-L3 tip
## 8 T32-L3-F17 Cladosporium_sp1 T32-L3 base
## 9 T36-L2-F21 Pezizomycetes_sp2 T36-L2 base
## 10 T36-L3-F17 Pezizomycetes_sp2 T36-L3 base
## 11 T38-L1-F2 Nemania_serpens_gen1 T38-L1 base
## 12 T4-L2-F20 Pezizomycetes_sp2 T4-L2 tip
## 13 T15-L1-F11 Nigrospora_sp1 T15-L1 tip
## 14 T15-L1-F14 Peziza_arvernensis T15-L1 base
## 15 T15-L1-F15 Sordariomycetes_sp1_gen1 T15-L1 base
## 16 T15-L1-F16 Sordariomycetes_sp1_gen4 T15-L1 base
## 17 T15-L1-F22 Pezizomycetes_sp5 T15-L1 base
## 18 T54-L3-F4 Sordariomycetes_sp1_gen1 T54-L3 tip
## Growth_rate_mm_day Num_isolates Num_taxa Class
## 1 1.7860000 8 6 Sordariomycetes
## 2 0.1785714 6 5 Dothideomycetes
## 3 1.1428571 6 5 Sordariomycetes
## 4 3.6000000 2 2 Sordariomycetes
## 5 1.1190000 7 5 Dothideomycetes
## 6 2.1430000 7 5 Sordariomycetes
## 7 1.9290000 7 5 Sordariomycetes
## 8 2.5714286 3 2 Dothideomycetes
## 9 1.0714286 3 2 Pezizomycetes
## 10 1.9000000 2 2 Pezizomycetes
## 11 4.3000000 1 1 Sordariomycetes
## 12 4.7142857 9 4 Pezizomycetes
## 13 8.0000000 8 6 Pezizomycetes
## 14 1.7140000 8 6 Pezizomycetes
## 15 4.2860000 8 6 Sordariomycetes
## 16 1.4290000 8 6 Sordariomycetes
## 17 11.7500000 8 6 Pezizomycetes
## 18 2.0000000 1 1 Sordariomycetes
## Order Family Species Solo
## 1 Xylariales Xylariaceae Nemania_serpens FALSE
## 2 Capnodiales Davidiellaceae Cladosporium_sp1 FALSE
## 3 Xylariales Xylariaceae Nemania_serpens FALSE
## 4 Xylariales Xylariaceae Nemania_serpens FALSE
## 5 Capnodiales Davidiellaceae Cladosporium_sp1 FALSE
## 6 Xylariales Xylariaceae Nemania_serpens FALSE
## 7 Xylariales Xylariaceae Nemania_serpens FALSE
## 8 Capnodiales Davidiellaceae Cladosporium_sp1 FALSE
## 9 Pezizomycetes_sp2 FALSE
## 10 Pezizomycetes_sp2 FALSE
## 11 Xylariales Xylariaceae Nemania_serpens TRUE
## 12 Pezizomycetes_sp2 FALSE
## 13 Trichosphaeriales Nigrospora_sp1 FALSE
## 14 Pezizales Pezizaceae Peziza_arvernensis FALSE
## 15 Sordariomycetes_sp1 FALSE
## 16 Sordariomycetes_sp1 FALSE
## 17 Pezizomycetes_sp5 FALSE
## 18 Sordariomycetes_sp1 TRUE
We can see that a column named Solo
containing TRUE/FALSE
was added to the end, but the ouput is hard to read. Let’s change the code so that it filters the data to only those rows where Solo
is TRUE
. The filter()
function returns all rows for which the condition specified is true.
# Display isolates that occurred alone in the lichen.
isolates %>%
mutate(Solo = Num_taxa == 1) %>%
filter(Solo == TRUE)
## IsolateID TaxonID LichenID Thallus_location
## 1 T38-L1-F2 Nemania_serpens_gen1 T38-L1 base
## 2 T54-L3-F4 Sordariomycetes_sp1_gen1 T54-L3 tip
## Growth_rate_mm_day Num_isolates Num_taxa Class Order
## 1 4.3 1 1 Sordariomycetes Xylariales
## 2 2.0 1 1 Sordariomycetes
## Family Species Solo
## 1 Xylariaceae Nemania_serpens TRUE
## 2 Sordariomycetes_sp1 TRUE
We could have done this with fewer lines of code:
# Display isolates that occured alone in the lichen
isolates %>%
filter(Num_taxa == 1)
## IsolateID TaxonID LichenID Thallus_location
## 1 T38-L1-F2 Nemania_serpens_gen1 T38-L1 base
## 2 T54-L3-F4 Sordariomycetes_sp1_gen1 T54-L3 tip
## Growth_rate_mm_day Num_isolates Num_taxa Class Order
## 1 4.3 1 1 Sordariomycetes Xylariales
## 2 2.0 1 1 Sordariomycetes
## Family Species
## 1 Xylariaceae Nemania_serpens
## 2 Sordariomycetes_sp1
Better, but let’s just select the columns we want to look at:
# Display isolates that occured alone in the lichen
isolates %>%
filter(Num_taxa == 1) %>%
select(IsolateID, TaxonID, Growth_rate_mm_day, Num_taxa)
## IsolateID TaxonID Growth_rate_mm_day Num_taxa
## 1 T38-L1-F2 Nemania_serpens_gen1 4.3 1
## 2 T54-L3-F4 Sordariomycetes_sp1_gen1 2.0 1
Problem 3. Modify the code above so that it returns just the taxa names for the isolates occuring with at least 5 other taxa.
Hint:
Code symbol | Definition |
---|---|
== |
equal to |
!= |
not equal to |
> |
greater than |
< |
less than |
>= |
greater than or equal to |
<= |
less than or equal to |
Now that you know how to group, summarize and filter data, solve the following problems.
Problem 4. Do growth rates differ among isolates belonging to different Classes? Answer this question by plotting the
isolates
data and conducting an appropriate statistical test.
Problem 5. Do isolates from lichens with fewer other taxa have faster growth rates? Answer this question by plotting the data and conducting an appropriate statistical analysis.
Hint: This might be a review question that requires you to use the last Data analysis exercise.