I usually include this code at the top of scripts so that R doesn’t automatically read text as factors.

Data manipulation

Loading Data

The following data tables are available for analysis:

Table Descriptions

sites: Location and canopy cover of Jasper Ridge survey sites.
lichens: Info about lichens collected at Jasper Ridge, including outcomes of growth and infection experiments.
fungal_isolates: Growth rates and identities of fungi isolated from lichens.
infection_prevalence: Abundance of E. prunastri and infection prevalence at survey sites.
interaction_experiment_byplate: Outcome of fungal isolate interaction experiment for each plate.
interaction_experiment_bycompetitor: Outcome of fungal isolate interaction experiment for each competitor on a plate.
env_loggers_summary: Temperature and humidity summary data for each site.
infection_experiment: Results of infection experiment.
taxa: Table of taxonomic names of fungal isolates.
site_X_species_tables: Tables that show whether different fungal taxa were present or absent in different lichens an at different sites. Each row is a lichen or a site and the columns are the different taxa. Check out the different tables on the Google Drive or on GitHub.

Table locations

Table name	Google sheet key	GitHub filename
sites	1LJj8BAngXFjBTvpOJv_0WXV_OwTXtEHk00p_WKsMuJ8	BIO46_2018_sample_sites.csv
lichens	11d3-k-3OaIOfm8EfneUiNGiumfPBdMBHMViaTbVtzwY	lichens.csv
fungal_isolates	16ku-VRUHMZHz5XgHtD2ESDhdem401sBnbbPEEVeYtrg
infection_prevalence	1iJ1bM4qGkL8M068wJAr52BU801hHo6HldQClU75FCMc	infection_prevalence.csv
interaction_experiment_byplate	1ty7Wz0oiEIFJcfSCy8FFIVd3eyiwI0iMc0AyGVdFrpY	interaction_experiment_byplate.csv
interaction_experiment_bycompetitor	16YxUnW2o1WfYYab5IUNTzxFVcrKVvwerbR7pGLJzHZQ	interaction_experiment_bycompetitor.csv
env_loggers_summary	1ktSM1PYOnazM663Fg_bdTYRPA8yUlHH2VYFriHFleEo	env_loggers_summary.csv
infection_experiment	1RwIhDQNeAWcLI9ZlrlpekDWQ0cjPM_UkBgb5FFBDiM8	infection_experiment.csv
taxa	1pbttwIAHZ8jijHdB9rvwkdxkh0rkLfKoegm07PxjB9Y	taxa.csv

Three ways to load these data:

1) From the Google Drive sheets

# Load package
library(googlesheets)

# Authorize googlesheets to access you Google Drive
gs_auth()

# Register the lichens data table
lichens_gs <- gs_key("11d3-k-3OaIOfm8EfneUiNGiumfPBdMBHMViaTbVtzwY")

# Read the lichens_gs googlesheet into R
lichens <- gs_read(lichens_gs)

2) From the GitHub repository

# Define the URL of the data repository
github_url <- "https://raw.githubusercontent.com/jescoyle/BIO46/master/2018/Data/"

# Read in the csv file using the url and the specific file name
# The following reads in three different tables
interactions <- read.csv(file.path(github_url, "interaction_experiment_bycompetitor.csv"))
sites <- read.csv(file.path(github_url, "BIO46_2018_sample_sites.csv"))
infection <- read.csv(file.path(github_url, "infection_prevalence.csv"))

3) From a file on your computer

First download the files to your BIO46 folder.

# Set the working directory
setwd("C:/Users/your_username/Documents/BIO46")

# Read in a csv file that you downloaded to that directory
read.csv("lichens.csv")

Manipulating Data

Functions for manipulating dataframes are in the dplyr and tidyr packages. Be sure to load them once before using these functions.

# Load the packages
library(dplyr)
library(tidyr)

Subsetting

Use select to choose specific columns and filter to choose specific rows based on the values in the columns. In the code below, the resulting dataframe is saved to an object named shrub_inf that can be used later.

# Show the site and lichen number of lichens on shrubs that have more than 0 ascomata (i.e. infected)
shrub_inf <- lichens %>%
  filter(Ascomata > 0 & Substrate_type == "shrub") %>%
  select(LichenID, Site, Ascomata)


shrub_inf

# A tibble: 5 x 3
  LichenID  Site Ascomata
     <chr> <int>    <int>
1     G2-2   106        4
2     G2-3   106        3
3     G3-6   108        2
4     G4-1    51        3
5     G4-2    51        6

Summarizing

Summarize values in columnes with the summarise function. Use group_by to specify the columns used to group data before calculating summary statistics. In the code examples below, the resulting dataframes are saved to an objects named substrate_summary and mean_growth that can be used later.

# For lichens on different substrates,
# Count the number of lichens found 
#  and calculate mean and maximum number of ascomata
substrate_summary <- lichens %>%
  group_by(Substrate_type) %>%
  summarise(Num_lichens  = n(),
            Ascomata = mean(Ascomata),
            Ascomata_max = max(Ascomata)
  )

substrate_summary

# A tibble: 2 x 4
  Substrate_type Num_lichens Ascomata Ascomata_max
           <chr>       <int>    <dbl>        <dbl>
1          shrub           9 2.000000     2.000000
2           tree          23 2.043478     2.043478

# Calculate the mean radial growth of each isolate growing solo or in co-culture
mean_growth <- interactions %>%
  group_by(IsolateID, Culture_type) %>%
  summarise(Radius_mm_day7 = mean(Radius_mm_day7),
            Radius_mm_day14 = mean(Radius_mm_day14)
  )

mean_growth

# A tibble: 17 x 4
# Groups:   IsolateID [?]
    IsolateID Culture_type Radius_mm_day7 Radius_mm_day14
        <chr>        <chr>          <dbl>           <dbl>
 1       S1-B    coculture       7.333333              NA
 2       S1-B         solo      11.000000        13.50000
 3       S2-B    coculture       7.666667              NA
 4       S2-B         solo       5.350000        10.50000
 5 T15-L1-F11         solo      20.000000        35.00000
 6 T15-L1-F15    coculture      27.500000              NA
 7 T15-L1-F15         solo      42.666667        56.33333
 8 T15-L1-F18         solo      12.500000        20.00000
 9 T15-L1-F22    coculture      18.750000        19.08333
10 T15-L1-F22         solo      55.000000        55.00000
11 T24-L1-F11         solo      11.500000        21.00000
12 T24-L1-F13         solo      10.000000        20.00000
13 T25-L2-F12         solo       7.000000        27.00000
14 T26-L3-F23         solo      14.000000        28.00000
15  T26-L3-F7         solo       9.000000        20.00000
16 T32-L3-F17         solo      14.375000        24.50000
17  T54-L3-F4         solo      16.500000        27.50000

Stacking columns

To combine multiple columns into the same column, use the gather function. key = is the name of a new column that will identify which column the data originally came from. value = is the name of a new column that will contain the values that were originally in separate columns.

# Put the radius measurements from day 7 and day 14 into a single column (named Radius_mm)
mean_growth_long <- mean_growth %>%
  gather(key = Measurement_day, value = Radius_mm, Radius_mm_day7, Radius_mm_day14)

mean_growth_long

# A tibble: 34 x 4
# Groups:   IsolateID [13]
    IsolateID Culture_type Measurement_day Radius_mm
        <chr>        <chr>           <chr>     <dbl>
 1       S1-B    coculture  Radius_mm_day7  7.333333
 2       S1-B         solo  Radius_mm_day7 11.000000
 3       S2-B    coculture  Radius_mm_day7  7.666667
 4       S2-B         solo  Radius_mm_day7  5.350000
 5 T15-L1-F11         solo  Radius_mm_day7 20.000000
 6 T15-L1-F15    coculture  Radius_mm_day7 27.500000
 7 T15-L1-F15         solo  Radius_mm_day7 42.666667
 8 T15-L1-F18         solo  Radius_mm_day7 12.500000
 9 T15-L1-F22    coculture  Radius_mm_day7 18.750000
10 T15-L1-F22         solo  Radius_mm_day7 55.000000
# ... with 24 more rows

Calculating new columns

Add new columns based on calculations with other columns using mutate.

# Add a new column to lichens that indicates whether it is infected (ascomata > 0)
lichens <- lichens %>%
  mutate(Infected = Ascomata > 0) 

# Display the LichenID, Ascomata and Infected columns
lichens %>%
    select(LichenID, Ascomata, Infected)

# A tibble: 32 x 3
   LichenID Ascomata Infected
      <chr>    <int>    <lgl>
 1    G1-11        0    FALSE
 2     G1-3       10     TRUE
 3     G1-4       14     TRUE
 4     G1-6        2     TRUE
 5     G1-7        0    FALSE
 6     G1-8        0    FALSE
 7     G1-9        3     TRUE
 8     G2-2        4     TRUE
 9     G2-3        3     TRUE
10     G2-4        0    FALSE
# ... with 22 more rows

# Add columns to mean_growth_long that indicate the number of days
# and then calculate growth rate (mm per day)
mean_growth_long <- mean_growth_long %>%
  mutate(Days = ifelse(Measurement_day == "Radius_mm_day7", 7, 14), 
         Growth_mm_day = Radius_mm / Days
  )
         
mean_growth_long

# A tibble: 34 x 6
# Groups:   IsolateID [13]
    IsolateID Culture_type Measurement_day Radius_mm  Days Growth_mm_day
        <chr>        <chr>           <chr>     <dbl> <dbl>         <dbl>
 1       S1-B    coculture  Radius_mm_day7  7.333333     7     1.0476190
 2       S1-B         solo  Radius_mm_day7 11.000000     7     1.5714286
 3       S2-B    coculture  Radius_mm_day7  7.666667     7     1.0952381
 4       S2-B         solo  Radius_mm_day7  5.350000     7     0.7642857
 5 T15-L1-F11         solo  Radius_mm_day7 20.000000     7     2.8571429
 6 T15-L1-F15    coculture  Radius_mm_day7 27.500000     7     3.9285714
 7 T15-L1-F15         solo  Radius_mm_day7 42.666667     7     6.0952381
 8 T15-L1-F18         solo  Radius_mm_day7 12.500000     7     1.7857143
 9 T15-L1-F22    coculture  Radius_mm_day7 18.750000     7     2.6785714
10 T15-L1-F22         solo  Radius_mm_day7 55.000000     7     7.8571429
# ... with 24 more rows

Combining tables

Merge the columns of two dataframes using a join function. left_join keeps all of the rows of the left dataframe. right_join keeps all of the rows of the right dataframe. full_join keeps all rows in both dataframes. The function will use the column names that are the same in both dataframes to match up the rows.

# Select some columns from the site data
sites_light <- sites %>%
  select(Site, Canopy_openness_pct)

# Add infection prevalence data to the abbreviated site data
inf_sites <- left_join(sites_light, infection)

inf_sites

   Site Canopy_openness_pct       Date Num_Eprunastri Num_infected Num_healthy
1     5               62.41 2018-02-08             16            2          14
2     8               60.28       <NA>             NA           NA          NA
3     9               71.68 2018-02-08              9            0           9
4    10               76.39 2018-02-08             12            0          12
5    10               76.39 2018-02-22             10            0           9
6    11               58.11       <NA>             NA           NA          NA
7    13               41.91 2018-02-22             18            0           9
8    15               57.49 2018-02-08             16            0           9
9    18               47.64 2018-02-08             24            6           3
10   19               25.59 2018-02-08              0            0           0
11   28               48.71 2018-02-22             10            1           8
12   37               75.38 2018-02-08             25            1           8
13   37               75.38 2018-02-22              9            0           9
14   38               58.71 2018-02-22              5            0           5
15   45               42.81 2018-02-08              4            3           1
16   47               15.67 2018-02-08             16            0           9
17   49               43.74 2018-02-08              0            0           0
18   51               22.41       <NA>             NA           NA          NA
19   52               32.40       <NA>             NA           NA          NA
20   54               29.60 2018-02-22              0            0           0
21   58               36.06 2018-02-22             17            1           8
22   61                  NA 2018-02-08              0            0           0
23   61                  NA 2018-02-22              3            0           3
24   65               45.88       <NA>             NA           NA          NA
25   67               29.61 2018-02-22              9            0           9
26   72               30.28 2018-02-22              8            0           8
27   74               28.04 2018-02-08             20            0           9
28   75               25.93 2018-02-08              7            1           6
29   75               25.93 2018-02-22              5            0           5
30   76               31.15 2018-02-22              0            0           0
31   79               72.15 2018-02-22              9            0           9
32   87               20.47       <NA>             NA           NA          NA
33   92               66.55 2018-02-22             10            3           7
34   94               25.46       <NA>             NA           NA          NA
35   95               51.15       <NA>             NA           NA          NA
36   97               45.26 2018-02-22              9            0           9
37   98               40.20 2018-02-08              0            0           0
38   99               33.04 2018-02-08              1            0           1
39  103               28.58 2018-02-08              3            0           3
40  103               28.58 2018-02-22              3            0           3
41  104               36.58       <NA>             NA           NA          NA
42  106               57.46 2018-02-08              0            0           0
43  108               53.13 2018-02-08              5            1           4
44  109               48.09       <NA>             NA           NA          NA
45  110               52.65 2018-02-08             35            0           9
46  110               52.65 2018-02-22             13            1           8
47  111               32.07       <NA>             NA           NA          NA
48  112               29.93 2018-02-08             12            6           6
49  114               63.93 2018-02-22              4            0           4
50   43               70.00       <NA>             NA           NA          NA

Data visualization

We will use functions from the ggplot2 package for data visualization.

# Load the package
library(ggplot2)

Scatterplot

A scatterplot shows how two variable relate to one another. Each point corresponds to a row in a dataframe. You can include a linear regression line using geom_smooth(method = "lm") or a logistic regression line using geom_smooth(method = "glm").

# Plot the number of ascomata versus a lichen's initial mass
lichens %>%
  ggplot(aes(x = Mass_mg_init, y = Ascomata)) +
    geom_point() +
    labs(x = "Initial mass (mg)",
         y = "Number of ascomata"
    )

# Create the same plot with a linear regression line.
lichens %>%
  ggplot(aes(x = Mass_mg_init, y = Ascomata)) +
    geom_point() +
    geom_smooth(method = "lm",
                se = TRUE,
                level = 0.95) +
    labs(x = "Initial mass (mg)",
         y = "Number of ascomata"
    )

# Create a plot of infectedion status versus initial mass
# including a logistic regression line
lichens %>%
  mutate(Infected = as.numeric(Infected)) %>% # convert to 0 or 1
  ggplot(aes(x = Mass_mg_init, y = Infected)) +
    geom_point() +
    geom_smooth(method = "glm",
                method.args = list(family = "binomial"),
                se = TRUE,
                level = 0.95) +
    labs(x = "Initial mass (mg)",
         y = "Infected"
    )

Histogram

A histogram shows the frequency of observed X values.

# Create a histogram that counts how many lichens have different numbers of ascomata
lichens %>%
  ggplot(aes(x = Ascomata)) +
    geom_histogram(binwidth = 5, boundary = 5) +
    labs(x = "Num. ascomata",
         y = "Number of lichens"
    )

Boxplot

A boxplot shows the distribution (median and quartiles) of Y values in different X categories.

The following boxplot uses data from the mean_growth_long dataframe created above in the Manipulating data -> Stacking columns section. Here are the first 10 rows…

IsolateID	Culture_type	Measurement_day	Radius_mm	Days	Growth_mm_day
S1-B	coculture	Radius_mm_day7	7.33	7	1.05
S1-B	solo	Radius_mm_day7	11.00	7	1.57
S2-B	coculture	Radius_mm_day7	7.67	7	1.10
S2-B	solo	Radius_mm_day7	5.35	7	0.76
T15-L1-F11	solo	Radius_mm_day7	20.00	7	2.86
T15-L1-F15	coculture	Radius_mm_day7	27.50	7	3.93
T15-L1-F15	solo	Radius_mm_day7	42.67	7	6.10
T15-L1-F18	solo	Radius_mm_day7	12.50	7	1.79
T15-L1-F22	coculture	Radius_mm_day7	18.75	7	2.68
T15-L1-F22	solo	Radius_mm_day7	55.00	7	7.86

# Boxplot of the growth rates of fungal isolates growing solo and in co-culture
mean_growth_long %>%
  ggplot(aes(x = Culture_type, y = Growth_mm_day)) +
    geom_boxplot() +
    labs(x = "Culture type",
         y = "Growth rate (mm / day)"
    )

Barchart

Barcharts can show the number of observations in different categories.

# Plot the number of infected vs. uninfected lichens on trees vs. shrubs
lichens %>%
  ggplot(aes(x = Substrate_type)) +
    geom_bar(aes(fill = Infected),
             position = "dodge" # remove this if you want the bars stacked
    ) +
    labs(x = "Host plant",
         y = "Num. lichens"
    )

Barcharts can also show the mean and standard devation of observations in different categories.

# Calculate mean and sd of growth on day 7 across taxa 
rad7_summary <- interactions %>%
  group_by(TaxonID, Culture_type) %>%
  summarise(Radius_mm = mean(Radius_mm_day7),
            Radius_sd = sd(Radius_mm_day7)
  )


# Plot the mean and sd of different taxa 
# growing solo versus in co-culture
rad7_summary %>%
  ggplot(aes(x = TaxonID, y = Radius_mm, fill = Culture_type)) +
    geom_col(position = "dodge") +
    geom_errorbar(aes(ymin = Radius_mm - Radius_sd,
                      ymax = Radius_mm + Radius_sd),
                  position = position_dodge(width = 0.9),
                  width = 0.5
    ) +
    labs(x = "",
         y = "Colony radius day 7 (mm)",
         fill = "Culture type"
    ) +
    theme(axis.text.x = element_text(angle=90, 
                                     vjust = 0.5, 
                                     hjust = 1
                        )
    )

Saving Figures

To save a figure, click the “Export” button in the Plots window of RStudio.

Alternatively, save the figure directly from your code by running the following immediately after generating a figure. Change the extension to .pdf or .svg if you prefer a vector format image.

# Save the last plot as my_file_name.png in the current working directory.
ggsave("my_file_name.png")

Statistics

Y categorical

We’ll use the lichens dataframe downloaded above to demonstrate analyses for when Y is categorical. Here are the first 10 rows…

Team	LichenID	Site	Substrate_type	Substrate_status	Ascomata	Experiment	Mass_mg_init	Infected
1	G1-11	9	shrub	living	0	infection	NA	FALSE
1	G1-3	5	tree	dead	10	growth	118	TRUE
1	G1-4	5	tree	dead	14	growth	121	TRUE
1	G1-6	99	tree	living	2	growth	46	TRUE
1	G1-7	112	tree	living	0	infection	NA	FALSE
1	G1-8	112	tree	living	0	growth	171	FALSE
1	G1-9	112	tree	living	3	growth control	66	TRUE
2	G2-2	106	shrub	living	4	growth control	90	TRUE
2	G2-3	106	shrub	living	3	growth	28	TRUE
2	G2-4	47	tree	living	0	infection	NA	FALSE

X categorical

To apply the following tests, first create a contingeny table of the data that tabulates how many observations are from each combination of X and Y categories. These tests evaluate whether the counts in the rows of the contingency table are different from one another.

# Count the number of infected vs. uninfected lichens on trees vs shrubs
lichen_count <- table(lichens$Substrate_type, lichens$Infected)
  
# Display table 
lichen_count

       
        FALSE TRUE
  shrub     4    5
  tree     13   10

Chi-square test

Use this test when you have many observations and the expected number of observations in each cell of the table is greater than five. This test is not valid when you have few observations.

# Test whether infection frequency is different between trees and shrubs
chisq.test(lichen_count)


    Pearson's Chi-squared test with Yates' continuity correction

data:  lichen_count
X-squared = 0.049105, df = 1, p-value = 0.8246

Fisher’s exact test

Use this test when you have few observations and when you know ahead of time how many observations are in each category of X and each category of Y (e.g. the marginal sums are fixed). Statisticians recommend not using this test and instead using unconditional tests (see Lydersen et al. 2009).

# Fisher's exact test with a two-sided hypothesis
fisher.test(lichen_count)


    Fisher's Exact Test for Count Data

data:  lichen_count
p-value = 0.6989
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.09605767 3.78195930
sample estimates:
odds ratio 
 0.6249097

Unconditional exact tests

Use this test when you have few observations and when you do not necessarily know ahead of time how many observations are in each category of X and each category of Y (e.g. the marginal sums are random).

# Load the package containing the function
library(Exact)

# Barnard's test used when the number of observations in each category of X is known
# X categories should be in the rows of the table, otherwise, specify, cond.row = FALSE
# method = "CSM" specifies Barnard's method, but you can use other options
barnard_test <- exact.test(lichen_count, 
                           model = "binomial", 
                           method = "CSM", 
                           cond.row = TRUE, 
                           to.plot = FALSE)

# Print test results
barnard_test


    csm

data:  4 out of 9 vs. 13 out of 23
test statistic = NA, first sample size = 9, second sample size = 23, p-value = 0.6508
alternative hypothesis: true difference in proportion is not equal to 0
sample estimates:
difference in proportion 
              -0.1207729

# Boschloo's test used when the number of observations in each category is unknown for X or Y
boschloo_test <- exact.test(lichen_count, 
                            model = "multinomial", 
                            method = "boschloo")

# Print test results
boschloo_test


    boschloo

data:  x11=4, x12=5, x21=13, x22=10
test statistic = 0.69886, total sample size = 32, p-value = 0.6139
alternative hypothesis: true difference in product of proportion is not equal to 0
sample estimates:
difference in product of proportion 
                        -0.02441406

X continuous

Logistic regression

Use this form of logistic regression when the dependent response variable has two outcomes (e.g. False / True, Absent / Present).

# Fit a logistic regression model with:
# Y = whether a lichen is infected
# X = the initial mass of the lichen
mod_infVmass <- glm(Infected ~ Mass_mg_init, 
                   data = lichens, 
                   family = binomial(link = "logit")
)

# View the results
summary(mod_infVmass)


Call:
glm(formula = Infected ~ Mass_mg_init, family = binomial(link = "logit"), 
    data = lichens)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7169  -1.2013   0.6215   1.0695   1.2216  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.232116   0.657943  -0.353    0.724
Mass_mg_init  0.005853   0.004471   1.309    0.190

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 31.755  on 23  degrees of freedom
Residual deviance: 29.402  on 22  degrees of freedom
  (8 observations deleted due to missingness)
AIC: 33.402

Number of Fisher Scoring iterations: 4

# Exponentiate the mode coeffifients (intercept and slope)
# to see the effect of mass on the odds of being infected
exp(coef(mod_infVmass))

 (Intercept) Mass_mg_init 
   0.7928544    1.0058706

Y continuous

X categorical

We’ll use the interactions dataframe downloaded above to demonstrate ANOVA and t-tests. Here are the first 10 rows…

IsolateID	TaxonID	Radius_mm_day7	Radius_mm_day14	Competitor_isolate	Competitor_taxon	Culture_type	PlateID
S2-B	NA	7.50	7.5	T32-L3-F17	Cladosporium_sp1	coculture	CJ-P1
S2-B	NA	6.00	8.5	T24-L1-F13	Nemania_serpens_gen2	coculture	CJ-P10
S2-B	NA	4.00	5.5	T26-L3-F7	Nemania_serpens_gen2	coculture	CJ-P11
T24-L1-F13	Nemania_serpens_gen2	10.00	20.0	NA	NA	solo	CJ-P12
T26-L3-F7	Nemania_serpens_gen2	9.00	20.0	NA	NA	solo	CJ-P13
S2-B	NA	6.50	9.0	T24-L1-F11	Cladosporium_sp1	coculture	CJ-P2
S2-B	NA	5.70	9.0	NA	NA	solo	CJ-P3
T32-L3-F17	Cladosporium_sp1	15.75	25.0	NA	NA	solo	CJ-P4
T24-L1-F11	Cladosporium_sp1	10.00	21.0	NA	NA	solo	CJ-P5
S2-B	NA	6.00	8.5	T15-L1-F18	Nemania_serpens_gen1	coculture	CJ-P6

ANOVA

Use ANOVA when you want to test whether a normally-distributed dependent variable differs among categories. ANOVA assumes that the variance within groups is the same across groups.

# Replace unknown (NA) taxon names with their isolate ID (e.g. S1-A)
int_noNAtaxa <- interactions %>%
  mutate(TaxonID = ifelse(is.na(TaxonID), IsolateID, TaxonID))

# ANOVA testing whether day 7 radius differs among taxa
# Fit model
mod_rad7Vtaxa <- lm(Radius_mm_day7 ~ factor(TaxonID), data = int_noNAtaxa)

# Perform ANOVA
anova(mod_rad7Vtaxa)

Analysis of Variance Table

Response: Radius_mm_day7
                Df Sum Sq Mean Sq F value    Pr(>F)    
factor(TaxonID)  7 4443.4  634.77  8.0596 2.062e-06 ***
Residuals       47 3701.7   78.76                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post-hoc t-test

ANOVA only tells you whether the categories are different from one another (variance between categories > variance within categories). If you want to know which categories differ from one another, use Tukey’s Honest Significant Difference test after performing an ANOVA to determine whether the pairs of categories have different means.

# First fit an AOV (analysis of variance) object
aov_rad7Vtaxa <- aov(mod_rad7Vtaxa)

# Then computed Tukey's HSD
TukeyHSD(aov_rad7Vtaxa)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = mod_rad7Vtaxa)

$`factor(TaxonID)`
                                                     diff         lwr       upr     p adj
Nemania_serpens_gen1-Cladosporium_sp1           0.3125000 -24.0608489 24.685849 1.0000000
Nemania_serpens_gen2-Cladosporium_sp1          -4.2708333 -25.7661066 17.224440 0.9982478
Nigrospora_sp1-Cladosporium_sp1                 7.0625000 -24.4033581 38.528358 0.9962309
Pezizomycetes_sp5-Cladosporium_sp1             10.9910714  -4.9650309 26.947174 0.3793044
S1-B-Cladosporium_sp1                          -4.9375000 -21.3700085 11.495008 0.9787338
S2-B-Cladosporium_sp1                          -5.6920455 -22.1245539 10.740463 0.9539829
Sordariomycetes_sp1_gen1-Cladosporium_sp1      17.1736111   0.2612206 34.086002 0.0442053
Nemania_serpens_gen2-Nemania_serpens_gen1      -4.5833333 -30.2750989 21.108432 0.9991229
Nigrospora_sp1-Nemania_serpens_gen1             6.7500000 -27.7191205 41.219121 0.9984024
Pezizomycetes_sp5-Nemania_serpens_gen1         10.6785714 -10.5962316 31.953375 0.7526636
S1-B-Nemania_serpens_gen1                      -5.2500000 -26.8844027 16.384403 0.9939164
S2-B-Nemania_serpens_gen1                      -6.0045455 -27.6389481 15.629857 0.9864880
Sordariomycetes_sp1_gen1-Nemania_serpens_gen1  16.8611111  -5.1400025 38.862225 0.2511248
Nigrospora_sp1-Nemania_serpens_gen2            11.3333333 -21.1644652 43.831132 0.9523310
Pezizomycetes_sp5-Nemania_serpens_gen2         15.2619048  -2.6435101 33.167320 0.1471494
S1-B-Nemania_serpens_gen2                      -0.6666667 -18.9978973 17.664564 1.0000000
S2-B-Nemania_serpens_gen2                      -1.4212121 -19.7524427 16.910018 0.9999968
Sordariomycetes_sp1_gen1-Nemania_serpens_gen2  21.4444444   2.6818317 40.207057 0.0150268
Pezizomycetes_sp5-Nigrospora_sp1                3.9285714 -25.2031524 33.060295 0.9998617
S1-B-Nigrospora_sp1                           -12.0000000 -41.3953648 17.395365 0.8963273
S2-B-Nigrospora_sp1                           -12.7545455 -42.1499102 16.640819 0.8633919
Sordariomycetes_sp1_gen1-Nigrospora_sp1        10.1111111 -19.5551844 39.777407 0.9577291
S1-B-Pezizomycetes_sp5                        -15.9285714 -27.2680796 -4.589063 0.0012500
S2-B-Pezizomycetes_sp5                        -16.6831169 -28.0226250 -5.343609 0.0006367
Sordariomycetes_sp1_gen1-Pezizomycetes_sp5      6.1825397  -5.8418549 18.206934 0.7297755
S2-B-S1-B                                      -0.7545455 -12.7551529 11.246062 0.9999993
Sordariomycetes_sp1_gen1-S1-B                  22.1111111   9.4613602 34.760862 0.0000343
Sordariomycetes_sp1_gen1-S2-B                  22.8656566  10.2159057 35.515407 0.0000180

t-test

Use a t-test whether you want to test whether the mean of a normally-distributed dependent variable differs between two categories.

The arguments in the t.test function can be used to specify whether the variances should be assumed equal among groups (Student’s t-test, var.equal = TRUE) or unequal (Welch’s t-test, var.equal = FALSE). The paired arguments determines whether you want an independent sample or dependent sample test. Use a dependent sample test (paired = TRUE) when you have measured the same individuals in both categories.

We’ll use the interactions data to demonstrate and test whether the radius measured on day 7 differed from the radius measured on day 14.

# Student's t-test, two-sided hypothesis
t.test(interactions$Radius_mm_day14, 
       interactions$Radius_mm_day7,
       var.equal = TRUE)


    Two Sample t-test

data:  interactions$Radius_mm_day14 and interactions$Radius_mm_day7
t = 1.8523, df = 102, p-value = 0.06688
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.3311459  9.6784186
sample estimates:
mean of x mean of y 
 21.00000  16.32636

# Student's t-test, one-sided hypothesis
t.test(interactions$Radius_mm_day14, 
       interactions$Radius_mm_day7,
       var.equal = TRUE,
       alternative = "greater")


    Two Sample t-test

data:  interactions$Radius_mm_day14 and interactions$Radius_mm_day7
t = 1.8523, df = 102, p-value = 0.03344
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 0.4852784       Inf
sample estimates:
mean of x mean of y 
 21.00000  16.32636

# Welch's t-test (unequal variance), one-sided hypothesis
t.test(interactions$Radius_mm_day14, 
       interactions$Radius_mm_day7,
       var.equal = FALSE,
       alternative = "greater")


    Welch Two Sample t-test

data:  interactions$Radius_mm_day14 and interactions$Radius_mm_day7
t = 1.8425, df = 97.82, p-value = 0.03422
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 0.4614567       Inf
sample estimates:
mean of x mean of y 
 21.00000  16.32636

Since measurements are not independent- we measured the same plates on day 7 and day 14, a paired test is more appropriate. Note that the two columns must have “individuals” in the same order.

# Paired Welch's t-test treating plates as individuals
t.test(interactions$Radius_mm_day14, 
       interactions$Radius_mm_day7,
       var.equal = FALSE,
       alternative = "greater",
       paired = TRUE)


    Paired t-test

data:  interactions$Radius_mm_day14 and interactions$Radius_mm_day7
t = 6.3926, df = 48, p-value = 3.142e-08
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 4.035136      Inf
sample estimates:
mean of the differences 
               5.470408

X continuous

We’ll use the inf_sites dataframe created above in the Manipulating Data section to demonstrate fitting linear regression models. Here are the first 20 rows…

Site	Canopy_openness_pct	Date	Num_Eprunastri	Num_infected	Num_healthy
5	62.41	2018-02-08	16	2	14
8	60.28	NA	NA	NA	NA
9	71.68	2018-02-08	9	0	9
10	76.39	2018-02-08	12	0	12
10	76.39	2018-02-22	10	0	9
11	58.11	NA	NA	NA	NA
13	41.91	2018-02-22	18	0	9
15	57.49	2018-02-08	16	0	9
18	47.64	2018-02-08	24	6	3
19	25.59	2018-02-08	0	0	0
28	48.71	2018-02-22	10	1	8
37	75.38	2018-02-08	25	1	8
37	75.38	2018-02-22	9	0	9
38	58.71	2018-02-22	5	0	5
45	42.81	2018-02-08	4	3	1
47	15.67	2018-02-08	16	0	9
49	43.74	2018-02-08	0	0	0
51	22.41	NA	NA	NA	NA
52	32.40	NA	NA	NA	NA
54	29.60	2018-02-22	0	0	0

Linear regression

Use normal linear regression when the dependent variable is continuous and (mostly) normally distributed.

# Linear regression of number of E prunastri vs. canopy openness
mod_eprunVcanopy <- lm(Num_Eprunastri ~ Canopy_openness_pct, data = inf_sites)

# View the results
summary(mod_eprunVcanopy)


Call:
lm(formula = Num_Eprunastri ~ Canopy_openness_pct, data = inf_sites)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.692  -5.383  -2.009   4.355  24.815 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
(Intercept)          4.63260    3.79711   1.220    0.231
Canopy_openness_pct  0.10545    0.07611   1.386    0.175

Residual standard error: 8.038 on 34 degrees of freedom
  (14 observations deleted due to missingness)
Multiple R-squared:  0.05345,   Adjusted R-squared:  0.02561 
F-statistic:  1.92 on 1 and 34 DF,  p-value: 0.1749

Count (Poisson) regression

Poisson regression is useful when the dependent variable is a discrete count. The number of E. prunastri is a count, not actually continuous and since there are many zeros and low counts (see histogram below) we may want to use Poisson regression instead.

# Histogram to view the distribution of Num_Eprunastri
inf_sites %>%
  ggplot(aes(x = Num_Eprunastri)) +
    geom_histogram(binwidth = 5, boundary = 5)

# Fit a Poisson regression
mod_eprunVcanopy_pois <- glm(Num_Eprunastri ~ Canopy_openness_pct, 
                             data = inf_sites,
                             family = "poisson"
)

# View the results
summary(mod_eprunVcanopy_pois)


Call:
glm(formula = Num_Eprunastri ~ Canopy_openness_pct, family = "poisson", 
    data = inf_sites)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.5922  -1.9392  -0.6726   1.3738   6.1363  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.732272   0.160744  10.777  < 2e-16 ***
Canopy_openness_pct 0.010847   0.003022   3.589 0.000332 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 261.88  on 35  degrees of freedom
Residual deviance: 249.06  on 34  degrees of freedom
  (14 observations deleted due to missingness)
AIC: 375.2

Number of Fisher Scoring iterations: 5

# Create a scatterplot
# fitted Poisson regression line in blue
# fitted linear regression line in red
inf_sites %>%
  ggplot(aes(x = Canopy_openness_pct, y = Num_Eprunastri)) +
    geom_point() +
    geom_smooth(method = "glm",
                method.args = list(family = "poisson"),
                color = "blue") +
    geom_smooth(method = "lm",
                color = "red")

R Summary Sheet

BIO 46, Winter 2018

Data manipulation

Loading Data

Manipulating Data

Subsetting

Summarizing

Stacking columns

Calculating new columns

Combining tables

Data visualization

Scatterplot

Histogram

Boxplot

Barchart

Saving Figures

Statistics

Y categorical

X categorical

Chi-square test

Fisher’s exact test

Unconditional exact tests

X continuous

Logistic regression

Y continuous

X categorical

ANOVA

t-test

X continuous

Linear regression

Count (Poisson) regression