Quantitative Microbiome Measurement: Estimating bias from mock communities using the fido R package

Michael R. McLaren

Estimating bias from mock communities using the fido R package

mock communities bias estimation R computational workflow

A basic demonstration of how to use the fido R package for estimating bias from mock communities.

Author

Affiliation

Michael R. McLaren

North Carolina State University

Published

Feb. 6, 2021

Citation

McLaren, 2021

Status: Draft.

This post aims to give a minimum working example of how to use the fido R package (Silverman et al. 2019) to estimate bias from mock communities using the bias modeling framework of McLaren, Willis, and Callahan (2019). It is meant to supplement the vignettes included in fido package, particularly the introduction to pibble models and the vignette on setting priors in pibble models. I therefore gloss over many important details about using fido and pibble models in general.

Bias estimation using fido’s pibble models has several advantages over the method used in McLaren, Willis, and Callahan (2019) (as implemented in the metacal package), including

Pibble models can be used to model the log relative efficiencies as linear functions of sample predictor variables, something not yet supported by metacal.
Pibble models model the random counting error inherent in the sequencing process, which removes the need for pseudocounts and should improve accuracy in cases where taxa have small read counts (say, ).
The multivariate Normal prior on the coefficients provides a natural way to regularize efficiency estimates and (perhaps) account for phylogenetic relatedness among taxa

Pibble models do require careful selection of complex, multivariate priors, making their use more difficult at first. The scale of variation in efficiencies and in phylogenetic conservation for a given protocol and set of taxa are not typically known ahead of time, so it may often be warranted to experiment with different priors or use rather weakly informative ones. Hierarchical modeling of the efficiency parameters would be useful to overcome this limitation, but is not supported by fido; an “Empirical Bayes” approach may be useful to achieve a similar effect.

R setup

library(tidyverse)

library(metacal)
library(fido)

library(ggdist)
library(cowplot)
library(patchwork)

# Other packages used: driver, data.table

I will use the data from the cell-mixtures experiment of Brooks et al. (2015), which is included in the metacal package. Each sample is a cellular mock community consisting of an even mixture of one to seven of a set of seven bacterial species. The following code chunk imports and lightly cleans the data.

Show code

# The sample data and the observed and actual species abundances are stored as
# `.csv` files. The abundances are stored in standard OTU-table format with
# samples as rows and taxa as columns.
sam <- system.file(
  "extdata", "brooks2015-sample-data.csv", package = "metacal"
) %>%
  read_csv(col_types = "cffcic") %>%
  dplyr::rename_with(str_to_lower)
observed <- system.file(
  "extdata", "brooks2015-observed.csv", package = "metacal"
) %>%
  read_csv %>%
  select(-Other) %>%
  column_to_rownames("Sample") %>%
  as("matrix")
actual <- system.file(
  "extdata", "brooks2015-actual.csv", package = "metacal"
) %>%
  read_csv %>%
  column_to_rownames("Sample") %>%
  as("matrix")
stopifnot(setequal(colnames(actual), colnames(observed)))
stopifnot(setequal(rownames(actual), rownames(observed)))
stopifnot(setequal(rownames(actual), sam$sample))
# For working with fido, we want to make sure that the corresponding entries of
# `actual` and `observed` exactly match up, both are oriented with taxa as
# rows, and the order of samples also matches those in the sample data.
# (Currently, the abundance matrices are oriented with taxa as columns.)
observed <- observed %>% t
observed <- observed[, sam$sample]
actual <- actual %>% t
actual <- actual[rownames(observed), colnames(observed)]
stopifnot(identical(dimnames(observed), dimnames(actual)))
stopifnot(identical(colnames(observed), sam$sample))

Our starting point is a (tibble) data frame sam with the sample metadata

sam %>% glimpse

#> Rows: 80
#> Columns: 6
#> $ sample       <chr> "s1-1", "s1-2", "s1-3", "s1-4", "s1-5", "s1-6"…
#> $ plate        <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ barcode      <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ mixture_type <chr> "Cells", "Cells", "Cells", "Cells", "Cells", "…
#> $ num_species  <int> 3, 3, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 2, 3, 2…
#> $ species_list <chr> "Lactobacillus_crispatus;Prevotella_bivia;Stre…

a matrix observed with the observed counts

observed %>% corner

#>                          s1-1  s1-2  s1-3  s1-4  s1-5
#> Atopobium_vaginae           1  2930  2073  2217     0
#> Gardnerella_vaginalis       1  1650  1514     0   754
#> Lactobacillus_crispatus 13670 22645     3 16037     3
#> Lactobacillus_iners         0     0     1     0     0
#> Prevotella_bivia         7544     1 15772     5 14670

and a matrix actual with the nominal actual proportions of each species in each mock sample,

actual %>% corner %>% round(2)

#>                         s1-1 s1-2 s1-3 s1-4 s1-5
#> Atopobium_vaginae       0.00 0.33 0.33 0.33  0.0
#> Gardnerella_vaginalis   0.00 0.33 0.33 0.00  0.5
#> Lactobacillus_crispatus 0.33 0.33 0.00 0.33  0.0
#> Lactobacillus_iners     0.00 0.00 0.00 0.00  0.0
#> Prevotella_bivia        0.33 0.00 0.33 0.00  0.5

I have ensured that the corresponding entries of actual and observed exactly match up—that is, the rows and columns represent the taxa and samples in exactly the same order—and oriented both with taxa as rows. I have also ensured that the set and order of samples in sam exactly matches those in these abundance matrices.

n_samples <- ncol(observed)
n_taxa <- nrow(observed)

Samples differ in the number and identify of taxa they contain, with most samples containing just a subset of the 7 taxa,

sam %>% count(num_species)

#> # A tibble: 5 x 2
#>   num_species     n
#>         <int> <int>
#> 1           1     9
#> 2           2    25
#> 3           3    43
#> 4           4     1
#> 5           7     2

These samples have all been measured by the same protocol, but are split across two sequencing plates,

sam %>% count(plate)

#> # A tibble: 2 x 2
#>   plate     n
#>   <fct> <int>
#> 1 1        40
#> 2 2        40

Some entries in the count matrix have small positive values despite the corresponding entry being 0 in the matrix of actual proportions,

actual %>% corner %>% round(2)

#>                         s1-1 s1-2 s1-3 s1-4 s1-5
#> Atopobium_vaginae       0.00 0.33 0.33 0.33  0.0
#> Gardnerella_vaginalis   0.00 0.33 0.33 0.00  0.5
#> Lactobacillus_crispatus 0.33 0.33 0.00 0.33  0.0
#> Lactobacillus_iners     0.00 0.00 0.00 0.00  0.0
#> Prevotella_bivia        0.33 0.00 0.33 0.00  0.5

observed %>% corner

#>                          s1-1  s1-2  s1-3  s1-4  s1-5
#> Atopobium_vaginae           1  2930  2073  2217     0
#> Gardnerella_vaginalis       1  1650  1514     0   754
#> Lactobacillus_crispatus 13670 22645     3 16037     3
#> Lactobacillus_iners         0     0     1     0     0
#> Prevotella_bivia         7544     1 15772     5 14670

Such behavior is expected due to index hopping and other types of cross-sample contamination, but is not accounted for by the framework of McLaren2019 and we must set these counts to zero before proceeding. (Note: The metacal estimate_bias() function will do this automatically).

observed[actual == 0] <- 0

Estimation using `metacal::estimate_bias()`

We estimate bias using metacal as a point of comparison. Use of metacal::estimate_bias() is explained in the metacal tutorial. Entries that are non-zero in the observed matrix but zero in the actual matrix will be automatically zero’d by the function (see above). Metacal also requires that all entries in observed corresponding to positive entries in actual are positive, and so I add a pseudocount prior; fido will not require this step.

mc_fit <- estimate_bias(
  observed,
  actual, 
  margin = 2, # samples as columns
  boot = TRUE
)
summary(mc_fit)

#> Summary of a metacal bias fit.
#> 
#> Estimated relative efficiencies:
#> # A tibble: 7 x 4
#>   taxon                    estimate gm_mean gm_se
#>   <chr>                       <dbl>   <dbl> <dbl>
#> 1 Atopobium_vaginae           0.285   0.286  1.04
#> 2 Gardnerella_vaginalis       0.160   0.160  1.05
#> 3 Lactobacillus_crispatus     2.29    2.29   1.03
#> 4 Lactobacillus_iners         4.68    4.68   1.02
#> 5 Prevotella_bivia            1.79    1.79   1.04
#> 6 Sneathia_amnii              4.59    4.58   1.04
#> 7 Streptococcus_agalactiae    0.250   0.250  1.03
#> 
#> Geometric standard error estimated from 1000 bootstrap replicates.

The estimated efficiencies are geometrically centered, so that the estimates of individual taxa should be interpreted as being relative to all taxa.

We can visualize the interval estimates from the bootstrap replicates by first creating a tidy data frame with the replicates

# Convert stored bootstrap results into a tidy data frame
mc_boot <- mc_fit$bootrep %>%
  as_tibble %>%
  mutate(.draw = row_number()) %>%
  pivot_longer(cols = -.draw, names_to = "taxon", values_to = "estimate") %>%
  mutate(
    across(taxon, fct_reorder, estimate)
  )

and plotting with geoms such as stat_interval from the ggdist package,

mc_boot %>%
  ggplot(aes(y = taxon, x = estimate)) +
  stat_interval(.width = c(0.5, 0.9)) +
  scale_color_brewer() +
  theme_minimal_hgrid() +
  scale_x_log10() +
  plot_annotation(title = "Metacal bias estimate")

Estimation using `fido::pibble()`

The pibble model defines a matrix of parameters , such that the observed counts for sample are multinomially distributed, , and the expected ALR transform of is given by the linear model

where

is a matrix of regression coefficients and

is the vector of covariates for sample

. Compare this with the MWC model in the case where all taxa are assumed present, for which the expected ALR library composition is

where

is the vector of relative efficiencies that we wish to estimate. To fit the MWC model in a linear regression function that supports sample-specific offsets, (such as base::lm()), we can simply set the offsets equal the actual ALR abundances and use a constant (intercept) term for

. Fido does not support offsets, so we instead take the approach recommended by Justin Silverman of adding indicator (dummy) variables to our model, one for each mock sample, and setting the priors on the corresponding coefficients to reflect our knowledge of the true sample compositions.

A further wrinkle is that pibble models assume that all taxa are in all samples (though potentially at very low frequencies). Therefore, to use the above approach we must treat samples for which a taxon was not added as having a very small amount of that taxon. By setting the frequency low enough such that it would have been very unlikely to yield any sequencing reads anyways, the estimates we obtain should be approximately equivalent to what we’d get using metacal::estimate_bias() and treating these entries as true biological zeros. This is a valid approach to dealing with biological zeros in pibble models (but not with the metacal estimator) thanks to the multinomial layer in the probability model.

In practice, we set our design matrix to have columns: one for each sample, and a final column for the protocol effect. The first columns correspond to indicator variables that are 0 except for sample . The -th column is 1 everywhere, like a standard intercept term, that will capture the bias.

# Design matrix
X <- model.matrix(~ 0 + sample,
  data = sam
) %>% 
  cbind(bias = 1) %>%
  t
dim(X)

#> [1] 81 80

X[c(1:3, nrow(X)), 1:3]

#>             1 2 3
#> samples1-1  1 0 0
#> samples1-10 0 0 0
#> samples1-11 0 0 0
#> bias        1 1 1

Next, we set our prior on the coefficients to reflect the actual ALR abundances in the samples. To avoid zeros in the actual abundances prior to ALR transformation, we will add a small value chosen to give a proportion that is less than 1 divided by the maximum sample read depth and so is expected to have negligible effect on the resulting estimates. (Note, setting the zero values to smaller numbers may create numerical issues that result in warnings and require experimentation to achieve reliable fits.) The mean for the bias term is set to 0 (corresponding to all taxa having equal efficiency). The matrix Theta denotes the prior mean of the coefficient matrix,

colSums(observed) %>% max %>% {1 / .}

#> [1] 1.948103e-05

actual.alr <- (actual + 1e-7) %>% 
  # driver::alr assumes taxa are columns
  t %>%
  driver::alr() %>%
  t
nms <- rownames(X) %>% 
  str_subset("^sample") %>% 
  str_replace("^sample", "")
Theta <- cbind(actual.alr[, nms], bias = 0)

Gamma is the prior covariance between the coefficients of different covariates. In this illustration, I will set Gamma to be diagonal, to reflect the assumption that the samples were constructed independently and thus are subject to independent construction error. The first diagonal elements for the mock covariates reflect our uncertainty in the true abundances and can be adjusted to reflect the tolerance in constructing and quantifying the mocks. The final element corresponds to our uncertainty in the ALR efficiencies and should typically be much larger.

v <- c(rep(0.02, n_samples), 1)
Gamma <- diag(v)

To proceed with model fitting, it remains to set a prior on the covariance between taxa ALR values. For now I simply follow the pibble introductory vignette and suggest more consideration of this step, as well as the choice Gamma, in real applications.

upsilon <- n_taxa + 3
Omega <- diag(n_taxa)
G <- cbind(diag(n_taxa - 1), -1)
Xi <- (upsilon - n_taxa) * G %*% Omega %*% t(G)

In this dataset, each sample is an independently constructed mock; however, in general one might have technical replicates of some mocks. In that case, it would be appropriate to instead have one indicator variable for each independently constructed mock rather than each sample, each corresponding to a row in actual and with appropriate adjustments to the design matrix and priors.

Next, I construct the pibble model object for the prior and fit to the observed counts following the pibble vignette. Note the conversion to CLR coordinates.

priors <- pibble(NULL, X, upsilon, Theta, Gamma, Xi)
print(priors)

#>  pibblefit Object (Priors Only): 
#>   Number of Samples:      80 
#>   Number of Categories:       7 
#>   Number of Covariates:       81 
#>   Number of Posterior Samples:    2000 
#>   Contains Samples of Parameters:Eta  Lambda  Sigma
#>   Coordinate System:      alr, reference category: 7

priors <- to_clr(priors)
names_covariates(priors) <- rownames(X)
priors$Y <- observed
posterior <- refit(priors, optim_method="adam")
names_categories(posterior) <- rownames(observed)

We can see a summary of the (CLR) efficiency estimates as follows,

s <- summary(posterior)
s$Lambda %>%
  filter(str_detect(covariate, "bias")) %>%
  select(-Parameter) %>%
  arrange(-mean)

#> # A tibble: 7 x 8
#>   coord            covariate   p2.5    p25    p50   mean    p75  p97.5
#>   <chr>            <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 clr_Lactobacill… bias       1.42   1.49   1.53   1.53   1.56   1.63 
#> 2 clr_Sneathia_am… bias       1.41   1.48   1.52   1.52   1.55   1.62 
#> 3 clr_Lactobacill… bias       0.729  0.789  0.820  0.820  0.851  0.910
#> 4 clr_Prevotella_… bias       0.459  0.515  0.549  0.550  0.582  0.652
#> 5 clr_Atopobium_v… bias      -1.35  -1.28  -1.24  -1.24  -1.21  -1.14 
#> 6 clr_Streptococc… bias      -1.47  -1.40  -1.37  -1.37  -1.33  -1.27 
#> 7 clr_Gardnerella… bias      -1.92  -1.84  -1.80  -1.80  -1.76  -1.69

I will compare the fido and metacal estimates by using the posterior samples (fido) and bootstrap replicates (metacal) to generate interval estimates. First, I put the fido posterior samples in a data frame in long (tidy) format,

fido_post <- posterior$Lambda %>%
  data.table::as.data.table() %>%
  rlang::set_names("coord", "term", ".draw", "value") %>%
  print

#>                                 coord       term .draw     value
#>       1:        clr_Atopobium_vaginae       bias     1 -1.207854
#>       2:        clr_Atopobium_vaginae       bias     2 -1.304963
#>       3:        clr_Atopobium_vaginae       bias     3 -1.253822
#>       4:        clr_Atopobium_vaginae       bias     4 -1.287016
#>       5:        clr_Atopobium_vaginae       bias     5 -1.275280
#>      ---                                                        
#> 1133996: clr_Streptococcus_agalactiae samples2-9  1996 -6.458650
#> 1133997: clr_Streptococcus_agalactiae samples2-9  1997 -6.421403
#> 1133998: clr_Streptococcus_agalactiae samples2-9  1998 -6.449342
#> 1133999: clr_Streptococcus_agalactiae samples2-9  1999 -6.332566
#> 1134000: clr_Streptococcus_agalactiae samples2-9  2000 -6.435391

(Alternately, you can use fido::pibble_tidy_samples(posterior).) To make this data frame compatible with the mc_boot data frame, we must exponentiate the CLR values and adjust the taxa and column names, This data frame has the actual abundances as well, so I first filter to just the bias coefficients.

fido_post0 <- fido_post %>%
  filter(term == "bias") %>%
  rename(taxon = coord, estimate = value) %>%
  mutate(
    across(estimate, exp),
    across(taxon, str_sub, 5) # Remove the "clr_"
  )

Now we can bind these estimates with the mc_boot estimate to view them side-by-side,

bind_rows(metacal = mc_boot, fido = fido_post0, .id = "method") %>%
  mutate(across(taxon, fct_reorder, estimate)) %>%
  ggplot(aes(y = method, x = estimate)) +
  stat_interval(.width = c(0.5, 0.9)) +
  scale_color_brewer() +
  theme_minimal_hgrid() +
  scale_x_log10() +
  facet_wrap(~taxon, ncol = 1) +
  plot_annotation("Geometrically-centered efficiency estimates")

Comparing the fido interval estimates side-by-side with the metacal estimates shows close agreement, with the fido intervals being larger and mostly containing the metacal intervals. The size of the fido intervals is likely sensitive to our choice of priors, which determine the assumed precision of the known mock abundances as well as an implicit signal-to-noise ratio in the sequencing measurements.

Multiple protocols or protocol covariates

In progress; for now just some hints as to how to include covariates in a pibble bias model. The basic idea is to replace the constant “bias” column in the design matrix while leaving the sample indicators unchanged. For example, suppose we wanted to estimate separate efficiencies for each of the two sequencing plates in the Brooks data,

sam %>% count(plate)

#> # A tibble: 2 x 2
#>   plate     n
#>   <fct> <int>
#> 1 1        40
#> 2 2        40

We could use the following design matrix,

X <- model.matrix(~ 0 + sample + I(plate == "1") + I(plate == "2"),
  data = sam
) %>% 
  t
# simplify names for the two plate indicators
rownames(X)[c(nrow(X) - 1, nrow(X))] <- str_c("plate", 1:2)
# partial view:
X[c(1:3, nrow(X) - 1, nrow(X)), c(1:2, 40:41)]

#>             1 2 40 41
#> samples1-1  1 0  0  0
#> samples1-10 0 0  0  0
#> samples1-11 0 0  0  0
#> plate1      1 1  1  0
#> plate2      0 0  0  1

The coefficients of the two plate terms will give the bias associated with each plate; differences indicate the presence of a batch effect. Alternatively, we could use a design such as

X <- model.matrix(~ 0 + sample + plate,
  data = sam
) %>% 
  cbind(bias = 1) %>%
  t
X[c(1:3, nrow(X) - 1, nrow(X)), c(1:2, 40:41)]

#>             1 2 40 41
#> samples1-1  1 0  0  0
#> samples1-10 0 0  0  0
#> samples1-11 0 0  0  0
#> plate2      0 0  0  1
#> bias        1 1  1  1

In this case, the bias coefficients give the bias in plate 1, and the plate2 coefficients gives the difference in bias of plate 2 vs. plate 1. Now, values in the plate2 coefficients that deviate from 0 indicate a batch effect. The same inferences can be draw from fitted models using either approach; however, the priors will have different interpretations and may lead one to prefer one approach versus another.

Session info

Click for session info

sessioninfo::session_info()

#> ─ Session info ──────────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.4 (2021-02-15)
#>  os       Arch Linux                  
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2021-02-25                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────────────
#>  package        * version    date       lib source                          
#>  ade4             1.7-16     2020-10-28 [1] CRAN (R 4.0.3)                  
#>  ape              5.4-1      2020-08-13 [1] CRAN (R 4.0.2)                  
#>  arrayhelpers     1.1-0      2020-02-04 [1] CRAN (R 4.0.2)                  
#>  assertthat       0.2.1      2019-03-21 [1] CRAN (R 4.0.0)                  
#>  backports        1.2.1      2020-12-09 [1] CRAN (R 4.0.3)                  
#>  Biobase          2.50.0     2020-10-27 [1] Bioconductor                    
#>  BiocGenerics     0.36.0     2020-10-27 [1] Bioconductor                    
#>  biomformat       1.18.0     2020-10-27 [1] Bioconductor                    
#>  Biostrings       2.58.0     2020-10-27 [1] Bioconductor                    
#>  broom            0.7.3      2020-12-16 [1] CRAN (R 4.0.3)                  
#>  cellranger       1.1.0      2016-07-27 [1] CRAN (R 4.0.0)                  
#>  cli              2.2.0      2020-11-20 [1] CRAN (R 4.0.3)                  
#>  cluster          2.1.0      2019-06-19 [2] CRAN (R 4.0.4)                  
#>  coda             0.19-4     2020-09-30 [1] CRAN (R 4.0.2)                  
#>  codetools        0.2-18     2020-11-04 [2] CRAN (R 4.0.4)                  
#>  colorspace       2.0-0      2020-11-11 [1] CRAN (R 4.0.3)                  
#>  cowplot        * 1.1.0      2020-09-08 [1] CRAN (R 4.0.2)                  
#>  crayon           1.3.4      2017-09-16 [1] CRAN (R 4.0.0)                  
#>  data.table       1.13.4     2020-12-08 [1] CRAN (R 4.0.3)                  
#>  DBI              1.1.0      2019-12-15 [1] CRAN (R 4.0.0)                  
#>  dbplyr           2.0.0      2020-11-03 [1] CRAN (R 4.0.3)                  
#>  digest           0.6.27     2020-10-24 [1] CRAN (R 4.0.3)                  
#>  distill          1.2        2021-02-03 [1] Github (rstudio/distill@a99e7d5)
#>  distributional   0.2.1      2020-10-06 [1] CRAN (R 4.0.3)                  
#>  downlit          0.2.1      2020-11-04 [1] CRAN (R 4.0.3)                  
#>  dplyr          * 1.0.2      2020-08-18 [1] CRAN (R 4.0.2)                  
#>  driver           0.1.1      2020-10-04 [1] Github (jsilve24/driver@16e4499)
#>  ellipsis         0.3.1      2020-05-15 [1] CRAN (R 4.0.0)                  
#>  evaluate         0.14       2019-05-28 [1] CRAN (R 4.0.0)                  
#>  fansi            0.4.2      2021-01-15 [1] CRAN (R 4.0.3)                  
#>  farver           2.0.3      2020-01-16 [1] CRAN (R 4.0.0)                  
#>  fido           * 0.1.13     2020-10-04 [1] Github (jsilve24/fido@4dda897)  
#>  forcats        * 0.5.0      2020-03-01 [1] CRAN (R 4.0.0)                  
#>  foreach          1.5.1      2020-10-15 [1] CRAN (R 4.0.3)                  
#>  fs               1.5.0      2020-07-31 [1] CRAN (R 4.0.2)                  
#>  generics         0.1.0      2020-10-31 [1] CRAN (R 4.0.3)                  
#>  ggdist         * 2.3.0      2020-10-31 [1] CRAN (R 4.0.3)                  
#>  ggplot2        * 3.3.2      2020-06-19 [1] CRAN (R 4.0.1)                  
#>  glue             1.4.2      2020-08-27 [1] CRAN (R 4.0.2)                  
#>  gtable           0.3.0      2019-03-25 [1] CRAN (R 4.0.0)                  
#>  haven            2.3.1      2020-06-01 [1] CRAN (R 4.0.1)                  
#>  here             1.0.1      2020-12-13 [1] CRAN (R 4.0.3)                  
#>  highr            0.8        2019-03-20 [1] CRAN (R 4.0.0)                  
#>  hms              0.5.3      2020-01-08 [1] CRAN (R 4.0.0)                  
#>  htmltools        0.5.1.1    2021-01-22 [1] CRAN (R 4.0.3)                  
#>  httr             1.4.2      2020-07-20 [1] CRAN (R 4.0.2)                  
#>  igraph           1.2.6      2020-10-06 [1] CRAN (R 4.0.3)                  
#>  IRanges          2.24.0     2020-10-27 [1] Bioconductor                    
#>  iterators        1.0.13     2020-10-15 [1] CRAN (R 4.0.3)                  
#>  jsonlite         1.7.2      2020-12-09 [1] CRAN (R 4.0.3)                  
#>  knitr            1.31       2021-01-27 [1] CRAN (R 4.0.3)                  
#>  lattice          0.20-41    2020-04-02 [2] CRAN (R 4.0.4)                  
#>  lifecycle        0.2.0      2020-03-06 [1] CRAN (R 4.0.0)                  
#>  lubridate        1.7.9.2    2020-11-13 [1] CRAN (R 4.0.3)                  
#>  magrittr         2.0.1      2020-11-17 [1] CRAN (R 4.0.3)                  
#>  MASS             7.3-53     2020-09-09 [2] CRAN (R 4.0.4)                  
#>  Matrix           1.3-2      2021-01-06 [2] CRAN (R 4.0.4)                  
#>  metacal        * 0.1.0.9010 2020-10-21 [1] local                           
#>  mgcv             1.8-33     2020-08-27 [2] CRAN (R 4.0.4)                  
#>  modelr           0.1.8      2020-05-19 [1] CRAN (R 4.0.0)                  
#>  multtest         2.46.0     2020-10-27 [1] Bioconductor                    
#>  munsell          0.5.0      2018-06-12 [1] CRAN (R 4.0.0)                  
#>  nlme             3.1-152    2021-02-04 [2] CRAN (R 4.0.4)                  
#>  nvimcom        * 0.9-102    2021-02-25 [1] local                           
#>  patchwork      * 1.1.1      2020-12-17 [1] CRAN (R 4.0.3)                  
#>  permute          0.9-5      2019-03-12 [1] CRAN (R 4.0.0)                  
#>  phyloseq         1.34.0     2020-10-27 [1] Bioconductor                    
#>  pillar           1.4.7      2020-11-20 [1] CRAN (R 4.0.3)                  
#>  pkgconfig        2.0.3      2019-09-22 [1] CRAN (R 4.0.0)                  
#>  plyr             1.8.6      2020-03-03 [1] CRAN (R 4.0.0)                  
#>  prettyunits      1.1.1      2020-01-24 [1] CRAN (R 4.0.0)                  
#>  progress         1.2.2      2019-05-16 [1] CRAN (R 4.0.2)                  
#>  ps               1.5.0      2020-12-05 [1] CRAN (R 4.0.3)                  
#>  purrr          * 0.3.4      2020-04-17 [1] CRAN (R 4.0.0)                  
#>  R6               2.5.0      2020-10-28 [1] CRAN (R 4.0.3)                  
#>  RColorBrewer     1.1-2      2014-12-07 [1] CRAN (R 4.0.0)                  
#>  Rcpp             1.0.6      2021-01-15 [1] CRAN (R 4.0.3)                  
#>  readr          * 1.4.0      2020-10-05 [1] CRAN (R 4.0.3)                  
#>  readxl           1.3.1      2019-03-13 [1] CRAN (R 4.0.0)                  
#>  reprex           0.3.0      2019-05-16 [1] CRAN (R 4.0.1)                  
#>  reshape2         1.4.4      2020-04-09 [1] CRAN (R 4.0.0)                  
#>  rhdf5            2.34.0     2020-10-27 [1] Bioconductor                    
#>  rhdf5filters     1.2.0      2020-10-27 [1] Bioconductor                    
#>  Rhdf5lib         1.12.0     2020-10-27 [1] Bioconductor                    
#>  rlang            0.4.10     2020-12-30 [1] CRAN (R 4.0.3)                  
#>  rmarkdown      * 2.6        2020-12-14 [1] CRAN (R 4.0.3)                  
#>  rprojroot        2.0.2      2020-11-15 [1] CRAN (R 4.0.3)                  
#>  rstudioapi       0.13       2020-11-12 [1] CRAN (R 4.0.3)                  
#>  rvest            0.3.6      2020-07-25 [1] CRAN (R 4.0.2)                  
#>  S4Vectors        0.28.0     2020-10-27 [1] Bioconductor                    
#>  scales           1.1.1      2020-05-11 [1] CRAN (R 4.0.0)                  
#>  sessioninfo      1.1.1      2018-11-05 [1] CRAN (R 4.0.0)                  
#>  stringi          1.5.3      2020-09-09 [1] CRAN (R 4.0.3)                  
#>  stringr        * 1.4.0      2019-02-10 [1] CRAN (R 4.0.0)                  
#>  survival         3.2-7      2020-09-28 [2] CRAN (R 4.0.4)                  
#>  svUnit           1.0.3      2020-04-20 [1] CRAN (R 4.0.2)                  
#>  tibble         * 3.0.4      2020-10-12 [1] CRAN (R 4.0.3)                  
#>  tidybayes        2.3.1      2020-11-02 [1] CRAN (R 4.0.3)                  
#>  tidyr          * 1.1.2      2020-08-27 [1] CRAN (R 4.0.2)                  
#>  tidyselect       1.1.0      2020-05-11 [1] CRAN (R 4.0.0)                  
#>  tidyverse      * 1.3.0      2019-11-21 [1] CRAN (R 4.0.0)                  
#>  useful           1.2.6      2018-10-08 [1] CRAN (R 4.0.0)                  
#>  utf8             1.1.4      2018-05-24 [1] CRAN (R 4.0.0)                  
#>  vctrs            0.3.6      2020-12-17 [1] CRAN (R 4.0.3)                  
#>  vegan            2.5-7      2020-11-28 [1] CRAN (R 4.0.3)                  
#>  withr            2.3.0      2020-09-22 [1] CRAN (R 4.0.2)                  
#>  xfun             0.20       2021-01-06 [1] CRAN (R 4.0.3)                  
#>  xml2             1.3.2      2020-04-23 [1] CRAN (R 4.0.0)                  
#>  XVector          0.30.0     2020-10-27 [1] Bioconductor                    
#>  yaml             2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                  
#>  zlibbioc         1.36.0     2020-10-27 [1] Bioconductor                    
#> 
#> [1] /home/michael/R/x86_64-pc-linux-gnu-library/4.0
#> [2] /usr/lib/R/library

References

Brooks, J Paul, David J Edwards, Michael D Harwich, Maria C Rivera, Jennifer M Fettweis, Myrna G Serrano, Robert A Reris, et al. 2015. “The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies.” BMC Microbiol. 15 (1): 66. https://doi.org/10.1186/s12866-015-0351-6.

McLaren, Michael R, Amy D Willis, and Benjamin J Callahan. 2019. “Consistent and correctable bias in metagenomic sequencing experiments.” eLife 8: 46923. https://doi.org/10.7554/eLife.46923.

Silverman, Justin D., Kimberly Roche, Zachary C. Holmes, Lawrence A. David, and Sayan Mukherjee. 2019. “Bayesian Multinomial Logistic Normal Models through Marginally Latent Matrix-T Processes.” arXiv. http://arxiv.org/abs/1903.11695.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/mikemc/qmm, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

McLaren (2021, Feb. 6). Quantitative Microbiome Measurement: Estimating bias from mock communities using the fido R package. Retrieved from https://microbiomemeasurement.org/posts/bias-estimation-from-mocks-with-fido/

BibTeX citation

@misc{mclaren2021estimating,
  author = {McLaren, Michael R.},
  title = {Quantitative Microbiome Measurement: Estimating bias from mock communities using the fido R package},
  url = {https://microbiomemeasurement.org/posts/bias-estimation-from-mocks-with-fido/},
  year = {2021}
}

Estimating bias from mock communities using the fido R package

Author

Affiliation

Published

Citation

R setup

Estimation using metacal::estimate_bias()

Estimation using fido::pibble()

Multiple protocols or protocol covariates

Session info

Footnotes

References

Corrections

Reuse

Citation

Estimation using `metacal::estimate_bias()`

Estimation using `fido::pibble()`