Bioinformatics and suggested analyses

This page and its children demonstrate analyses that are commonly needed by users of the GTL.

See below for a series of QC analyses that are worth considering.

Note, when reading OTU tables into R it is worth being aware that vsearch inserts a “#” and a space into the first entry of the first field. To get around this, you can use this solution when reading the data into R:

“read.table("otutable", strip.white=TRUE, comment.char="", sep="\t", header=TRUE)”

Note that R will place “X” in front of the field names because they start with an integer and are thus not syntactically valid. Not a big deal, but be aware of this when doing data wrangling.

Count reads per replicate and get a feel for those samples that failed. Perhaps a treatment or a sampling location just didn’t pan out.
Check out technical replicates and make sure they look like similar. Combine read counts from them, if desired. If you don't want to combine them then beware to avoid pseudoreplication in downstream analyses.
Check for cross-contamination using coligos. Decide what to do about cross-contamination if you find it. See THIS link for instructions for how to determine which OTUs are the ISD and coligo.
Check out your negative controls and see what is in there. Decide how to deal with contaminants. Note that deleting anything that shows up in a blank from the whole dataset is not a good approach. Often one can see very minor contamination of some ubiquitous microbe that is really present out in nature. Currently, I recommend deleting those taxa that are very abundant in blanks, but not abundant in replicates. For instance, if 5% or more of the total reads from a taxon are in a blank then I might consider deleting that taxon. I also may delete a taxon if it is a known contaminant, see Salter et al. 2014 and other papers exploring the ‘kitome’. One may also want to redo analyses with and without the putative contaminants to see if anything changes.
Doublecheck that you do not have multiple OTUs that are the same taxon, but perhaps have an indel or other variant present. Combine OTUs if desired.
Generate taxonomic hypotheses for each OTU. Typically this is done using a classifier such as SINTAX and a database of curated sequences, such as SILVA or UNITE. If you do not know what I am talking about then please talk to Josh, Paul, John, Gordon, or someone else who has done this before. A new tool called AutoTax may be useful as well (see: https://mbio.asm.org/content/11/5/e01557-20#sec-1). Another possible tool we could use for both ITS and 16S classification IDTAXA (see: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0521-5). When generating taxonomic hypotheses make sure to report the date accessed or version of the database used, because the results will change depending on database version.
Convert from relative to absolute abundances if you want to do that. It can be instructive to compare results between relative and absolute datasets, they will almost certainly differ substantially.
Remove nontarget OTUs. These are sequences from hosts, eukaryotes etc.
While not an analysis, most journals will require your data be deposited online. Consider depositing the raw sequences, the OTU table, all code used, and the consensus OTU sequences. The university can host data with a DOI for free. Contact Josh if you need to do that soon, otherwise we will eventually post instructions here. Post your data/code at the latest possible date, or in a version controlled repo for the code, else you will need to repost after the many rounds of revisions that are inevitable when publishing.

Non QC analyses

Further analyses will likely be project specific, e.g., whether one chooses to use Dirichlet-multinomial modeling or a certain ordination technique, etc. will likely depend upon the scientific goals of the project. If everyone wants to do the same analysis (e.g., metric multidimensional scaling) then we can talk about standard code to make that happen, to help us all catch errors.

Another non QC analysis that is commonly implemented is Alpha diversity. Though it doesn’t tell nearly as much as some of the more sophisticated analyses, I think we will still have to do it.

Please all edit this page as you see fit. This is just a starting point and much detail needs to be added. @Alex Buerkle @Gregg Randolph @Gordon Custer @Reilly Dibner @Alessandra Ceretto @Ella DeWolf @Seifeddine Ben Tekaya @Félix Brédoire @Abby Hoffman @Macy Ricketts @John Calder @Paul Ayayee @Erin Bentley

Genome Technologies Laboratory

Bioinformatics and suggested analyses

Non QC analyses

Related content