Microbial bioinformatics at UWYO

Orientation

A page to organize bioinformatics related tasks required to complete the mission of the Microbial Ecology Collaborative at UWYO. This page only describes sequence processing and curation, for information on statistics, including modeling efforts, click /wiki/spaces/UWMICBIO/pages/557301. When complete, this page should include SOPs for sequence processing, code, and background materials necessary for someone relatively new to microbial sequencing to understand the process.

At the bottom of this page is a table to self-assign working group role. Please check out the legend under the table to help decide what category to choose.

CURRENT SOP

In the future an agreed upon SOP will be posted here. As the SOP evolves, old versions will be stored here as well.

Existing pipelines to be considered:

Bash script wrapping up USEARCH/VSEARCH commands (J. Harrison's current pipeline): https://github.com/JHarrisonEcoEvo/BioinformaticsTools/blob/master/fastq_to_OTUtable.sh

DADA2 1.8 Tutorial

CODE and TUTORIALs

A brief tutorial on sequence processing that takes the reader from fastq files to an OTU table (click /wiki/spaces/UWMICBIO/pages/28606515).

How to use Teton (click HERE).

Wishlist (i.e. things to do)

Note, some of these tasks are broader in scope than just the microbial bioinformatics working group, but are mentioned here anyway. As we understand how to address each question, a page will be added documenting protocols and knowledge. If beneficial, these broad needs will be split into tasks on Jira and assigned to people, but for now I am just listing them here.

Decide on standardized data processing steps to facilitate merging data generated from various substrates.
- Incorporate these standards into existing bash scripts or make new ones.
- Decide upon summary data and analytical notes to provide end users (current bash script outputs a lot of useful summary data already, but should be revisited).
Train data handlers how to process sequence data as it arrives.
- The outcome should be an OTU table and taxonomic hypotheses for all ESVs delivered to whoever generated the data and to the folks storing all the data.
- Decide if scripts should be amended to automatically generate many commonly desired analyses (e.g. ordination, diversity index calculation). This may be more trouble than it is worth, however we should discuss the pros and cons. Automatic incorporation of hierarchical Bayesian modeling is not practical at this stage (due to the need to tailor analyses to the data and the computational expense).
Decide on how to back up data (not limited to just microbial bioinformatics).
Decide how metadata should be curated. This includes not only sample provenance and associated covariates (e.g. plant traits), but also ensures credit is given to whomever did the collecting/lab work etc (not limited to just microbial bioinformatics).
Generate an ESV database so we can see exactly where each microbe we sequence was found, including substrate. This will also help us identify common sequences that show up in negative controls (idea: the easiest way to do this will be for the handlers to output ESVs from each dataset as it arrives and merge them with a database. In other words, I don't think this should be the responsibility of the folks doing the science).

Team members anD roles

Data handlers or leadership

Learner and user of methods

Downstream users

Paul Ayayee

Seifeddine Ben Tekaya

Please add your name to the table above.

Explanations of categories

"Downstream user" - This category is appropriate for those that do not envision doing microbial sequencing in the future or who will have collaborators that do the bioinformatics. For instance, say you are doing a one-off microbial project, then there is not a need to learn the ever changing methods required to process the data, instead data processing would be conducted by data handlers. Handlers would give you back a microbe by sample matrix that you could then use to answer your question. This choice might be especially good for students that are not pursuing a career in microbial ecology or for others that do not need/want to learn the details of sequence processing.

"Learner and user of methods" - This will likely be the best choice for most folks. You may want to self-assign as this category if you want to understand how each step in the sequence processing pipeline works and potentially implement the pipeline, but are not anticipating a future career dedicated to staying up to date with new methods or doing method development. Data handlers will set up tutorial sessions to explain methods to people in this category. You will not need to have grasp of Linux use and scripting to understand the bulk of the tutorial. However, you will need to know how to use Linux to actually do sequence processing.

"Data handler/leadership" - Category for those that envision a career processing microbial data. Data handlers will process data for downstream users and assist learners. Data handlers will require detailed knowledge of how to compute in Linux with HPC systems and will need to be able to occasionally write bash and/or Python scripts. SOPs for sequence processing and the necessary scripts to implement those SOPs will be created by the data handlers. We envision a streamlined, standardized sequence processing pipeline developed and implemented by the handlers. Data will come in, the handlers will run some scripts, and then give the output to others.

Topics to discuss/agree upon

Data handlers Linda van Diepen, Paul Ayayee, Gordon Custer, Alex Buerkle, as time allows perhaps consider the following pipeline(s) and add comments/questions as desired. When we meet we can figure out how to address each question.

Existing questions are in red. Branches denote different pipelines depending on method chosen.

I've (gc) added some comments in blue.

Summary:

I looked for comparisons of pipelines with simulated fastq data but have come up empty (will continue to dig). However, USEARCH and DADA2 seem to produce similar beta-diversity results, but slightly different (though correlated) alpha diversity results. Likely a result of the increased number of ESVs DADA2 calls.

USEARCH strengths/weaknesses:

Way faster than other options for all steps. Multithreading built in. Is quick on really huge datasets.

Errors are corrected during sequence merging in an elegant way.

DADA2 strengths/weaknesses:

Possibly has better error modeling for ESV generation then does UNOISE (need to think more about this).

Processes samples one at a time, whereas UNOISE processes them all at once. There are pros and cons to this (DADA2 can do both. You have the option to pool samples). The pro is that in cases where a particular taxon is common in one sample but rare everywhere else, DADA2 is more likely to correctly call that taxon an ESV. On the other hand, when there are few reads for a sample DADA2 will likely suggest many false positives. I suspect this is why in several papers DADA2 typically generates more ESVs than does UNOISE.

It is unclear how to implement as many qc steps with DADA2 compared to USEARCH. We can make methods to do this, but it seems like this will be on us and not part of the core software functionality. I've split the qf in DADA2 into multiple steps. Though it takes longer, this has allowed for retention of the most reads.

My suggestion is that we take the best of both U/VSEARCH and merge the pipelines.

DEBLUR - I do not see any reason to use this method at this time. It is not as fast as USEARCH and, perhaps, not as thorough as DADA2. After rereading the paper, I still do not think this method is as good as either of the others listed above.

MED - I have not used this method, but it always performs worse than other methods in the simulations I have read, so perhaps we can omit this method from consideration.

SEEKDEEP - I have not used this method either. Need to learn more about it.

Steps. Note: sequences retained after each step will be written to a summary file.These steps assume PE reads that can be merged. If we have single end reads or PE reads that are too short to merge, then we will need a different pipeline. We should not have this problem assuming 2x250 PE on the HiSeq. Agreed. However, we might have some problems with ITS reads as they vary in length. I think we should have two pipelines, one for those sequences that merge and one for the sequences that didn't.

Data download and curation.
1. How best to do this?
  1. We should pull raw data directly from sequencing facility and store in folder of raw reads (read only folder). We can start demultiplexing and move the output .fastq files to a demultiplexed folder where we can import data for processing (read only folder).
2. Necessity of a ReadME file explaining what the covariates are?
  1. Do you think we need this for processing? Maybe this should be stored with final ESV/OTU tables.

Download data to backed up portion of Teton (where is this?)
1. I think we will want to have this data else where as well. TETON isn't consistently backed up like petaLibrary (though i don't know if petaLibrary is the answer either). Should we think about offsite backups for raw data? AWS or somewhere else?
Copy data to working directory. Never operate directly on the raw data.
Include metadata and covariates with the sequencing data.

Unzip data.

Use GNU parallel unzip.

Demultiplex. I made a python script that uses Levenshtein distances to do this, but it is slow. Alex may have a faster version. If not, then we should figure out a way to parallelize this as it can take days to demultiplex.
QC: crosstalk (index switching) should not be an issue for us since we use custom internal barcodes. No QC needed here, but it is probably worth remembering this benefit of our approach and stating that in methods.
QC: if desired, spot check a few files to look at overall read quality. I have had no problem with read quality in awhile, there are too many reads to visually check, and problematic reads will get removed later anyway, so this step is not very important in my opinion. However, if I have problems merging reads, then I revisit this step and take a look to see if I have massive decline in base calling accuracy towards the end of the read. If so, then I tweak my merging specifications.

BRANCH 1. U/VSEARCH

If using the USEARCH pipeline:

Merge forward and reverse reads - I like USEARCH best for this as it does a nice job of using quality scores from both strands to figure out tricky bases. This is a strength of USEARCH. Can we use merging now and then learn error rates on a subset of merged reads using Dada2? Yes this is what Paul and I have done. We used the USEARCH merge_fastq_pairs.sh command to merge our reads, then we import the merged reads into DADA2 and perform quality filtering, error learning and dereplication.
1. Currently I am allowing a maximum of 10 different bases in overlapping regions (this helps when overlapping regions are very long, in which case errors tend to accrue and prevent many merges).
2. At least 80% of the two strands must agree to merge.
3. Minimum overlap is 12 bp.

Merging parameters are important and influence results substantially so we should agree on these. I think this will need to be assessed on a per-lane basis. We will receive varying quality and as such we will want to tweak our parameters for each.

QC step: ensure merging was generally successful (i.e. 70% or more of sequences merged, normally it is in the 90+% range). If merging was bad, then a read truncation step could help (removing low quality bases from the ends of reads) and relaxing merging parameters a bit could also help. Remember chimeras are removed later.

Should we remove low quality bases towards the end of the read anyway? I don’t normally, as they don’t seem to be causing me problems. However, this could be a way to get a few more reads to merge. In my experience, if we use the merge first approach then we can leave them and clip anything thats really poor following merging.

QC step: make sure merged reads are all oriented in the same direction. Usearch does this easily. Not sure about Dada2.
Remove primer binding regions. Some folks say these areas are prone to sequencing errors because the primer can bind to sequences with a slight mismatch thus you get discrepancies in this region. Also, I don’t think there is any information at this locus, as this region should be the same among ESVs. Thus, I propose deleting this portion of the sequence. This will require user input as primer sequence length will vary with primer. Note: removing primer binding regions was important back in the 97% OTU days because these regions changed overall sequence length, and thus the denominator in percentage calculations. We have incorporated CutAdapt in our pipeline to check for non-biological bases. We can just make vector of potential primer sequences and remove any if found.
Filter out crappy sequences. I have been removing sequences with more than a single expected error, as modeled by USEARCH. This stringency seems important for ESV based analyses. Filtering is done post-merging so base quality info from both reads can be used to assign bases with higher confidence. Perhaps we might bump this up to 2 as per DADA2 (or even 3) errors to retain more reads (erroneous reads should still be caught by the denoiser)? We have allowed 1 and some times 2 expected errors. The QF of DADA2 is very similar to U(V)SEARCH.
Combine filtered reads and dereplicate
Call ESVs using UNOISE (see other branch for dada2). I am not considering 97% OTUs as I see no reason to use these anymore. However, if 97% OTUs are desired they can be generated here. Current scripts output both ESVs and 97OTUs just in case somebody wants them.

QC: Chimeras are removed automatically during this step. Number of chimeric sequences will be recorded.

QC step: remove very short ESVs. Does 64 bp in length seem reasonable (this is what I have arbitrarily been using)? I suppose we could use primer dimer length as a minimum and that could be more sensible. This doesn’t remove many ESVs typically and shouldn’t matter at all when we use Bluepippin. We should look at possible dimer lengths and remove anything below that. However, these shorter reads might cause problems in our "have we ever seen this sequence" database.
QC step: remove low complexity ESVs (i.e. those composed of many repeated motifs). I have been removing ESVs comprising more than 25% repeated motifs. This step may not be necessary, it typically does not remove much, if anything.
Remove PhiX

We could also remove other ubiquitous contaminants here too, like human and dog?

QC: check for offset ESVs (those are ESVs that are the same but shifted). This should not be a problem, so long as we use one primer, UNLESS we have errors in demultiplexing. If barcodes don't get removed properly and an extra base gets left in we could generate an offset. Normally shouldn’t be a problem.
Concatenate sequences in prep for ESV table building. Sample info must be in header, must be fasta files. Simple commands can be used to accomplish these goals.
Make an ESV table. Usearch has a function to do this, so, thankfully, don't have to do by hand anymore. This step is the slowest part of this branch. Takes ESVs and aligns original, unfiltered data to them, thus retaining a handful of reads that were filtered out earlier.

Minor cleaning steps to make the ESV table easier to parse (e.g. remove some hashes and spaces)

QC: check sequence coverage. How much of the starting data made it into the ESVs table. Hopefully most of it did.
QC: check that all ESVs made it into the table.

BRANCH 2. DADA2

Remove non-biological and any N base calls.
Filter out PhiX, reads with more than X number of expected reads (default is 2) and reads with undefined bases. A trimming operation is also performed, which truncates reads once base quality declines. I have moved away from using a set length and towards using quality parameters as a means to control what makes it through. This is especially true for ITS data. The trunclen() parameters in DADA2 discard anything shorter than that minimum length.
Learn error rates, can be multithreaded. From dada2 help: “The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).” Need to learn more about the actual math.
QC: plot some of the error rates to make sure the model did a good job.
Dereplicate. Dada2 saves ESV quality info, which is nice.
Call ESVs.
Merge reads. To me, it is weird that this happens at this point in the dada2 pipeline. It seems like one could leverage more info if this happened first. Perhaps we could hack their pipeline to do this by using USEARCH to merge reads first, keeping q scores, then call ESVs with DADA2. I have not tried this so don’t know if it would work.
Make ESV table.
Remove chimeras.

BRANCH 3. Proposed DADA2/V(U)SEARCH pipeline:

Remove non-biological and any N base calls. CutAdapt
Initial quality filtering to remove any N base pairs. Remove basepairs after first N.
*Merge using U(V)SEARCH retaining quality information
Import merged .FASTQ reads into DADA2
Filter using basepair quality information no minimum length parameters.
1. This can be done in a step-wise fashion. First, remove BP <X quality, then remove reads with >X EE ...
Learn Errors - Put simply, the algorithm determines biological variants vs. sequencing errors
1. We will need to think about how to incorporate previous sequencing runs into this step ("have we seen this before" database). We will have a wealth of sequence info by the end that we will want to leverage for our error learning step.
  1. We will want to keep a copy of every sequence and number of times observed regardless of whether or not it is assigned as a sequencing error or biological variant. This will allow us to use previous runs to ID rare taxa.
Dereplicate - tabulates number of times each read is seen and in each sample. We lose our duplication here and everything after runs much faster.
1. Any new sequences will need to added to the "have we seen this before" database.
*Merge in dada2 (if we decide against a U/VSEAERCH merge first approach, we can merge in DADA2 here).
Tabulate into ESV/OTU table
Remove chimeric sequences.
Assign taxonomy. We will have to think about how we will deal with new versions of each database. DADA2 currently supports Greengenes, Silva, RDP, and UNITE (others can be made available as well).

Export ESV/OTU table to new folder containing metadata and ESV/OTU table with some common name for matching.

MERGING of branches.

Assign taxonomy. USEARCH uses the SINTAX algorithm, DADA2 uses a naive Bayesian classifier. Many other options exist. Does anyone have a preference of algorithm? I will continue to add to this step as this is one of the fastest changing parts of this pipeline in terms of new tool availability.

Choice of training database is critically important. I typically use two training databases and use information from both. For 16s, I use GreenGenes and Silva; for ITS, I use Warcup and UNITE. Does anyone want to use other databases? See comments above in BRANCH 3 for DADA2 taxonomic assignment databases.

Remove host DNA, cpDNA, and mtDNA. Info on this is gained from the databases used during taxonomic hypothesis generation, but also I recommend we use MIDORI (Machida et al. 2017), MITOFUN (Ntertilis et al. 2013), and the RefSeq mitochondria and plastid databases.

(Tatusova et al. 2014). Often these extra databases don’t detect many other unwanted ESVs, but it provides an extra measure of thoroughness.

Put ESVs and metadata into a database. This includes getting ESVs that show up in negative controls, so we can build a database of possible contaminants. NOTE: I do think we should spend a ton of time on this up front. We can deal with this after we have a workable pipeline that lets people do science. IMO we could spend a ton of time on this before we even need it and this would be very counterproductive.

How well the database work? Who inputs data?
What form should we get data into for inclusion?
How do we access the database to get info? For instance, regarding possible contaminants.
Maybe the database could spit out possible contaminants on the fly as data are added. We could then automatically search incoming data for these ESVs and flag them?

PROPOSED END of standardized pipeline. We could let end users remove contaminants from their data using negative controls, confirm success using positive controls and mock communities, and do stats. It may be difficult to automate these steps given varying experimental designs, sample names, etc. However, we should establish guidelines for people to follow.

Possible guidelines include:

Sum up reads that are not in negative controls and compare these to negative control reads. If an ESV is in both sets of reads but is proportionally more abundant in control samples then remove it. I have been using the threshold of >1%, arbitrarily. For instance, if >1% of an ESVs total abundance is in the negative controls then I remove it from the data. I do not like completely removing anything that is in the negative controls regardless of abundance, because sometimes the negative controls have a few reads from taxa that are super abundant in the non-control data, meaning the controls were slightly contaminated or that taxon is ubiquitous. In either case, one wouldn’t want to remove that taxon from the data. Surely, we can do better then what I have been doing, but this may not be of critical importance and developing better methods could be a time sink. Ideas welcome. In time, we will have a database of ESVs commonly found in lab reagents and can better identify contaminants.
Check mock community and positive controls to see how well sequencing worked, accuracy of taxonomic hypotheses, etc. ESVs will almost certainly expand on the expected biodiversity within mock communities, and end users should be aware of this.
We could provide guidelines regarding basic statistics, but this may be going a step to far. For instance, how to do a permanova, calculate diversity indices, etc. Perhaps this is best left up to end users.
We should incorporate code review into all statistics. This is probably not the place to talk about how to implement that, but I am writing it here just as something to think about.

Considerations:

For ITS sequences some folks like to try and remove the ITS region using ITSxtractor tool. I do not do this because it seems a good way to introduce bias, is unnecessary now that we use ESVs, and is super slow. If someone disagrees, then I would be keen to hear why.
Everything above assumes reads merge! If we have many biologically relevant reads that do not merge because read lengths output by the machine are too short. Then we have many additional considerations. Do we want to build in machinery that deals with unmerging reads on the fly?

Should we concatenate forward and reverse reads.
We will need to decide on a global trimming length, otherwise just one extra base on one side or the other of a read could lead to different ESVs
Concatenation may complicate chimera detection, but I am not sure.

Once we have decided on how we want to process this info we will need to ensure the proper dependencies are installed on TETON. We might want to develop a singularity image so we can easily duplicate processing.

UW Data Science Center