Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: comments for moving forward with MEC pipeline development

...

Existing questions are in red. Branches denote different pipelines depending on method chosen.

I've (gc) added some comments in blue. 

Summary:

USEARCH strengths/weaknesses:

    Way faster than other options for all steps. Multithreading built in. Is quick on really huge I looked for comparisons of pipelines with simulated fastq data but have come up empty (will continue to dig). However, USEARCH and DADA2 seem to produce similar beta-diversity results, but slightly different (though correlated) alpha diversity results. Likely a result of the increased number of ESVs DADA2 calls. 

USEARCH strengths/weaknesses:

    Way faster than other options for all steps. Multithreading built in. Is quick on really huge datasets.

    Errors are corrected during sequence merging in an elegant way.

...

    Processes samples one at a time, whereas UNOISE processes them all at once. There are pros and cons to this . The pro is (DADA2 can do both. You have the option to pool samples). The pro is that in cases where a particular taxon is common in one sample but rare everywhere else, DADA2 is more likely to correctly call that taxon an ESV. On the other hand, when there are few reads for a sample DADA2 will likely suggest many false positives. I suspect this is why in several papers DADA2 typically generates more ESVs than does UNOISE.

    It is unclear how to implement as many qc steps with DADA2 compared to USEARCH. We can make methods to do this, but it seems like this will be on us and not part of the core software functionalityI've split the qf in DADA2 into multiple steps. Though it takes longer, this has allowed for retention of the most reads. 

My suggestion is that we take the best of both U/VSEARCH and merge the pipelines. 

DEBLUR - I do not see any reason to use this method at this time. It is not as fast as USEARCH and, perhaps, not as thorough as DADA2. After rereading the paper, I still do not think this method is as good as either of the others listed above.

...

Steps. Note: sequences retained after each step will be written to a summary file.These steps assume PE reads that can be merged. If we have single end reads or PE reads that are too short to merge, then we will need a different pipeline. We should not have this problem assuming 2x250 PE on the HiSeq.

...

 Agreed. However, we might have some problems with ITS reads as they vary in length. I think we should have two pipelines, one for those sequences that merge and one for the sequences that didn't. 

  1. Data download and curation. Merge forward and reverse reads - I like USEARCH best for this as
    1. How best to do this?
      1. Necessity of a ReadME file explaining what the covariates are?
      1. Download data to backed up portion of Teton (where is this?)
      2. Copy data to working directory. Never operate directly on the raw data.
      3. Include metadata and covariates with the sequencing data.
    2. Unzip data.
      1. Use GNU parallel unzip.
    3. Demultiplex. I made a python script that uses Levenshtein distances to do this, but it is slow. Alex may have a faster version. If not, then we should figure out a way to parallelize this as it can take days to demultiplex.
    4. QC: crosstalk (index switching) should not be an issue for us since we use custom internal barcodes. No QC needed here, but it is probably worth remembering this benefit of our approach and stating that in methods.
    5. QC: if desired, spot check a few files to look at overall read quality. I have had no problem with read quality in awhile, there are too many reads to visually check, and problematic reads will get removed later anyway, so this step is not very important in my opinion. However, if I have problems merging reads, then I revisit this step and take a look to see if I have massive decline in base calling accuracy towards the end of the read. If so, then I tweak my merging specifications.

    BRANCH 1. U/VSEARCH

        If using the USEARCH pipeline:

        1. We should pull raw data directly from sequencing facility and store in folder of raw reads (read only folder). We can start demultiplexing and move the output .fastq files to a demultiplexed folder where we can import data for processing (read only folder). 
      1. Necessity of a ReadME file explaining what the covariates are?
        1. Do you think we need this for processing? Maybe this should be stored with final ESV/OTU tables. 
      1. Download data to backed up portion of Teton (where is this?)
        1. I think we will want to have this data else where as well. TETON isn't consistently backed up like petaLibrary (though i don't know if petaLibrary is the answer either). Should we think about offsite backups for raw data? AWS or somewhere else? 
      2. Copy data to working directory. Never operate directly on the raw data.
      3. Include metadata and covariates with the sequencing data.
    1. Unzip data.
      1. Use GNU parallel unzip.
    2. Demultiplex. I made a python script that uses Levenshtein distances to do this, but it is slow. Alex may have a faster version. If not, then we should figure out a way to parallelize this as it can take days to demultiplex.
    3. QC: crosstalk (index switching) should not be an issue for us since we use custom internal barcodes. No QC needed here, but it is probably worth remembering this benefit of our approach and stating that in methods.
    4. QC: if desired, spot check a few files to look at overall read quality. I have had no problem with read quality in awhile, there are too many reads to visually check, and problematic reads will get removed later anyway, so this step is not very important in my opinion. However, if I have problems merging reads, then I revisit this step and take a look to see if I have massive decline in base calling accuracy towards the end of the read. If so, then I tweak my merging specifications.

    BRANCH 1. U/VSEARCH

        If using the USEARCH pipeline:

    1. Merge forward and reverse reads - I like USEARCH best for this as it does a nice job of using quality scores from both strands to figure out tricky bases. This is a strength of USEARCH. Can we use merging now and then learn error rates on a subset of merged reads using Dada2?Currently I  Yes this is what Paul and I have done. We used the USEARCH merge_fastq_pairs.sh command to merge our reads, then we import the merged reads into DADA2 and perform quality filtering, error learning and dereplication. 
      1. Currently I am allowing a maximum of 10 different bases in overlapping regions (this helps when overlapping regions are very long, in which case errors tend to accrue and prevent many merges).
      2. At least 80% of the two strands must agree to merge.
      3. Minimum overlap is 12 bp.
      1. Merging parameters are important and influence results substantially so we should agree on these.
      QC step: ensure merging was generally successful
      1.  I think this will need to be assessed on a per-lane basis. We will receive varying quality and as such we will want to tweak our parameters for each. 
    2. QC step: ensure merging was generally successful (i.e. 70% or more of sequences merged, normally it is in the 90+% range). If merging was bad, then a read truncation step could help (removing low quality bases from the ends of reads) and relaxing merging parameters a bit could also help. Remember chimeras are removed later.
      1. Should we remove low quality bases towards the end of the read anyway? I don’t normally, as they don’t seem to be causing me problems. However, this could be a way to get a few more reads to merge. In my experience, if we use the merge first approach then we can leave them and clip anything thats really poor following merging. 
    3. QC step: make sure merged reads are all oriented in the same direction. Usearch does this easily. Not sure about Dada2.
    4. Remove primer binding regions. Some folks say these areas are prone to sequencing errors because the primer can bind to sequences with a slight mismatch thus you get discrepancies in this region. Also, I don’t think there is any information at this locus, as this region should be the same among ESVs. Thus, I propose deleting this portion of the sequence. This will require user input as primer sequence length will vary with primer.  Note: removing primer binding regions was important back in the 97% OTU days because these regions changed overall sequence length, and thus the denominator in percentage calculations.Filter out crappy  We have incorporated CutAdapt in our pipeline to check for non-biological bases. We can just make vector of potential primer sequences and remove any if found. 
    5. Filter out crappy sequences. I have been removing sequences with more than a single expected error, as modeled by USEARCH. This stringency seems important for ESV based analyses. Filtering is done post-merging so base quality info from both reads can be used to assign bases with higher confidence. Perhaps we might bump this up to 2 as per DADA2 (or even 3) errors to retain more reads (erroneous reads should still be caught by the denoiser)? We have allowed 1 and some times 2 expected errors. The QF of DADA2 is very similar to U(V)SEARCH. 
    6. Combine filtered reads and dereplicate
    7. Call ESVs using UNOISE (see other branch for dada2). I am not considering 97% OTUs as I see no reason to use these anymore. However, if 97% OTUs are desired they can be generated here. Current scripts output both ESVs and 97OTUs just in case somebody wants them.
      1. QC: Chimeras are removed automatically during this step. Number of chimeric sequences will be recorded.
    8. QC step: remove very short ESVs. Does 64 bp in length seem reasonable (this is what I have arbitrarily been using)? I suppose we could use primer dimer length as a minimum and that could be more sensible. This doesn’t remove many ESVs typically and shouldn’t matter at all when we use Bluepippin.QC step We should look at possible dimer lengths and remove anything below that. However, these shorter reads might cause problems in our "have we ever seen this sequence" database. 
    9. QC step: remove low complexity ESVs (i.e. those composed of many repeated motifs). I have been removing ESVs comprising more than 25% repeated motifs. This step may not be necessary, it typically does not remove much, if anything.
    10. Remove PhiX
      1. We could also remove other ubiquitous contaminants here too, like human and dog? 
    11. QC: check for offset ESVs (those are ESVs that are the same but shifted). This should not be a problem, so long as we use one primer, UNLESS we have errors in demultiplexing. If barcodes don't get removed properly and an extra base gets left in we could generate an offset. Normally shouldn’t be a problem.
    12. Concatenate sequences in prep for ESV table building. Sample info must be in header, must be fasta files. Simple commands can be used to accomplish these goals.
    13. Make an ESV table. Usearch has a function to do this, so, thankfully, don't have to do by hand anymore. This step is the slowest part of this branch. Takes ESVs and aligns original, unfiltered data to them, thus retaining a handful of reads that were filtered out earlier.
      1. Minor cleaning steps to make the ESV table easier to parse (e.g. remove some hashes and spaces)
    14. QC: check sequence coverage. How much of the starting data made it into the ESVs table. Hopefully most of it did.
    15. QC: check that all ESVs made it into the table.

    BRANCH 2. DADA2

    1. Filter out PhiX, reads with more than X number of expected reads (default is 2) and reads with undefined bases. A trimming operation is also performed, which truncates reads once base quality declines. Learn error rates, can be multithreaded. From dada2 help: “The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).” Need to learn more about the actual math.  QC: plot some of the error rates to make sure the model did a good job.Dereplicate. Dada2 saves ESV quality info, which is nice.Call ESVs. Merge reads. To me, it is weird that this happens at this point in the dada2 pipeline. It seems like one could leverage more info if this happened first. Perhaps we could hack their pipeline to do this by using USEARCH to merge reads first, keeping q scores, then call ESVs with DADA2. I have not tried this so don’t know if it would work.Make ESV table.Remove chimeras. . How much of the starting data made it into the ESVs table. Hopefully most of it did.
    2. QC: check that all ESVs made it into the table.


    BRANCH 2. DADA2


    1. Remove non-biological and any N base calls. 
    2. Filter out PhiX, reads with more than X number of expected reads (default is 2) and reads with undefined bases. A trimming operation is also performed, which truncates reads once base quality declines. I have moved away from using a set length and towards using quality parameters as a means to control what makes it through. This is especially true for ITS data. The trunclen() parameters in DADA2 discard anything shorter than that minimum length. 
    3. Learn error rates, can be multithreaded. From dada2 help: “The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).” Need to learn more about the actual math.
    4.  QC: plot some of the error rates to make sure the model did a good job.
    5. Dereplicate. Dada2 saves ESV quality info, which is nice.
    6. Call ESVs.
    7. Merge reads. To me, it is weird that this happens at this point in the dada2 pipeline. It seems like one could leverage more info if this happened first. Perhaps we could hack their pipeline to do this by using USEARCH to merge reads first, keeping q scores, then call ESVs with DADA2. I have not tried this so don’t know if it would work.
    8. Make ESV table.
    9. Remove chimeras.

    BRANCH 3. Proposed DADA2/V(U)SEARCH pipeline: 

    1. Remove non-biological and any N base calls. CutAdapt
    2. Initial quality filtering to remove any N base pairs. Remove basepairs after first N. 
    3. *Merge using U(V)SEARCH retaining quality information
    4. Import merged .FASTQ reads into DADA2
    5. Filter using basepair quality information no minimum length parameters. 
      1. This can be done in a step-wise fashion. First, remove BP <X quality, then remove reads with >X EE ... 
    6. Learn Errors - Put simply, the algorithm determines biological variants vs. sequencing errors
      1. We will need to think about how to incorporate previous sequencing runs into this step ("have we seen this before" database). We will have a wealth of sequence info by the end that we will want to leverage for our error learning step. 
        1. We will want to keep a copy of every sequence and number of times observed regardless of whether or not it is assigned as a sequencing error or biological variant. This will allow us to use previous runs to ID rare taxa. 
    7. Dereplicate - tabulates number of times each read is seen and in each sample. We lose our duplication here and everything after runs much faster. 
      1. Any new sequences will need to added to the "have we seen this before" database. 
    8. *Merge in dada2 (if we decide against a U/VSEAERCH merge first approach, we can merge in DADA2 here). 
    9. Tabulate into ESV/OTU table
    10. Remove chimeric sequences. 
    11. Assign taxonomy. We will have to think about how we will deal with new versions of each database. DADA2 currently supports Greengenes, Silva, RDP, and UNITE (others can be made available as well). 

    Export ESV/OTU table to new folder containing metadata and ESV/OTU table with some common name for matching. 


    MERGING of branches.


    1. Assign taxonomy. USEARCH uses the SINTAX algorithm, DADA2 uses a naive Bayesian classifier. Many other options exist. Does anyone have a preference of algorithm? I will continue to add to this step as this is one of the fastest changing parts of this pipeline in terms of new tool availability.
    1. Choice of training database is critically important. I typically use two training databases and use information from both. For 16s, I use GreenGenes and Silva; for ITS, I use Warcup and UNITE. Does anyone want to use other databases? See comments above in BRANCH 3 for DADA2 taxonomic assignment databases. 
    1. Remove host DNA, cpDNA, and mtDNA. Info on this is gained from the databases used during taxonomic hypothesis generation, but also I recommend we use MIDORI (Machida et al. 2017), MITOFUN (Ntertilis et al. 2013), and the RefSeq mitochondria and plastid databases

    (Tatusova et al. 2014). Often these extra databases don’t detect many other unwanted ESVs, but it provides an extra measure of thoroughness.

    ...