Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

    Possibly has better error modeling for ESV generation then does UNOISE (need to think more about this).

    Seems to be agnostic to overall data size, whereas other methods are not. Thus, with UNIOSE one will get different results for smaller datasets (according to Caruso et al,. 2019), but one won't with DADA2. I need to vet this. Also, this issue may not matter to us because I imagine as data increase in size the variation in UNOISE output attenuates.      Processes samples one at a time, whereas UNOISE processes them all at once. There are pros and cons to this. The pro is that in cases where a particular taxon is common in one sample but rare everywhere else, DADA2 is more likely to correctly call that taxon an ESV. On the other hand, when there are few reads for a sample DADA2 will likely suggest many false positives. I suspect this is why in several papers DADA2 typically generates more ESVs than does UNOISE.

    It is unclear how to implement as many qc steps with DADA2 compared to USEARCH. We can make methods to do this, but it seems like this will be on us and not part of the core software functionality.

...

DEBLUR - I do not see any reason to use this method at this time. It is not as fast as USEARCH and, perhaps, not as thorough as DADA2. I will read After rereading the paper again and edit this if I change my mind. Others that know about this method, please add any opinions here.
, I still do not think this method is as good as either of the others listed above.


MED - I have not used this method, but it always performs worse than other methods in the simulations I have read, so perhaps we can omit this method from consideration.


SEEKDEEP - I have not used this method either. Need to learn more about it.


Steps. Note: sequences retained after each step will be written to a summary file.These steps assume PE reads that can be merged. If we have single end reads or PE reads that are too short to merge, then we will need a different pipeline. We should not have this problem assuming 2x250 PE on the HiSeq.

...

  1. For ITS sequences some folks like to try and remove the ITS region using ITSxtractor tool. I do not do this because it seems a good way to introduce bias, is unnecessary now that we use ESVs, and is super slow. If someone disagrees, then I would be keen to hear why.
  2. Everything above assumes reads merge! If we have many biologically relevant reads that do not merge because read lengths output by the machine are too short. Then we have many additional considerations. Do we want to build in machinery that deals with unmerging reads on the fly?
    1. Should we concatenate forward and reverse reads.
    2. We will need to decide on a global trimming length, otherwise just one extra base on one side or the other of a read could lead to different ESVs
    3. Concatenation may complicate chimera detection, but I am not sure.
    Now is not the time, but we may want to consider inventing our own denoiser. The three existing tools use various distances (Lev., Hamming, etc.) to build error models to detect ESVs. None of these methods seem to make use of known sequence error rates from the machine...and I bet improvements could be made. As I learn more, I will muse further on this here
    1. .