How to generate taxonomic hypotheses for OTUs

OTU sequences can be matched against databases of sequences that are from identified taxa. If a good match occurs, then that is evidence the sequence could have come from that taxon. There is much more to generating taxonomic hypotheses from sequence data, and if this is new to you, then I recommend reading the papers supporting various algorithms and dig into the literature debating the merits of different approaches. Here, I will show one way to generate taxonomic hypotheses. Other ways exist, and I urge readers that prefer other approaches to add to this page.

The SINTAX algorithm

Details of this algorithm can be found here: https://www.drive5.com/usearch/manual/cmd_sintax.html

Basically, this algorithm splits up a sequence into k-mers and matches them to a database. The proportion of k-mers matching an accession is used as evidence for that accession representing a suitable taxonomic hypothesis for the query sequence. To use the database is pretty simple:

vsearch -sintax OTUFILE -db REFERENCEDATABASE -tabbedout OUTPUT -sintax_cutoff 0.8 -strand both -threads 32

The ‘sintax-cutoff’ option is the proportion of k-mers that need to match a taxon for the algorithm to suggest that taxon as a possible match. For info on other options, see the vsearch manual and the usearch webpage.

Different results will be obtained from different databases. There is much debate over the current best database. I don’t have strong opinions on this, other than that UNITE seems to be very good for fungi. See https://www.drive5.com/usearch/manual/faq_tax_db.html for some opinion from the guy that made the SINTAX algorithm. He doesn’t like a lot of the big, standard databases, like SILVA and Greengenes. A lot of people do not follow his recommendations and use these databases, and I am not convinced it matters that much what one chooses to do, given that we should be thinking of these tools as hypothesis generation mechanisms not definitive ways to identify taxa.

To reiterate, even if taxonomic hypotheses are congruent across databases, it is still possible that the taxonomic hypothesis suggested is wrong. I suggest using these hypotheses cautiously. Even more caution is needed when trying to assign ecological function based on taxonomy. I have used databases like FUNGUILD to do this in the past, but I am pretty skeptical of that approach now and don’t think I recommend it. If others have different opinions here, then those would be welcome.