How to identify and remove contaminant sequences

Background and recommendations for removal of contaminants from one’s data

Laboratory reagents and solvents have bacteria and fungi in them. The problem is ubiquitous—inspiring some researcher’s to coin the phrase the ‘kitome’ to describe the consortia of microbes commonly found in DNA extraction kits. Every project that we sequence should have project-specific negative controls for each DNA extraction kit used and all PCR ingredients used. In addition, the GTL will include negative controls in sequencing runs as we see fit (ideally, we will include such controls in every run). Note that we may colloquially refer to negative controls as ‘blanks’.

Most of the time, researchers will want to remove sequences from contaminant microbes from their data. To do this, we can match the sequences that show up in our negative controls to those that are in our samples. Matches are removed, or otherwise flagged to denote their status as possible contaminants (see below).

This page will show you how to identify contaminants within your OTUs. Once contaminants are identified one is faced with the thorny challenge of what to do with them. At first glance, one might think it would be best to discard possible contaminants by simply removing the line from the OTU table that describes the read counts in each sample for that contaminating taxon. The problem with this approach is two-fold. First, contaminating taxa can occur in nature, so just because a taxon is in a negative control doesn’t mean the presence of that sequence in other samples signifies laboratory-derived contamination of those samples. Second, cross-contamination happens, and when it does one might expect to see sequences from taxa in the negative control that occur in nature and thus shouldn’t be omitted from the data all together.

There is no consensus on how best to deal with contaminants at this time. I have settled on two techniques that reviewers have seemed ok with and that make sense to me as a good compromise. First, I scan through possible contaminants and see if there are taxa in there that have shown up as contaminants before, either in my own experience or as mentioned in various papers discussing the 'kitome” (e.g., Salter et al. 2014). Frankly, unless you know your system well (e.g. like Seif knows what to expect in his rocks) this approach can be challenging. The other approach I use is to calculate the proportion of the total reads for each possible contaminating taxon that is in the negative controls. For instance, I might calculate that 90% of all the sequences in my data that are assigned to E. coli showed up in my negative controls. In such a situation, it seems pretty likely that E. coli is a laboratory-derived contaminant. So, if the proportion I calculate is high, say more then 5% or 10%, then I flag the taxon as a likely contaminant and normally I delete it from the data. This solution should keep one from deleting an abundant taxon that occurs in one's samples that has migrated into the negative control through cross-contamination, because that abundant taxon should be way more abundant in the biological samples than in the negative controls.

I recommend trying a few different thresholds for the proportion of the reads for a taxon that must show up in a negative control to instigate deletion (e.g., 5% to 15%) and see how the results are affected.

There is clearly a need for modeling contamination as a function of prevalence among samples and relative abundances within samples and controls. Some folks are working on this (e.g., Davis et al. 2017) and I hope software will continue to improve.

Contamination is always worth considering and trying to minimize, however, unless one is working with samples with few taxa in them it is unlikely that contaminants will affect one’s results much. This is because contamination should not be confounded with treatment groups, assuming proper randomization of samples during processing. Moreover, contaminants should comprise a small proportion of the data. That said, if one is working in a taxon-poor substrate or is interested in very rare taxa then problems due to contamination should be given a lot of consideration.

How to identify and remove contaminants

Establishment of a reference database of contaminants

EDIT: I don’t think many people have added to the contaminant database. Moreover, for this database to be useful, everyone would need to ensure that their blanks were not egregiously contaminated before adding putative contaminants to the database…Therefore, I suspect most of us should take the simpler approach and simply look at what occurs in blanks and remove any OTUs that are of much higher relative abundance in the blanks compared to the biological samples (as described above).

I have copied the OTUs from the ‘control’ project in library 1B (psomagen_26may20) into these two files:

/project/microbiome/ref_db/possible_contaminants_16S.fasta
/project/microbiome/ref_db/possible_contaminants_ITS.fasta

These files can serve as our database for contaminants as they came from negative controls.

Below, I show how to grow these files to include information from additional negative controls, beyond those in library 1B. We do not have a ‘control’ project for the other libraries, so it wasn’t a simple matter to extract data from negative controls from those libraries. Below, I suggest a possible way for clients to contribute information from their own blanks to the aforementioned files. This represents a first pass toward information sharing, it is likely a better approach will present itself.

Using u/vsearch to identify possible contaminants in an OTU file

To search an OTU file via usearch is quite easy. The following command will do the trick:

usearch -usearch_global PATH_TO_YOUR_DATA/zotus_nonchimeric.fa -db /project/microbiome/ref_db/possible_contaminants_16S.fasta -strand both -blast6out OUTFILE -id 0.97 -maxaccepts 0

IMPORTANT: if you are searching your 16S OTUs for contaminants use the 16S contaminant fasta. If you are searching the ITS OTUs for contaminants then use the ITS fasta.

Usearch supports many output formats. I used blast6 formatting here, but perhaps other options are worth considering for different workflows. See the usearch help for more.

The id flag describes the percent match. One may want to be more exact, and if so, then I suggest using the search_exact algorithm instead of usearch_global. This is because the search_exact algorithm is just what it says, it doesn’t use any heuristics like usearch_global.

The maxaccepts flag tells how many different hits are registered between an OTU and various sequences in the database. Since many sequences in the contaminant database are variants of some common taxon, one will get multiple matches between the same OTU and various entries in the database. When the maxaccepts flag is set to 0 a complete search is performed. See the usearch help for more.

To count the total number of contaminants, one can use something like this:

cut -f 1 possible_contamintants_16S_asle.b6 | uniq | wc -l

I suggest performing this analysis via an interactive session on Teton. This is because it doesn’t take very long to run—typically just a few seconds. If you need help setting up an interactive session, loading modules, etc. then please contact Josh Harrison, or one of your colleagues who can help.

Removing contaminants

Once contaminants are identified, one can identify them and remove them, if desired (see discussion above). We can post code to remove contaminants here, as it is developed. However, I urge everyone to consider a bespoke treatment of contaminants, depending upon their own project. In fact, one may decide to eschew the database I link above altogether and instead use one’s own blanks only.

Adding to the contaminant database

Please add information to the contaminant database from your own blanks. To do this use the following steps:

Step 1

cat file1.fasta >> file2.fasta

This use of the ‘>>’ means append. So this command appends whatever is in file1.fasta to the end of file2.fasta. Therefore, to add new contaminant info to the database your command will look like this:

cat YourContaminants.fasta >> ExistingContaminantDabase.fasta

Please make sure to add the contaminants to the correct database, either the 16S or the ITS database.

THIS OVERWRITES the database, so if you are new to this sort of thing, then be careful. You could instead use this command if you are concerned about messing something up:

cat YourContaminants.fasta ExistingContaminantDabase.fasta > newCombinedContaminants

If the new combined file has the expected number of lines in it then keep it and delete the input files (the sum of the two concatenated files; use wc -l to determine the number of lines in a file). Rename the new combined contaminant file to match the original database name.

Never fear, we will back up the database, so if something gets messed up, we can fix it.

Step 2

Some of the contaminants you added may have already been present in the database. So we will dereplicate the database using usearch, like this:

usearch -fastx_uniques  DATABASE -fastaout DEREPLICATED_DATABASE -sizeout

Step 3

Clean up. Remove any temporary files that were made and rename the new version of the database to the same name as the old, now deleted, database (i.e., the file names provided at the top of this page).

Step 4

Commit the changes made to our git repo

git commit -m "YOURMESSAGE" possible_contaminants_16S.fasta

Make the message informative. For instance, “Josh Harrison added negative controls from ASLE project”

If you don’t know what git is and want to learn more this page is pretty good: http://swcarpentry.github.io/2014-04-14-wise/advanced/git/local.html

 

 

 

 

 

 

 

 


Davis, N. M., Proctor, D. M., Holmes, S. P., Relman, D. A., & Callahan, B. J. (2018). Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome, 6(1), 226.

Salter, S. J., Cox, M. J., Turek, E. M., Calus, S. T., Cookson, W. O., Moffatt, M. F., ... & Walker, A. W. (2014). Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC biology, 12(1), 87.