How to identify the OTUs that are the ISD or coligos

@Gordon Custer @Reilly Dibner @Macy Ricketts @Seifeddine Ben Tekaya @Ella DeWolf @John Calder @Erin Bentley @Paul Ayayee @Alessandra Ceretto @Abby Hoffman

Hi all, I am gonna post some code here that you can use to identify which of your OTUs are the ISD and the coligos. I will also be making a better, stand-alone program that will build separate OTU tables for just the coligos and ISD. You will find these coligo otu tables in your project folders. Note that even after I make this new program, there will still be ISD and Coligos in your main otu table that you will need to identify…hence this page. Please comment on confusing text and ask questions. Cheers, jh


To identify which OTU is the ISD one can use this call to usearch:

usearch -usearch_global zotus_nonchimeric.fa -db /project/microbiome/ref_db/synthgene.fasta -strand both -blast6out YOUROUTPUT -id 0.85

Replace YOUROUTPUT with a file name of your choosing. I recommend something easy to understand like “ISD_OTUs.txt”. The “blast6out” tag specifies the format of the output. The output tells you which OTU matched the ISD and how close the match was. To learn more about the output formats available, including the blast6out see this page: https://www.drive5.com/usearch/manual/output_files.html

The “strand” flag refers to which direction of the DNA to search (either sense or anti-sense). You could use “plus” or “both”. Since this is so fast I think searching in both directions makes sense (see what I did there).

“maxaccepts” refers to how many hits the algorithm will accept before stopping. Setting this to 0 does a complete search. The default is 1. You can read more here: https://www.drive5.com/usearch/manual/termination_options.html.

Important note: the synthgene.fasta file has several known variants of the ISD that have shown up. Therefore, you will see most of your OTUs that map to the ISD will map to each of these variants. Having these variants in the query database is useful if you wanted to do an exact search, but at 85% match having them all in there is redundant. But, again, this is so fast it doesn’t matter. I mention this so that when you check out the output it will be clear why an OTU will match the synthgene variant 1 and variant 2.

Don’t feel strange that there will be multiple OTUs associated with the ISD. This is to be expected given our degenerate primers, and various sources of technical error (and matches what other people report too for their ISDs). I suggest summing the reads for all the otus that belong to the ISD and using that for your normalization calculations.

How to identify the OTUs that are Coligos

To identify which otus are coligos you can use the R script located at: /project/microbiome/scripts/coligoFinder.R. Note that it may be advisable to process your data such that low-complexity reads are omitted from the outset. Low-complexity reads are those that contain a lot of repeats (homopolymers). Removing low-complexity reads removes coligos because coligo sequences end in a poly-A tail followed by a long poly-G tail. Usearch/Vsearch allows low-complexity sequences to be removed. This can be done before the other steps in one’s bioninformatic pipeline and coligos processed separately via the code in the section below that is called “Making a coligo table”.

Usage of the script is like so:

Rscript /project/microbiome/scripts/coligoFinder.R ./zotus_nonchimeric.fa /project/microbiome/ref_db/coligoSequences.csv

where zotus_nonchimeric is the path to your OTU file. Note for the script to work you need to use the “coligoSequences” file I stuck in the ref_db/ directory in the microbiome project (absolute path shown above).

This script will output a text file showing which otu corresponds to which coligo and the well for that coligo.

@Ella DeWolf used this on her 16s table and found ~100,000 otus that matched coligo sequences. Looking at those otus, they were much longer than we might expect for coligos so we might be finding coligo sequences within longer biological reads. To attempt to work around this, I added the primer sequence to either end of the coligos and only found ~2,000 otus that matched coligos that look to be the right length. I made an edited script for this at /project/microbiome/scripts/coligoFinder_16s_primers.R. Note from JH: we have this script set to match with up to 2 edits, this might be a bit strict, so I advise playing with this and bumping up the allowable edit distance in the call to agrep and seeing what happens. Examine the OTUs that are now flagged as coligos and see if they look like coligos.

You can use the output of this script to remove unwanted coligos from your OTU table using the script “CleanUpOTUtable.R” that is here: /project/microbiome/scripts/

To use the script: Rscript /project/microbiome/scripts/CleanUpOTUtable.R OTUTABLE OUTPUTfrom_coligoFinder

Note: the script was broken for a few days because of changes to one of the input files that were necessitated by other improvements further upstream. This script works as of Oct 9, 2020.

Making a coligo table

It is useful to make an OTU table that just has coligo counts in it. That way one can scan down a column and see which samples had more than a single coligo.

Here is some simple code to make such a table (ask Josh if you don’t understand what something means here). First we remove forward primers from the forward sequences. Best to pipe in your sequences into the call to cutadapt. Then we tweak the sequence headers a bit, remove the first 13 bases, convert to fasta, and make a table. Note that this script will miss a coligo sequence here and there because it does not attempt to correct technical errors. For simplicities sake, I have not made a fancier version of this script that would perform error correction. This is because the goal is to detect problematic contamination, so if we miss a sequence here or there it doesn’t matter. In the future, we can optimize if we like. Put the following code (after changing the input and output strings to match your needs) into a slurm submission script:

module load cutadapt

cutadapt -g ^GTGYCAGCMGCCGCGGTAA -o R1out_no16s YOURFORWARDSEQUENCES -e 0.25

cutadapt -g ^CTTGGTCATTTAGAGGAAGTAAT -o R1out_no16s_noits R1out_no16s -e 0.25

sed -i 's/\s/_/g' R1out_no16s_noits

sed -i 's/^>16/>rna16/' R1out_no16s_noits

sed -i 's/-/_/g' R1out_no16s_noits

awk '{if ($1 ~ /@/) {print}else{print substr ($0, 0, 13)}}' R1out_no16s_noits > short_polyg_filtered.fa

module load vsearch

vsearch --fastq_filter short_polyg_filtered.fa -fastaout short_polyg_filteredFA.fa

echo "Making coligo table"

vsearch -search_exact short_polyg_filteredFA.fa -db /project/microbiome/ref_db/coligos_and_abbreviatedISD.fa -strand plus -otutabout coligoTable  -maxaccepts 0 -minseqlen 5


Once the table has been made you can switch to R and see how bad cross-contamination was. You should expect occasional cross-contamination, unfortunately. So far, we have seen minor instances in all libraries. Note that you will likely need to make minor modifications to the code below to suit your own needs (due to variation in file headers, demux keys, and so on).

In R:

dat <- read.table("YOURTABLE", stringsAsFactors = F, header = T)
dat[1:5,1:5]
dim(dat)

key <- read.csv("YOURDEMUXKEY, stringsAsFactors = F)

names(dat) <- gsub("X(.*)","\1",names(dat)) #This part may need changed depending on your data.

key$forward_barcode <- toupper(key$forward_mid)
key$reverse_barcode <- toupper(key$reverse_mid)

key$combo <- paste(key$locus, key$forward_barcode, key$reverse_barcode, key$samplename, sep = "_")

for(i in 1:length(key$combo)){
names(dat)[which(key$combo[i] == names(dat))] <- key$wellposition[i]
}

dat$OTUID <- gsub("Coligo_", "",dat$OTUID)

dat <- dat[dat$OTUID != "ISD",]

percent_target <- NA

for(i in 2:length(names(dat))){
if(dat[dat$OTUID == names(dat)[i],i] > 15){
percent_target[i] <- dat[dat$OTUID == names(dat)[i],i] / sum(dat[,i])
}
}
mean(na.omit(percent_target))
summary(percent_target)
table(percent_target > 0.90)


Please let me know if you detect any bugs, have suggestions, etc. Please consider the output of these scripts carefully to make sure they meet your expectations. Cheers!