Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To identify which otus are coligos you can use the R script located at: /project/microbiome/scripts/coligoFinder.R. Note that it may be advisable to process your data such that low-complexity reads are omitted from the outset. Low-complexity reads are those that contain a lot of repeats (homopolymers). Removing low-complexity reads removes coligos because coligo sequences end in a poly-A tail followed by a long poly-G tail. Usearch/Vsearch allows low-complexity sequences to be removed. This can be done before the other steps in one’s bioninformatic pipeline and coligos processed separately via the code in the section below that is called “Making a coligo table”.

Usage of the script is like so:

...

Note: the script was broken for a few days because of changes to one of the input files that were necessitated by other improvements further upstream. This script works as of Oct 9, 2020.

Making a coligo table

It is useful to make an OTU table that just has coligo counts in it. That way one can scan down a column and see which samples had more than a single coligo.

Here is some simple code to make such a table (ask Josh if you don’t understand what something means here). First we remove forward primers from the forward sequences. Best to pipe in your sequences into the call to cutadapt. Then we tweak the sequence headers a bit, remove the first 13 bases, convert to fasta, and make a table. Note that this script will miss a coligo sequence here and there because it does not attempt to correct technical errors. For simplicities sake, I have not made a fancier version of this script that would perform error correction. This is because the goal is to detect problematic contamination, so if we miss a sequence here or there it doesn’t matter. In the future, we can optimize if we like. Put the following code (after changing the input and output strings to match your needs) into a slurm submission script:

module load cutadapt

cutadapt -g ^GTGYCAGCMGCCGCGGTAA -o R1out_no16s YOURFORWARDSEQUENCES -e 0.25

cutadapt -g ^CTTGGTCATTTAGAGGAAGTAA -o R1out_no16s_noits R1out_no16s -e 0.25

sed -i 's/\s/_/g' R1out_no16s_noits

sed -i 's/^>16/>rna16/' R1out_no16s_noits

sed -i 's/-/_/g' R1out_no16s_noits

awk '{if ($1 ~ /@/) {print}else{print substr ($0, 0, 13)}}' R1out_no16s_noits > short_polyg_filtered.fa

module load vsearch

vsearch --fastq_filter short_polyg_filtered.fa -fastaout short_polyg_filteredFA.fa

echo "Making coligo table"

vsearch -search_exact short_polyg_filteredFA.fa -db /project/microbiome/ref_db/coligos_and_abbreviatedISD.fa -strand plus -otutabout coligoTable  -maxaccepts 0 -minseqlen 5

...

Once the table has been made you can switch to R and see how bad cross-contamination was. You should expect occasional cross-contamination, unfortunately. So far, we have seen minor instances in all libraries. Note that you will likely need to make minor modifications to the code below to suit your own needs (due to variation in file headers, demux keys, and so on).

In R:

dat <- read.table("YOURTABLE", stringsAsFactors = F, header = T)
dat[1:5,1:5]
dim(dat)

key <- read.csv("YOURDEMUXKEY, stringsAsFactors = F)

names(dat) <- gsub("X(.*)","\1",names(dat)) #This part may need changed depending on your data.

key$forward_barcode <- toupper(key$forward_mid)
key$reverse_barcode <- toupper(key$reverse_mid)

key$combo <- paste(key$locus, key$forward_barcode, key$reverse_barcode, key$samplename, sep = "_")

for(i in 1:length(key$combo)){
names(dat)[which(key$combo[i] == names(dat))] <- key$wellposition[i]
}

dat$OTUID <- gsub("Coligo_", "",dat$OTUID)

dat <- dat[dat$OTUID != "ISD",]

percent_target <- NA

for(i in 2:length(names(dat))){
if(dat[dat$OTUID == names(dat)[i],i] > 15){
percent_target[i] <- dat[dat$OTUID == names(dat)[i],i] / sum(dat[,i])
}
}
mean(na.omit(percent_target))
summary(percent_target)
table(percent_target > 0.90)

...

Please let me know if you detect any bugs, have suggestions, etc. Please consider the output of these scripts carefully to make sure they meet your expectations. Cheers!

...