Coligo creation and quality control

We use synthetic sequences that are spiked into individual wells of 96-well plates to track instances of cross-contamination. We call these sequences ‘coligos’, which is short for cross-contamination checking oligo.

Version 1 of the coligos were only 13 nucleotides long. They were sequences pulled from Hawkins et al. 2018 and were selected to allow differentiation via edit distance and avoid extreme GC bias or homopolymers. I chose to make them short to save money. Shorter oligos cost a lot less than longer ones. However, their short length caused several inconveniences, including that the usearch_global algorithm had trouble differentiating them. This meant we had to use custom code or a strict search that didn’t allow for inevitable sequence variation. Also, the merging algorithms for paired-end reads don’t work well for short sequences. Still, version 1 coligos showed the utility of them as a simple quality control tool. They highlighted errors in organizing barcode plates and let us quality check our work. Now on to version 2….

Version 2 of the coligos were designed Oct. 9, 2020 and were originally nucleotides long. The script below can be revised to make the coligos different lengths. I took 12 randomly selected 17 base long sequences from Hawkins et al. 2018 any two of which were at least an edit distance of two apart (this is the minimum pairwise edit distance in Hawkins’s sequences. The ones we sampled were much more differentiable, see below). I concatenated these twelve sequences to make a possible coligo. I did this many times to make a large number of possible coligos to be whittled down based on our needs. First, I calculated pairwise Levenshtein distances between coligos. The minimum distance was 93. Second, I took random 15 base long sections from our ISD and matched it to the coligos and removed any coligos that matched. Third, I compared 15 base long sections of the 515,806,ITS1f,and ITS2 primers to the coligos and removed any hits. I did this for the normal sense primer sequences, their reverses, compliments, and reverse compliments. Fourth, I removed any coligos with GC content less than 40% or more than 60%. Finally, I checked our N1, N2, and RNaseP primers against the coligos, in the same way I describe immediately above and removed any hits. Fifth, I looked for parts of the Illumina adaptor in the coligos and removed those. In addition to searching for the adaptors like I did with the primers, I also tried the beginning, middle, and end of the adaptor with an edit distance of 3 or 4, just to catch more possible matches. The R script to do all this is linked below. I couldn’t think of much else to check so I wrote the coligos out as a fasta and used SINTAX to compare it to the Silva database to see if they matched anything biological. I did the same thing with the UNITE database.

vsearch -sintax ~/synced/coligoSequences_longerVersion.fasta -db /project/microbiome/ref_db/unite4_02_20.fa -tabbedout coligoLong.sintax -strand both -sintax_cutoff 0.8

None of the coligos matched anything, so I have made a spreadsheet that shows the new coligo sequences with their primers and well positions. This spreadsheet is linked directly below.


This is because the longer ones cost $10k and the midlength ones cost $6k. I am too much of a cheap skate for that! Will buy more when we run out of the cheap ones I guess.


Hawkins, J. A., Jones, S. K., Finkelstein, I. J., & Press, W. H. (2018). Indel-correcting DNA barcodes for high-throughput sequencing. Proceedings of the National Academy of Sciences, 115(27), E6217-E6226.