...
Merge reads, filter, and (optionally) trim primer regions
– Alex Buerkle is rerunning the following steps, with updated code to correspond to new options we’re using across libraries.
In /project/microbiome/data/seq/psomagen_17sep20_novaseq2/tfmergedreads
, we used run_slurm_mergereads.pl
to crawl all of the project folders and sample files (created in the splitting step above) to merge or join read pairs, filter based on base quality, and trim primer regions from the reads. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/
and ITS/
folders in tfmergedreads/
. Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv
in each project folder. For the full lane these summaries were concatenated in tfmergedreads/
with
...
I used commands.R
in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge (this does not account for reads that were joined/concatenated because the amplicon region was so long that we didn’t have sufficient overlap for merge), or did not meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).
...
Find unique sequences, cluster, and count OTUs
We ran a single script for each library that generated otu tables for 16S and ITS. These are , which can be found in the otu/ directory for each library. The script we ran is also there: e.g., /project/microbiome/data/seq/psomagen_
6mar2017sep20_novaseq2/otu
/16S
and /project/microbiome/data/seq/psomagen_17sep20_novaseq2/otu
/ITS
directories for all projects. The script to do this is /project/microbiome/data/seq/psomagen_17sep20_novaseq2/otu/run_slurm_mkotu.pl
.
This script crawls all of the projects (and loci) in the tfmerged/
folder, and concatenates all input sequences files and passes them through a UNIX pipe for analysis (rather than making another copy of all the sequence data). The key part the script is the job that is created for each project:
Code Block | ||||
---|---|---|---|---|
| ||||
$job .= "cdcat $allprojectsotu{$tfmdirectory}$tfmdirectory/*tfmergedreads.fa $tfmdirectory/joined/*tfmergedreads.fa | vsearch --derep_fulllength - --threads 32 --output uniqueSequences.fa --sizeout;\n"; # this is the otu directory $job .= "cat $tfmdirectory/*tfmergedreadsvsearch --cluster_unoise uniqueSequences.fa | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch -derep_fulllength - -threads 32 --output uniqueSequences-sizein --sizeout --consout zotus.fa --sizeoutminsize 8 ;\n"; $job .= "vsearch --clusteruchime3_unoisedenovo uniqueSequenceszotus.fa --sizeinnonchimeras --consout zotus_nonchimeric.fa --minsize 12 --id 0.99 --relabel 'Otu' --iddef 2threads 32;\n"; $job .= "vsearch --uchime3_denovo zotus.fa --chimeras chimera_out.fa --nonchimerascat $tfmdirectory/*tfmergedreads.fa $tfmdirectory/joined/*tfmergedreads.fa | vsearch --usearch_global - --db zotus_nonchimeric.fa --threads 32;\n"; $job .= "cat $tfmdirectory/*tfmergedreads.fa otutabout - --id 0.99 --threads 32 | sed -e 's/\\s/_/g^#OTU ID/OTUID/' | sed -e 's/^>16/>rna16/' | vsearch --usearch_global - --db zotus_nonchimeric.fa --otutabout 'otutable' --id 0.99;> 'otutable';\n"; |
These steps can be repeated by individual users with different input data and vsearch options, as desired. Please see below for instructions with examples. The settings we used for all projects took only a few hours for each project.
As noted above, we are not trimming primer regions, so there are potentially some spurious OTUs/ESVs introduced by sequence variation in the primer regions.
...
Make coligo table
In /project/microbiome/data/seq/psomagen_17sep20_novaseq2/coligoISD
, there are 16S
and ITS
directories for all projects. These contain a file named coligoISDtable.txt
with counts of the coligos and the ISD found in the trimmed forward reads, per sample. The file run_slurm_mkcoligoISDtable.pl
has the code that passes over all of the projects and uses vsearch
for making the table.
...
These steps can be repeated by individual users with different input data and vsearch options, as desired. Please see below for instructions with examples.
...
Next steps for individual users and projects
Info |
---|
The notes below have not been updated to parallel the steps taken above (analysis updated 20 March 2021). Some parts of the next likely apply and are still correct. More up-to-date documentation and suggestions are in Bioinformatics v3.0 |
The next tasks each user will need to perform are QC of the data, generation of taxonomic hypotheses for ESVs/OTUs, identification of those ESVs that are the synthgene and coligo and combining data for those ESVs sensibly, and finally statistical analysis. We will go through each of these tasks below and provide some guidelines. We also provide a few tips for those new to bioinformatics. For the experienced users, we will need to work together to help folks that are new to the game.
...