Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Info

Status (

28 September 2020)

...

20 March 2021)

I repeated the trim, merge, join, and otutable generation steps, to have equivalent methods that we have now applied to Novaseq 1A, 1B, 1C, 3, and 4. Also redid otu table generation and coligo table construction.

Table of Contents

Demultiplexing and splitting

The directory for the raw sequence data (gz compressed) and the parsed and split reads is /project/microbiome/data/seq/psomagen_17sep20_novaseq2/rawdata (corresponding to micro NovaSeq Run #2 ). Files for individual samples are in /project/microbiome/data/seq/psomagen_17sep20_novaseq2/rawdata/sample_fastq/.

Demultiplexing

...

run_splitFastq_fwd.sh and run_splitFastq_rev.sh run splitFastq_manyInputfiles.pl, which steps through the many pairs of files to split reads by sample and project, and place them in /project/microbiome/data/seq/psomagen_17sep20_novaseq2/rawdata/sample_fastq/.

splitFastq.pl and splitFastq_manyInputfiles.pl will need tweaking in the future, until sample names and the format of the key for demultiplexing and metadata stabilizes. The number of columns has differed among our three completed sequence lanes.

...

In a sequence library’s rawdata/ directory (e.g., /project/microbiome/data/seq/psomagen_17sep20_26may20novaseq2/rawdata) run:

module load swset/2018.05 gcc/7.3.0 usearch/10.0.240

...

In /project/microbiome/data/seq/psomagen_17sep20_novaseq2/tfmergedreads , we used run_slurm_mergereads.plto crawl all of the project folders and sample files (created in the splitting step above) to merge or join read pairs, filter based on base quality, and optionally trim primer regions from the reads. We are now not trimming primer regions, even though we found that we could not readily eliminate low frequency, potentially spurious OTUs downstream by masking with vsearch; it would not properly mask these regions [by marking them with lower-case bases] and the soft mask them when we do vsearch cluster_size. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/ and ITS/ folders in tfmergedreads/. Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv in each project folder. For the full lane these summaries were concatenated in tfmergedreads/ with

...

I used commands.R in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge (this does not account for reads that were joined/concatenated because the amplicon region was so long that we didn’t have sufficient overlap for merge), or did not meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).

...

Find unique sequences, cluster, and count OTUs

We ran a single script for each library that generated otu tables for 16S and ITS. These are , which can be found in the otu/ directory for each library. The script we ran is also there: e.g., /project/microbiome/data/seq/psomagen_17sep20_novaseq2/otu/16S and /project/microbiome/data/seq/psomagen_17sep20_novaseq2/otu/ITS directories for all projects. The script to do this is /project/microbiome/data/seq/psomagen_17sep20_6mar20novaseq2/otu/run_slurm_mkotu.pl. This script crawls all of the projects (and loci) in the tfmerged/ folder, and concatenates all input sequences files and passes them through a UNIX pipe for analysis (rather than making another copy of all the sequence data). The key part the script is the job that This was rerun on with the addition of line 5 (--search_exact) and --relabel 'otu' added to line 2. The key part the script is the job that is created for each project:

Code Block
breakoutModefull-width
languagebash
$job .= "cd $allprojectsotu{$tfmdirectory};\n"; # this is the otu directory $job .= "cat $tfmdirectory/*tfmergedreads.fa $tfmdirectory/joined/*tfmergedreads.fa | sedvsearch -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch -derep_fulllength - --threads 32 --output uniqueSequences.fa --sizeout;\n";
    $job .= "vsearch --cluster_unoise uniqueSequences.fa --relabel 'otu' --sizein --sizeout --consout zotus.fa --minsize 12 --id 0.99 --relabel 'Otu' --iddef 2;\n";
8 ;\n";
    $job .= "vsearch --uchime3_denovo zotus.fa --chimeras chimera_out.fa --nonchimeras zotus_nonchimeric.fa --threads 32;\n";
    $job .= "cat $tfmdirectory/*tfmergedreads.fa $tfmdirectory/joined/*tfmergedreads.fa | sedvsearch -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch --usearch_global--usearch_global - --db zotus_nonchimeric.fa --otutabout - --dbid zotus_nonchimeric0.fa99 --otutabout 'otutable' --id 0.99;\n";

...

threads 32 | sed 's/^#OTU ID/OTUID/' > 'otutable';\n";
    $job .= 'cat $tfmdirectory/*tfmergedreads.fa $tfmdirectory/joined/*tfmergedreads.fa | vsearch --search_exact - --db zotus_nonchimeric.fa --otutabout - --id 0.99 --threads 32'. " | sed 's/^#OTU ID/OTUID/' > 'otutable.esv';\n";

Make coligo table

In /project/microbiome/data/seq/psomagen_17sep20_novaseq2/coligoISD, there are 16S and ITS directories for all projects. These contain a file named coligoISDtable.txt with counts of the coligos and the ISD found in the trimmed forward reads, per sample. The file run_slurm_mkcoligoISDtable.pl has the code that passes over all of the projects and uses vsearch for making the table.

...

These steps can be repeated by individual users with different input data and vsearch options, as desired. Please see below for instructions with examples. The settings we used for all projects took only a few hours for each project.As noted above, we are not trimming primer regions, so there are potentially some spurious OTUs/ESVs introduced by sequence variation in the primer regions. Please see below for instructions with examples.

...

Next steps for individual users and projects

Info

The notes below have not been updated to parallel the steps taken above (analysis updated 20 March 2021). Some parts of the next likely apply and are still correct. More up-to-date documentation and suggestions are in Bioinformatics v3.0

The next tasks each user will need to perform are QC of the data, generation of taxonomic hypotheses for ESVs/OTUs, identification of those ESVs that are the synthgene and coligo and combining data for those ESVs sensibly, and finally statistical analysis. We will go through each of these tasks below and provide some guidelines. We also provide a few tips for those new to bioinformatics. For the experienced users, we will need to work together to help folks that are new to the game.

...