...
The directory for the raw sequence data (gz compressed) and the parsed and split reads is /project/microbiome/data/seq/psomagen_17sep20_novaseq2/rawdata
(corresponding to micro NovaSeq Run #2 ). Files for individual samples are in /project/microbiome/data/seq/psomagen_17sep20_novaseq2/rawdata/sample_fastq/
.
Demultiplexing
...
run_splitFastq_fwd.sh
and run_splitFastq_rev.sh
run splitFastq_manyInputfiles.pl
, which steps through the many pairs of files to split reads by sample and project, and place them in /project/microbiome/data/seq/psomagen_17sep20_novaseq2/rawdata/sample_fastq/.
splitFastq.pl
and splitFastq_manyInputfiles.pl
will need tweaking in the future, until sample names and the format of the key for demultiplexing and metadata stabilizes. The number of columns has differed among our three completed sequence lanes.
...
In a sequence library’s rawdata/
directory (e.g., /project/microbiome/data/seq/psomagen_17sep20_26may20novaseq2/rawdata
) run:
module load swset/2018.05 gcc/7.3.0 usearch/10.0.240
...
In /project/microbiome/data/seq/psomagen_17sep20_novaseq2/tfmergedreads
, we used run_slurm_mergereads.pl
to crawl all of the project folders and sample files (created in the splitting step above) to merge read pairs, filter based on base quality, and optionally trim primer regions from the reads. We are now not trimming primer regions, even though we found that we could not readily eliminate low frequency, potentially spurious OTUs downstream by masking with vsearch; it would not properly mask these regions [by marking them with lower-case bases] and the soft mask them when we do vsearch cluster_size
. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/
and ITS/
folders in tfmergedreads/
. Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv
in each project folder. For the full lane these summaries were concatenated in tfmergedreads/
with
...
I used commands.R
in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge, or did meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).
...
Find unique sequences, cluster, and count OTUs
...