Bioinformatics for Novaseq run 3

Status (13 October 2020)

Data arrived on 9 October. @Alex Buerkle processed the data to merged reads and handed off to @Joshua Harrison for otutable generation, etc.

1 Demultiplexing and splitting
- 1.1 Demultiplexing
- 1.2 Splitting
2 Calculate summary statistics on reads
3 Merge and filter reads

Demultiplexing and splitting

The directory for the raw sequence data (typically gz compressed; use run_pigz.sh and run_unpigz.sh to compress and decompress with multithreaded pigz, using SLURM) and the parsed and split reads is /project/microbiome/data/seq/psomagen_9oct20_novaseq3/rawdata (corresponding to micro NovaSeq Run #3). Files for individual samples will be in /project/microbiome/data/seq/psomagen_9oct20_novaseq3/rawdata/sample_fastq/.

Demultiplexing

The work is done by run_parse_count_onSplitInput.pl. As the name implies, we split the raw data into many files, so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across 191 jobs. By doing so, the parsing takes less than one hour.

Splitting the raw (uncompressed) data was accomplish with the program split (done in an interactive SLURM job), with 16x10⁶ lines (4x10⁶ reads) being written to each file (with a remainder in the final file). These files were written to /gscratch and, as intermediate files that can be reconstructed readily, will not be retained long-term.

mkdir -p /gscratch/buerkle/psomagen_9oct20_novaseq3/rawdata
cd /gscratch/buerkle/psomagen_9oct20_novaseq3/rawdata
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1/project/microbiome/data/seq/psomagen_9oct20_novaseq3/rawdata/NovaSeq3_R1.fastq  novaseq3_R1_ ;
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1/project/microbiome/data/seq/psomagen_9oct20_novaseq3/rawdata/NovaSeq3_R2.fastq  novaseq3_R2_

making 220 R1 files and 220 R2 files, with structured names (e.g., for the R1 set):

/gscratch/buerkle/psomagen_9oct20_novaseq3/rawdata/novaseq3_R1_000.fastq
/gscratch/buerkle/psomagen_9oct20_novaseq3/rawdata/novaseq3_R1_001.fastq
etc.

run_parse_count_onSplitInput.pl also writes to /gscratch.

novaseq3_demux.csv was provided by @Joshua Harrison. This file is used to map MIDS to sample names and projects.

Splitting

The info lines for each read in parsed_novaseq*_R1.fastq and parsed_novaseq*_R2.fastq have the locus, the forward mid, the reverse mid, and the sample name. These can be used with the demux key to separate reads into loci, projects, and samples, in the folder sample_fastq/. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.

run_splitFastq_fwd.sh and run_splitFastq_rev.sh run splitFastq_manyInputfiles.pl, which steps through the many pairs of files to split reads by sample and project, and place them in /project/microbiome/data/seq/psomagen_9oct20_novaseq3/rawdata/sample_fastq/.

splitFastq.pl and splitFastq_manyInputfiles.pl will need tweaking in the future, until sample names and the format of the key for demultiplexing and metadata stabilizes. The number of columns has differed among our three completed sequence lanes.

Calculate summary statistics on reads

In a sequence library’s rawdata/ directory (e.g., /project/microbiome/data/seq/psomagen_9oct20_novaseq3/rawdata) run:

module load swset/2018.05 gcc/7.3.0 usearch/10.0.240

aggregate_usearch_fastx_info.pl

Merge and filter reads

In /project/microbiome/data/seq/psomagen_9oct20_novaseq2/tfmergedreads , we used run_slurm_mergereads.plto crawl the project folders and sample files (created in the splitting step above) to merge read pairs, and filter based on base quality. We are now not trimming primer regions. This is true even though we found with previous libraries that we could not readily eliminate low frequency, potentially spurious OTUs downstream by masking with vsearch; it would not properly mask these regions [by marking them with lower-case bases] and the soft mask them when we do vsearch cluster_size. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/ and ITS/ folders in tfmergedreads/. Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv in each project folder. For the full lane these summaries were concatenated in tfmergedreads/ with

cat */*/filtermergestats.csv > all_filtermergestats.csv

I used commands.R in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge, or did meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).

Initially @Alex Buerkle / I stopped there. But after finding an error in how we were using vsearch for other libraries, on Apr 5, 2021 I ran otutable generation for this library too. See otu/ for code and output. These data were not processed to work with joined reads (as we’ve done with other libraries. @Joshua Harrison : please let me know if you’d like me to redo the trim, merge, join, filter step over again, and then rerun the otutable generation.

Genome Technologies Laboratory