ITS Bioinformatics for ALF1

Status (29 Aug 2022)

This is a draft modified from Bioinformatics for Novaseq run 4 ).
Raw Data arrived 29-08-2022

1 Demultiplexing and splitting
- 1.1 Demultiplexing
- 1.2 Splitting to fastq for individuals
2 Calculate summary statistics on reads
3 Remaining Steps

Demultiplexing and splitting

The directory for the raw sequence data (typically gz compressed; use run_pigz.sh and run_unpigz.sh to compress and decompress with multithreaded pigz, using SLURM) and the parsed and split reads is /project/gtl/data/raw/ALF1/ITS/rawdata. Files for individual samples will be in /project/gtl/data/distribution/ALF1/ITS/rawdata/sample_fastq/.

Demultiplexing

The work is done by run_parse_count_onSplitInput.pl. As the name implies, we split the raw data into many files (492), so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across many jobs. By doing so, the parsing takes less than one hour.

Splitting the compressed raw data was accomplished with the program split (done in an interactive SLURM job), with 16x10⁶ lines (4x10⁶ reads) being written to each file (with a remainder in the final file). These files were written to /gscratch and, as intermediate files that can be reconstructed readily, will not be retained long-term. Note that in a subsequent run, I used unpigz on the raw to send down pipe to split (see https://microcollaborative.atlassian.net/wiki/pages/resumedraft.action?draftId=1292926977).

mkdir -p /gscratch/grandol1/ALFITS/rawdata
cd /gscratch/grandol1/ALFITS/rawdata
unpigz --to-stdout /project/gtl/data/raw/ALF1/ITS/rawdata/1ALF_ITS_P1_1.fq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - ALFITS_R1_ 
unpigz --to-stdout /project/gtl/data/raw/ALF1/ITS/rawdata/1ALF_ITS_P1_2.fq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - ALFITS_R2_

making 167 R1 files and 167 R2 files, with structured names (e.g., for the R1 set):

/gscratch/grandol1/ALFITS/rawdata/ALFITS_R1_000.fastq
/gscratch/grandol1/ALFITS/rawdata/ALFITS_R1_001.fastq
etc.

run_parse_count_onSplitInput.pl also writes to /gscratch.

ALF1_ITS_Demux.csv is used to map MIDS to sample names and projects.

Splitting to fastq for individuals

The info lines for each read in parsed_ALFITS_R1_*.fastq and parsed_ALFITS_R2_*.fastq have the locus, the forward mid, the reverse mid, and the sample name. These can be used with the demux key to separate reads into loci, projects, and samples, in the folder sample_fastq/. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.

run_splitFastq_fwd.sh and run_splitFastq_rev.sh run splitFastq_manyInputfiles.pl, which steps through the many pairs of files to split reads by sample and project, and place them in /project/gtl/data/raw/ALF1/16S/rawdata/sample_fastq/

Calculate summary statistics on reads

In a sequence library’s rawdata/ directory (e.g., /project/gtl/data/raw/ALF1/ITS/rawdata/) I made run_aggregate.sh, to run aggregate_usearch_fastx_info.pl with a slurm job. Summaries are written to summary_sample_fastq.csv.

Remaining Steps

mkdir /project/gtl/data/raw/ALF1/rawdata/sample_fastq

cd /project/gtl/data/raw/ALF1/16S/rawdata/sample_fastq

cp -R 16S /project/gtl/data/raw/ALF1/rawdata/sample_fastq

cd /project/gtl/data/raw/ALF1/ITS/rawdata/sample_fastq

cp -R ITS /project/gtl/data/raw/ALF1/rawdata/sample_fastq/

Go to 16S Bioinformatics for ALF1 to see the remaining processing steps from section “Trim, merge and filter reads”

@Gregg Randolph : please see /project/gtl/data/raw/ALF1/16S/tfmergedreads where I made mergereads.nf, teton.conf and edited trim_merge.pl to trim_mergecab.pl (initially this was because I didn't have permissions to run the file, so I copied it, but I found I need to make some changes, which are in the *cab.pl version).

You run the nextflow script with: module load nextflowand nextflow run -bg mergereads.nf -c teton.config. See inside of mergereads.nf for other ways of running it (i.e., not in the background).

I tried this on pair of input files and one of the vsearch steps in the middle fails because the inputs are too small. I then ran it on all input. The nextflow script completes, but one the vsearch steps appears to produce no output. It might be that some of the input files are genuinely too small. I can see that the trimming step is working. You can see this in output/trimmed/. But we're not getting other files (in unmerged/, joined/, and in output/ itself.).

Logs and other files for debugging from Nextflow will be made automatically in work/. I remove this folder between jobs, so that I can see what output is from what run. For example, see work/ff/d9576360a5295e6c92a0183e485944/.command.log and neighboring files with ls -al work/ff/d9576360a5295e6c92a0183e485944/. In this case I think vsearch is silently failing to write anything. Importantly, you can see the command that was being run, to debug: see work/ff/d9576360a5295e6c92a0183e485944/.command.sh. I think a next step would be to try the commands with some sample data (an R1 and R2 fastq file) that we expect should work.

Right now each job requests 1 hour from SLURM. I suspect this will be much smaller in the end and we can tailor it down.

Genome Technologies Laboratory