Page Comparison

Info

Status (02 May 2022)

Data arrived by Globus on 01 10 2023. Everything below is modified from Bioinformatics for Novaseq run 4 ). Data processing ~~finished~~ .

Table of Contents

Demultiplexing and splitting

The directory for the raw sequence data (typically gz compressed; use run_pigz.sh and run_unpigz.sh to compress and decompress with multithreaded pigz, using SLURM) and the parsed and split reads is /project/microbiome/data_queue/seq/5ALA/rawdata. Files for individual samples will be in /project/microbiome/data_queue/seq/5ALA/rawdata/sample_fastq/.

Demultiplexing

The work is done by run_parse_count_onSplitInput.pl. As the name implies, we split the raw data into many files (492), so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across many jobs. By doing so, the parsing takes less than one hour.

...

Code Block

mkdir -p /gscratch/grandol1/NS6IllTest6-17/rawdata
cd /gscratch/grandol1/NS6IllTest6-17/rawdata
unpigz --to-stdout /project/microbiome/data_queue/seq/NS6IllTest6-17/rawdata/NovaSeq6IllTest_S1_poolR1_1001.fqfasq.gz  | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - NS6TOAD_AIR_R1_ ;
unpigz --to-stdout /project/microbiome/data_queue/seq/NS6IllTest6-17/rawdata/NovaSeq6IllTest_S1_poolR2_2001.fqfasq.gz  | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - NS6TOAD_AIR_R2_

making 257 R1 files and 257 R2 files, with structured names (e.g., for the R1 set):

/gscratch/grandol1/5ALAIllTest6-17/rawdata/NS6TOAD_AIR_R1_000.fastq
/gscratch/grandol1/5ALAIllTest6-17/rawdata/NS6TOAD_AIR_R1_001.fastq
etc.

run_parse_count_onSplitInput.pl also writes to /gscratch.

NS6_Demux.csv is used to map MIDS to sample names and projects.

Splitting to fastq for individuals

The info lines for each read in parsed_NS*_R1.fastq and parsed_NS*_R2.fastq have the locus, the forward mid, the reverse mid, and the sample name. These can be used with the demux key to separate reads into loci, projects, and samples, in the folder sample_fastq/. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.

...

Stopped at above step on 2/01/23 6:05pm

Calculate summary statistics on reads

In a sequence library’s rawdata/ directory (e.g., /project/microbiome/data_queue/seq/NS6/rawdata) I made run_aggregate.sh, to run aggregate_usearch_fastx_info.pl with a slurm job. Summaries are written to summary_sample_fastq.csv.

Trim, merge and filter reads

In /project/microbiome/data_queue/seq/NS6/tfmergedreads , we used run_slurm_mergereads.plto crawl the project folders and sample files (created in the splitting step above) to merge read pairs, and filter based on base quality. This script conforms to the steps in https://microcollaborative.atlassian.net/wiki/spaces/MICLAB/pages/1123778569/Bioinformatics+v3.0?focusedCommentId=1280377080#comment-1280377080, including trimming primers, and joining unmerged reads. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/ and ITS/ folders in tfmergedreads/. For example, see contents of /project/microbiome/data/seq/NS6/tfmergedreads/16S/

...

I used commands.R in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge, or did meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).

...

Make OTU table

In /project/microbiome/data_queue/seq/NS6/otu, I ran run_slurm_mkotu.pl, which I modified to also pick up the joined reads (in addition the merged reads).

Make coligo table

In /project/microbiome/data_queue/seq/NS6/coligoIS or /project/microbiome/data/seq/NS6/coligoISD, there are 16S and ITS directories for all projects. These contain a file named coligoISDtable.txt with counts of the coligos and the ISD found in the trimmed forward reads, per sample. The file run_slurm_mkcoligoISDtable.pl has the code that passes over all of the projects and uses vsearch for making the table.

...

Versions Compared

Old Version 1

New Version 2

Key