Redo2 Bioinformatics
Status (02 May 2022)
Everything below is modified from Bioinformatics for Novaseq run 4 ). Data processing finished .
Demultiplexing and splitting
The directory for the raw sequence data (typically gz compressed; use run_pigz.sh and run_unpigz.sh to compress and decompress with multithreaded pigz, using SLURM) and the parsed and split reads is /project/microbiome/data_queue/seq/ReRun2/rawdata
. Files for individual samples will be in /project/microbiome/data_queue/seq/ReRun2/rawdata/sample_fastq/
.
Demultiplexing
The work is done by run_parse_count_onSplitInput.pl
. As the name implies, we split the raw data into many files (492), so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across many jobs. By doing so, the parsing takes less than one hour.
Splitting the compressed raw data was accomplished with the program split
(done in an interactive SLURM job), with 16x106 lines (4x106 reads) being written to each file (with a remainder in the final file). These files were written to /gscratch and, as intermediate files that can be reconstructed readily, will not be retained long-term. Note that in a subsequent run, I used unpigz
on the raw to send down pipe to split (see https://microcollaborative.atlassian.net/wiki/pages/resumedraft.action?draftId=1292926977).
mkdir -p /gscratch/grandol1/ReRun2/rawdata
cd /gscratch/grandol1/ReRun2/rawdata
unpigz --to-stdout /project/microbiome/data_queue/seq/ReRun2/rawdata/ReRun2EPSCoR_S1_R1_001.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - ReRun2_R1_ ;
unpigz --to-stdout /project/microbiome/data_queue/seq/ReRun2/rawdata/ReRun2EPSCoR_S1_R2_001.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - ReRun2_R2_
making 257 R1 files and 257 R2 files, with structured names (e.g., for the R1 set):
/gscratch/grandol1/IllTest6-17/rawdata/TOAD_AIR_R1_000.fastq
/gscratch/grandol1/IllTest6-17/rawdata/TOAD_AIR_R1_001.fastq
etc.
run_parse_count_onSplitInput.pl
also writes to /gscratch
.
ReRun2_Demux.csv
is used to map MIDS to sample names and projects.
Splitting to fastq for individuals
The info lines for each read in parsed_*_R1.fastq
and parsed_*_R2.fastq
have the locus, the forward mid, the reverse mid, and the sample name. These can be used with the demux key to separate reads into loci, projects, and samples, in the folder sample_fastq/
. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.
run_splitFastq_fwd.sh
and run_splitFastq_rev.sh
run splitFastq_manyInputfiles.pl
, which steps through the many pairs of files to split reads by sample and project, and place them in /project/microbiome/data_queue/seq/*/rawdata/sample_fastq/.
splitFastq.pl
and splitFastq_manyInputfiles.pl
will need tweaking in the future, whenever sample names and the format of the key for demultiplexing and metadata changes. The number of columns has differed among some of early sequence lanes, which necessitated changes to this parsing script.
Stopped at above step on 2/01/23 6:05pm
Calculate summary statistics on reads
In a sequence library’s rawdata/
directory (e.g., /project/microbiome/data_queue/seq/*/rawdata
) I made run_aggregate.sh
, to run aggregate_usearch_fastx_info.pl
with a slurm job. Summaries are written to summary_sample_fastq.csv
.
Trim, merge and filter reads
In /project/microbiome/data_queue/seq/*/tfmergedreads
, we used run_slurm_mergereads.pl
to crawl the project folders and sample files (created in the splitting step above) to merge read pairs, and filter based on base quality. This script conforms to the steps in https://microcollaborative.atlassian.net/wiki/spaces/MICLAB/pages/1123778569/Bioinformatics+v3.0?focusedCommentId=1280377080#comment-1280377080, including trimming primers, and joining unmerged reads. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/
and ITS/
folders in tfmergedreads/
. For example, see contents of /project/microbiome/data/seq/*/tfmergedreads/16S/
Within each of these directories are files for the trimmed, merged, and filtered reads, in subfolders trimmed/
, joined/
, and unmerged/
(the last one is used as a working directory, should be empty; unmerged reads are filtered and joined and put in joined/
if they can be joined; the joined directory can be empty, if all unmerged reads were coligos for example).
Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv
in each project folder. Please note that this will not include the number of reads that failed to merge, but we were able to join. This category is likely to include ITS sequences for which the amplicon was large enough that our 2x250bp reads could not span the whole length. The greater number removed in ITS (orange) in the plot below is consistent with this idea. For the full lane these summaries were concatenated in tfmergedreads/
with
cat */*/filtermergestats.csv > all_filtermergestats.csv
Make OTU table
In /project/microbiome/data_queue/seq/*/otu
, I ran run_slurm_mkotu.pl
, which I modified to also pick up the joined reads (in addition the merged reads).
Make coligo table
In /project/microbiome/data_queue/seq/*/coligoIS
or /project/microbiome/data/seq/*/coligoISD
, there are 16S
and ITS
directories for all projects. These contain a file named coligoISDtable.txt
with counts of the coligos and the ISD found in the trimmed forward reads, per sample. The file run_slurm_mkcoligoISDtable.pl
has the code that passes over all of the projects and uses vsearch
for making the table.