Content Comparison

Info

Status (28 April 02 May 2022)

Data arrived by sftp on 28 April 2022. Everything below is a draft (copied modified from Bioinformatics for Novaseq run 4 ). I will begin processing tomorrow, 4-29Data processing finished 5-12-22.

Table of Contents

Demultiplexing and splitting

The directory for the raw sequence data (typically gz compressed; use run_pigz.sh and run_unpigz.sh to compress and decompress with multithreaded pigz, using SLURM) and the parsed and split reads is /project/microbiome/data_queue/seq/cu_24feb21novaseq4NS5/rawdata. Files for individual samples will be in /project/microbiome/data_queue/seq/cu_24feb21novaseq4NS5/rawdata/sample_fastq/. In this case, rather than a pair of files, six four files were delivered. Files that included R2 I1 and R3 I2 in the name were indexing reads that make no sense for our library constructs, so I have deleted them with rm WyomingPoolRG_SP_500_*_RI[1-2-3]_*.fastq.gz. That leaves four files, which I have concatenated for the sake of simplicity2 files.

cat WyomingPoolRG_L2SP_S2_L002_R1_001.fastq WyomingPool_L1_S1_L001500_S1_R1_001.fastq > WyomingPool_novaseq4_R1.fastqcat WyomingPool_L2_S2_L002_R4_001.fastq WyomingPool_L1_S1_L001_R4.gz

RG_SP_500_S1_R1_001.fastq > WyomingPool_novaseq4_R2.fastq. (note that I renamed R4 to R2, because R2 is the typical label for the reverse reads)..gz

Demultiplexing

The work is done by run_parse_count_onSplitInput.pl. As the name implies, we split the raw data into many files (240492), so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across many jobs. By doing so, the parsing takes less than one hour.

...

Code Block

mkdir -p /gscratch/buerklegrandol1/cu_24feb21novaseq4NS5/rawdata
cd /gscratch/buerklegrandol1/cu_24feb21novaseq4NS5/rawdata
unpigz --to-stdout /project/microbiome/data_queue/seq/cu_24feb21novaseq4NS5/rawdata/WyomingPool_novaseq4RG_SP_500_S1_R1_001.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - novaseq4NS5_R1_ ;
unpigz --to-stdout /project/microbiome/data_queue/seq/cu_24feb21novaseq4NS5/rawdata/WyomingPool_novaseq4RG_SP_500_S1_R2_001.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - novaseq4NS5_R2_

making 240 R1 files and 240 R2 files, with structured names (e.g., for the R1 set):

/gscratch/buerklegrandol1/cu_24feb21novaseq4NS5/rawdata/novaseq4NS5_R1_000.fastq
/gscratch/buerklegrandol1/cu_24feb21novaseq4NS5/rawdata/novaseq4NS5_R1_001.fastq
etc.

run_parse_count_onSplitInput.pl also writes to /gscratch.

NovaSeq4NS5_Demux.csv was provided by Gregg Randolph. This file is used to map MIDS to sample names and projects.

...

The info lines for each read in parsed_novaseqNS*_R1.fastq and parsed_novaseqNS*_R2.fastq have the locus, the forward mid, the reverse mid, and the sample name. These can be used with the demux key to separate reads into loci, projects, and samples, in the folder sample_fastq/. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.

run_splitFastq_fwd.sh and run_splitFastq_rev.sh run splitFastq_manyInputfiles.pl, which steps through the many pairs of files to split reads by sample and project, and place them in /project/microbiome/data_queue/seq/cu_24feb21novaseq4NS5/rawdata/sample_fastq/.

splitFastq.pl and splitFastq_manyInputfiles.pl will need tweaking in the future, whenever sample names and the format of the key for demultiplexing and metadata changes. The number of columns has differed among some of early sequence lanes, which necessitated changes to this parsing script.

...

In a sequence library’s rawdata/ directory (e.g., /project/microbiome/data_queue/seq/cu_24feb21novaseq4NS5/rawdata) I made run_aggregate.sh, to run aggregate_usearch_fastx_info.pl with a slurm job. Summaries are written to summary_sample_fastq.csv.

...

In /project/microbiome/data_queue/seq/cu_24feb21novaseq4NS5/tfmergedreads , we used run_slurm_mergereads.plto crawl the project folders and sample files (created in the splitting step above) to merge read pairs, and filter based on base quality. This script conforms to the steps in https://microcollaborative.atlassian.net/wiki/spaces/MICLAB/pages/1123778569/Bioinformatics+v3.0?focusedCommentId=1280377080#comment-1280377080, including trimming primers, and joining unmerged reads. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/ and ITS/ folders in tfmergedreads/. For example, see contents of /project/microbiome/data/seq/cu_24feb21novaseq4NS5/tfmergedreads/16S/ayayee/

Within each of these directories are files for the trimmed, merged, and filtered reads, in subfolders trimmed/, joined/, and unmerged/ (the last one is used as a working directory, should be empty; unmerged reads are filtered and joined and put in joined/ if they can be joined; the joined directory can be empty, if all unmerged reads were coligos for example).

...

In /project/microbiome/data_queue/seq/cu_24feb21novaseq4NS5/otu, I ran run_slurm_mkotu.pl, which I modified to also pick up the joined reads (in addition the merged reads). I reran this on 05 Apr 2021 to address an error in how we using vsearch, whereby unique sequences were being incorrectly merged in the otutable because they did not have unique names (runs should be done late on 05 Apr 2021.

Make coligo table

In /project/microbiome/data_queue/seq/psomagen_6mar20NS5/coligoISD, /project/microbiome/data/seq/psomagen_26may20NS5/coligoISD, and /project/microbiome/data/seq/psomagen_29jan21novaseq1cNS5/coligoISD, there are 16S and ITS directories for all projects. These contain a file named coligoISDtable.txt with counts of the coligos and the ISD found in the trimmed forward reads, per sample. The file run_slurm_mkcoligoISDtable.pl has the code that passes over all of the projects and uses vsearch for making the table.

Version	Old Version 2	New Version Current
Changes made by	Gregg Randolph	Gregg Randolph
Saved on	Apr 28, 2022	May 13, 2022

Versions Compared

Key

Demultiplexing and splitting

Demultiplexing

Make coligo table

5-2-2022 Alex Buerkle transferred all of this to `/project/microbiome/data/seq/cu_29april22novaseq5`

Content Comparison

Versions Compared

Key

Demultiplexing and splitting

Demultiplexing

Make coligo table

5-2-2022 Alex Buerkle transferred all of this to /project/microbiome/data/seq/cu_29april22novaseq5

5-2-2022 Alex Buerkle transferred all of this to `/project/microbiome/data/seq/cu_29april22novaseq5`