Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The directory for the raw sequence data (typically gz compressed; use run_pigz.sh and run_unpigz.sh to compress and decompress with multithreaded pigz, using SLURM) and the parsed and split reads is /project/microbiome/data_queue/seq/cu_24feb21novaseq4NS5/rawdata. Files for individual samples will be in /project/microbiome/data_queue/seq/cu_24feb21novaseq4NS5/rawdata/sample_fastq/. In this case, rather than a four pair of files, eight four files were delivered. Files that included I1 and I2 in the name were indexing reads that make no sense for our library constructs, so I have deleted them with rm RG_SP_500_*_I[1-2]_*.fastq.gz. That leaves four files, which I have concatenated for the sake of simplicity2 files.

cat RG_SP_500_S1_R1_001.fastq RG_SP_500_S2_R1_001.fastq > Novaseq5_R1.fastqcat .gz

RG_SP_500_S1_R1_001.fastq RG_SP_500_S2_R1_001.fastq > Novaseq5_R2.fastq. .gz

Demultiplexing

The work is done by run_parse_count_onSplitInput.pl. As the name implies, we split the raw data into many files (240), so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across many jobs. By doing so, the parsing takes less than one hour.

...

Code Block
mkdir -p /gscratch/grandol1/NS5/rawdata
cd /gscratch/grandol1/NS5/rawdata
unpigz --to-stdout /project/microbiome/data_queue/seq/NS5/rawdata/Novaseq5RG_SP_500_S1_R1_001.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - NS5_R1_ ;
unpigz --to-stdout /project/microbiome/data_queue/seq/NS5/rawdata/Novaseq5_R2RG_SP_500_S1_R1_001.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - NS5_R2_

Below This Point is yet to be done

making 240 R1 files and 240 R2 files, with structured names (e.g., for the R1 set):

...

run_parse_count_onSplitInput.pl also writes to /gscratch.

NovaSeq5NS5_Demux.csv is used to map MIDS to sample names and projects.

...