Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Info

Status (28 April 02 May 2022)

  • Data arrived by sftp on 28 April 2022. Everything below is a draft modified from Bioinformatics for Novaseq run 4 ). I will begin processing tomorrow, 4-29-22. Most of the text should remain the same. Current figures are irrelevant placeholders.Data processing finished 5-12-22.

Table of Contents

Demultiplexing and splitting

...

The work is done by run_parse_count_onSplitInput.pl. As the name implies, we split the raw data into many files (240492), so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across many jobs. By doing so, the parsing takes less than one hour.

...

Code Block
mkdir -p /gscratch/grandol1/NS5/rawdata
cd /gscratch/grandol1/NS5/rawdata
unpigz --to-stdout /project/microbiome/data_queue/seq/NS5/rawdata/RG_SP_500_S1_R1_001.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - NS5_R1_ ;
unpigz --to-stdout /project/microbiome/data_queue/seq/NS5/rawdata/RG_SP_500_S1_R1R2_001.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - NS5_R2_

...

making 240 R1 files and 240 R2 files, with structured names (e.g., for the R1 set):

...

In /project/microbiome/data_queue/seq/psomagen_6mar20NS5/coligoISD, /project/microbiome/data/seq/psomagen_26may20NS5/coligoISD, and /project/microbiome/data/seq/psomagen_29jan21novaseq1cNS5/coligoISD, there are 16S and ITS directories for all projects. These contain a file named coligoISDtable.txt with counts of the coligos and the ISD found in the trimmed forward reads, per sample. The file run_slurm_mkcoligoISDtable.pl has the code that passes over all of the projects and uses vsearch for making the table.

5-2-2022 Alex Buerkle transferred all of this to /project/microbiome/data/seq/cu_29april22novaseq5