Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The work is done by run_parse_count_onSplitInput.pl. As the name implies, we split the raw data into many files (492176), so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 ~48 hours of CPU time, so we are splitting the task across many jobs. By doing so, the parsing takes less than one hour.

...

Code Block
mkdir -p /gscratch/grandol1/ReRun4/rawdata
cd /gscratch/grandol1/ReRun4/rawdata
unpigz --to-stdout /project/microbiome/data_queue/seq/ReRun4/rawdata/16SreRun4_S1_R1_001.fastq.gz  | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - ReRun4_R1_ ;
unpigz --to-stdout /project/microbiome/data_queue/seq/ReRun4/rawdata/16SreRun4_S1_R2_001.fastq.gz  | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - ReRun4_R2_

...

/gscratch/grandol1/IllTest6-17/rawdata/ReRun4_R1_000.fastq
/gscratch/grandol1/IllTest6-17/rawdata/ReRun4_R1_001.fastq
etc.

...