Content Comparison

...

The work is done by run_parse_count_onSplitInput.pl. As the name implies, we split the raw data into many files (492176), so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 ~48 hours of CPU time, so we are splitting the task across many jobs. By doing so, the parsing takes less than one hour.

...

Code Block

mkdir -p /gscratch/grandol1/ReRun4/rawdata
cd /gscratch/grandol1/ReRun4/rawdata
unpigz --to-stdout /project/microbiome/data_queue/seq/ReRun4/rawdata/16SreRun4_S1_R1_001.fastq.gz  | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - ReRun4_R1_ ;
unpigz --to-stdout /project/microbiome/data_queue/seq/ReRun4/rawdata/16SreRun4_S1_R2_001.fastq.gz  | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - ReRun4_R2_

~~making 257 R1 files and 257 R2 files, with structured names (e.g., for the R1 set):~~

/gscratch/grandol1/IllTest6-17/rawdata/ReRun4_R1_000.fastq
/gscratch/grandol1/IllTest6-17/rawdata/ReRun4_R1_001.fastq
etc.

...

Code Block

cd /gscratch/grandol1/ReRun3ReRun4/rawdata
rm *
unpigz --to-stdout /project/microbiome/data_queue/seq/ReRun3ReRun4/rawdata/ReRun3ReRun4_ITS_S1_R1_001.fastq.gz  | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - ReRun3ReRun4_R1_ ;
unpigz --to-stdout /project/microbiome/data_queue/seq/ReRun3ReRun4/rawdata/ReRun3ReRun4_ITS_S1_R2_001.fastq.gz  | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - ReRun3ReRun4_R2_

Version	Old Version 1	New Version Current
Changes made by	Gregg Randolph	Gregg Randolph
Saved on	Oct 04, 2023	Feb 14, 2024

Versions Compared

Key