Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Splitting the raw (uncompressed) data was accomplish with the program split, with 1x106 lines (4x106 reads) being written to each file (with a remainder in the final file). These files were written to /gscratch and, as intermediate files that can be reconstructed readily, will not be retained long-term.

Code Block
splitmkdir -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1p /gscratch/grandol1/LarkGrouseTest1/rawdata
cd /gscratch/grandol1/LarkGrouseTest1/rawdata
unpigz --to-stdout /project/gtl/data/raw/LarkGrouseTest1/rawdata/NovaSeq1ALarkGrouseTest_S1_L001_R1_001.fastq.gz  LarkGrouseTest1_R1_| split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1- LarkGrouseTest1_R1_ ;
unpigz --to-stdout /project/gtl/data/raw/LarkGrouseTest1/rawdata/NovaSeq1A_R2/rawdata/LarkGrouseTest_S1_L001_R2_001.fastq.gz  | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LarkGrouseTest1_R2_

making 220 20 R1 files and 220 20 R2 files, with structured names (e.g., for the R1 set):

/gscratch/buerkle/psomagen_17sep20_novaseq2grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_000.fastq
/gscratch/buerkle/psomagen_17sep20_novaseq2grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_001.fastq
/gscratch/buerkle/psomagen_17sep20_novaseq2grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_002.fastq
/gscratch/buerkle/psomagen_17sep20_novaseq2grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_003.fastq etc.

run_parse_count_onSplitInput.pl also writes to /gscratch.

LarkGrouseTest1_DemuxJHDemux.csv was exported created from a Google Sheet notebook and further edited to correct some assignment of project namesR and a MISO report. This file is used to map MIDS to sample names and projects.

...