...
Splitting the raw (uncompressed) data was accomplish with the program split
, with 1x106 lines (4x106 reads) being written to each file (with a remainder in the final file). These files were written to /gscratch and, as intermediate files that can be reconstructed readily, will not be retained long-term.
Code Block |
---|
splitmkdir -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq /pfs/tsfs1p /gscratch/grandol1/LarkGrouseTest1/rawdata cd /gscratch/grandol1/LarkGrouseTest1/rawdata unpigz --to-stdout /project/gtl/data/raw/LarkGrouseTest1/rawdata/NovaSeq1ALarkGrouseTest_S1_L001_R1_001.fastq.gz LarkGrouseTest1_R1_| split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq /pfs/tsfs1- LarkGrouseTest1_R1_ ; unpigz --to-stdout /project/gtl/data/raw/LarkGrouseTest1/rawdata/NovaSeq1A_R2/rawdata/LarkGrouseTest_S1_L001_R2_001.fastq.gz | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LarkGrouseTest1_R2_ |
making 220 20 R1 files and 220 20 R2 files, with structured names (e.g., for the R1 set):
/gscratch/buerkle/psomagen_17sep20_novaseq2grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_000.fastq
/gscratch/buerkle/psomagen_17sep20_novaseq2grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_001.fastq
/gscratch/buerkle/psomagen_17sep20_novaseq2grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_002.fastq
/gscratch/buerkle/psomagen_17sep20_novaseq2grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_003.fastq etc.
run_parse_count_onSplitInput.pl
also writes to /gscratch
.
LarkGrouseTest1_DemuxJHDemux.csv
was exported created from a Google Sheet notebook and further edited to correct some assignment of project namesR and a MISO report. This file is used to map MIDS to sample names and projects.
...