Raw reads
We received four files with sequence reads. Two of these contain the 1x100bp reads, because two lanes were used on the instrument. Two of these because CU unnecessarily ran indexing reads on the fragments. I deleted these nonsense files. The two files with the raw reads of interest are (these are in /project/microbiome/data/seq/HMAX1/rawreads
).
WyomingPool_L1_S1_L001_R1_001.fastq.gz (22 GB) – 457,726,974 reads (109 GBytes uncompressed)
WyomingPool_L2_S2_L002_R1_001.fastq.gz (22 GB) – 450,678,667 reads (107 GBytes uncompressed)
I used unpigz.sh to decompress the fastq files, because our parser does not read from gzipped files.
Demultiplexing
In /project/microbiome/analyses/gtl/HMAX1
I removed extraneous spaces in the file that maps MIDS to individual identifiers (Hmax1Demux.csv
). Also, the original Hmax1Demux.csv
didn’t follow the scheme we have used for GBS: MIDname, MID, sample id. So, I made a fixed version (now we have Hmax1Demux_fixed.csv
):
sed -E 's/^([[:alnum:]-]+),([[:alnum:]-]+),([[:alnum:]-]+).*/\3,\1,\2/' Hmax1Demux.csv > Hmax1Demux_fixed.csv
Demultiplexing on the two files in parallel took more than the two days I initially allocated to it (in part because of the ~10% of the data that do not match our MIDS, because we did not filter contaminants). So I broke the data into 228 parts (each with 16 million lines) and ran 228 jobs in parallel.
mkdir /gscratch/buerkle/data/HMAX1 cd /gscratch/buerkle/data/HMAX1 cat /project/microbiome/data/seq/HMAX1/rawreads/WyomingPool* | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - WyomingPool_HMAX1_ mkdir rawreads mv WyomingPool_HMAX1_* rawreads/ /project/microbiome/analyses/gtl/HMAX1/demultiplex/run_parsebarcodes_onSplitInput.pl
Note that I did not do separate contaminant filtering (which I did for Penstemon), because the parsing code and other downstream steps should knock out contaminants. I can double-check this.