Alfalfa GBS bioinformatics 2021

Raw reads

In March 2021, we received eight files with sequence reads. Four of these contain the 1x100bp reads, because four lanes (pools) were used on the instrument. Four of these because CU unnecessarily ran indexing reads on the fragments. I deleted these nonsense files. The four files with the raw reads of interest are (these are in /project/evolgen/data/local/alfalfa/alf1GBS_NS1_mar21/, with original files in /project/microbiome/data/seq/alfalfa/GBS/Alf1GBS_NS1/).

Pool1_S1_L001_R1_001.fastq (20 GBytes) – 416,256,593 reads (99 GBytes uncompressed)
Pool2_S2_L002_R1_001.fastq.gz (20 GBytes) – 405,613,054 reads (97 GBytes uncompressed)
Pool3_S1_L001_R1_001.fastq.gz (23 GBytes) – 482,619,189 reads (115 GBytes uncompressed)
Pool4_S2_L002_R1_001.fastq.gz (19 GBytes) – 427,134,565 reads (102 GBytes uncompressed)

I used unpigz.sh to decompress the fastq files, because our parser does not read from gzipped files.

Demultiplexing

The MID to sample name files didn’t follow the scheme we have used for GBS: MIDname, MID, sample id. Also, we don’t want the header. So, I made a fixed version (now we have Hmax1Demux_fixed.csv):

tail -n+2 Pool1Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool1Alf1GbsTest_Demux_fixed.csv
tail -n+2 Pool2Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool2Alf1GbsTest_Demux_fixed.csv
tail -n+2 Pool3Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool3Alf1GbsTest_Demux_fixed.csv
tail -n+2 Pool4Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool4Alf1GbsTest_Demux_fixed.csv

To speed up the demultiplexing (which would have taken several days on each input file), I split the concatenated raw files into 435 files with 16 million lines each. I needed to respect the different libraries, because these use overlapping MIDs. I did that with the split, but then I also needed to modify run_parsebarcodes_onSplitInput.pl to work with different pools/demux keys, which I did.

mkdir -p /gscratch/buerkle/data/alfalfa/rawreads
cd /gscratch/buerkle/data/alfalfa/rawreads
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq /project/evolgen/data/local/alfalfa/alf1GBS_NS1_mar21/Pool1_S1_L001_R1_001.fastq alf1_pool1_
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq /project/evolgen/data/local/alfalfa/alf1GBS_NS1_mar21/Pool2_S2_L002_R1_001.fastq alf1_pool2_
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq /project/evolgen/data/local/alfalfa/alf1GBS_NS1_mar21/Pool3_S1_L001_R1_001.fastq alf1_pool3_
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq /project/evolgen/data/local/alfalfa/alf1GBS_NS1_mar21/Pool4_S2_L002_R1_001.fastq alf1_pool4_
/project/evolgen/assem/alf1GBS_NS1_mar21/demultiplex/run_parsebarcodes_onSplitInput.pl

I gave each of the 435 jobs 6 hours to complete.