Alfalfa GBS bioinformatics 2021

Raw reads

In March 2021, we received eight files with sequence reads. Four of these contain the 1x100bp reads, because four lanes (pools) were used on the instrument. Four of these because CU unnecessarily ran indexing reads on the fragments. I deleted these nonsense files. The four files with the raw reads of interest are (these are in /project/evolgen/data/local/alfalfa/alf1GBS_NS1_mar21/, with original files in /project/microbiome/data/seq/alfalfa/GBS/Alf1GBS_NS1/).

Pool1_S1_L001_R1_001.fastq (20 GBytes) – 416,256,593 reads (99 GBytes uncompressed)
Pool2_S2_L002_R1_001.fastq.gz (20 GBytes) – 405,613,054 reads (97 GBytes uncompressed)
Pool3_S1_L001_R1_001.fastq.gz (23 GBytes) – 482,619,189 reads (115 GBytes uncompressed)
Pool4_S2_L002_R1_001.fastq.gz (19 GBytes) – 427,134,565 reads (102 GBytes uncompressed)

I used unpigz.sh to decompress the fastq files, because our parser does not read from gzipped files.

Demultiplexing

The MID to sample name files didn’t follow the scheme we have used for GBS: MIDname, MID, sample id. Also, we don’t want the header. So, I made a fixed version (now we have Hmax1Demux_fixed.csv):

tail -n+2 Pool1Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool1Alf1GbsTest_Demux_fixed.csv
tail -n+2 Pool2Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool2Alf1GbsTest_Demux_fixed.csv
tail -n+2 Pool3Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool3Alf1GbsTest_Demux_fixed.csv
tail -n+2 Pool4Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool4Alf1GbsTest_Demux_fixed.csv

To speed up the demultiplexing (which would have taken several days on each input file), I split the concatenated raw files into files with 16 million lines each.

mkdir /gscratch/buerkle/data/alfalfa
cd /gscratch/buerkle/data/alfalfa
cat /project/evolgen/data/local/alfalfa/alf1GBS_NS1_mar21/Pool*_S* | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - alf1_
mkdir rawreads
mv alf1_* rawreads/
/project/evolgen/assem/alf1GBS_NS1_mar21/demultiplex/run_parsebarcodes_onSplitInput.pl

In /project/evolgen/assem/alf1GBS_NS1_mar21/demulitplex I launched

On 07 Aug 2021 I started parse_barcodes_slurm_pool1.sh etc (one for each pool) . Note that I did not do separate contaminant filtering (which I did for Penstemon), because the parsing code and other downstream steps should knock out contaminants. I can double-check this.