Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

I used unpigz.sh to decompress the fastq files, because our parser does not read from gzipped files.

Demultiplexing

In /project/evolgen/assem/alf1GBS_NS1_mar21/demulitplex I …
Also, the The MID to sample name files didn’t follow the scheme we have used for GBS: MIDname, MID, sample id. Also, we don’t want the header. So, I made a fixed version (now we have Hmax1Demux_fixed.csv):

Code Block
languagebash
tail -n+2 Pool1Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool1Alf1GbsTest_Demux_fixed.csv
tail -n+2 Pool2Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool2Alf1GbsTest_Demux_fixed.csv
tail -n+2 Pool3Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool3Alf1GbsTest_Demux_fixed.csv
tail -n+2 Pool4Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool4Alf1GbsTest_Demux_fixed.csv

To speed up the demultiplexing (which would have taken several days on each input file), I split the concatenated raw files into files with 16 million lines each.

Code Block
mkdir /gscratch/buerkle/data/alfalfa
cd /gscratch/buerkle/data/alfalfa
cat /project/evolgen/data/local/alfalfa/alf1GBS_NS1_mar21/Pool*_S* | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - alf1_
mkdir rawreads
mv alf1_* rawreads/
/project/evolgen/assem/alf1GBS_NS1_mar21/demultiplex/run_parsebarcodes_onSplitInput.pl

In /project/evolgen/assem/alf1GBS_NS1_mar21/demulitplex I launched

On I started parse_barcodes_slurm_pool1.sh etc (one for each pool) . Note that I did not do separate contaminant filtering (which I did for Penstemon), because the parsing code and other downstream steps should knock out contaminants. I can double-check this.

...