...
I used unpigz.sh to decompress the fastq files, because our parser does not read from gzipped files.
Demultiplexing
In /project/evolgen/assem/alf1GBS_NS1_mar21/demulitplex
I …
Also, the The MID to sample name files didn’t follow the scheme we have used for GBS: MIDname, MID, sample id. Also, we don’t want the header. So, I made a fixed version (now we have Hmax1Demux_fixed.csv
):
Code Block | ||
---|---|---|
| ||
tail -n+2 Pool1Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool1Alf1GbsTest_Demux_fixed.csv tail -n+2 Pool2Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool2Alf1GbsTest_Demux_fixed.csv tail -n+2 Pool3Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool3Alf1GbsTest_Demux_fixed.csv tail -n+2 Pool4Alf1GbsTest_Demux.csv | sed -E 's/^([._[:alnum:]-]+),([._[:alnum:]-]+),([._[:alnum:]-]+).*/\3,\1,\2/' > Pool4Alf1GbsTest_Demux_fixed.csv |
To speed up the demultiplexing (which would have taken several days on each input file), I split the concatenated raw files into files with 16 million lines each.
Code Block |
---|
mkdir /gscratch/buerkle/data/alfalfa
cd /gscratch/buerkle/data/alfalfa
cat /project/evolgen/data/local/alfalfa/alf1GBS_NS1_mar21/Pool*_S* | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - alf1_
mkdir rawreads
mv alf1_* rawreads/
/project/evolgen/assem/alf1GBS_NS1_mar21/demultiplex/run_parsebarcodes_onSplitInput.pl |
In /project/evolgen/assem/alf1GBS_NS1_mar21/demulitplex
I launched
On I started parse_barcodes_slurm_pool1.sh
etc (one for each pool) . Note that I did not do separate contaminant filtering (which I did for Penstemon), because the parsing code and other downstream steps should knock out contaminants. I can double-check this.
...