Bioinformaticsx

Raw reads

We received four files with sequence reads. Two of these contain the 1x100bp reads, because two lanes were used on the instrument. Two of these because CU unnecessarily ran indexing reads on the fragments. I deleted these nonsense files. The two files with the raw reads of interest are (these are in /project/microbiome/data/seq/HMAX1/rawreads).

WyomingPool_L1_S1_L001_R1_001.fastq.gz (22 GB) – 457,726,974 reads (109 GBytes uncompressed)
WyomingPool_L2_S2_L002_R1_001.fastq.gz (22 GB) – 450,678,667 reads (107 GBytes uncompressed)

I used unpigz.sh to decompress the fastq files, because our parser does not read from gzipped files.

Demultiplexing

In /project/microbiome/analyses/gtl/HMAX1 I removed extraneous spaces in the file that maps MIDS to individual identifiers (Hmax1Demux.csv). Also, the original Hmax1Demux.csv didn’t follow the scheme we have used for GBS: MIDname, MID, sample id. So, I made a fixed version (now we have Hmax1Demux_fixed.csv):

sed -E 's/^([[:alnum:]-]+),([[:alnum:]-]+),([[:alnum:]-]+).*/\3,\1,\2/' Hmax1Demux.csv > Hmax1Demux_fixed.csv

On 06 Aug 2021 I started parse_barcodes_slurm_L1.sh and parse_barcodes_slurm_L2.sh