HPAU bioinformatics

Raw reads

We received four files with sequence reads. Two of these contain the 1x100bp reads, because two lanes were used on the instrument. Two of these because CU unnecessarily ran indexing reads on the fragments. I deleted these nonsense files. The two files with the raw reads of interest are (these started in /project/gtl/data/raw/HPAU1/rawreads).

HPAU1_Pool1_S1_L001_R1_001.fastq.gz (17GB)
HPAU1_Pool1_S1_L002_R1_001.fastq.gz (16GB)

I combined the two files into one using:

cat HPAU1_Pool1_S1_L001_R1_001.fastq.gz HPAU1_Pool1_S1_L002_R1_001.fastq.gz > HPAU1.fastq.gz

HPAU1.fastq.gz (33GB)

I then removed the originals with:

rm HPAU1_Pool*

Then, I split the files for faster processing:

mkdir -p /gscratch/grandol1/HPAU1/rawreads
cd /gscratch/grandol1/HPAU1/rawreads
unpigz --to-stdout /project/gtl/data/raw/HPAU1/rawreads/HPAU1.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - HPAU1_

Making 204 files.

Demultiplexing

In /project/gtl/raw/HPAU1 I removed extraneous spaces in the file that maps MIDS to individual identifiers (HPAU1Demux.csv). Also, the original HPAU1Demux.csv didn’t follow the scheme we have used for GBS: MIDname, MID, sample id. So, I made a fixed version (now we have HPAU1Demux_fixed.csv):

cd /gscratch/grandol1/HPAU1/rawreads/

Parse split files

/project/gtl/data/raw/HPAU1/demultiplex/run_parsebarcodes_onSplitInput.pl

Recombine by sample name and mid

/project/gtl/data/raw/HPAU1/demultiplex/run_splitFastq_gbs.sh