RHIR1 bioinformatics

We received four files with sequence reads. Two of these contain the 1x100bp reads, because two lanes were used on the instrument. Two of these because CU unnecessarily ran indexing reads on the fragments. I deleted these nonsense files. The two files with the raw reads of interest are (these started in /project/gtl/data/raw/HPAU1/rawreads).

  1. RHIR1_Pool2_S1_L001_R1_001.fastq.gz (22GB)

  2. RHIR1_Pool2_S1_L002_R1_001.fastq.gz (22GB)

I combined the two files into one using:

cat RHIR1_Pool2_S1_L001_R1_001.fastq.gz RHIR1_Pool2_S1_L002_R1_001.fastq.gz > RHIR1.fastq.gz

  1. RHIR1.fastq.gz (44GB)

I then removed the originals with:

rm RHIR1_Pool*

Then, I split the files for faster processing:

mkdir -p /gscratch/grandol1/RHIR1/rawreads cd /gscratch/grandol1/RHIR1/rawreads unpigz --to-stdout /project/gtl/data/raw/RHIR1/rawreads/RHIR1.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - RHIR1_

Making 236 files

Demultiplexing

In /project/gtl/data/RHIR1/rawreads/ I removed extraneous spaces in the file that maps MIDS to individual identifiers (RHIR1_Demux.csv). I made a fixed version (now we have RHIR1Demux_fixed.csv):

sed -E 's/^([[:alnum:]-]+),([[:alnum:]-]+),([[:alnum:]-]+).*/\1,\2,\3/' RHIR1_Demux.csv > RHIR1Demux_fixed.csv

cd /gscratch/grandol1/RHIR1/rawreads/

Parse split files

/project/gtl/data/raw/RHIR1/demultiplex/run_parsebarcodes_onSplitInput.pl

Recombine by sample name and mid

/project/gtl/data/raw/RHIR1/demultiplex/run_splitFastq_gbs.sh