RHIR1 bioinformatics
We received four files with sequence reads. Two of these contain the 1x100bp reads, because two lanes were used on the instrument. Two of these because CU unnecessarily ran indexing reads on the fragments. I deleted these nonsense files. The two files with the raw reads of interest are (these started in /project/gtl/data/raw/HPAU1/rawreads
).
RHIR1_Pool2_S1_L001_R1_001.fastq.gz (22GB)
RHIR1_Pool2_S1_L002_R1_001.fastq.gz (22GB)
I combined the two files into one using:
cat RHIR1_Pool2_S1_L001_R1_001.fastq.gz RHIR1_Pool2_S1_L002_R1_001.fastq.gz > RHIR1.fastq.gz
RHIR1.fastq.gz (44GB)
I then removed the originals with:
rm RHIR1_Pool*
Then, I split the files for faster processing:
mkdir -p /gscratch/grandol1/RHIR1/rawreads
cd /gscratch/grandol1/RHIR1/rawreads
unpigz --to-stdout /project/gtl/data/raw/RHIR1/rawreads/RHIR1.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - RHIR1_
Making 236 files
Demultiplexing
In /project/gtl/data/RHIR1/rawreads/
I removed extraneous spaces in the file that maps MIDS to individual identifiers (RHIR1_Demux.csv
). I made a fixed version (now we have RHIR1Demux_fixed.csv
):
sed -E 's/^([[:alnum:]-]+),([[:alnum:]-]+),([[:alnum:]-]+).*/\1,\2,\3/' RHIR1_Demux.csv > RHIR1Demux_fixed.csv
cd /gscratch/grandol1/RHIR1/rawreads/
Parse split files
/project/gtl/data/raw/RHIR1/demultiplex/run_parsebarcodes_onSplitInput.pl
Recombine by sample name and mid
/project/gtl/data/raw/RHIR1/demultiplex/run_splitFastq_gbs.sh