Raw reads
We received four files with sequence reads. Two of these contain the 1x100bp reads, because two lanes were used on the instrument. Two of these because CU unnecessarily ran indexing reads on the fragments. I deleted these nonsense files. The two files with the raw reads of interest are (these
...
started in /project/
...
gtl/data/
...
raw/
...
HPAU1/rawreads
).
WyomingPoolRHIR1_L1Pool2_S1_L001_R1_001.fastq.gz (22 GB) – 457,726,974 reads (109 GBytes uncompressed)WyomingPool_L2_S222GB)
RHIR1_Pool2_S1_L002_R1_001.fastq.gz (22GB)
I combined the two files into one using:
cat RHIR1_Pool2_S1_L001_R1_001.fastq.gz RHIR1_Pool2_S1_L002_R1_001.fastq.gz > RHIR1.fastq.gz
...
I used unpigz.sh to decompress the fastq files, because our parser does not read from gzipped files.
RHIR1.fastq.gz (44GB)
I then removed the originals with:
rm RHIR1_Pool*
Then, I split the files for faster processing:
Code Block |
---|
mkdir -p /gscratch/grandol1/RHIR1/rawreads
cd /gscratch/grandol1/RHIR1/rawreads
unpigz --to-stdout /project/gtl/data/raw/RHIR1/rawreads/RHIR1.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - RHIR1_ |
Making 236 files
Stopped Here on 05-19-2022
Demultiplexing
In /project/gtl/microbiomedata/analysesRHIR1/gtlrawreads/HMAX1
I removed extraneous spaces in the file that maps MIDS to individual identifiers (Hmax1DemuxRHIR1_Demux.csv
). Also, the original Hmax1Demux.csv
didn’t follow the scheme we have used for GBS: MIDname, MID, sample id. So, I made a fixed version (now we have Hmax1DemuxRHIR1Demux_fixed.csv
):
Code Block | ||
---|---|---|
| ||
sed -E 's/^([[:alnum:]-]+),([[:alnum:]-]+),([[:alnum:]-]+).*/\3,\1,\2/' Hmax1DemuxRHIR1_Demux.csv > Hmax1Demux_fixed.csv |
This demux key indicated that there were several samples that were technical replicates, but the user also reported one error, which we corrected by hand (see diff below) and made Hmax1Demux_fixed2.csv
(this file also lacks a final newline at the end of the file, but I verified this is not a problem
Code Block |
---|
196c196
< C01,CGATATAG,MN-ABR-BLA-9-L
---
> C01,CGATATAG,MN-ABR-BLA-10-L |
Demultiplexing on the two files in parallel took more than the two days I initially allocated to it (in part because of the ~10% of the data that do not match our MIDS, because we did not filter contaminants). So I broke the data into 228 parts (each with 16 million lines) and ran 228 jobs in parallel. I repeated this when we learned of the one error in the demux key.
Code Block | ||
---|---|---|
| ||
mkdir /gscratch/buerkle/data/HMAX1 cd /gscratch/buerkle/data/HMAX1 cat /project/microbiome/data/seq/HMAX1/rawreads/WyomingPool* | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - WyomingPool_HMAX1_ mkdir rawreads mv WyomingPool_HMAX1_* rawreads/ /project/microbiome/analyses/gtl/HMAX1RHIR1Demux_fixed.csv |
cd /gscratch/grandol1/RHIR1/rawreads/
Parse split files
/project/gtl/data/raw/RHIR1/demultiplex/run_parsebarcodes_onSplitInput.pl
Note that I did not do separate contaminant filtering (which I did for Penstemon), because the parsing code and other downstream steps should knock out contaminants. I can double-check this.
I modified the script from 16S/ITS work for splitting fastq files based on information in their info line, to different files. It is: /project/microbiome/analyses/gtl/HMAX1/demultiplex/splitFastq_manyInputfiles_gbs.pl
and is run with run_splitFastq_gbs.sh
, in the same directory. Output was initially in /project/microbiome/analyses/gtl/HMAX1/demultiplex/sample_fastq
. All of this now is in /project/microbiome/data/seq/HMAX1/demultiplex
, so that it is reachable thru globus at /project/microbiome/data/seq/HMAX1/
.
Eight individuals were duplicated, with different MIDs. Was this planned? I didn’t account for this in the parsing script (the info line only has the individual sample ID, not the MID. I could add it back in. But then the replicates would need to be merged. As is now, all reads for an individual are going into one file. There are also four tubes labeled ‘BLANK' that will all have been merged (all the reads went into BLANK.GGATCCTT.fq).
compressed all
sample_fastq/
files with pigz: usingsbatch /project/microbiome/data/seq/HMAX1/demultiplex/run_pigz.sh
moved fastq for all four blank samples (data are all in one file because names are collapsed; noted above) to a subfolder (
/project/microbiome/data/seq/HMAX1/demultiplex/sample_fastq/blanks
), to get them out of the way.started denovo assembly in
/gscratch/buerkle/data/HMAX1/denovo
Completed first step for dDocent and am running cd-hit for 92%, 96% and 98% minimum match. Initially didn’t give these enough wall time and in reruns I bumped up the number of cores to 16.
...
summarized denovo assemblies: 98 (46,971,194 contigs), 96 (27,839,279 contigs), 92 (18,493,729 contigs). Um, that’s a lot. Previously they aligned to the Helianthus annuus genome (v1.0), so we will try that. This was in Testing for evolutionary change in restoration: A genomic comparison between ex situ, native, and commercial seed sources of Helianthus maximiliani. Fetching annuus genome (v2.0) now. It is in /project/evolgen/data/public/genomes/helianthus/GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna.gz
...
Next up, use bwa
to map against reference genome.
To do:
...
Summarize the parse report files in /gscratch with some code to iterate over all the individual reports and get an overall count.
...
Recombine by sample name and mid
/project/gtl/data/raw/RHIR1/demultiplex/run_splitFastq_gbs.sh