1Sauger bioinformatics
Raw reads
We received two files with sequence reads.
Demultiplexing
Â
mkdir /gscratch/grandol1/1SaugOdds /gscratch/grandol1/1SaugOdds/rawreads
cd /gscratch/grandol1/1SaugOdds
unpigz --to-stdout /project/gtl/data/distribution/Wagner/SJohnson/1Saug/rawreads/1SaugOdds.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - 1SaugOdds_
/project/gtl/data/distribution/Wagner/SJohnson/1Saug/demultiplex/run_parsebarcodes_onSplitInput_Odds.pl
Repeat for Evens
mkdir /gscratch/grandol1/1SaugEvens /gscratch/grandol1/1SaugEvens/rawreads
cd /gscratch/grandol1/1SaugEvens
unpigz --to-stdout /project/gtl/data/distribution/Wagner/SJohnson/1Saug/rawreads/1SaugEvens.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - 1SaugEvens_
/project/gtl/data/distribution/Wagner/SJohnson/1Saug/demultiplex/run_parsebarcodes_onSplitInputEvens.pl
I modified the script from 16S/ITS work for splitting fastq files based on information in their info line, to different files. It is: /project/gtl/data/distribution/Wagner/SJohnson/1Saug/splitFastq_manyInputfiles_gbs.pl
and is run with run_splitFastq_gbs.sh
, in the same directory. Output is in /project/gtl/data/distribution/Wagner/SJohnson/1Saug/demultiplex
.
Â
compressed all
sample_fastq/
files with pigz: usingsbatch /project/gtl/data/distribution/Wagner/SJohnson/1Saug/demultiplex/run_pigz.sh
started denovo assembly in
/gscratch/grandol1/data/HMAX1/denovo
Completed first step for dDocent and am running cd-hit for 92%, 96% and 98% minimum match. Initially didn’t give these enough wall time and in reruns I bumped up the number of cores to 16.
Stopped here 10-23-2024:
Assembly
Sep 10, 2022 Working in /project/microbiome/data/seq/HMAX1/assem
and assembling all reads in /project/microbiome/data/seq/HMAX1/demultiplex/sample_fastq/
against /project/evolgen/data/public/genomes/helianthus/GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna
.
Ran
bwa index -a bwtsw GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna
by hand in an interactive node (took roughly one hour)Commands are in
0_assem.nf
. Run this withnextflow run -bg 0_assem.nf -c teton.config
. These are jobs are using:module load swset/2018.05 gcc/7.3.0 bwa/0.7.17 samtools/1.12
as specified inteton.config
in this directory (bwa is version 0.7.17-r1188). Output is in/project/microbiome/data/seq/HMAX1/assem/sambam/
. Gave each job 60 minutes, which was unnecessarily long, but conservative. Longest running jobs I could see were less than 20 minutes. Moved all 468 inputs files through in about 30 minutes total.I removed the duplicative sam and unsorted bam files with:
rm -f *.sam *[^d].bam
, saving ~270 GB of space
Variant calling
Following steps from https://github.com/zgompert/DimensionsExperiment.
Built bcftools version 1.16 and installed in
/project/evolgen/bin/
.bcftools
needed reference genome in bzip2 format, not gzip. So I now simply have an unzipped reference genome, which I have reindexed.Completed this step with something like:
sbatch --account=evolgen --time=1-00:00 --nodes=1 --mem=8G --mail-type=ENDÂ 0_call_variants.sh
(this took 12 hours and 40 minutes and 552 MB of RAM; I asked for 120GB, which likely gave me the whole node and made it a bit faster)Filtered vcf with
1_filter_variants.sh
, which contains notes on the criteria that I used (could be altered to suit). This set is based on a fairly tight set of criteria (it matches what we used for a recent paper), which could be modified as needed. Note that currently there is no explicit minor allele frequency filtering. There are 5016 sites inhmax_variants_filtered.vcf
## minimum mapping quality of 30 was already enforced in bcftools mpileup ## These are written as exclusion filters # ------ INFO/DP < 952 # 2x depth overall: obtained with INFO/DP > 951 (476 x 2 = 952) ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth"> # ------ INFO/AC1 < 10 ## a minimum of 10 alt reads to support a polymorphism # ------ INFO/BQBZ > 1e-5 ##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)"> # ------ INFO/RPBZ > 1e-5 ##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)"> ## biallelic snps obtained in bcftools view ## 476 individuals (80% with data would mean 380 individuals; 380/476=0.798) ## set bcftools view to include only sites with the fraction of missingness less than 0.2