Page Comparison

...

10 Sep 2022 Working in /project/microbiome/data/seq/HMAX1/assem and assembling all reads in /project/microbiome/data/seq/HMAX1/demultiplex/sample_fastq/ against /project/evolgen/data/public/genomes/helianthus/GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna. gz.

Ran bwa index -a bwtsw GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna.gz by hand in an interactive node (took roughly one hour)
Commands are in 0_assem.nf. Run this with nextflow run -bg 0_assem.nf -c teton.config. These are jobs are using: module load swset/2018.05 gcc/7.3.0 bwa/0.7.17 samtools/1.12 as specified in teton.config in this directory (bwa is version 0.7.17-r1188). Output is in /project/microbiome/data/seq/HMAX1/assem/sambam/. Gave each job 60 minutes, which was unnecessarily long, but conservative. Longest running jobs I could see were less than 20 minutes. Moved all 476 468 inputs files through in about 30 minutes total.
I removed the duplicative sam and unsorted bam files with: rm -f *.sam *[^d].bam, saving ~270 GB of space

Variant calling

Following steps from https://github.com/zgompert/DimensionsExperiment.
Built bcftools version 1.16 and installed in /project/evolgen/bin/.
bcftools needed reference genome in bzip2 format, not gzip. So I now simply have an unzipped reference genome, which I have reindexed.
Completed this step with something like: sbatch --account=evolgen --time=1-00:00 --nodes=1 --mem=120G 8G --mail-type=END 0_call_variants.sh

...

(this took 12 hours and 40 minutes and 552 MB of RAM; I asked for 120GB, which likely gave me the whole node and made it a bit faster)
Filtered vcf with 1_filter_variants.sh

...

Variant filtering

Used the following filters: 2X coverage (2302 reads), 10 alt. reads, not fixed, Man-Whitney P for BQB = 0.01, Man-Whitney P for RPB = 0.01, minimum mapping quality 30, missing data for fewer than 230 (80% with data), biallelic SNPs only.

Ended up with 64,061 SNPs in /uufs/chpc.utah.edu/common/home/gompert-group2/data/dimension_lyc_gbs/Variants/filtered2x_lmel_variants.vcf.

...

, which contains notes on the criteria that I used (could be altered to suit). This set is based on a fairly tight set of criteria (it matches what we used for a recent paper), which could be modified as needed. Note that currently there is no explicit minor allele frequency filtering. There are 5016 sites in hmax_variants_filtered.vcf

Code Block

## minimum mapping quality of 30 was already enforced in bcftools mpileup

## These are written as exclusion filters
# ------ INFO/DP < 952
# 2x depth overall: obtained with INFO/DP > 951 (476 x 2 = 952)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">

# ------ INFO/AC1 < 10 
## a minimum of 10 alt reads to support a polymorphism

# ------ INFO/BQBZ > 1e-5
##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">

# ------ INFO/RPBZ > 1e-5
##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">

## biallelic snps obtained in bcftools view

## 476 individuals (80% with data would mean 380 individuals; 380/476=0.798)
## set bcftools view to include only sites with the fraction of missingness less than 0.2

To do:

Summarize the parse report files in /gscratch with some code to iterate over all the individual reports and get an overall count.

Versions Compared

Old Version 29

New Version Current

Key

Variant calling