Page Comparison

...

Eight individuals were duplicated, with different MIDs. Was this planned? I didn’t account for this in the parsing script (the info line only has the individual sample ID, not the MID. I could add it back in. But then the replicates would need to be merged. As is now, all reads for an individual are going into one file. There are also four tubes labeled ‘BLANK' that will all have been merged (all the reads went into BLANK.GGATCCTT.fq).

On 12 Sep 2022 I moved 8 files that contained no lines into empty/ with wc -l *.gz | awk '{if ($1 == "0") print $2 }' | xargs mv -t empty/

16 Dec 2021

compressed all sample_fastq/ files with pigz: using sbatch /project/microbiome/data/seq/HMAX1/demultiplex/run_pigz.sh
moved fastq for all four blank samples (data are all in one file because names are collapsed; noted above) to a subfolder (/project/microbiome/data/seq/HMAX1/demultiplex/sample_fastq/blanks), to get them out of the way.
started denovo assembly in /gscratch/buerkle/data/HMAX1/denovo Completed first step for dDocent and am running cd-hit for 92%, 96% and 98% minimum match. Initially didn’t give these enough wall time and in reruns I bumped up the number of cores to 16.

...

10 Sep 2022 Working in /project/microbiome/data/seq/HMAX1/assem and assembling all reads in /project/microbiome/data/seq/HMAX1/demultiplex/sample_fastq/ against /project/evolgen/data/public/genomes/helianthus/GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna. gz.

Ran bwa index -a bwtsw GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna.gz by hand in an interactive node (took roughly one hour)
Commands are in 0_assem.nf. Run this with nextflow run -bg 0_assem.nf -c teton.config. These are jobs are using: module load swset/2018.05 gcc/7.3.0 bwa/0.7.17 samtools/1.12 as specified in teton.config in this directory (bwa is version 0.7.17-r1188). Output is in /project/microbiome/data/seq/HMAX1/assem/sambam/. Gave each job 60 minutes, which was unnecessarily long, but conservative. Longest running jobs I could see were less than 20 minutes. Moved all 477 468 inputs files through in about 30 minutes total.
I removed the duplicative sam and unsorted bam files with: rm -f *.sam *[^d].bam, saving ~270 GB of space

Variant calling

Following steps from https://github.com/zgompert/DimensionsExperiment.
Built bcftools version 1.16 and installed in /project/evolgen/bin/.

Copying steps from https://github.com/zgompert/DimensionsExperiment

Code Block

bcftools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/gompert-group3/data/LmelGenome/Lmel_dovetailPacBio_genome.fasta -q 30 -Q 20 -I -b lmel_bams.txt -o lmel_variants.bcf -O u -a FORMAT/AD,FORMAT/DP
bcftools call -v -c -p 0.01 -P 0.001 -O v -o lmel_variants.vcf lmel_variants.bcf

Variant filtering with vcfFilter.pl and filterSomeMore.pl

Used the following filters: 2X coverage (2302 reads), 10 alt. reads, not fixed, Man-Whitney P for BQB = 0.01, Man-Whitney P for RPB = 0.01, minimum mapping quality 30, missing data for fewer than 230 (80% with data), biallelic SNPs only.

Ended up with 64,061 SNPs in /uufs/chpc.utah.edu/common/home/gompert-group2/data/dimension_lyc_gbs/Variants/filtered2x_lmel_variants.vcf.

...

bcftools needed reference genome in bzip2 format, not gzip. So I now simply have an unzipped reference genome, which I have reindexed.
Completed this step with something like: sbatch --account=evolgen --time=1-00:00 --nodes=1 --mem=8G --mail-type=END 0_call_variants.sh (this took 12 hours and 40 minutes and 552 MB of RAM; I asked for 120GB, which likely gave me the whole node and made it a bit faster)

Filtered vcf with 1_filter_variants.sh, which contains notes on the criteria that I used (could be altered to suit). This set is based on a fairly tight set of criteria (it matches what we used for a recent paper), which could be modified as needed. Note that currently there is no explicit minor allele frequency filtering. There are 5016 sites in hmax_variants_filtered.vcf

Code Block

## minimum mapping quality of 30 was already enforced in bcftools mpileup

## These are written as exclusion filters
# ------ INFO/DP < 952
# 2x depth overall: obtained with INFO/DP > 951 (476 x 2 = 952)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">

# ------ INFO/AC1 < 10 
## a minimum of 10 alt reads to support a polymorphism

# ------ INFO/BQBZ > 1e-5
##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">

# ------ INFO/RPBZ > 1e-5
##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">

## biallelic snps obtained in bcftools view

## 476 individuals (80% with data would mean 380 individuals; 380/476=0.798)
## set bcftools view to include only sites with the fraction of missingness less than 0.2

To do:

Summarize the parse report files in /gscratch with some code to iterate over all the individual reports and get an overall count.

Versions Compared

Old Version 24

New Version Current

Key

Variant calling