Raw reads
We received four files with sequence reads. Two of these contain the 1x100bp reads, because two lanes were used on the instrument. Two of these because CU unnecessarily ran indexing reads on the fragments. I deleted these nonsense files. The two files with the raw reads of interest are (these are in /project/microbiome/data/seq/HMAX1/rawreads
).
...
summarized denovo assemblies: 98 (46,971,194 contigs), 96 (27,839,279 contigs), 92 (18,493,729 contigs). Um, that’s a lot. Previously they aligned to the Helianthus annuus genome (v1.0). This was in Testing for evolutionary change in restoration: A genomic comparison between ex situ, native, and commercial seed sources of Helianthus maximiliani. We will use the v2.0 annuus genome (v2.0). It is in /
project/evolgen/data/public/genomes/helianthus/GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna.gz
Next up, use
bwa
to map against reference genome.
Assembly
Working in /project/microbiome/data/seq/HMAX1/assem
and assembling all reads in /project/microbiome/data/seq/HMAX1/demultiplex/sample_fastq/
against /project/evolgen/data/public/genomes/helianthus/GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna.gz
.
Ran
bwa index -a bwtsw GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna.gz
by hand in an interactive nodeCommands are in
0_assem.nf
. Run this withnextflow run -bg 0_assem.nf -c teton.config
. These are jobs are using:module load swset/2018.05 gcc/7.3.0 bwa/0.7.17 samtools/1.12
as specified inteton.config
in this directory (bwa is version 0.7.17-r1188)
… working here …
Variant calling
Copying steps from https://github.com/zgompert/DimensionsExperiment
Variant calling with
bcftools
version 1.9
Code Block |
---|
bcftools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/gompert-group3/data/LmelGenome/Lmel_dovetailPacBio_genome.fasta -q 30 -Q 20 -I -b lmel_bams.txt -o lmel_variants.bcf -O u -a FORMAT/AD,FORMAT/DP bcftools call -v -c -p 0.01 -P 0.001 -O v -o lmel_variants.vcf lmel_variants.bcf |
Variant filtering with
vcfFilter.pl
andfilterSomeMore.pl
Used the following filters: 2X coverage (2302 reads), 10 alt. reads, not fixed, Man-Whitney P for BQB = 0.01, Man-Whitney P for RPB = 0.01, minimum mapping quality 30, missing data for fewer than 230 (80% with data), biallelic SNPs only.
Ended up with 64,061 SNPs in /uufs/chpc.utah.edu/common/home/gompert-group2/data/dimension_lyc_gbs/Variants/filtered2x_lmel_variants.vcf.
Next, I dropped SNPs with > mean + 3 SD coverage (possible repeats). This left 63,194 SNPs in morefilter_filtered2x_lmel_variants.vcf.
To do:
Summarize the parse report files in /gscratch with some code to iterate over all the individual reports and get an overall count.variant calling