Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • compressed all sample_fastq/ files with pigz: using sbatch /project/microbiomegtl/data/distribution/Wagner/seqSJohnson/HMAX11Saug/demultiplex/run_pigz.shmoved fastq for all four blank samples (data are all in one file because names are collapsed; noted above) to a subfolder (/project/microbiome/data/seq/HMAX1/demultiplex/sample_fastq/blanks), to get them out of the way.

  • started denovo assembly in /gscratch/buerklegrandol1/data/HMAX1/denovo Completed first step for dDocent and am running cd-hit for 92%, 96% and 98% minimum match. Initially didn’t give these enough wall time and in reruns I bumped up the number of cores to 16.

...

Stopped here 10-23-2024:

Assembly

Working in /project/microbiome/data/seq/HMAX1/assem and assembling all reads in /project/microbiome/data/seq/HMAX1/demultiplex/sample_fastq/ against /project/evolgen/data/public/genomes/helianthus/GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna.

...

  • Following steps from https://github.com/zgompert/DimensionsExperiment.

  • Built bcftools version 1.16 and installed in /project/evolgen/bin/.

  • bcftools needed reference genome in bzip2 format, not gzip. So I now simply have an unzipped reference genome, which I have reindexed.

  • Completed this step with something like: sbatch --account=evolgen --time=1-00:00 --nodes=1 --mem=8G --mail-type=END  0_call_variants.sh (this took 12 hours and 40 minutes and 552 MB of RAM; I asked for 120GB, which likely gave me the whole node and made it a bit faster)

  • Filtered vcf with 1_filter_variants.sh, which contains notes on the criteria that I used (could be altered to suit). This set is based on a fairly tight set of criteria (it matches what we used for a recent paper), which could be modified as needed. Note that currently there is no explicit minor allele frequency filtering. There are 5016 sites in hmax_variants_filtered.vcf

    • Code Block
      ## minimum mapping quality of 30 was already enforced in bcftools mpileup
      
      ## These are written as exclusion filters
      # ------ INFO/DP < 952
      # 2x depth overall: obtained with INFO/DP > 951 (476 x 2 = 952)
      ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
      
      # ------ INFO/AC1 < 10 
      ## a minimum of 10 alt reads to support a polymorphism
      
      # ------ INFO/BQBZ > 1e-5
      ##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
      
      # ------ INFO/RPBZ > 1e-5
      ##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
      
      ## biallelic snps obtained in bcftools view
      
      ## 476 individuals (80% with data would mean 380 individuals; 380/476=0.798)
      ## set bcftools view to include only sites with the fraction of missingness less than 0.2

To do:

  • Summarize the parse report files in /gscratch with some code to iterate over all the individual reports and get an overall count.