Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Raw reads

We received four files with sequence reads. Two of these contain the 1x100bp reads, because two lanes were used on the instrument. Two of these because CU unnecessarily ran indexing reads on the fragments. I deleted these nonsense files. The two files with the raw reads of interest are (these are in /project/microbiome/data/seq/HMAX1/rawreads).

...

I used unpigz.sh to decompress the fastq files, because our parser does not read from gzipped files.

Demultiplexing

In /project/microbiome/analyses/gtl/HMAX1 I removed extraneous spaces in the file that maps MIDS to individual identifiers (Hmax1Demux.csv). Also, the original Hmax1Demux.csv didn’t follow the scheme we have used for GBS: MIDname, MID, sample id. So, I made a fixed version (now we have Hmax1Demux_fixed.csv):

Code Block
languagebash
sed -E 's/^([[:alnum:]-]+),([[:alnum:]-]+),([[:alnum:]-]+).*/\3,\1,\2/' Hmax1Demux.csv > Hmax1Demux_fixed.csv

On I started parse_barcodes_slurm_L1.sh and parse_barcodes_slurm_L2.sh This demux key indicated that there were several samples that were technical replicates, but the user also reported one error, which we corrected by hand (see diff below) and made Hmax1Demux_fixed2.csv (this file also lacks a final newline at the end of the file, but I verified this is not a problem

Code Block
196c196
< C01,CGATATAG,MN-ABR-BLA-9-L
---
> C01,CGATATAG,MN-ABR-BLA-10-L

Demultiplexing on the two files in parallel took more than the two days I initially allocated to it (in part because of the ~10% of the data that do not match our MIDS, because we did not filter contaminants). So I broke the data into 228 parts (each with 16 million lines) and ran 228 jobs in parallel. I repeated this when we learned of the one error in the demux key.

Code Block
languagebash
mkdir /gscratch/buerkle/data/HMAX1
cd /gscratch/buerkle/data/HMAX1
cat /project/microbiome/data/seq/HMAX1/rawreads/WyomingPool* | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - WyomingPool_HMAX1_
mkdir rawreads
mv WyomingPool_HMAX1_* rawreads/
/project/microbiome/analyses/gtl/HMAX1/demultiplex/run_parsebarcodes_onSplitInput.pl

Note that I did not do separate contaminant filtering (which I did for Penstemon), because the parsing code and other downstream steps should knock out contaminants. I can double-check this.

I modified the script from 16S/ITS work for splitting fastq files based on information in their info line, to different files. It is: /project/microbiome/analyses/gtl/HMAX1/demultiplex/splitFastq_manyInputfiles_gbs.pl and is run with run_splitFastq_gbs.sh, in the same directory. Output was initially in /project/microbiome/analyses/gtl/HMAX1/demultiplex/sample_fastq. All of this now is in /project/microbiome/data/seq/HMAX1/demultiplex, so that it is reachable thru globus at /project/microbiome/data/seq/HMAX1/.

Eight individuals were duplicated, with different MIDs. Was this planned? I didn’t account for this in the parsing script (the info line only has the individual sample ID, not the MID. I could add it back in. But then the replicates would need to be merged. As is now, all reads for an individual are going into one file. There are also four tubes labeled ‘BLANK' that will all have been merged (all the reads went into BLANK.GGATCCTT.fq).

On I moved 8 files that contained no lines into empty/ with wc -l *.gz | awk '{if ($1 == "0") print $2 }' | xargs  mv -t empty/

  • compressed all sample_fastq/ files with pigz: using sbatch /project/microbiome/data/seq/HMAX1/demultiplex/run_pigz.sh

  • moved fastq for all four blank samples (data are all in one file because names are collapsed; noted above) to a subfolder (/project/microbiome/data/seq/HMAX1/demultiplex/sample_fastq/blanks), to get them out of the way.

  • started denovo assembly in /gscratch/buerkle/data/HMAX1/denovo Completed first step for dDocent and am running cd-hit for 92%, 96% and 98% minimum match. Initially didn’t give these enough wall time and in reruns I bumped up the number of cores to 16.

Assembly

Working in /project/microbiome/data/seq/HMAX1/assem and assembling all reads in /project/microbiome/data/seq/HMAX1/demultiplex/sample_fastq/ against /project/evolgen/data/public/genomes/helianthus/GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna.

  1. Ran bwa index -a bwtsw GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna by hand in an interactive node (took roughly one hour)

  2. Commands are in 0_assem.nf. Run this with nextflow run -bg 0_assem.nf -c teton.config. These are jobs are using: module load swset/2018.05 gcc/7.3.0 bwa/0.7.17 samtools/1.12 as specified in teton.config in this directory (bwa is version 0.7.17-r1188). Output is in /project/microbiome/data/seq/HMAX1/assem/sambam/. Gave each job 60 minutes, which was unnecessarily long, but conservative. Longest running jobs I could see were less than 20 minutes. Moved all 468 inputs files through in about 30 minutes total.

  3. I removed the duplicative sam and unsorted bam files with: rm -f *.sam *[^d].bam, saving ~270 GB of space

Variant calling

  • Following steps from https://github.com/zgompert/DimensionsExperiment.

  • Built bcftools version 1.16 and installed in /project/evolgen/bin/.

  • bcftools needed reference genome in bzip2 format, not gzip. So I now simply have an unzipped reference genome, which I have reindexed.

  • Completed this step with something like: sbatch --account=evolgen --time=1-00:00 --nodes=1 --mem=8G --mail-type=END  0_call_variants.sh (this took 12 hours and 40 minutes and 552 MB of RAM; I asked for 120GB, which likely gave me the whole node and made it a bit faster)

  • Filtered vcf with 1_filter_variants.sh, which contains notes on the criteria that I used (could be altered to suit). This set is based on a fairly tight set of criteria (it matches what we used for a recent paper), which could be modified as needed. Note that currently there is no explicit minor allele frequency filtering. There are 5016 sites in hmax_variants_filtered.vcf

    • Code Block
      ## minimum mapping quality of 30 was already enforced in bcftools mpileup
      
      ## These are written as exclusion filters
      # ------ INFO/DP < 952
      # 2x depth overall: obtained with INFO/DP > 951 (476 x 2 = 952)
      ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
      
      # ------ INFO/AC1 < 10 
      ## a minimum of 10 alt reads to support a polymorphism
      
      # ------ INFO/BQBZ > 1e-5
      ##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
      
      # ------ INFO/RPBZ > 1e-5
      ##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
      
      ## biallelic snps obtained in bcftools view
      
      ## 476 individuals (80% with data would mean 380 individuals; 380/476=0.798)
      ## set bcftools view to include only sites with the fraction of missingness less than 0.2

To do:

  • Summarize the parse report files in /gscratch with some code to iterate over all the individual reports and get an overall count.