1Sauger bioinformatics

Raw reads

We received two files with sequence reads.

Demultiplexing

 

mkdir /gscratch/grandol1/1SaugOdds /gscratch/grandol1/1SaugOdds/rawreads cd /gscratch/grandol1/1SaugOdds unpigz --to-stdout /project/gtl/data/distribution/Wagner/SJohnson/1Saug/rawreads/1SaugOdds.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - 1SaugOdds_ /project/gtl/data/distribution/Wagner/SJohnson/1Saug/demultiplex/run_parsebarcodes_onSplitInput_Odds.pl

Repeat for Evens

mkdir /gscratch/grandol1/1SaugEvens /gscratch/grandol1/1SaugEvens/rawreads cd /gscratch/grandol1/1SaugEvens unpigz --to-stdout /project/gtl/data/distribution/Wagner/SJohnson/1Saug/rawreads/1SaugEvens.fastq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - 1SaugEvens_ /project/gtl/data/distribution/Wagner/SJohnson/1Saug/demultiplex/run_parsebarcodes_onSplitInputEvens.pl

I modified the script from 16S/ITS work for splitting fastq files based on information in their info line, to different files. It is: /project/gtl/data/distribution/Wagner/SJohnson/1Saug/splitFastq_manyInputfiles_gbs.pl and is run with run_splitFastq_gbs.sh, in the same directory. Output is in /project/gtl/data/distribution/Wagner/SJohnson/1Saug/demultiplex.

 

  • compressed all sample_fastq/ files with pigz: using sbatch /project/gtl/data/distribution/Wagner/SJohnson/1Saug/demultiplex/run_pigz.sh

  • started denovo assembly in /gscratch/grandol1/data/HMAX1/denovo Completed first step for dDocent and am running cd-hit for 92%, 96% and 98% minimum match. Initially didn’t give these enough wall time and in reruns I bumped up the number of cores to 16.

Stopped here 10-23-2024:

Assembly

Sep 10, 2022 Working in /project/microbiome/data/seq/HMAX1/assem and assembling all reads in /project/microbiome/data/seq/HMAX1/demultiplex/sample_fastq/ against /project/evolgen/data/public/genomes/helianthus/GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna.

  1. Ran bwa index -a bwtsw GCF_002127325.2_HanXRQr2.0-SUNRISE_genomic.fna by hand in an interactive node (took roughly one hour)

  2. Commands are in 0_assem.nf. Run this with nextflow run -bg 0_assem.nf -c teton.config. These are jobs are using: module load swset/2018.05 gcc/7.3.0 bwa/0.7.17 samtools/1.12 as specified in teton.config in this directory (bwa is version 0.7.17-r1188). Output is in /project/microbiome/data/seq/HMAX1/assem/sambam/. Gave each job 60 minutes, which was unnecessarily long, but conservative. Longest running jobs I could see were less than 20 minutes. Moved all 468 inputs files through in about 30 minutes total.

  3. I removed the duplicative sam and unsorted bam files with: rm -f *.sam *[^d].bam, saving ~270 GB of space

Variant calling