LRII.6 Bioinformatics

Assign reads and otus to samples:

/project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata

salloc --account=microbiome -t 0-06:00

mkdir -p /gscratch/grandol1/LRII_LocAd2_3_16_22/rawdata

cd /gscratch/grandol1/LRII_LocAd2_3_16_22/rawdata

unpigz --to-stdout /project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata/LRII-5-RMJan22_S1_L001_R1_001.fastq.gz | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LRII_LocAd2_3_16_22_R1_ ;unpigz --to-stdout /project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata/LRII-5-RMJan22_S1_L001_R2_001.fastq.gz | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LRII_LocAd2_3_16_22_R2_

//project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata/run_parse_count_onSplitInput.pl

cd /project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata

./run_splitFastq_fwd.sh

./run_splitFastq_rev.sh

cd /project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata

./run_aggregate.sh

Exploration of Data created so far:

From my understanding, Line 9 from above simply splits the raw data into equal sized files, but the total number of reads should remain constant.

cd /gscratch/grandol1/LRII_LocAd2_3_16_22/rawdata

wc -l LRII*

Should return 8x the number of paired end reads (2x for R1 and 4x for the structure of fastq files).

This returns: 33809712 total

Divided by 8: 4226214

Line 11 then reads through all of the split files and assigns each read to a sample (parsed), to PhiX or non target (phixOther), or a mid error (truemiderrors). The reads assigned to these should add up to the numbers above.

wc -l parsed*

Returns: 25644064 total

Divided by 8: 3205508 assigned to samples.

Assigned/Total (*100) = percent assigned: ~76%

The target for samples was 80%.

Things get more confusing with the phixOther and truemiderror files, because they do not appear to be true fastq files nor do they appear to be Fasta. So, I do not know how to count reads.

Blasting random lines from phixOther returns a mix of phiX and ‘uncultured bacterium 16S’. I see no way of disentangling this.

So, let us explore the results of lines 13 to 17. These should be found in /project/microbiome/data_queue/seq/loc_ad2/rawdata/sample_fastq/16S/locad2

and

/project/microbiome/data_queue/seq/loc_ad2/rawdata/sample_fastq/16S/LRII

For locad2:

wc -l locad2*

Returns: 18304944 total

The file formats appear the same as the “parsed*” files above.

Divided by 8: 2288118

For LRII:

wc -l LRII*

Returns: 7339120 total

Divided by 8: 917390

LRII + locad2:

2288118+917390= 3205508

Same as the parsed read count above.

~~cd /project/microbiome/data_queue/seq/loc_ad2/tfmergedreads~~

~~./run_slurm_mergereads.pl~~

~~cd /project/microbiome/data_queue/seq/LowReadII/otu~~

~~./run_slurm_mkotu.pl~~

Assign taxonomy

salloc --account=microbiome -t 0-02:00 --mem=500G

module load swset/2018.05  gcc/7.3.0

module load vsearch/2.15.1

vsearch --sintax zotus.fa --db /project/microbiome/users/grandol1/ref_db/gg_16s_13.5.fa -tabbedout LRII.sintax -sintax_cutoff 0.8

~~Output:~~

~~Reading file /project/microbiome/users/grandol1/ref_db/gg_16s_13.5.fa 100%~~

~~1769520677 nt in 1262986 seqs, min 1111, max 2368, avg 1401~~

~~Counting k-mers 100%~~

~~Creating k-mer index 100%~~

~~Classifying sequences 100%~~

~~Classified 4038 of 4042 sequences (99.90%)~~

Convert into useful form:

awk -F "\t" '{OFS=","} NR==1 {print "OTU_ID","SEQS","SIZE","DOMAIN","KINGDOM","PHYLUM","CLASS","ORDER","FAMILY","GENUS","SPECIES"} {gsub(";", ","); gsub("centroid=", ""); gsub("seqs=", ""); gsub("size=", ""); match($4, /d:[^,]+/, d); match($4, /k:[^,]+/, k); match($4, /p:[^,]+/, p); match($4, /c:[^,]+/, c); match($4, /o:[^,]+/, o); match($4, /f:[^,]+/, f); match($4, /g:[^,]+/, g); match($4, /s:[^,]+/, s); print $1, d[0]=="" ? "NA" : d[0], k[0]=="" ? "NA" : k[0], p[0]=="" ? "NA" : p[0], c[0]=="" ? "NA" : c[0], o[0]=="" ? "NA" : o[0], f[0]=="" ? "NA" : f[0], g[0]=="" ? "NA" : g[0], s[0]=="" ? "NA" : s[0] }' LRII.sintax > LRIItaxonomy.csv