Page Comparison

Assign reads and otus to samples:

Code Block
/project/microbiome/data_queue/seq/LowReadII/rawdata

...



salloc --account=microbiome -t 0-06:00

...



mkdir -p /gscratch/grandol1/LowReadII/rawdata

...



cd /gscratch/grandol1/LowReadII/rawdata

...



unpigz --to-stdout /project/microbiome/data_queue/seq/LowReadII/rawdata/Low-Read-II_S1_L001_R1_001.fastq | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LowReadII_R1_ ;

...

unpigz --to-stdout /project/microbiome/data_queue/seq/LowReadII/rawdata/Low-Read-II_S1_L001_R2_001.fastq | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LowReadII_R2_

...



//project/microbiome/data_queue/seq/LowReadII/rawdata/run_parse_count_onSplitInput.pl

...



cd /project/microbiome/data_queue/seq/LowReadII/rawdata

...



./run_splitFastq_fwd.sh

...



./run_splitFastq_rev.sh

...



./run_aggregate.sh

...



cd /project/microbiome/data_queue/seq/LowReadII/tfmergedreads

...



./run_slurm_mergereads.pl

...



cd /project/microbiome/data_queue/seq/LowReadII/otu

...



./run_slurm_mkotu.pl

~~vsearch -sintax OTUFILE -db REFERENCEDATABASE -tabbedout OUTPUT -sintax_cutoff 0.8 -strand both -threads 32~~

Assign taxonomy

Code Block
salloc --account=microbiome -t 0-02:00 --mem=500G

...



module load swset/2018.05  gcc/7.3.0

...



module load vsearch/2.15.1

...



vsearch --sintax zotus.fa --db /project/microbiome/users/grandol1/ref_db/gg_16s_13.5.fa -tabbedout LRII.sintax -sintax_cutoff 0.8

Output:

Reading file /project/microbiome/users/grandol1/ref_db/gg_16s_13.5.fa 100%

...

Classified 4038 of 4042 sequences (99.90%)

Convert into useful form:

Code Block

awk -F "\t" '{OFS=","} NR==1 {print "OTU_ID","SEQS","SIZE","DOMAIN","KINGDOM","PHYLUM","CLASS","ORDER","FAMILY","GENUS","SPECIES"} {gsub(";", ","); gsub("centroid=", ""); gsub("seqs=", ""); gsub("size=", ""); match($4, /d:[^,]+/, d); match($4, /k:[^,]+/, k); match($4, /p:[^,]+/, p); match($4, /c:[^,]+/, c); match($4, /o:[^,]+/, o); match($4, /f:[^,]+/, f); match($4, /g:[^,]+/, g); match($4, /s:[^,]+/, s); print $1, d[0]=="" ? "NA" : d[0], k[0]=="" ? "NA" : k[0], p[0]=="" ? "NA" : p[0], c[0]=="" ? "NA" : c[0], o[0]=="" ? "NA" : o[0], f[0]=="" ? "NA" : f[0], g[0]=="" ? "NA" : g[0], s[0]=="" ? "NA" : s[0] }' LRII.sintax > LRIItaxonomy.csv

...

LowReadII Analysis

This experiment intended to answer the questions below.

Questions:

How can we best accommodate low concentration input DNAs?
Concentrate input DNAs?
Concentrate/normalize low yield products based off qPCR?
Both?
Just put more volume of original DNA in?

The questions actually probed are 1, 5, and 3. Though, one could argue that 2 and 5 are similar enough, without doing a cleanup of the DNA as well.

Based off 2-step prep sample sequence outputs, we selected 4 average read samples and 8 low read samples. We used our 1-step prep for this experiment though. All samples were normalized to 10 ng/ul or less. 2 ul of each average read sample was used for the 1-step PCR. The low read samples were repeated 3 times at different volumes (2ul, 4ul, and 8ul). 2 ul of 10ng/ul Zymo Mock Community (consisting of Listeria monocytogenes - 12%, Pseudomonas aeruginosa - 12%, Bacillus subtilis - 12%, Escherichia coli - 12%, Salmonella enterica - 12%, Lactobacillus fermentum - 12%, Enterococcus faecalis - 12%, Staphylococcus aureus - 12%, Saccharomyces cerevisiae - 2%, and Cryptococcus neoformans - 2%) was used as a positive control in quadruplicate, and 2 ul of ultra pure water was used as a negative control. After pcr and cleanup, 2 ul from each of these was pooled. All samples have "_2", "_4", or "_8" appended to the sample name to indicate DNA volume used in PCR.

The whole above setup was repeated with the same samples with the intention of addressing question 3. After pcr and cleanup, these preps were all quantified via qPCR and various volumes were combined to have each prep equimolar in the pool. These sample names have "_N" appended to them for normalized.

Both pools were then quantified via qPCR, adjusted to 1 nMolar and combined in equal amounts to maintain 1 nMolar concentration. After dilution to 50pM, this was the pool used for sequencing on the iSeq

Read in otu table. Transpose it. Remove added values to front of sample names (mid assignments). Sort by samplename and remove extraneous text, "centroid="

Code Block

language	{r, echo=false,results='hide',fig.keep='all'}

library(tidyverse)
library(ggplot2)

#read in otu table
LowReadOtu <- read.csv("/Volumes/Macintosh HD/Users/gregg/Desktop/WorkingFolder/otutable.txt", sep = "", header = TRUE, stringsAsFactors = FALSE)

#LowReadReads <- read.csv("/Volumes/Macintosh HD/Users/gregg/Desktop/WorkingFolder/filtermergestats.csv", header=FALSE, stringsAsFactors = FALSE, na.strings = "")

#LowReadReads <- separate(LowReadReads, 1, sep="[.]", into = c(NA, "samplename", NA, NA, NA, NA, NA, NA, NA))

#LowReadOtu <- as.data.frame(LowReadOtu)

#transpose
LowReadOtu<- LowReadOtu %>% 
  rownames_to_column() %>% 
  gather(variable, value, -rowname) %>% 
  spread(rowname, value)
colnames(LowReadOtu) <- LowReadOtu[1,]
LowReadOtu <- LowReadOtu[-1, ] 

#simplify naming
LowReadOtu$OTUID <- gsub("rna16S_[A-Z]+_[A-Z]+_","", LowReadOtu[,1])


#Sort by name
LowReadOtu <- arrange(LowReadOtu, OTUID)


#simplify otu names
colnames(LowReadOtu) <- gsub("centroid=", "", colnames(LowReadOtu))
LowReadOtu <- type_convert(LowReadOtu)
LowReadOtu[is.na(LowReadOtu)] <- 0

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))
}
LowReadOtu$Normalized <- gsub("[0-9]", "", substrRight(LowReadOtu$OTUID, 1))
LowReadOtu <- arrange(LowReadOtu, Normalized)
LowReadOtu$Volume <- sub('.*(\\d{1}).*', '\\1', LowReadOtu$OTUID)

#Total reads assigned to otus
LowReadOtu <- LowReadOtu %>%
  rowwise()%>%
  mutate(reads = sum(c_across(otu1:otu1967)))

#make sums first observation
LowReadOtu <- cbind(LowReadOtu[1], LowReadOtu[,3803], LowReadOtu[2:3802])
#Calculate Percent ISD for all samples
LowReadOtu$PercISD <- LowReadOtu$otu1 / LowReadOtu$reads
LowReadOtu <- LowReadOtu[,c(1, 3804, 2:3803)]

#add repeats together maintaining V and N categorizations
LRIIotu <- LowReadOtu %>%
  group_by(OTUID, Volume, Normalized) %>%
  summarise(TotalReads = sum(reads))
ggplot(LRIIotu, aes(TotalReads, Volume, color=OTUID, shape=Normalized))+
  geom_point(show.legend = FALSE)

...

Looking at total reads, qPCR normalization and DNA volume used does not appear to have a consistent impact. Why does there appear to be a disconnect between qPCR results and Reads? We separated out each sample and compared it to itself under the different conditions.

Code Block

language	{r}

#fix odd naming convention SAG_ vs SAG#
LRIIotu$OTUID <- gsub("SAG_", "SAG", LRIIotu$OTUID)
#remove V and N appendices from samplenames
LRIIotu$OTUID <- gsub("_.*$", "", LRIIotu$OTUID)
ggplot(LRIIotu[45:72,], aes(TotalReads, Volume, color=OTUID, shape=Normalized))+
  geom_point()+
  facet_wrap(~OTUID)
ggplot(LRIIotu[1:44,], aes(TotalReads, Volume, color=OTUID, shape=Normalized))+
  geom_point()+
  facet_wrap(~OTUID)
ggplot(LRIIotu, aes(TotalReads, Volume, color=OTUID, shape=Normalized))+
  geom_point()+
  facet_wrap(~OTUID)

...

There still appears to be a disconnect, though from 2 to 4 ul there is a more consistent increase for the non-normalized. Reads do not consistently go up with increased DNA starting Volume. So, is there a problem with our sample setup? Is our demux key incorrect? Are blanks largely ISD?

Code Block

language	r

#Explore how the blanks performed
BlanksOnly <- LowReadOtu[c(1:8, 73:80),] 
library(janitor)
BlanksOnly <- BlanksOnly %>%
  adorn_totals("row")
BlanksOnly <- BlanksOnly[which(BlanksOnly[17,] != 0)]
#Some contamination, but mostly ISD.  One replicate of Blank 1_2N is suspect at only 67% ISD.
#How many samples are more than 65% ISD?
LowReadOtu$OTUID[which(LowReadOtu$PercISD > 0.65)]

[1] "Blank1_2" "Blank1_2" "Blank2_2" "Blank2_2" "Blank3_2" "Blank3_2" "Blank4_2" "Blank4_2"
[9] "SAG_S30P2R_8" "SAG_S30P2R_8" "SAG_S38P1R_8" "SAG_S38P1R_8" "Blank1_2_N" "Blank1_2_N" "Blank2_2_N" "Blank2_2_N"
[17] "Blank3_2_N" "Blank3_2_N" "Blank4_2_N" "Blank4_2_N" "SAG_S15P4R_8_N" "SAG_S38P1R_8_N" "SAG_S38P1R_8_N"

All of the 8 ul iterations of SAG_S38P1R appear to be largely ISD. All of the samples appearing in the Blank range are 8 ul samples. Does too much DNA or too much inhibitor(s) in DNA favor the ISD?

Blanks look good excluding one suspect at only 67% ISD. what about the Mock Community? Is Mock Community largely the 8 expected taxa? (Listeria monocytogenes, Pseudomonas aeruginosa, Bacillus subtilis, Escherichia coli, Salmonella enterica, Lactobacillus fermentum, Enterococcus faecalis, Staphylococcus aureus)

Code Block

language	{r}

#Read in taxonomy data
LowReadSyntax <- read.csv("/Volumes/Macintosh HD/Users/gregg/Desktop/WorkingFolder/LRIItaxonomy.csv", header = TRUE, stringsAsFactors = FALSE)

#Simplify table
LowReadSyntax <- LowReadSyntax[,c(1, 3:11) ]


#Is Mock Community largely List mono, Pseu aer, Bac sub, E coli, Sal ent, Lac fer, Ent fae, and Sta aur ??? Isolate MC
MConly <- LowReadOtu[c(9:16, 81:88),]
#add totals per otu
MConly <- MConly %>%
  adorn_totals("row")
#remove empty otu columns
MConly <- MConly[which(MConly[17,] != 0)]

#isolate taxonomy data for MC samples only
MConlyTaxa <- LowReadSyntax[LowReadSyntax$OTU_ID %in% colnames(MConly)[4:65],]

#on a cursory examination, Family seems to be the most consistently called lowest taxonomic level for these samples at 2x151 sequencing.
MCtaxa <- unique(MConlyTaxa$FAMILY)
MCtaxa

[1] NA "f:mitochondria" "f:Lactobacillaceae" "f:Pseudomonadaceae" "f:Enterobacteriaceae"
[6] "f:Bacillaceae" "f:Staphylococcaceae" "f:Listeriaceae" "f:Enterococcaceae" "f:[Chthoniobacteraceae]"

So, there are some contaminants that should not be in the Mock Community.

Code Block

language	{r}

#Mock community bacterial community should be:
#Listeria monocytogenes - 12%, Pseudomonas aeruginosa - 12%, Bacillus subtilis - 12%, Escherichia coli - 12%, Salmonella enterica - 12%, Lactobacillus fermentum - 12%, Enterococcus faecalis - 12%, Staphylococcus aureus - 12%, Saccharomyces cerevisiae - 2%, and Cryptococcus neoformans - 2%. 
#On the family level this is columns 3 through 9.  Salmonella seems to be the least resolved on this iSeq run
#3 through 9 are the expected families
MCtaxa <- MCtaxa[3:9]

#isolate the contaminants
MCcontaminantsTaxa <- subset(MConlyTaxa, !(FAMILY %in% MCtaxa))

#otu1 is not a contaminant; it is ISD.  Isolate the others.
MCcontaminantsTaxa <- MCcontaminantsTaxa[2:9,]

#isolate the relevant contaminant otus
MCcon <- MCcontaminantsTaxa$OTU_ID
MCcon

[1] "otu31" "otu18" "otu51" "otu1443" "otu1766" "otu2540" "otu3273" "otu3278"

Are these contaminants present in the blanks? 4 of them are

Code Block

language	{r}

#Are these contaminants in the blanks?  4 of them are
McBlankContaminants <- BlanksOnly[, c("OTUID", "reads", "otu31", "otu18", "otu51", "otu2540")]

The Mock Community appears to be as expected with some contamination. How do the samples look? Could any of them be MC or blanks remembering that some 8 ul DNA samples are showing high ISD.

Code Block

language	{r}

#Look at only Samples
Samples <- LowReadOtu[c(17:72, 89:144),]
Samples <- Samples %>%
  adorn_totals("row")
#Remove empty columns
Samples <- Samples[which(Samples[113,] != 0)]

#Cursorily, these are not mostly ISD or MC
#Lets remove Blank contaminant otus and see
SamplesnoBlank <- Samples %>%
  select(-one_of(colnames(BlanksOnly)[5:65]))
#That removes 61 otus.  How does it impact read counts? Redo calculations
SamplesnoBlank <- SamplesnoBlank %>%
  rowwise()%>%
  mutate(NewReads = sum(c_across(otu1:otu1967)))
SamplesnoBlank <- SamplesnoBlank[,c(1, 3696, 2:3695)] 
#Less than 5% change in read totals removing otus contaminant in blanks
SamplesnoBlank[113,2]/SamplesnoBlank[113,4]

0.9532478

Less than 5% of the reads for the non control samples are from the contaminants present in the blanks. How does this carry over per sample?

Code Block

language	{r}

#How about per sample?  
SamplesnoBlank$PercDiff <- 1 - (SamplesnoBlank$NewReads/SamplesnoBlank$reads)
SamplesnoBlank <- SamplesnoBlank[, c(1, 3697, 2:3696)]
SamplesnoBlank <- arrange(SamplesnoBlank, PercDiff)
#There are actually 10 samples that are more than ~10% the contaminants from the blanks with the largest at 33%
PoorSamples <- SamplesnoBlank[104:113,]
PoorSamples <- arrange(PoorSamples, OTUID)
PoorSamples[,1:6]

OTUID <chr>	*PercDiff <dbl>	NewReads <dbl>	PercISD <dbl>	reads <dbl>	otu1 <dbl>
SAG_S30P2R_2	0.3068182	610	0.0750000000	880	66
SAG_S30P2R_4	0.1120219	650	0.3360655738	732	246
SAG_S38P1R_2	0.3333333	2	0.0000000000	3	0
SAG_S44P2R_2	0.1930036	669	0.0024125452	829	2
SAG_S44P2R_2	0.3133940	528	0.0494148244	769	38
SAG_S44P2R_2_N	0.1647915	2524	0.0274652548	3022	83
SAG_S44P2R_2_N	0.2193375	4360	0.0044762757	5585	25
SAG190238_4_N	0.1249619	2871	0.0262115209	3281	86
SAG191508_2	0.1002424	5197	0.0001731302	5776	1
SAG191508_2	0.1058133	4876	0.0001833853	5453	1

*PercDiff is the percent of reads coming from the contaminants in the blanks.

Graph is all samples having more than 10% of the reads from blank contaminants.

We should repeat the normalization after qPCR part of this experiment because our qPCR machine needed recalibration and the redone results are much different for the "normalized" samples. Increasing DNA into the PCR rxn does not seem to have the desired impact though.

View file

name	LowReadAnalysis.Rmd

View file

name	LowReadAnalysis.nb.html

View file

name	otutable.txt

View file

name	LRIItaxonomy.csv

Version	Old Version 5	New Version Current
Changes made by	Gregg Randolph	Gregg Randolph
Saved on	Jan 19, 2022	May 04, 2022

Versions Compared

Key

Assign reads and otus to samples:

Assign taxonomy

Convert into useful form:

LowReadII Analysis