LRII.5 Bioinformatics

The red Low Read II samples were renormalized via pooling and the new MC samples were added to that same pool at 1 ul per sample. This pool was then adjusted to 1nM. The pool of the repeated RMJan22 samples was also adjusted to 1 nM. 50 ul of Low Read was added to 100 ul RMJan22. However, when the LowRead pool was qPCRed, one replicate was much higher than the other 2. The 1:2 ratio might be off because of this. ~~We are running an RNaseP plate to recheck the recalibration of the 7500 qPCR machine.~~ We ran an RNaseP Test plate to check the recalibration of the qPCR machine and it looks good. Because the reads passing filter was so low (35%), we are going to re-qPCR new dilutions of the pools and try this again.

RNaseP Check of qPCR machine recalibration:

Recalibration looks good. The 5,000 copy samples all reported back as close to 5,000 copies and the 10,000 copy samples all reported back as roughly 10,000 copies with no real outliers. A snapshot is below:

Sample Name	Detector	Task	Ct	StdDev Ct	Qty	Mean Qty	StdDev Qty
5K	RNase P	Unknown	27.80	0.052	5170.54	4935.10	168.205
5K	RNase P	Unknown	27.80	0.052	5164.36	4935.10	168.205
10K	RNase P	Unknown	26.77	0.058	10193.14	9923.36	380.987
10K	RNase P	Unknown	26.82	0.058	9863.28	9923.36	380.987
NTC	RNase P	NTC	Undetermined
NTC	RNase P	NTC	Undetermined
NTC	RNase P	NTC	Undetermined
NTC	RNase P	NTC	Undetermined
Standard1	RNase P	Standard	29.94	0.039	1250.00
Standard1	RNase P	Standard	29.92	0.039	1250.00
Standard1	RNase P	Standard	30.00	0.039	1250.00
Standard1	RNase P	Standard	30.00	0.039	1250.00
Standard2	RNase P	Standard	28.92	0.059	2500.00
Standard2	RNase P	Standard	28.87	0.059	2500.00
Standard2	RNase P	Standard	28.82	0.059	2500.00
Standard2	RNase P	Standard	28.78	0.059	2500.00
Standard3	RNase P	Standard	27.81	0.034	5000.00
Standard3	RNase P	Standard	27.82	0.034	5000.00
Standard3	RNase P	Standard	27.86	0.034	5000.00
Standard3	RNase P	Standard	27.88	0.034	5000.00
Standard4	RNase P	Standard	26.91	0.013	10000.00
Standard4	RNase P	Standard	26.93	0.013	10000.00
Standard4	RNase P	Standard	26.90	0.013	10000.00
Standard4	RNase P	Standard	26.89	0.013	10000.00
Standard5	RNase P	Standard	25.76	0.062	20000.00
Standard5	RNase P	Standard	25.69	0.062	20000.00
Standard5	RNase P	Standard	25.67	0.062	20000.00
Standard5	RNase P	Standard	25.61	0.062	20000.00

Assign reads and otus to samples:

/project/microbiome/data_queue/seq/LowReadII/rawdata

salloc --account=microbiome -t 0-06:00

mkdir -p /gscratch/grandol1/loc_ad2/rawdata

cd /gscratch/grandol1/loc_ad2/rawdata

unpigz --to-stdout /project/microbiome/data_queue/seq/loc_ad2/rawdata/LRII-RMJAN22_S1_L001_R1_001.fastq.gz | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LowReadII_R1_ ;unpigz --to-stdout /project/microbiome/data_queue/seq/loc_ad2/rawdata/LRII-RMJAN22_S1_L001_R2_001.fastq.gz | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LowReadII_R2_

//project/microbiome/data_queue/seq/loc_ad2/rawdata/run_parse_count_onSplitInput.pl

cd /project/microbiome/data_queue/seq/loc_ad2/rawdata

./run_splitFastq_fwd.sh

./run_splitFastq_rev.sh

cd /project/microbiome/data_queue/seq/loc_ad2/rawdata

./run_aggregate.sh

Exploration of Data created so far:

From my understanding, Line 9 from above simply splits the raw data into equal sized files, but the total number of reads should remain constant.

cd /gscratch/grandol1/loc_ad2/rawdata

wc -l LowRead*

Should return 8x the number of paired end reads (2x for R1 and 4x for the structure of fastq files).

This returns: 21045424 total

Divided by 8: 2630678

Line 11 then reads through all of the split files and assigns each read to a sample (parsed), to PhiX or non target (phixOther), or a mid error (truemiderrors). The reads assigned to these should add up to the numbers above.

wc -l parsed*

Returns: 15371232

Divided by 8: 1921404 assigned to samples.

Assigned/Total (*100) = percent assigned: ~73%

The target for samples was 83% (Off target by 12%).

Things get more confusing with the phixOther and truemiderror files, because they do not appear to be true fastq files nor do they appear to be Fasta. So, I do not know how to count reads.

Blasting random lines from phixOther returns a mix of phiX and ‘uncultured bacterium 16S’. I see no way of disentangling this.

So, let us explore the results of lines 13 to 17. These should be found in /project/microbiome/data_queue/seq/loc_ad2/rawdata/sample_fastq/16S/locad2

and

/project/microbiome/data_queue/seq/loc_ad2/rawdata/sample_fastq/16S/LRII

For locad2:

wc -l locad2*

Returns: 4071280

The file formats appear the same as the “parsed*” files above.

Divided by 8: 508910

For LRII:

wc -l LRII*

Returns: 11299952

Divided by 8: 1412494

LRII + locad2: 1921404

Even if all the unassigned reads are from locad2, this does not fix the expected ration of 2lo:1LR.

[508910+(2630678-1921404)] = 1218184 total possible locad2 reads

~~cd /project/microbiome/data_queue/seq/loc_ad2/tfmergedreads~~

~~./run_slurm_mergereads.pl~~

~~cd /project/microbiome/data_queue/seq/LowReadII/otu~~

~~./run_slurm_mkotu.pl~~

Assign taxonomy

salloc --account=microbiome -t 0-02:00 --mem=500G

module load swset/2018.05  gcc/7.3.0

module load vsearch/2.15.1

vsearch --sintax zotus.fa --db /project/microbiome/users/grandol1/ref_db/gg_16s_13.5.fa -tabbedout LRII.sintax -sintax_cutoff 0.8

~~Output:~~

~~Reading file /project/microbiome/users/grandol1/ref_db/gg_16s_13.5.fa 100%~~

~~1769520677 nt in 1262986 seqs, min 1111, max 2368, avg 1401~~

~~Counting k-mers 100%~~

~~Creating k-mer index 100%~~

~~Classifying sequences 100%~~

~~Classified 4038 of 4042 sequences (99.90%)~~

Convert into useful form:

awk -F "\t" '{OFS=","} NR==1 {print "OTU_ID","SEQS","SIZE","DOMAIN","KINGDOM","PHYLUM","CLASS","ORDER","FAMILY","GENUS","SPECIES"} {gsub(";", ","); gsub("centroid=", ""); gsub("seqs=", ""); gsub("size=", ""); match($4, /d:[^,]+/, d); match($4, /k:[^,]+/, k); match($4, /p:[^,]+/, p); match($4, /c:[^,]+/, c); match($4, /o:[^,]+/, o); match($4, /f:[^,]+/, f); match($4, /g:[^,]+/, g); match($4, /s:[^,]+/, s); print $1, d[0]=="" ? "NA" : d[0], k[0]=="" ? "NA" : k[0], p[0]=="" ? "NA" : p[0], c[0]=="" ? "NA" : c[0], o[0]=="" ? "NA" : o[0], f[0]=="" ? "NA" : f[0], g[0]=="" ? "NA" : g[0], s[0]=="" ? "NA" : s[0] }' LRII.sintax > LRIItaxonomy.csv