LRII.6 Bioinformatics
Assign reads and otus to samples:
Assign Reads:
/project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata
salloc --account=microbiome -t 0-06:00
mkdir -p /gscratch/grandol1/LRII_LocAd2_3_16_22/rawdata
cd /gscratch/grandol1/LRII_LocAd2_3_16_22/rawdata
unpigz --to-stdout /project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata/LRII-5-RMJan22_S1_L001_R1_001.fastq.gz | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LRII_LocAd2_3_16_22_R1_ ;unpigz --to-stdout /project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata/LRII-5-RMJan22_S1_L001_R2_001.fastq.gz | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LRII_LocAd2_3_16_22_R2_
//project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata/run_parse_count_onSplitInput.pl
cd /project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata
./run_splitFastq_fwd.sh
./run_splitFastq_rev.sh
cd /project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/rawdata
./run_aggregate.sh
Process through to otus:
salloc --account=microbiome -t 0-06:00
cd /project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/tfmergedreads
./run_slurm_mergereads.pl
cd /project/microbiome/data_queue/seq/LRII_LocAd2_3_16_22/otu
./run_slurm_mkotu.pl
Exploration of Data created so far:
From my understanding, Line 9 from above simply splits the raw data into equal sized files, but the total number of reads should remain constant.
cd /gscratch/grandol1/LRII_LocAd2_3_16_22
/rawdata
wc -l LRII*
Should return 8x the number of paired end reads (2x for R1 and 4x for the structure of fastq files).
This returns: 33809712 total
Divided by 8: 4226214
Line 11 then reads through all of the split files and assigns each read to a sample (parsed), to PhiX or non target (phixOther), or a mid error (truemiderrors). The reads assigned to these should add up to the numbers above.
wc -l parsed*
Returns: 25644064 total
Divided by 8: 3205508 assigned to samples.
Assigned/Total (*100) = percent assigned: ~76%
The target for samples was 80%.
Things get more confusing with the phixOther and truemiderror files, because they do not appear to be true fastq files nor do they appear to be Fasta. So, I do not know how to count reads.
Blasting random lines from phixOther returns a mix of phiX and ‘uncultured bacterium 16S’. I see no way of disentangling this.
So, let us explore the results of lines 13 to 17. These should be found in /project/microbiome/data_queue/seq/loc_ad2/rawdata/sample_fastq/16S/locad2
and
/project/microbiome/data_queue/seq/loc_ad2/rawdata/sample_fastq/16S/LRII
For locad2:
wc -l locad2*
Returns: 18304944 total
The file formats appear the same as the “parsed*” files above.
Divided by 8: 2288118
For LRII:
wc -l LRII*
Returns: 7339120 total
Divided by 8: 917390
LRII + locad2:
2288118+917390= 3205508
Same as the parsed read count above.