Bioinformatics for loc ad1
Step 0. Demultiplex
module load swset/2018.05
module load gcc/7.3.0
module load usearch/10.0.240
module load perl-text-levenshtein-xs
perl //project/microbiome/data/seq/cu_24feb21novaseq4/rawdata/parse_count.pl loc_ad1_Demux.csv 5RM1-16S_S1_L001_R1_001.fastq 5RM1-16S_S1_L001_R2_001.fastq FS10000611
perl //project/microbiome/data/seq/cu_24feb21novaseq4/rawdata/parse_count.pl ../loc_ad1_Demux.csv 5RM1-16S-TAKE2_S1_L001_R1_001.fastq 5RM1-16S-TAKE2_S1_L001_R2_001.fastq FS10000611
Step 1. load modules
module load cutadapt/2.10 usearch/10.0.240 vsearch/2.15.1
The only other software that will be used is fastp, which is located here: /project/microbiome/bin/fastp
Step 2. Remove primers using cutadapt and clean up headers
Remove forward 16s primer
cutadapt -g ^GTGYCAGCMGCCGCGGTAA -o trimmed_5RM1-16S_S1_L001_R1_001.fastq parsed_5RM1-16S_S1_L001_R1_001.fastq -e 0.25
Remove forward 16s primer
cutadapt -g ^GTGYCAGCMGCCGCGGTAA -o trimmed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq parsed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq -e 0.25
Remove reverse 16s primer
cutadapt -g ^GGACTACHVGGGTWTCTAAT -o trimmed_5RM1-16S_S1_L001_R2_001.fastq parsed_5RM1-16S_S1_L001_R2_001.fastq -e 0.25
Remove reverse 16s primer
cutadapt -g ^GGACTACHVGGGTWTCTAAT -o trimmed_5RM1-16S-TAKE2_S1_L001_R2_001.fastq parsed_5RM1-16S-TAKE2_S1_L001_R2_001.fastq -e 0.25
Clean up the headers of all files to remove problematic characters.
sed 's/\s/_/g' trimmed_5RM1-16S_S1_L001_R1_001.fastq | sed -e 's/^@16/@rna16/' - | sed -e 's/-/_/g' > cleaned_trimmed_5RM1-16S_S1_L001_R1_001.fastq
sed 's/\s/_/g' trimmed_5RM1-16S_S1_L001_R2_001.fastq | sed -e 's/^@16/@rna16/' - | sed -e 's/-/_/g' > cleaned_trimmed_5RM1-16S_S1_L001_R2_001.fastq
Clean up the headers of all files to remove problematic characters.
sed 's/\s/_/g' trimmed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq | sed -e 's/^@16/@rna16/' - | sed -e 's/-/_/g' > cleaned_trimmed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq
sed 's/\s/_/g' trimmed_5RM1-16S-TAKE2_S1_L001_R2_001.fastq | sed -e 's/^@16/@rna16/' - | sed -e 's/-/_/g' > cleaned_trimmed_5RM1-16S-TAKE2_S1_L001_R2_001.fastq
Step 3. Make coligo table
Note, I changed my mind and think it makes sense to include coligo processing in any comprehensive bash script, so long as we continue to use coligos.
The input to this call to awk is the output from step 2 (primers removed and headers cleaned up with sed).
awk '{if ($1 ~ /@/) {print}else{print substr ($0, 0, 13)}}' cleaned_trimmed_5RM1-16S_S1_L001_R1_001.fastq > coligo_5RM1-16S_S1_L001_R1_001.fastq
vsearch --fastq_filter coligo_5RM1-16S_S1_L001_R1_001.fastq --fastaout filter_coligo_5RM1-16S_S1_L001_R1_001.fasta
Output of the following command is our coligo table. It is left to the client to look at the coligo table, we don’t process it further here.
vsearch -search_exact filter_coligo_5RM1-16S_S1_L001_R1_001.fasta -db /project/microbiome/ref_db/coligos_and_abbreviatedISD.fa -strand plus -otutabout final_coligo_5RM1-16S_S1_L001_R1_001.fasta -minseqlen 5
The input to this call to awk is the output from step 2 (primers removed and headers cleaned up with sed).
awk '{if ($1 ~ /@/) {print}else{print substr ($0, 0, 13)}}' cleaned_trimmed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq > coligo_5RM1-16S-TAKE2_S1_L001_R1_001.fastq
vsearch --fastq_filter coligo_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --fastaout filter_coligo_5RM1-16S-TAKE2_S1_L001_R1_001.fasta
Output of the following command is our coligo table. It is left to the client to look at the coligo table, we don’t process it further here.
vsearch -search_exact filter_coligo_5RM1-16S-TAKE2_S1_L001_R1_001.fasta -db /project/microbiome/ref_db/coligos_and_abbreviatedISD.fa -strand plus -otutabout final_coligo_5RM1-16S-TAKE2_S1_L001_R1_001.fasta -minseqlen 5
Step 4. Remove low complexity reads (this gets rid of coligos)
usearch -filter_lowc cleaned_trimmed_5RM1-16S_S1_L001_R1_001.fastq -reverse cleaned_trimmed_5RM1-16S_S1_L001_R2_001.fastq -output NoLow_5RM1-16S_S1_L001_R1_001.fastq -output2 NoLow_5RM1-16S_S1_L001_R2_001.fastq
And Again:
usearch -filter_lowc cleaned_trimmed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq -reverse cleaned_trimmed_5RM1-16S-TAKE2_S1_L001_R2_001.fastq -output NoLow_5RM1-16S-TAKE2_S1_L001_R1_001.fastq -output2 NoLow_5RM1-16S-TAKE2_S1_L001_R2_001.fastq
There doesn’t appear to be a similar command in vsearch.
Step 5. Merge reads (note we are saving both merged and unmerged reads so there are multiple outputs to this command)
vsearch --fastq_mergepairs NoLow_5RM1-16S_S1_L001_R1_001.fastq --reverse NoLow_5RM1-16S_S1_L001_R2_001.fastq --fastqout Merged_5RM1-16S_S1_L001_R1_001.fastq --fastq_maxdiffs 12 --fastq_allowmergestagger --fastq_minovlen 10 --fastq_minmergelen 60 --fastqout_notmerged_fwd UnMerged_5RM1-16S_S1_L001_R1_001.fastq --fastqout_notmerged_rev UnMerged_5RM1-16S_S1_L001_R2_001.fastq
And Again:
vsearch --fastq_mergepairs NoLow_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --reverse NoLow_5RM1-16S-TAKE2_S1_L001_R2_001.fastq --fastqout Merged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --fastq_maxdiffs 12 --fastq_allowmergestagger --fastq_minovlen 10 --fastq_minmergelen 60 --fastqout_notmerged_fwd UnMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --fastqout_notmerged_rev UnMerged_5RM1-16S-TAKE2_S1_L001_R2_001.fastq
Step 6. Filter reads
Filter merged reads using:
vsearch -fastq_filter Merged_5RM1-16S_S1_L001_R1_001.fastq -fastq_maxee 1 -fastaout filteredMerged_5RM1-16S_S1_L001_R1_001.fastq
Filter merged reads using:
vsearch -fastq_filter Merged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq -fastq_maxee 1 -fastaout filteredMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq
Filter reads that didn’t merge using:
/project/microbiome/bin/fastp --in1 UnMerged_5RM1-16S_S1_L001_R1_001.fastq --in2 UnMerged_5RM1-16S_S1_L001_R2_001.fastq -q 15 -u 40 -l 107 --out1 FilteredUnMerged_5RM1-16S_S1_L001_R1_001.fastq --out2 FilteredUnMerged_5RM1-16S_S1_L001_R2_001.fastq
Filter reads that didn’t merge using:
/project/microbiome/bin/fastp --in1 UnMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --in2 UnMerged_5RM1-16S-TAKE2_S1_L001_R2_001.fastq -q 15 -u 40 -l 107 --out1 FilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --out2 FilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R2_001.fastq
Step 7. Truncate and join reads that didn’t merge
vsearch -fastx_filter FilteredUnMerged_5RM1-16S_S1_L001_R1_001.fastq --fastq_trunclen 215 -fastqout TruncFilteredUnMerged_5RM1-16S_S1_L001_R1_001.fastq
vsearch -fastx_filter FilteredUnMerged_5RM1-16S_S1_L001_R2_001.fastq --fastq_trunclen 215 -fastqout TruncFilteredUnMerged_5RM1-16S_S1_L001_R2_001.fastq
And Again
vsearch -fastx_filter FilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --fastq_trunclen 215 -fastqout TruncFilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq
vsearch -fastx_filter FilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R2_001.fastq --fastq_trunclen 215 -fastqout TruncFilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R2_001.fastq
Step 8. Dereplicate reads
I think it makes sense to cat both merged (output from step 5 above) and joined reads (output from step 7 above) and pipe them into this command. Otherwise the command must be run twice, once on the merged reads then again on the joined reads and then the output of both of those commands should be concatenated.
vsearch -derep_fulllength Merged_5RM1-16S_S1_L001_R1_001.fastq --output DeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --sizeout --sizein
And Again:
vsearch -derep_fulllength Merged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --output DeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --sizeout --sizein
Step 9. Make OTUs and remove chimeras
vsearch --cluster_unoise DeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --centroids DenoisedDeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --minsize 8 --relabel 'otu' --sizein --sizeout
vsearch --uchime3_denovo DenoisedDeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --nonchimeras NoChiDenoisedDeRepMerged_5RM1-16S_S1_L001_R1_001.fastq
And Again:
vsearch --cluster_unoise DeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --centroids DenoisedDeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --minsize 8 --relabel 'otu' --sizein --sizeout
vsearch --uchime3_denovo DenoisedDeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --nonchimeras NoChiDenoisedDeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq
Step 10. Make OTU table
The input to this is concatenated merged and joined reads (reads should be filtered and in fasta format as output from steps 7 and 5 above).
The “-db” argument below stands for database and should be the fasta sequence for the OTUs/ESVs (from step 9) also note that the OUTPUT here is in single quotes.
vsearch --usearch_global DeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --db NoChiDenoisedDeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --otutabout 'OTU_5RM1-16S_S1_L001_R1_001.fastq' --id 0.99
And Again:
vsearch --usearch_global DeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --db NoChiDenoisedDeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --otutabout 'OTU_5RM1-16S-TAKE2_S1_L001_R1_001.fastq' --id 0.99
Remove the pesky “#OTU ID” that vsearch puts in the first cell of the table:
sed -i 's/^#OTU ID/OTUID/' OTU_5RM1-16S_S1_L001_R1_001.fastq
And Again:
sed -i 's/^#OTU ID/OTUID/' OTU_5RM1-16S-TAKE2_S1_L001_R1_001.fastq