Bioinformatics for loc ad1

Step 0. Demultiplex

module load swset/2018.05

module load gcc/7.3.0

module load usearch/10.0.240

module load perl-text-levenshtein-xs

perl //project/microbiome/data/seq/cu_24feb21novaseq4/rawdata/parse_count.pl loc_ad1_Demux.csv 5RM1-16S_S1_L001_R1_001.fastq 5RM1-16S_S1_L001_R2_001.fastq FS10000611

perl //project/microbiome/data/seq/cu_24feb21novaseq4/rawdata/parse_count.pl ../loc_ad1_Demux.csv 5RM1-16S-TAKE2_S1_L001_R1_001.fastq 5RM1-16S-TAKE2_S1_L001_R2_001.fastq FS10000611

Step 1. load modules

module load cutadapt/2.10 usearch/10.0.240 vsearch/2.15.1

The only other software that will be used is fastp, which is located here: /project/microbiome/bin/fastp

Step 2. Remove primers using cutadapt and clean up headers

Remove forward 16s primer

cutadapt -g ^GTGYCAGCMGCCGCGGTAA -o trimmed_5RM1-16S_S1_L001_R1_001.fastq parsed_5RM1-16S_S1_L001_R1_001.fastq -e 0.25

Remove forward 16s primer

cutadapt -g ^GTGYCAGCMGCCGCGGTAA -o trimmed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq parsed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq -e 0.25

Remove reverse 16s primer

cutadapt -g ^GGACTACHVGGGTWTCTAAT -o trimmed_5RM1-16S_S1_L001_R2_001.fastq parsed_5RM1-16S_S1_L001_R2_001.fastq -e 0.25

Remove reverse 16s primer

cutadapt -g ^GGACTACHVGGGTWTCTAAT -o trimmed_5RM1-16S-TAKE2_S1_L001_R2_001.fastq parsed_5RM1-16S-TAKE2_S1_L001_R2_001.fastq -e 0.25

Clean up the headers of all files to remove problematic characters.

sed 's/\s/_/g' trimmed_5RM1-16S_S1_L001_R1_001.fastq | sed -e 's/^@16/@rna16/' - | sed -e 's/-/_/g' > cleaned_trimmed_5RM1-16S_S1_L001_R1_001.fastq

sed 's/\s/_/g' trimmed_5RM1-16S_S1_L001_R2_001.fastq | sed -e 's/^@16/@rna16/' - | sed -e 's/-/_/g' > cleaned_trimmed_5RM1-16S_S1_L001_R2_001.fastq

Clean up the headers of all files to remove problematic characters.

sed 's/\s/_/g' trimmed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq | sed -e 's/^@16/@rna16/' - | sed -e 's/-/_/g' > cleaned_trimmed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq

sed 's/\s/_/g' trimmed_5RM1-16S-TAKE2_S1_L001_R2_001.fastq | sed -e 's/^@16/@rna16/' - | sed -e 's/-/_/g' > cleaned_trimmed_5RM1-16S-TAKE2_S1_L001_R2_001.fastq

Step 3. Make coligo table

Note, I changed my mind and think it makes sense to include coligo processing in any comprehensive bash script, so long as we continue to use coligos.

The input to this call to awk is the output from step 2 (primers removed and headers cleaned up with sed).

awk '{if ($1 ~ /@/) {print}else{print substr ($0, 0, 13)}}' cleaned_trimmed_5RM1-16S_S1_L001_R1_001.fastq > coligo_5RM1-16S_S1_L001_R1_001.fastq

vsearch --fastq_filter coligo_5RM1-16S_S1_L001_R1_001.fastq --fastaout filter_coligo_5RM1-16S_S1_L001_R1_001.fasta

Output of the following command is our coligo table. It is left to the client to look at the coligo table, we don’t process it further here.

vsearch -search_exact filter_coligo_5RM1-16S_S1_L001_R1_001.fasta -db /project/microbiome/ref_db/coligos_and_abbreviatedISD.fa -strand plus -otutabout final_coligo_5RM1-16S_S1_L001_R1_001.fasta -minseqlen 5

The input to this call to awk is the output from step 2 (primers removed and headers cleaned up with sed).

awk '{if ($1 ~ /@/) {print}else{print substr ($0, 0, 13)}}' cleaned_trimmed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq > coligo_5RM1-16S-TAKE2_S1_L001_R1_001.fastq

vsearch --fastq_filter coligo_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --fastaout filter_coligo_5RM1-16S-TAKE2_S1_L001_R1_001.fasta

Output of the following command is our coligo table. It is left to the client to look at the coligo table, we don’t process it further here.

vsearch -search_exact filter_coligo_5RM1-16S-TAKE2_S1_L001_R1_001.fasta -db /project/microbiome/ref_db/coligos_and_abbreviatedISD.fa -strand plus -otutabout final_coligo_5RM1-16S-TAKE2_S1_L001_R1_001.fasta -minseqlen 5

Step 4. Remove low complexity reads (this gets rid of coligos)

usearch -filter_lowc cleaned_trimmed_5RM1-16S_S1_L001_R1_001.fastq -reverse cleaned_trimmed_5RM1-16S_S1_L001_R2_001.fastq -output NoLow_5RM1-16S_S1_L001_R1_001.fastq -output2 NoLow_5RM1-16S_S1_L001_R2_001.fastq

And Again:

usearch -filter_lowc cleaned_trimmed_5RM1-16S-TAKE2_S1_L001_R1_001.fastq -reverse cleaned_trimmed_5RM1-16S-TAKE2_S1_L001_R2_001.fastq -output NoLow_5RM1-16S-TAKE2_S1_L001_R1_001.fastq -output2 NoLow_5RM1-16S-TAKE2_S1_L001_R2_001.fastq

There doesn’t appear to be a similar command in vsearch.

Step 5. Merge reads (note we are saving both merged and unmerged reads so there are multiple outputs to this command)

vsearch --fastq_mergepairs NoLow_5RM1-16S_S1_L001_R1_001.fastq --reverse NoLow_5RM1-16S_S1_L001_R2_001.fastq --fastqout Merged_5RM1-16S_S1_L001_R1_001.fastq --fastq_maxdiffs 12 --fastq_allowmergestagger --fastq_minovlen 10 --fastq_minmergelen 60 --fastqout_notmerged_fwd UnMerged_5RM1-16S_S1_L001_R1_001.fastq --fastqout_notmerged_rev UnMerged_5RM1-16S_S1_L001_R2_001.fastq

And Again:

vsearch --fastq_mergepairs NoLow_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --reverse NoLow_5RM1-16S-TAKE2_S1_L001_R2_001.fastq --fastqout Merged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --fastq_maxdiffs 12 --fastq_allowmergestagger --fastq_minovlen 10 --fastq_minmergelen 60 --fastqout_notmerged_fwd UnMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --fastqout_notmerged_rev UnMerged_5RM1-16S-TAKE2_S1_L001_R2_001.fastq

Step 6. Filter reads

Filter merged reads using:

vsearch -fastq_filter Merged_5RM1-16S_S1_L001_R1_001.fastq -fastq_maxee 1 -fastaout filteredMerged_5RM1-16S_S1_L001_R1_001.fastq

Filter merged reads using:

vsearch -fastq_filter Merged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq -fastq_maxee 1 -fastaout filteredMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq

Filter reads that didn’t merge using:

/project/microbiome/bin/fastp --in1 UnMerged_5RM1-16S_S1_L001_R1_001.fastq --in2 UnMerged_5RM1-16S_S1_L001_R2_001.fastq -q 15 -u 40 -l 107 --out1 FilteredUnMerged_5RM1-16S_S1_L001_R1_001.fastq --out2 FilteredUnMerged_5RM1-16S_S1_L001_R2_001.fastq

Filter reads that didn’t merge using:

/project/microbiome/bin/fastp --in1 UnMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --in2 UnMerged_5RM1-16S-TAKE2_S1_L001_R2_001.fastq -q 15 -u 40 -l 107 --out1 FilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --out2 FilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R2_001.fastq

Step 7. Truncate and join reads that didn’t merge

vsearch -fastx_filter FilteredUnMerged_5RM1-16S_S1_L001_R1_001.fastq --fastq_trunclen 215 -fastqout TruncFilteredUnMerged_5RM1-16S_S1_L001_R1_001.fastq

vsearch -fastx_filter FilteredUnMerged_5RM1-16S_S1_L001_R2_001.fastq --fastq_trunclen 215 -fastqout TruncFilteredUnMerged_5RM1-16S_S1_L001_R2_001.fastq

And Again

vsearch -fastx_filter FilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --fastq_trunclen 215 -fastqout TruncFilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq

vsearch -fastx_filter FilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R2_001.fastq --fastq_trunclen 215 -fastqout TruncFilteredUnMerged_5RM1-16S-TAKE2_S1_L001_R2_001.fastq

Step 8. Dereplicate reads

I think it makes sense to cat both merged (output from step 5 above) and joined reads (output from step 7 above) and pipe them into this command. Otherwise the command must be run twice, once on the merged reads then again on the joined reads and then the output of both of those commands should be concatenated.

vsearch -derep_fulllength Merged_5RM1-16S_S1_L001_R1_001.fastq --output DeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --sizeout --sizein

And Again:

vsearch -derep_fulllength Merged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --output DeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --sizeout --sizein

Step 9. Make OTUs and remove chimeras

vsearch --cluster_unoise DeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --centroids DenoisedDeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --minsize 8 --relabel 'otu' --sizein --sizeout

vsearch --uchime3_denovo DenoisedDeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --nonchimeras NoChiDenoisedDeRepMerged_5RM1-16S_S1_L001_R1_001.fastq

And Again:

vsearch --cluster_unoise DeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --centroids DenoisedDeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --minsize 8 --relabel 'otu' --sizein --sizeout

vsearch --uchime3_denovo DenoisedDeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --nonchimeras NoChiDenoisedDeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq

Step 10. Make OTU table

The input to this is concatenated merged and joined reads (reads should be filtered and in fasta format as output from steps 7 and 5 above).

The “-db” argument below stands for database and should be the fasta sequence for the OTUs/ESVs (from step 9) also note that the OUTPUT here is in single quotes.

vsearch --usearch_global DeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --db NoChiDenoisedDeRepMerged_5RM1-16S_S1_L001_R1_001.fastq --otutabout 'OTU_5RM1-16S_S1_L001_R1_001.fastq' --id 0.99

And Again:

vsearch --usearch_global DeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --db NoChiDenoisedDeRepMerged_5RM1-16S-TAKE2_S1_L001_R1_001.fastq --otutabout 'OTU_5RM1-16S-TAKE2_S1_L001_R1_001.fastq' --id 0.99

Remove the pesky “#OTU ID” that vsearch puts in the first cell of the table:

sed -i 's/^#OTU ID/OTUID/' OTU_5RM1-16S_S1_L001_R1_001.fastq
And Again:

sed -i 's/^#OTU ID/OTUID/' OTU_5RM1-16S-TAKE2_S1_L001_R1_001.fastq

Genome Technologies Laboratory

Bioinformatics for loc ad1

Related content