Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Info
(23 September 2020) A first pass through the bioinformatics for this library is complete and otutables are available for review (Paul Ayayee Erin Bentley John Calder Joshua HarrisonSeifeddine Ben Tekaya). Alex Buerkle used the same bioinformatic steps as were used for runs 1A and 1B

Status (1 October 2020)

  • Based on Gordon Custer and Joshua Harrison's work to understand why chimeras were not being found and screened out, Alex Buerkle made the suggested change in a setting in vsearch. Started rerunning all otutable generation. Should be complete on 2 October. Should result in ~2% fewer OTUs.

  • 20 March 2021)

    I repeated the trim, merge, join, and otutable generation steps, to have equivalent methods that we have now applied to Novaseq 1A, 1B, 1C, 3, and 4. Also redid otu table generation and coligo table construction.

    Table of Contents

    Demultiplexing and splitting

    ...

    In /project/microbiome/data/seq/psomagen_17sep20_novaseq2/tfmergedreads , we used run_slurm_mergereads.plto crawl all of the project folders and sample files (created in the splitting step above) to merge read pairs, filter based on base quality, and optionally trim primer regions from the reads. We are now not trimming primer regions, even though we found that we could not readily eliminate low frequency, potentially spurious OTUs downstream by masking with vsearch; it would not properly mask these regions [by marking them with lower-case bases] and the soft mask them when we do vsearch cluster_sizeor join read pairs, filter based on base quality, and trim primer regions from the reads. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/ and ITS/ folders in tfmergedreads/. Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv in each project folder. For the full lane these summaries were concatenated in tfmergedreads/ with

    ...

    I used commands.R in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge, or did the number reads that were removed because they did not merge (this does not account for reads that were joined/concatenated because the amplicon region was so long that we didn’t have sufficient overlap for merge), or did not meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).

    ...

    Find unique sequences, cluster, and count OTUs

    We ran a single script for each library that generated otu tables for 16S and ITS. These are found in the otu/ directory for each library. The script we ran is also there: e.g., generated otu tables for 16S and ITS, which can be found in the /project/microbiome/data/seq/psomagen_17sep20_novaseq2/otu/16S and /project/microbiome/data/seq/psomagen_17sep20_novaseq2/otu/ITS directories for all projects. The script to do this is /project/microbiome/data/seq/psomagen_17sep20_6mar20novaseq2/otu/run_slurm_mkotu.pl. This script crawls all of the projects (and loci) in the tfmerged/ folder, and concatenates all input sequences files and passes them through a UNIX pipe for analysis (rather than making another copy of all the sequence data)than making another copy of all the sequence data). This was rerun on with the addition of line 5 (--search_exact) and --relabel 'otu' added to line 2. The key part the script is the job that is created for each project:

    ...

    breakoutModefull-width

    ...

    job that is created for each project:

    Code Block
    breakoutModefull-width
    languagebash
        $job .= "cat $tfmdirectory/*tfmergedreads.fa  | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/'$tfmdirectory/joined/*tfmergedreads.fa | vsearch --derep_fulllength - --threads 32 --output uniqueSequences.fa --sizeout;\n";
        $job .= "vsearch --cluster_unoise uniqueSequences.fa --relabel 'otu' --sizein --sizeout --consout zotus.fa --minsize 8 ;\n";
        $job .= "vsearch --clusteruchime3_unoisedenovo uniqueSequenceszotus.fa --sizeinnonchimeras --consout zotus_nonchimeric.fa --minsize 12 --id 0.99 --relabel 'Otu' --iddef 2;\n";
    $job .= "vsearch --uchime3_denovo zotus.fa --chimeras chimera_out.fa --nonchimerasthreads 32;\n";
        $job .= "cat $tfmdirectory/*tfmergedreads.fa $tfmdirectory/joined/*tfmergedreads.fa | vsearch --usearch_global - --db zotus_nonchimeric.fa --threads 32;\n";
    $job .= "cat $tfmdirectory/*tfmergedreads.fa otutabout - --id 0.99 --threads 32 | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/'/^#OTU ID/OTUID/' > 'otutable';\n";
        $job .= 'cat $tfmdirectory/*tfmergedreads.fa $tfmdirectory/joined/*tfmergedreads.fa | vsearch --usearchsearch_globalexact - --db zotus_nonchimeric.fa --otutabout - --id 0.99 --otutabout 'otutable' --id 0.99--threads 32'. " | sed 's/^#OTU ID/OTUID/' > 'otutable.esv';\n";

    These steps can be repeated by individual users with different input data and vsearch options, as desired. Please see below for instructions with examples. The settings we used for all projects took only a few hours for each project.

    As noted above, we are not trimming primer regions, so there are potentially some spurious OTUs/ESVs introduced by sequence variation in the primer regions.

    ...

    Make coligo table

    In /project/microbiome/data/seq/psomagen_17sep20_novaseq2/coligoISD, there are 16S and ITS directories for all projects. These contain a file named coligoISDtable.txt with counts of the coligos and the ISD found in the trimmed forward reads, per sample. The file run_slurm_mkcoligoISDtable.pl has the code that passes over all of the projects and uses vsearch for making the table.

    ...

    These steps can be repeated by individual users with different input data and vsearch options, as desired. Please see below for instructions with examples.

    ...

    Next steps for individual users and projects

    Info

    The notes below have not been updated to parallel the steps taken above (analysis updated 20 March 2021). Some parts of the next likely apply and are still correct. More up-to-date documentation and suggestions are in Bioinformatics v3.0

    The next tasks each user will need to perform are QC of the data, generation of taxonomic hypotheses for ESVs/OTUs, identification of those ESVs that are the synthgene and coligo and combining data for those ESVs sensibly, and finally statistical analysis. We will go through each of these tasks below and provide some guidelines. We also provide a few tips for those new to bioinformatics. For the experienced users, we will need to work together to help folks that are new to the game.

    ...