Bioinformatics for LarkGrouseTest1
Status (20 March 2021)
I repeated the trim, merge, join, and otutable generation steps, to have equivalent methods that we have now applied to Novaseq 1A, 1B, 1C, 3, and 4. Also redid otu table generation and coligo table construction.
- 1 Demultiplexing and splitting
- 1.1 Demultiplexing
- 1.2 Splitting
- 2 Calculate summary statistics on reads
- 3 Merge reads, filter, and (optionally) trim primer regions
- 4 Find unique sequences, cluster, and count OTUs
- 5 Make coligo table
- 6 Next steps for individual users and projects
- 6.1 Basic info for those who are new to bioinformatics
- 6.2 Best practices
- 6.3 Explanation of files in your project folder
- 6.4 Taxonomic hypothesis generation
- 6.5 Finding coligos and ISDs
- 6.6 Quality control
- 6.7 Rerunning otu table generation
- 6.8 Statistics
- 6.9 Further considerations
- 6.10 Bioinformatic wish list
Demultiplexing and splitting
The directory for the raw sequence data (gz compressed) and the parsed and split reads is /project/gtl/data/raw/LarkGrouseTest1/rawdata
(corresponding to micro NovaSeq Run #2 ). Files for individual samples are in /project/gtl/data/raw/LarkGrouseTest1/rawdata/sample_fastq/
.
Demultiplexing
The work is done by run_parse_count_onSplitInput.pl
. As the name implies, we split the raw data into many files, so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across ~220 jobs. By doing so, the parsing takes less than one hour.
Splitting the raw (uncompressed) data was accomplish with the program split
, with 1x106 lines (4x106 reads) being written to each file (with a remainder in the final file). These files were written to /gscratch and, as intermediate files that can be reconstructed readily, will not be retained long-term.
mkdir -p /gscratch/grandol1/LarkGrouseTest1/rawdata
cd /gscratch/grandol1/LarkGrouseTest1/rawdata
unpigz --to-stdout /project/gtl/data/raw/LarkGrouseTest1/rawdata/LarkGrouseTest_S1_L001_R1_001.fastq.gz | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LarkGrouseTest1_R1_ ;
unpigz --to-stdout /project/gtl/data/raw/LarkGrouseTest1/rawdata/LarkGrouseTest_S1_L001_R2_001.fastq.gz | split -l 1000000 -d --suffix-length=3 --additional-suffix=.fastq - LarkGrouseTest1_R2_
making 20 R1 files and 20 R2 files, with structured names (e.g., for the R1 set):
/gscratch/grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_000.fastq
/gscratch/grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_001.fastq
/gscratch/grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_002.fastq
/gscratch/grandol1/LarkGrouseTest1/rawdata/LarkGrouseTest1_R1_003.fastq etc.
run_parse_count_onSplitInput.pl
also writes to /gscratch
.
LarkGrouseTest1_Demux.csv
was created from a R and a MISO report. This file is used to map MIDS to sample names and projects.
Splitting
The info lines for each read in parsed_LarkGrouseTest1*_R1.fastq
and parsed_LarkGrouseTest1*_R2.fastq
have the locus, the forward mid, the reverse mid, and the sample name. These can be used with LarkGrouseTest1_Demux.csv
to separate reads into loci, projects, and samples, in the folder sample_fastq/
. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.
run_splitFastq_fwd.sh
and run_splitFastq_rev.sh
run splitFastq_manyInputfiles.pl
, which steps through the many pairs of files to split reads by sample and project, and place them in /project/gtl/data/raw/LarkGrouseTest1/rawdata/sample_fastq/.
splitFastq.pl
and splitFastq_manyInputfiles.pl
will need tweaking in the future, until sample names and the format of the key for demultiplexing and metadata stabilizes. The number of columns has differed among our three completed sequence lanes.
Calculate summary statistics on reads
In a sequence library’s rawdata/
directory (e.g., /project/gtl/data/raw/LarkGrouseTest1/rawdata
) run:
module load swset/2018.05 gcc/7.3.0 usearch/10.0.240
aggregate_usearch_fastx_info.pl
Merge reads, filter, and (optionally) trim primer regions
In /project/gtl/data/raw/LarkGrouseTest1/tfmergedreads
, we used run_slurm_mergereads.pl
to crawl all of the project folders and sample files (created in the splitting step above) to merge or join read pairs, filter based on base quality, and trim primer regions from the reads. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/
and ITS/
folders in tfmergedreads/
. Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv
in each project folder. For the full lane these summaries were concatenated in tfmergedreads/
with
cat */*/filtermergestats.csv > all_filtermergestats.csv
I used commands.R
in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge (this does not account for reads that were joined/concatenated because the amplicon region was so long that we didn’t have sufficient overlap for merge), or did not meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).
Find unique sequences, cluster, and count OTUs
We generated otu tables for 16S and ITS, which can be found in the /project/microbiome/data/seq/psomagen_17sep20_novaseq2/otu
/16S
and /project/microbiome/data/seq/psomagen_17sep20_novaseq2/otu
/ITS
directories for all projects. The script to do this is /project/microbiome/data/seq/psomagen_17sep20_novaseq2/otu/run_slurm_mkotu.pl.
This script crawls all of the projects (and loci) in the tfmerged/
folder, and concatenates all input sequences files and passes them through a UNIX pipe for analysis (rather than making another copy of all the sequence data). This was rerun on Apr 5, 2021 with the addition of line 5 (--search_exact
) and --relabel 'otu'
added to line 2. The key part the script is the job that is created for each project:
$job .= "cat $tfmdirectory/*tfmergedreads.fa $tfmdirectory/joined/*tfmergedreads.fa | vsearch --derep_fulllength - --threads 32 --output uniqueSequences.fa --sizeout;\n";
$job .= "vsearch --cluster_unoise uniqueSequences.fa --relabel 'otu' --sizein --sizeout --consout zotus.fa --minsize 8 ;\n";
$job .= "vsearch --uchime3_denovo zotus.fa --nonchimeras zotus_nonchimeric.fa --threads 32;\n";
$job .= "cat $tfmdirectory/*tfmergedreads.fa $tfmdirectory/joined/*tfmergedreads.fa | vsearch --usearch_global - --db zotus_nonchimeric.fa --otutabout - --id 0.99 --threads 32 | sed 's/^#OTU ID/OTUID/' > 'otutable';\n";
$job .= 'cat $tfmdirectory/*tfmergedreads.fa $tfmdirectory/joined/*tfmergedreads.fa | vsearch --search_exact - --db zotus_nonchimeric.fa --otutabout - --id 0.99 --threads 32'. " | sed 's/^#OTU ID/OTUID/' > 'otutable.esv';\n";
Make coligo table
In /project/microbiome/data/seq/psomagen_17sep20_novaseq2/coligoISD
, there are 16S
and ITS
directories for all projects. These contain a file named coligoISDtable.txt
with counts of the coligos and the ISD found in the trimmed forward reads, per sample. The file run_slurm_mkcoligoISDtable.pl
has the code that passes over all of the projects and uses vsearch
for making the table.
These steps can be repeated by individual users with different input data and vsearch options, as desired. Please see below for instructions with examples.
Next steps for individual users and projects
The notes below have not been updated to parallel the steps taken above (analysis updated 20 March 2021). Some parts of the next likely apply and are still correct. More up-to-date documentation and suggestions are in https://microcollaborative.atlassian.net/wiki/pages/createpage.action?spaceKey=MICLAB&title=Bioinformatics%20v3.0&linkCreation=true&fromPageId=2088173569
The next tasks each user will need to perform are QC of the data, generation of taxonomic hypotheses for ESVs/OTUs, identification of those ESVs that are the synthgene and coligo and combining data for those ESVs sensibly, and finally statistical analysis. We will go through each of these tasks below and provide some guidelines. We also provide a few tips for those new to bioinformatics. For the experienced users, we will need to work together to help folks that are new to the game.
Basic info for those who are new to bioinformatics
If you are new to bioinformatics and much of this page is unintelligible, then do not fear. Please talk to those in your lab with experience. If you want to do bioinformatics yourself, great! There is a learning curve, so be patient. It is best to start with basic Linux use because to do much of anything with sequencing data it is important to know how to operate in the Linux environment. One can google around and find various tutorials for the basics. Much useful info regarding Teton use has been collated by the Data Science group at UWYO in a Knowledge base. Also see the ARCC wiki for Teton.
Please consult these resources and figure out the basics of using Teton. You will need to know how to operate in Linux to work on Teton.
Best practices
To avoid wasting a large amount of disk space and potentially causing other problems please do not copy the raw fastq files to your scratch or home directories on Teton. If you want to use them, then do so while leaving them in place (see below for an example) and do not keep intermediate files. However, please feel free to copy OTU tables and work on your data as you see fit.
It is very easy to get confused when doing bioinformatics. One makes tens to thousands of files and has to keep them all organized. It is ideal to copy all one's commands to a digital notebook. This may seem onerous, but it will save time in the long run. This will be particularly helpful if you need to ask someone for help. It is very challenging to provide bioinformatic advice when it is unclear what has been done.
Explanation of files in your project folder
If one navigates to /project/microbiome/data/seq/ one will find folders labeled psomagen_date, where date corresponds with the date we received data from Psomagen.
Inside those files one will find these folders: otu rawdata tfmergedreads
otu - contains otu tables for either 16s or ITS organized by project (project names are user provided). Inside the project specific folder, one will find “chimera_out.fa” and “zotus_nonchimeric.fa”, “uniqueSequences”, “zotus.fa”, and “otutable”.
“Chimera_out.fa” - are chimeric sequences
“zotus_nonchimeric.fa” are non-chimeric ESVs. If you want to do something with the ESV sequences use this file.
“uniqueSequences” are all the unique sequences in the project.
“zotus.fa” has ESVs (which usearch refers to as Zotus), these differ from unique sequences slightly in that some low frequency sequences contained in “uniqueSequences.fa” will not be included in “zotus.fa”.
rawdata - contains raw data and scripts used to parse them (see above).
tfmergedreads - contains trimmed, filtered, merged reads. Note that we are not currently trimming reads, though one could do this if desired. Within this folder one will find folders labeled 16S or ITS and inside them will be fasta files labeled by sample name, MID, user and project name. These reads were processed as described by Alex earlier on this page. One can extract filtering and merging statistics from the file, “all_filtermergestats.csv”. This file includes a row for each file. The first field contains the file name, the second field should be an integer and tells how many reads were in the file originally, third field is how many reads were left after merging, fourth field is how many reads left after filtering.
Taxonomic hypothesis generation
One will want to determine which ESVs correspond to which taxa. This can be done via vsearch like so:
vsearch -sintax ESVFILE -db REFERENCEDATABASE -tabbedout OUTPUT -sintax_cutoff 0.8 -strand both -threads 32
This uses the sintax algorithm to match the input ESV sequences, contained within “ESVFILE”, to the reference database. The reference database has a lot of annotated sequences. For example, there is a 16S sequence for E. coli. By matching one's sequences to those in the database one can hypothesize a taxonomy for the sequence. For more details, see the SINTAX paper. Please read the Vsearch help for more options to specify for this algorithm. The options provided here are sufficient, but it is worth being familiar with the possibilities offered by the tool.
There are many reference databases one can choose and, of course, taxonomic hypotheses will vary to some extent by one’s choice. Several reference databases are at /project/microbiome/ref_db. One can download and make one’s own database as well. This is explained in the documentation of vsearch.
Finding coligos and ISDs
One will want to identify the internal standard (ISD; previously referred to as a synthgene). One can compare ESVs to known variants of the ISD using usearch_global.
vsearch -usearch_global zotus_nonchimeric.fa -db /project/microbiome/ref_db/synthgene.fasta -strand plus -blast6out ~/matches.b6 -id 0.85 -query_cov 0.8 -maxaccepts 50
For more on accept options see: https://drive5.com/usearch/manual/accept_options.html. Note that J Harrison chose the values in the preceding example somewhat arbitrarily (they worked well with a subset of the data we obtained in May 2020). It is worth playing with the parameters to see how results change. Look at the ESV sequences suggested to be ISDs and see if they match expectations.
Those ESVs that correspond to the ISD can be combined. The easiest way to do this is to sum the corresponding rows in the OTU table. A script can be made to pull in the matches provided by vsearch and combine those rows of the OTU table. If this is challenging then please ask for help. Here is a file that has the ISD sequences, including a few known variants of the ISD that occur in the data.
--
Identify coligos. One needs to go through the ESV file and figure out which sequences are coligos. Here is an example script and input files to do this.
Quality control
Some suggestions for QC
Check the dimensions of the OTU table. Are all the samples present that were supposed to be present? If not, then what are the missing ones, did they have very low DNA?
How much is in the negative control? There shouldn’t be too much, but some contaminants are expected.
Check and see if the sequences in ones positive control match expectations. If sufficient data are not provided in the output from one’s preferred taxonomic hypothesis generation software, then one can blast sequences here: https://blast.ncbi.nlm.nih.gov/Blast.cgi?LINK_LOC=blasthome&PAGE_TYPE=BlastSearch&PROGRAM=blastn (of course, if real common sequences aren’t identified by the taxonomic hypothesis generation software, then it would be worth revisiting that step!). If you don’t have a positive control then check out some of the most abundant OTUs and see what they are. Do they match expectations?
Check out the ESV sequences, are there any very short sequences that are not coligos? This can be done via functionality in Vsearch.
Look for low complexity ESVs.
Check and see if the ISD is present in every sample. If not, that is ok, but something to be aware of as it will influence absolute abundance estimation.
See if coligos are in wells that they are not supposed to be in. If so, then consider carefully what to do. If contamination is very minor, then one could estimate it to report in publications and probably ignore it. For insance, if contamination happened in two samples and was 0.0001% of the data of those samples, then no need to worry. If cross-contamination is higher, then one will likely need to develop a bespoke approach to deal with it.
Check out technical replicates and quantify how much variation there is among them in terms of the metrics cared about (e.g., relative abundance profiles).
Rerunning otu table generation
We anticipate that users will want to rerun otu table generation with different input sequences or search options for clustering (and masking of portions of the sequence, like the primer regions). For example, to combine 16S reads from snow18
and snow19
projects in the psomagen_6mar20
project, one could do so as follows, from the command-line:
srun --pty --account="microbiome" --ntasks-per-node=32 --nodes=1 --mem=120G -t 0-2:00 /bin/bash
module load swset/2018.05 gcc/7.3.0 vsearch/2.9.0
cat /project/microbiome/data/seq/psomagen_6mar20/tfmergedreads/16S/snow1*/*tfmergedreads.fa | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch -derep_fulllength - -threads 32 --output uniqueSequences.fa --sizeout
vsearch --cluster_unoise uniqueSequences.fa --sizein --consout zotus.fa --minsize 12 --id 0.99 --relabel 'Otu' --iddef 2
vsearch --uchime3_denovo zotus.fa --chimeras chimera_out.fa --nonchimeras zotus_nonchimeric.fa --threads 32
cat /project/microbiome/data/seq/psomagen_6mar20/tfmergedreads/16S/snow1*/*tfmergedreads.fa | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch --usearch_global - --db zotus_nonchimeric.fa --otutabout 'otutable' --id 0.99
Note that if biom format files are desired they can be output by vsearch. See the vsearch help for more.
Statistics
Everyone will have different statistics to estimate from their data. We advocate the Dirichlet-multinomial model described here (see the supplement for a vignette describing how to apply the model). After relative abundances are estimated we suggest dividing them by the relative abundance estimate of the internal standard, thus placing the data on the scale of the internal standard, which is a proxy for actual abundance. For more, see this preprint by the Buerkle lab.
It is possible to extract an estimate of the variation in relative abundance estimates among technical replicates (calculate variation of the MCMC samples for each relative abundance estimate from the DM model). We have not experimented with this too much yet, but it seems like it would be a valuable consideration. Please let us know if you develop software to do this.
Further considerations
It is worth considering masking or deleting primer regions and combining ESVs to avoid having multiple ESVs for the same taxon. Primers are degenerate so their inclusion in bioinformatics leads to more ESVs. This may reflect biology or it may not. Therefore, we chose to leave primer binding regions on sequences and leave it up to the client how to proceed. It is likely that ones choice will have little bearing on beta-diversity or PERMANOVA, but may influence differential relative abundance testing. We plan to experiment with masking so that the user has the option of including masked regions, or filtering them out.
Bioinformatic wish list
Bioinformatic raw sequence processing performed with primer regions removed.