Info |
---|
Status (1 12 October 2020)
|
Table of Contents |
---|
Demultiplexing and splitting
The directory for the raw sequence data (typically gz compressed; use run_pigz.sh and run_unpigz.sh to compress and decompress with multithreaded pigz, using SLURM) and the parsed and split reads is /project/microbiome/data/seq/psomagen_17sep209oct20_novaseq2novaseq3/rawdata
(corresponding to micro NovaSeq Run #2#3). Files for individual samples are will be in /project/microbiome/data/seq/psomagen_17sep209oct20_novaseq2novaseq3/rawdata/sample_fastq/
.
Demultiplexing
The work is done by run_parse_count_onSplitInput.pl
. As the name implies, we split the raw data into many files, so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across ~220 jobs. By doing so, the parsing takes less than one hour.
Splitting the raw (uncompressed) data was accomplish with the program split
(done in an interactive SLURM job), with 16x106 lines (4x106 reads) being written to each file (with a remainder in the final file). These files were written to /gscratch and, as intermediate files that can be reconstructed readily, will not be retained long-term.
...
Code Block |
---|
cd /gscratch/buerkle/psomagen_9oct20_novaseq3/rawdata split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq /pfs/tsfs1/project/microbiome/data/seq/psomagen_9oct20_17sep20novaseq3/rawdata/NovaSeq1ANovaSeq3_R1.fastq novaseq1Anovaseq3_R1_ ; split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq /pfs/tsfs1/project/microbiome/data/seq/psomagen_17sep209oct20_novaseq3/rawdata/NovaSeq1ANovaSeq3_R2.fastq novaseq1Anovaseq3_R2_ |
making 220 R1 files and 220 R2 files, with structured names (e.g., for the R1 set):/gscratch/buerkle/psomagen_17sep20_novaseq2/rawdata/novaseq2_R1_000.fastq
/gscratch/buerkle/psomagen_17sep20_novaseq2/rawdata/novaseq2_R1_001.fastq
/gscratch/buerkle/psomagen_17sep209oct20_novaseq2novaseq3/rawdata/novaseq2novaseq3_R1_002000.fastq
/gscratch/buerkle/psomagen_17sep209oct20_novaseq2novaseq3/rawdata/novaseq2novaseq3_R1_003001.fastq
etc.
run_parse_count_onSplitInput.pl
also writes to /gscratch
.
NovaSeq2novaseq3_DemuxJHdemux.csv
was exported from a Google Sheet notebook and further edited to correct some assignment of project namesprovided by Joshua Harrison. This file is used to map MIDS to sample names and projects.
...
Info |
---|
(text below are next steps and will be edited as I move through the steps). |
Splitting
The info lines for each read in parsed_NovaSeq2*_R1.fastq
and parsed_NovaSeq2*_R2.fastq
have the locus, the forward mid, the reverse mid, and the sample name. These can be used with NovaSeq2_DemuxJH.csv
to separate reads into loci, projects, and samples, in the folder sample_fastq/
. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.
...
splitFastq.pl
and splitFastq_manyInputfiles.pl
will need tweaking in the future, until sample names and the format of the key for demultiplexing and metadata stabilizes. The number of columns has differed among our three completed sequence lanes.
Calculate summary statistics on reads
In a sequence library’s rawdata/
directory (e.g., /project/microbiome/data/seq/psomagen_17sep20_novaseq2/rawdata
) run:
...
aggregate_usearch_fastx_info.pl
Merge reads, filter, and (optionally) trim primer regions
In /project/microbiome/data/seq/psomagen_17sep20_novaseq2/tfmergedreads
, we used run_slurm_mergereads.pl
to crawl all of the project folders and sample files (created in the splitting step above) to merge read pairs, filter based on base quality, and optionally trim primer regions from the reads. We are now not trimming primer regions, even though we found that we could not readily eliminate low frequency, potentially spurious OTUs downstream by masking with vsearch; it would not properly mask these regions [by marking them with lower-case bases] and the soft mask them when we do vsearch cluster_size
. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/
and ITS/
folders in tfmergedreads/
. Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv
in each project folder. For the full lane these summaries were concatenated in tfmergedreads/
with
...
I used commands.R
in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge, or did meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).
...
Find unique sequences, cluster, and count OTUs
We ran a single script for each library that generated otu tables for 16S and ITS. These are found in the otu/ directory for each library. The script we ran is also there: e.g., /project/microbiome/data/seq/psomagen_6mar20/otu/run_slurm_mkotu.pl
...
Finally, please note that vsearch
does not allow hyphens in the sample name, so all instances of hyphens in the sample names will appear as underscores in the otutable.
...
Next steps for individual users and projects
The next tasks each user will need to perform are QC of the data, generation of taxonomic hypotheses for ESVs/OTUs, identification of those ESVs that are the synthgene and coligo and combining data for those ESVs sensibly, and finally statistical analysis. We will go through each of these tasks below and provide some guidelines. We also provide a few tips for those new to bioinformatics. For the experienced users, we will need to work together to help folks that are new to the game.
Basic info for those who are new to bioinformatics
If you are new to bioinformatics and much of this page is unintelligible, then do not fear. Please talk to those in your lab with experience. If you want to do bioinformatics yourself, great! There is a learning curve, so be patient. It is best to start with basic Linux use because to do much of anything with sequencing data it is important to know how to operate in the Linux environment. One can google around and find various tutorials for the basics. Much useful info regarding Teton use has been collated by the Data Science group at UWYO in a Knowledge base. Also see the ARCC wiki for Teton.
Please consult these resources and figure out the basics of using Teton. You will need to know how to operate in Linux to work on Teton.
Best practices
To avoid wasting a large amount of disk space and potentially causing other problems please do not copy the raw fastq files to your scratch or home directories on Teton. If you want to use them, then do so while leaving them in place (see below for an example) and do not keep intermediate files. However, please feel free to copy OTU tables and work on your data as you see fit.
It is very easy to get confused when doing bioinformatics. One makes tens to thousands of files and has to keep them all organized. It is ideal to copy all one's commands to a digital notebook. This may seem onerous, but it will save time in the long run. This will be particularly helpful if you need to ask someone for help. It is very challenging to provide bioinformatic advice when it is unclear what has been done.
Explanation of files in your project folder
If one navigates to /project/microbiome/data/seq/ one will find folders labeled psomagen_date, where date corresponds with the date we received data from Psomagen.
...
tfmergedreads - contains trimmed, filtered, merged reads. Note that we are not currently trimming reads, though one could do this if desired. Within this folder one will find folders labeled 16S or ITS and inside them will be fasta files labeled by sample name, MID, user and project name. These reads were processed as described by Alex earlier on this page. One can extract filtering and merging statistics from the file, “all_filtermergestats.csv”. This file includes a row for each file. The first field contains the file name, the second field should be an integer and tells how many reads were in the file originally, third field is how many reads were left after merging, fourth field is how many reads left after filtering.
Taxonomic hypothesis generation
One will want to determine which ESVs correspond to which taxa. This can be done via vsearch like so:
...
There are many reference databases one can choose and, of course, taxonomic hypotheses will vary to some extent by one’s choice. Several reference databases are at /project/microbiome/ref_db. One can download and make one’s own database as well. This is explained in the documentation of vsearch.
Finding coligos and ISDs
One will want to identify the internal standard (ISD; previously referred to as a synthgene). One can compare ESVs to known variants of the ISD using usearch_global.
...
View file | ||
---|---|---|
|
View file | ||
---|---|---|
|
View file | ||
---|---|---|
|
Quality control
Some suggestions for QC
Check the dimensions of the OTU table. Are all the samples present that were supposed to be present? If not, then what are the missing ones, did they have very low DNA?
How much is in the negative control? There shouldn’t be too much, but some contaminants are expected.
Check and see if the sequences in ones positive control match expectations. If sufficient data are not provided in the output from one’s preferred taxonomic hypothesis generation software, then one can blast sequences here: https://blast.ncbi.nlm.nih.gov/Blast.cgi?LINK_LOC=blasthome&PAGE_TYPE=BlastSearch&PROGRAM=blastn (of course, if real common sequences aren’t identified by the taxonomic hypothesis generation software, then it would be worth revisiting that step!). If you don’t have a positive control then check out some of the most abundant OTUs and see what they are. Do they match expectations?
Check out the ESV sequences, are there any very short sequences that are not coligos? This can be done via functionality in Vsearch.
Look for low complexity ESVs.
Check and see if the ISD is present in every sample. If not, that is ok, but something to be aware of as it will influence absolute abundance estimation.
See if coligos are in wells that they are not supposed to be in. If so, then consider carefully what to do. If contamination is very minor, then one could estimate it to report in publications and probably ignore it. For insance, if contamination happened in two samples and was 0.0001% of the data of those samples, then no need to worry. If cross-contamination is higher, then one will likely need to develop a bespoke approach to deal with it.
Check out technical replicates and quantify how much variation there is among them in terms of the metrics cared about (e.g., relative abundance profiles).
Rerunning otu table generation
We anticipate that users will want to rerun otu table generation with different input sequences or search options for clustering (and masking of portions of the sequence, like the primer regions). For example, to combine 16S reads from snow18
and snow19
projects in the psomagen_6mar20
project, one could do so as follows, from the command-line:
...
Note that if biom format files are desired they can be output by vsearch. See the vsearch help for more.
Statistics
Everyone will have different statistics to estimate from their data. We advocate the Dirichlet-multinomial model described here (see the supplement for a vignette describing how to apply the model). After relative abundances are estimated we suggest dividing them by the relative abundance estimate of the internal standard, thus placing the data on the scale of the internal standard, which is a proxy for actual abundance. For more, see this preprint by the Buerkle lab.
It is possible to extract an estimate of the variation in relative abundance estimates among technical replicates (calculate variation of the MCMC samples for each relative abundance estimate from the DM model). We have not experimented with this too much yet, but it seems like it would be a valuable consideration. Please let us know if you develop software to do this.
Further considerations
It is worth considering masking or deleting primer regions and combining ESVs to avoid having multiple ESVs for the same taxon. Primers are degenerate so their inclusion in bioinformatics leads to more ESVs. This may reflect biology or it may not. Therefore, we chose to leave primer binding regions on sequences and leave it up to the client how to proceed. It is likely that ones choice will have little bearing on beta-diversity or PERMANOVA, but may influence differential relative abundance testing. We plan to experiment with masking so that the user has the option of including masked regions, or filtering them out.
Bioinformatic wish list
Bioinformatic raw sequence processing performed with primer regions removed.
...