Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Status (28 September 2020)

Demultiplexing and splitting

The directories for the raw sequence data (gz compressed) and the parsed and split reads are /project/microbiome/data/seq/psomagen_6mar20/rawdata (corresponding to /wiki/spaces/MICLAB/pages/566919183 ) and /project/microbiome/data/seq/psomagen_26may20/rawdata(corresponding to /wiki/spaces/MICLAB/pages/566951972). Files for individual samples are in /project/microbiome/data/seq/psomagen_6mar20/rawdata/sample_fastq/ and /project/microbiome/data/seq/psomagen_26may20/rawdata/sample_fastq.

Demultiplexing

The work is done by run_parse_count_onSplitInput.pl. As the name implies, we split the raw data into many files, so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across ~220 jobs. By doing so, the parsing takes less than one hour.

Splitting the raw (uncompressed) data was accomplish with the program split, with 16x106 lines (4x106 reads) being written to each file (with a remainder in the final file). These files were written to /gscratch and, as intermediate files that can be reconstructed readily, will not be retained long-term.

split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1/project/microbiome/data/seq/psomagen_6mar20/rawdata/NovaSeq1A_R1.fastq  novaseq1A_R1_
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1/project/microbiome/data/seq/psomagen_6mar20/rawdata/NovaSeq1A_R2.fastq  novaseq1A_R2_
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1/project/microbiome/data/seq/psomagen_26may20/rawdata/NovaSeq1B_R1.fastq  novaseq1B_R1_ 
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1/project/microbiome/data/seq/psomagen_26may20/rawdata/NovaSeq1B_R2.fastq  novaseq1B_R2_

making 442 files, with structured names (e.g., for the R2 set):

/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_000.fastq
/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_001.fastq
/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_002.fastq …

/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_219.fastq
/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_220.fastq
/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_221.fastq

run_parse_count_onSplitInput.pl also writes to /gscratch.

NovaSeqmetadataRUN1.csv and NovaSeqmetadataRUN2.csv were exported from a Google Sheet notebook and used to map MIDS to sample names. NovaSeqmetadataRUN1.csv had 1920 lines in which the locus field did not match the locus name in the midplate. Furthermore, by initial manual inspection, Alex Buerkle verified that some sequences labeled as 16S started with ITS primer, and vise versa. To check that the corrected locus information in NovaSeqmetadataRUN1_corrected.csv (below) is correct, I then wrote verify_mismatch_naming_locus.pl to look at the first 10 instances in which the expected primer was not found. This involved reading up through read 2,481,225. All of the ten mismatches appeared to be the right locus, but to have some sequence variation relative to the primer. Interestingly, the primer sequence is preceded in some cases by up to 4 additional bases (or is missing the first base) relative to what we expected to be in the primer (so I had to allow for that in verify_mismatch_naming_locus.pl), after we trimmed off the recognized MID. So I think this is fine for now, but we should follow-up to understand the origin of these additional bases. Our scheme for assigning amplicons to loci now allows for up to 4 bases mismatch from the specified primers (5/20 would be 25%, which seemed pretty high, so Alex chose 4 as the maximum). Additionally, in September 2020 users provided update information for assigning some samples to different projects, and those corrections were done directly to NovaSeqmetadataRUN1.csv and NovaSeqmetadataRUN2.csv.

To correct some of the MIDs problems, we copied the original NovaSeqmetadataRUN1.csv file to NovaSeqmetadataRUN1_uncorrected.csv We used correctMIDS.pl to write a new, corrected versionNovaSeqmetadataRUN1_corrected.csv, which we copied to NovaSeqmetadataRUN1.csv. All subsequent steps were completed again for run 1.

Splitting

The info lines for each read in parsed_NovaSeq1A_R1.fastq and parsed_NovaSeq1A_R2.fastq have the locus, the forward mid, the reverse mid, and the sample name. These can be used with NovaSeqmetadataRUN1.csv and NovaSeqmetadataRUN2.csv to separate reads into loci, projects, and samples, in the folder sample_fastq/. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.

run_splitFastq_fwd.sh and run_splitFastq_rev.sh run splitFastq_manyInputfiles.pl, which steps through the many pairs of files to split reads by sample and project, and place them in /project/microbiome/data/seq/psomagen_6mar20/rawdata/sample_fastq/ and /project/microbiome/data/seq/psomagen_26may20/rawdata/sample_fastq.

splitFastq.pl and splitFastq_manyInputfiles.pl will need tweaking in the future, until sample names and the format of the key for demultiplexing and metadata stabilizes. For example, in this case, the number of columns differed between NovaSeqmetadataRUN1.csv and NovaSeqmetadataRUN2.csv, because the former had an additional field for an ‘other_samplename’ or something like that.

Calculate summary statistics on reads

In a sequence library’s rawdata/ directory (e.g., /project/microbiome/data/seq/psomagen_26may20/rawdata) run:

module load swset/2018.05 gcc/7.3.0 usearch/10.0.240

aggregate_usearch_fastx_info.pl

Merge reads, filter, and (optionally) trim primer regions

In /project/microbiome/data/seq/psomagen_26may20/tfmergedreads and /project/microbiome/data/seq/psomagen_6mar20/tfmergedreads, we used run_slurm_mergereads.plto crawl all of the project folders and sample files (created in the splitting step above) to merge read pairs, filter based on base quality, and optionally trim primer regions from the reads. We are now not trimming primer regions, even though we found that we could not readily eliminate low frequency, potentially spurious OTUs downstream by masking with vsearch; it would not properly mask these regions [by marking them with lower-case bases] and the soft mask them when we do vsearch cluster_size. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/ and ITS/ folders in tfmergedreads/. Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv in each project folder. For the full lane these summaries were concatenated in tfmergedreads/ with

cat */*/filtermergestats.csv > all_filtermergestats.csv

I used commands.R in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge, or did meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).

Find unique sequences, cluster, and count OTUs

We ran a single script for each library that generated otu tables for 16S and ITS. These are found in the otu/ directory for each library. The script we ran is also there: e.g., /project/microbiome/data/seq/psomagen_6mar20/otu/run_slurm_mkotu.pl

This script crawls all of the projects (and loci) in the tfmerged/ folder, and concatenates all input sequences files and passes them through a UNIX pipe for analysis (rather than making another copy of all the sequence data). The key part the script is the job that is created for each project:

$job .= "cd $allprojectsotu{$tfmdirectory};\n"; # this is the otu directory
$job .= "cat $tfmdirectory/*tfmergedreads.fa  | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch -derep_fulllength - -threads 32 --output uniqueSequences.fa --sizeout;\n";
$job .= "vsearch --cluster_unoise uniqueSequences.fa --sizein --consout zotus.fa --minsize 12 --id 0.99 --relabel 'Otu' --iddef 2;\n";
$job .= "vsearch --uchime3_denovo zotus.fa --chimeras chimera_out.fa --nonchimeras zotus_nonchimeric.fa --threads 32;\n";
$job .= "cat $tfmdirectory/*tfmergedreads.fa  | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch --usearch_global - --db zotus_nonchimeric.fa --otutabout 'otutable' --id 0.99;\n";

These steps can be repeated by individual users with different input data and vsearch options, as desired. Please see below for instructions with examples. The settings we used for all projects took only a few hours for each project.

When we originally passed reads through steps to cluster into OTUs / ESVs and counts, we were getting ridiculously enormous OTU tables, because we were inadvertently having every read count as a different sample (vsearch uses the info line in fastq as its sample name, not the filename, and because we assigned a unique integer to every parsed sequence, every read was being treated as a unique sample (so had a single non-zero value for OTUs. Big mistake. Joshua Harrison figured out that this was the problem). We’ve fixed that. Lesson learned.

We decided to trim primer regions and get fewer OTUs by excluding what we think are likely spurious OTUs/ESVs. We could redo this and include primer regions again, now that the otu tables are sane.

Next steps for individual users and projects

The next tasks each user will need to perform are QC of the data, generation of taxonomic hypotheses for ESVs/OTUs, identification of those ESVs that are the synthgene and coligo and combining data for those ESVs sensibly, and finally statistical analysis. We will go through each of these tasks below and provide some guidelines. We also provide a few tips for those new to bioinformatics. For the experienced users, we will need to work together to help folks that are new to the game.

Basic info for those who are new to bioinformatics

If you are new to bioinformatics and much of this page is unintelligible, then do not fear. Please talk to those in your lab with experience. If you want to do bioinformatics yourself, great! There is a learning curve, so be patient. It is best to start with basic Linux use because to do much of anything with sequencing data it is important to know how to operate in the Linux environment. One can google around and find various tutorials for the basics. Much useful info regarding Teton use has been collated by the Data Science group at UWYO in a Knowledge base. Also see the ARCC wiki for Teton.

Please consult these resources and figure out the basics of using Teton. You will need to know how to operate in Linux to work on Teton.

Also, John Calder will be offering a workshop on the bioinformatics and statistics of these data in the coming weeks that will help.

Best practices

To avoid wasting a large amount of disk space and potentially causing other problems please do not copy the raw fastq files to your scratch or home directories on Teton. If you want to use them, then do so while leaving them in place (see below for an example) and do not keep intermediate files. However, please feel free to copy OTU tables and work on your data as you see fit.

It is very easy to get confused when doing bioinformatics. One makes tens to thousands of files and has to keep them all organized. It is ideal to copy all one's commands to a digital notebook. This may seem onerous, but it will save time in the long run. This will be particularly helpful if you need to ask someone for help. It is very challenging to provide bioinformatic advice when it is unclear what has been done.

Explanation of files in your project folder

If one navigates to /project/microbiome/data/seq/ one will find folders labeled psomagen_date, where date corresponds with the date we received data from Psomagen.

Inside those files one will find these folders: otu rawdata tfmergedreads

otu - contains otu tables for either 16s or ITS organized by project (project names are user provided). Inside the project specific folder, one will find “chimera_out.fa” and “zotus_nonchimeric.fa”, “uniqueSequences”, “zotus.fa”, and “otutable”.

“Chimera_out.fa” - are chimeric sequences

“zotus_nonchimeric.fa” are non-chimeric ESVs. If you want to do something with the ESV sequences use this file.

“uniqueSequences” are all the unique sequences in the project.

“zotus.fa” has ESVs (which usearch refers to as Zotus), these differ from unique sequences slightly in that some low frequency sequences contained in “uniqueSequences.fa” will not be included in “zotus.fa”.

rawdata - contains raw data and scripts used to parse them (see above).

tfmergedreads - contains trimmed, filtered, merged reads. Note that we are not currently trimming reads, though one could do this if desired. Within this folder one will find folders labeled 16S or ITS and inside them will be fasta files labeled by sample name, MID, user and project name. These reads were processed as described by Alex earlier on this page. One can extract filtering and merging statistics from the file, “all_filtermergestats.csv”. This file includes a row for each file. The first field contains the file name, the second field should be an integer and tells how many reads were in the file originally, third field is how many reads were left after merging, fourth field is how many reads left after filtering.

Taxonomic hypothesis generation

One will want to determine which ESVs correspond to which taxa. This can be done via vsearch like so:

vsearch -sintax ESVFILE -db REFERENCEDATABASE -tabbedout OUTPUT -sintax_cutoff 0.8 -strand both -threads 32

This uses the sintax algorithm to match the input ESV sequences, contained within “ESVFILE”, to the reference database. The reference database has a lot of annotated sequences. For example, there is a 16S sequence for E. coli. By matching one's sequences to those in the database one can hypothesize a taxonomy for the sequence. For more details, see the SINTAX paper. Please read the Vsearch help for more options to specify for this algorithm. The options provided here are sufficient, but it is worth being familiar with the possibilities offered by the tool.

There are many reference databases one can choose and, of course, taxonomic hypotheses will vary to some extent by one’s choice. Several reference databases are at /project/microbiome/ref_db. One can download and make one’s own database as well. This is explained in the documentation of vsearch.

Finding coligos and ISDs

One will want to identify the internal standard (ISD; previously referred to as a synthgene). One can compare ESVs to known variants of the ISD using usearch_global.

vsearch -usearch_global zotus_nonchimeric.fa -db /project/microbiome/ref_db/synthgene.fasta -strand plus -blast6out ~/matches.b6 -id 0.85 -query_cov 0.8 -maxaccepts 50

For more on accept options see: https://drive5.com/usearch/manual/accept_options.html. Note that J Harrison chose the values in the preceding example somewhat arbitrarily (they worked well with a subset of the data we obtained in May 2020). It is worth playing with the parameters to see how results change. Look at the ESV sequences suggested to be ISDs and see if they match expectations.

Those ESVs that correspond to the ISD can be combined. The easiest way to do this is to sum the corresponding rows in the OTU table. A script can be made to pull in the matches provided by vsearch and combine those rows of the OTU table. If this is challenging then please ask for help. Here is a file that has the ISD sequences, including a few known variants of the ISD that occur in the data.

--

Identify coligos. One needs to go through the ESV file and figure out which sequences are coligos. Here is an example script and input files to do this.

Project naming and merging

We want otutables for each project, but the ‘projects’ that were named in the MIDS file might have overspilt the samples into too many projects. The ones that were split across sequencing runs 1 and 2, might have 16S in one run and ITS in the other, so wouldn’t need recalculation on an otutable on a merged set of sequences. The projects we encountered are below in the first table. In the second table, Alessandra Ceretto Gordon Custer Ella DeWolf Reilly Dibner Abby Hoffman and Macy Ricketts, can you please provide guidance on any simplification of names for ‘projects’?

Run 1

Run 2

AH2

Cheatgrass

control

fire_19

fire_19

GSG1

HerbPT1

HerbPT2

Invasives

Lake

LAKE

MACY1

MACY2

MACY3

MACY4

MACY6

MICROPLOT

MP001

MP002

MP003

MP004

MP006

sm19-arch

sm19lily

sm19lily

sm19-temp

snow18

snow18

snow19

snow19

snow20

snow21

snow22

snow23

snow24

snow25

snow26

snow27

snow28

snow29

snow30

snow31

snow32

snow33

snowfield

snowfield

test

test

unknown

yermo_19

yermo_19

Simplifications that are possible:

Scientist

Project

Project name in run 1

Project name in run 2

Dibner

Project fire_19 is all soils collected from March 2019 through October 2019 around the Chimney Park area. These soils came from areas that were burned severely (SF), experienced lighter understory fire (UF), or were not burned (NB). I’m looking to identify differences in assemblages of microbes across these treatments and over time.

fire_19

fire_19

Dibner

Project yermo_19 includes bulk soils from a site containing a threatened plant, all within a broader sagebrush ecosystem. Rhizsophere samples from the same study are forthcoming and not included in these runs, so analyses on this project is lower priority than fire_19 and others.

yermo_19

yermo_19


Quality control

Some suggestions for QC

  1. Check the dimensions of the OTU table. Are all the samples present that were supposed to be present? If not, then what are the missing ones, did they have very low DNA?

  2. How much is in the negative control? There shouldn’t be too much, but some contaminants are expected.

  3. Check and see if the sequences in ones positive control match expectations. If sufficient data are not provided in the output from one’s preferred taxonomic hypothesis generation software, then one can blast sequences here: https://blast.ncbi.nlm.nih.gov/Blast.cgi?LINK_LOC=blasthome&PAGE_TYPE=BlastSearch&PROGRAM=blastn (of course, if real common sequences aren’t identified by the taxonomic hypothesis generation software, then it would be worth revisiting that step!). If you don’t have a positive control then check out some of the most abundant OTUs and see what they are. Do they match expectations?

  4. Check out the ESV sequences, are there any very short sequences that are not coligos? This can be done via functionality in Vsearch.

  5. Look for low complexity ESVs.

  6. Check and see if the ISD is present in every sample. If not, that is ok, but something to be aware of as it will influence absolute abundance estimation.

  7. See if coligos are in wells that they are not supposed to be in. If so, then consider carefully what to do. If contamination is very minor, then one could estimate it to report in publications and probably ignore it. For insance, if contamination happened in two samples and was 0.0001% of the data of those samples, then no need to worry. If cross-contamination is higher, then one will likely need to develop a bespoke approach to deal with it.

  8. Check out technical replicates and quantify how much variation there is among them in terms of the metrics cared about (e.g., relative abundance profiles).

Rerunning otu table generation

We anticipate that users will want to rerun otu table generation with different input sequences or search options for clustering (and masking of portions of the sequence, like the primer regions). For example, to combine 16S reads from snow18 and snow19 projects in the psomagen_6mar20 project, one could do so as follows, from the command-line:

srun --pty --account="microbiome" --ntasks-per-node=32 --nodes=1 --mem=120G -t 0-2:00  /bin/bash
module load swset/2018.05 gcc/7.3.0 vsearch/2.9.0
cat /project/microbiome/data/seq/psomagen_6mar20/tfmergedreads/16S/snow1*/*tfmergedreads.fa  | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch -derep_fulllength - -threads 32 --output uniqueSequences.fa --sizeout
vsearch --cluster_unoise uniqueSequences.fa --sizein --consout zotus.fa --minsize 12 --id 0.99 --relabel 'Otu' --iddef 2
vsearch --uchime3_denovo zotus.fa --chimeras chimera_out.fa --nonchimeras zotus_nonchimeric.fa --threads 32
cat /project/microbiome/data/seq/psomagen_6mar20/tfmergedreads/16S/snow1*/*tfmergedreads.fa  | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch --usearch_global - --db zotus_nonchimeric.fa --otutabout 'otutable' --id 0.99

Note that if biom format files are desired they can be output by vsearch. See the vsearch help for more.

Statistics

Everyone will have different statistics to estimate from their data. We advocate the Dirichlet-multinomial model described here (see the supplement for a vignette describing how to apply the model). After relative abundances are estimated we suggest dividing them by the relative abundance estimate of the internal standard, thus placing the data on the scale of the internal standard, which is a proxy for actual abundance. For more, see this preprint by the Buerkle lab.

It is possible to extract an estimate of the variation in relative abundance estimates among technical replicates (calculate variation of the MCMC samples for each relative abundance estimate from the DM model). We have not experimented with this too much yet, but it seems like it would be a valuable consideration. Please let us know if you develop software to do this.

Further considerations

It is worth considering masking or deleting primer regions and combining ESVs to avoid having multiple ESVs for the same taxon. Primers are degenerate so their inclusion in bioinformatics leads to more ESVs. This may reflect biology or it may not. Therefore, we chose to leave primer binding regions on sequences and leave it up to the client how to proceed. It is likely that ones choice will have little bearing on beta-diversity or PERMANOVA, but may influence differential relative abundance testing. We plan to experiment with masking so that the user has the option of including masked regions, or filtering them out.

Bioinformatic wish list

Bioinformatic raw sequence processing performed with primer regions removed.

  • No labels