Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Demultiplexing and splitting

The directories directory for the raw sequence data (gz compressed) and the parsed and split reads are is /project/microbiome/data/seq/psomagen_6mar2017sep20/rawdata (corresponding to /wiki/spaces/MICLAB/pages/566919183 ) and /project/microbiome/data/seq/psomagen_26may20/rawdata(corresponding to /wiki/spaces/MICLAB/pages/566951972 micro NovaSeq Run #2 ). Files for individual samples are in /project/microbiome/data/seq/psomagen_6mar2017sep20/rawdata/sample_fastq/ and /project/microbiome/data/seq/psomagen_26may20/rawdata/sample_fastq.

Demultiplexing

The work is done by .

Demultiplexing

The work is done by run_parse_count_onSplitInput.pl. As the name implies, we split the raw data into many files, so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across ~220 jobs. By doing so, the parsing takes less than one hour.

...

Code Block
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1/project/microbiome/data/seq/psomagen_6mar2017sep20/rawdata/NovaSeq1A_R1.fastq  novaseq1A_R1_
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1/project/microbiome/data/seq/psomagen_6mar2017sep20/rawdata/NovaSeq1A_R2.fastq  novaseq1A_R2_
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1/project/microbiome/data/seq/psomagen_26may20/rawdata/NovaSeq1B_R1.fastq  novaseq1B_R1_ 
split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq  /pfs/tsfs1/project/microbiome/data/seq/psomagen_26may20/rawdata/NovaSeq1B_R2.fastq  novaseq1B_R2_

making 442 files, with structured names (e.g., for the R2 set):

/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_000.fastq
/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_001.fastq
/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_002.fastq …

/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_219.fastq
/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_220.fastq
/gscratch/buerkle/psomagen_6mar20/rawdata/novaseq1A_R2_221.fastq

run_parse_count_onSplitInput.pl also writes to /gscratch.

NovaSeqmetadataRUN1.csv and NovaSeqmetadataRUN2.csv were exported from a Google Sheet notebook and used to map MIDS to sample names. NovaSeqmetadataRUN1.csv had 1920 lines in which the locus field did not match the locus name in the midplate. Furthermore, by initial manual inspection, Alex Buerkle verified that some sequences labeled as 16S started with ITS primer, and vise versa. To check that the corrected locus information in NovaSeqmetadataRUN1_corrected.csv (below) is correct, I then wrote verify_mismatch_naming_locus.pl to look at the first 10 instances in which the expected primer was not found. This involved reading up through read 2,481,225. All of the ten mismatches appeared to be the right locus, but to have some sequence variation relative to the primer. Interestingly, the primer sequence is preceded in some cases by up to 4 additional bases (or is missing the first base) relative to what we expected to be in the primer (so I had to allow for that in verify_mismatch_naming_locus.pl), after we trimmed off the recognized MID. So I think this is fine for now, but we should follow-up to understand the origin of these additional bases. Our scheme for assigning amplicons to loci now allows for up to 4 bases mismatch from the specified primers (5/20 would be 25%, which seemed pretty high, so Alex chose 4 as the maximum). Additionally, in September 2020 users provided update information for assigning some samples to different projects, and those corrections were done directly to NovaSeqmetadataRUN1.csv and NovaSeqmetadataRUN2.csv.

To correct some of the MIDs problems, we copied the original NovaSeqmetadataRUN1.csv file to NovaSeqmetadataRUN1_uncorrected.csv We used correctMIDS.pl to write a new, corrected versionNovaSeqmetadataRUN1_corrected.csv, which we copied to NovaSeqmetadataRUN1.csv. All subsequent steps were completed again for run 1.

Splitting

The info lines for each read in parsed_NovaSeq1A_R1.fastq and parsed_NovaSeq1A_R2.fastq have the locus, the forward mid, the reverse mid, and the sample name. These can be used with NovaSeqmetadataRUN1.csv and NovaSeqmetadataRUN2.csv to separate reads into loci, projects, and samples, in the folder sample_fastq/. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.

run_splitFastq_fwd.sh and run_splitFastq_rev.sh run splitFastq_manyInputfiles.pl, which steps through the many pairs of files to split reads by sample and project, and place them in /project/microbiome/data/seq/psomagen_6mar20/rawdata/sample_fastq/ and /project/microbiome/data/seq/psomagen_26may20/rawdata/sample_fastq.

splitFastq.pl and splitFastq_manyInputfiles.pl will need tweaking in the future, until sample names and the format of the key for demultiplexing and metadata stabilizes. For example, in this case, the number of columns differed between NovaSeqmetadataRUN1.csv and NovaSeqmetadataRUN2.csv, because the former had an additional field for an ‘other_samplename’ or something like that.

Calculate summary statistics on reads

In a sequence library’s rawdata/ directory (e.g., /project/microbiome/data/seq/psomagen_26may20/rawdata) run:

module load swset/2018.05 gcc/7.3.0 usearch/10.0.240

aggregate_usearch_fastx_info.pl

Merge reads, filter, and (optionally) trim primer regions

In /project/microbiome/data/seq/psomagen_26may20/tfmergedreads and /project/microbiome/data/seq/psomagen_6mar20/tfmergedreads, we used run_slurm_mergereads.plto crawl all of the project folders and sample files (created in the splitting step above) to merge read pairs, filter based on base quality, and optionally trim primer regions from the reads. We are now not trimming primer regions, even though we found that we could not readily eliminate low frequency, potentially spurious OTUs downstream by masking with vsearch; it would not properly mask these regions [by marking them with lower-case bases] and the soft mask them when we do vsearch cluster_size. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/ and ITS/ folders in tfmergedreads/. Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv in each project folder. For the full lane these summaries were concatenated in tfmergedreads/ with

cat */*/filtermergestats.csv > all_filtermergestats.csv

I used commands.R in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge, or did meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).

...

Find unique sequences, cluster, and count OTUs

We ran a single script for each library that generated otu tables for 16S and ITS. These are found in the otu/ directory for each library. The script we ran is also there: e.g., /project/microbiome/data/seq/psomagen_6mar20/otu/run_slurm_mkotu.pl

This script crawls all of the projects (and loci) in the tfmerged/ folder, and concatenates all input sequences files and passes them through a UNIX pipe for analysis (rather than making another copy of all the sequence data). The key part the script is the job that is created for each project:

Code Block
breakoutModefull-width
$job .= "cd $allprojectsotu{$tfmdirectory};\n"; # this is the otu directory
$job .= "cat $tfmdirectory/*tfmergedreads.fa  | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch -derep_fulllength - -threads 32 --output uniqueSequences.fa --sizeout;\n";
$job .= "vsearch --cluster_unoise uniqueSequences.fa --sizein --consout zotus.fa --minsize 12 --id 0.99 --relabel 'Otu' --iddef 2;\n";
$job .= "vsearch --uchime3_denovo zotus.fa --chimeras chimera_out.fa --nonchimeras zotus_nonchimeric.fa --threads 32;\n";
$job .= "cat $tfmdirectory/*tfmergedreads.fa  | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch --usearch_global - --db zotus_nonchimeric.fa --otutabout 'otutable' --id 0.99;\n";

These steps can be repeated by individual users with different input data and vsearch options, as desired. Please see below for instructions with examples. The settings we used for all projects took only a few hours for each project.

When we originally passed reads through steps to cluster into OTUs / ESVs and counts, we were getting ridiculously enormous OTU tables, because we were inadvertently having every read count as a different sample (vsearch uses the info line in fastq as its sample name, not the filename, and because we assigned a unique integer to every parsed sequence, every read was being treated as a unique sample (so had a single non-zero value for OTUs. Big mistake. Joshua Harrison figured out that this was the problem). We’ve fixed that. Lesson learned.

We decided to trim primer regions and get fewer OTUs by excluding what we think are likely spurious OTUs/ESVs. We could redo this and include primer regions again, now that the otu tables are sane.

Next steps for individual users and projects

The next tasks each user will need to perform are QC of the data, generation of taxonomic hypotheses for ESVs/OTUs, identification of those ESVs that are the synthgene and coligo and combining data for those ESVs sensibly, and finally statistical analysis. We will go through each of these tasks below and provide some guidelines. We also provide a few tips for those new to bioinformatics. For the experienced users, we will need to work together to help folks that are new to the game.

Basic info for those who are new to bioinformatics

If you are new to bioinformatics and much of this page is unintelligible, then do not fear. Please talk to those in your lab with experience. If you want to do bioinformatics yourself, great! There is a learning curve, so be patient. It is best to start with basic Linux use because to do much of anything with sequencing data it is important to know how to operate in the Linux environment. One can google around and find various tutorials for the basics. Much useful info regarding Teton use has been collated by the Data Science group at UWYO in a Knowledge base. Also see the ARCC wiki for Teton.

Please consult these resources and figure out the basics of using Teton. You will need to know how to operate in Linux to work on Teton.

Also, John Calder will be offering a workshop on the bioinformatics and statistics of these data in the coming weeks that will help.

Best practices

To avoid wasting a large amount of disk space and potentially causing other problems please do not copy the raw fastq files to your scratch or home directories on Teton. If you want to use them, then do so while leaving them in place (see below for an example) and do not keep intermediate files. However, please feel free to copy OTU tables and work on your data as you see fit.

It is very easy to get confused when doing bioinformatics. One makes tens to thousands of files and has to keep them all organized. It is ideal to copy all one's commands to a digital notebook. This may seem onerous, but it will save time in the long run. This will be particularly helpful if you need to ask someone for help. It is very challenging to provide bioinformatic advice when it is unclear what has been done.

Explanation of files in your project folder

If one navigates to /project/microbiome/data/seq/ one will find folders labeled psomagen_date, where date corresponds with the date we received data from Psomagen.

Inside those files one will find these folders: otu rawdata tfmergedreads

otu - contains otu tables for either 16s or ITS organized by project (project names are user provided). Inside the project specific folder, one will find “chimera_out.fa” and “zotus_nonchimeric.fa”, “uniqueSequences”, “zotus.fa”, and “otutable”.

“Chimera_out.fa” - are chimeric sequences

“zotus_nonchimeric.fa” are non-chimeric ESVs. If you want to do something with the ESV sequences use this file.

“uniqueSequences” are all the unique sequences in the project.

“zotus.fa” has ESVs (which usearch refers to as Zotus), these differ from unique sequences slightly in that some low frequency sequences contained in “uniqueSequences.fa” will not be included in “zotus.fa”.

rawdata - contains raw data and scripts used to parse them (see above).

tfmergedreads - contains trimmed, filtered, merged reads. Note that we are not currently trimming reads, though one could do this if desired. Within this folder one will find folders labeled 16S or ITS and inside them will be fasta files labeled by sample name, MID, user and project name. These reads were processed as described by Alex earlier on this page. One can extract filtering and merging statistics from the file, “all_filtermergestats.csv”. This file includes a row for each file. The first field contains the file name, the second field should be an integer and tells how many reads were in the file originally, third field is how many reads were left after merging, fourth field is how many reads left after filtering.

Taxonomic hypothesis generation

One will want to determine which ESVs correspond to which taxa. This can be done via vsearch like so:

vsearch -sintax ESVFILE -db REFERENCEDATABASE -tabbedout OUTPUT -sintax_cutoff 0.8 -strand both -threads 32

This uses the sintax algorithm to match the input ESV sequences, contained within “ESVFILE”, to the reference database. The reference database has a lot of annotated sequences. For example, there is a 16S sequence for E. coli. By matching one's sequences to those in the database one can hypothesize a taxonomy for the sequence. For more details, see the SINTAX paper. Please read the Vsearch help for more options to specify for this algorithm. The options provided here are sufficient, but it is worth being familiar with the possibilities offered by the tool.

There are many reference databases one can choose and, of course, taxonomic hypotheses will vary to some extent by one’s choice. Several reference databases are at /project/microbiome/ref_db. One can download and make one’s own database as well. This is explained in the documentation of vsearch.

Finding coligos and ISDs

One will want to identify the internal standard (ISD; previously referred to as a synthgene). One can compare ESVs to known variants of the ISD using usearch_global.

vsearch -usearch_global zotus_nonchimeric.fa -db /project/microbiome/ref_db/synthgene.fasta -strand plus -blast6out ~/matches.b6 -id 0.85 -query_cov 0.8 -maxaccepts 50

For more on accept options see: https://drive5.com/usearch/manual/accept_options.html. Note that J Harrison chose the values in the preceding example somewhat arbitrarily (they worked well with a subset of the data we obtained in May 2020). It is worth playing with the parameters to see how results change. Look at the ESV sequences suggested to be ISDs and see if they match expectations.

Those ESVs that correspond to the ISD can be combined. The easiest way to do this is to sum the corresponding rows in the OTU table. A script can be made to pull in the matches provided by vsearch and combine those rows of the OTU table. If this is challenging then please ask for help. Here is a file that has the ISD sequences, including a few known variants of the ISD that occur in the data.

View file
namesynthgene.fasta

--

Identify coligos. One needs to go through the ESV file and figure out which sequences are coligos. Here is an example script and input files to do this.

View file
namesomeESVs.fa
View file
namecoligoSequences.csv
View file
namecoligo_ISD_exploration.R

Project naming and merging

We want otutables for each project, but the ‘projects’ that were named in the MIDS file might have overspilt the samples into too many projects. The ones that were split across sequencing runs 1 and 2, might have 16S in one run and ITS in the other, so wouldn’t need recalculation on an otutable on a merged set of sequences. The projects we encountered are below in the first table. In the second table, Alessandra Ceretto Gordon Custer Ella DeWolf Reilly Dibner Abby Hoffman and Macy Ricketts, can you please provide guidance on any simplification of names for ‘projects’?

...

Run 1

...

Run 2

...

AH2

...

Cheatgrass

...

control

...

fire_19

...

fire_19

...

GSG1

...

HerbPT1

...

HerbPT2

...

Invasives

...

Lake

...

LAKE

...

MACY1

...

MACY2

...

MACY3

...

MACY4

...

MACY6

...

MICROPLOT

...

MP001

...

MP002

...

MP003

...

MP004

...

MP006

...

sm19-arch

...

sm19lily

...

sm19lily

...

sm19-temp

...

snow18

...

snow18

...

snow19

...

snow19

...

snow20

...

snow21

...

snow22

...

snow23

...

snow24

...

snow25

...

snow26

...

snow27

...

snow28

...

snow29

...

snow30

...

snow31

...

snow32

...

snow33

...

snowfield

...

snowfield

...

test

...

test

...

unknown

...

yermo_19

...

yermo_19

Simplifications that are possible:

...

Scientist

...

Project

...

Project name in run 1

...

Project name in run 2

...

Dibner

...

Project fire_19 is all soils collected from March 2019 through October 2019 around the Chimney Park area. These soils came from areas that were burned severely (SF), experienced lighter understory fire (UF), or were not burned (NB). I’m looking to identify differences in assemblages of microbes across these treatments and over time.

...

fire_19

...

fire_19

...

Dibner

...

Project yermo_19 includes bulk soils from a site containing a threatened plant, all within a broader sagebrush ecosystem. Rhizsophere samples from the same study are forthcoming and not included in these runs, so analyses on this project is lower priority than fire_19 and others.

...

yermo_19

...

making 220 R1 files and 220 R2 files, with structured names (e.g., for the R1 set):

/gscratch/buerkle/psomagen_17sep20_novaseq2/rawdata/novaseq2_R1_000.fastq
/gscratch/buerkle/psomagen_17sep20_novaseq2/rawdata/novaseq2_R1_001.fastq
/gscratch/buerkle/psomagen_17sep20_novaseq2/rawdata/novaseq2_R1_002.fastq
/gscratch/buerkle/psomagen_17sep20_novaseq2/rawdata/novaseq2_R1_003.fastq etc.

run_parse_count_onSplitInput.pl also writes to /gscratch.

NovaSeq2_DemuxJH.csv was exported from a Google Sheet notebook and further edited to correct some assignment of project names. This file is used to map MIDS to sample names and projects.

Splitting

The info lines for each read in parsed_NovaSeq2*_R1.fastq and parsed_NovaSeq2*_R2.fastq have the locus, the forward mid, the reverse mid, and the sample name. These can be used with NovaSeq2_DemuxJH.csv to separate reads into loci, projects, and samples, in the folder sample_fastq/. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.

run_splitFastq_fwd.sh and run_splitFastq_rev.sh run splitFastq_manyInputfiles.pl, which steps through the many pairs of files to split reads by sample and project, and place them in /project/microbiome/data/seq/psomagen_17sep20/rawdata/sample_fastq/.

splitFastq.pl and splitFastq_manyInputfiles.pl will need tweaking in the future, until sample names and the format of the key for demultiplexing and metadata stabilizes. The number of columns has differed among our three completed sequence lanes.

Calculate summary statistics on reads

In a sequence library’s rawdata/ directory (e.g., /project/microbiome/data/seq/psomagen_26may20/rawdata) run:

module load swset/2018.05 gcc/7.3.0 usearch/10.0.240

aggregate_usearch_fastx_info.pl

Merge reads, filter, and (optionally) trim primer regions

In /project/microbiome/data/seq/psomagen_17sep20/tfmergedreads , we used run_slurm_mergereads.plto crawl all of the project folders and sample files (created in the splitting step above) to merge read pairs, filter based on base quality, and optionally trim primer regions from the reads. We are now not trimming primer regions, even though we found that we could not readily eliminate low frequency, potentially spurious OTUs downstream by masking with vsearch; it would not properly mask these regions [by marking them with lower-case bases] and the soft mask them when we do vsearch cluster_size. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/ and ITS/ folders in tfmergedreads/. Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv in each project folder. For the full lane these summaries were concatenated in tfmergedreads/ with

cat */*/filtermergestats.csv > all_filtermergestats.csv

I used commands.R in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge, or did meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).

Find unique sequences, cluster, and count OTUs

We ran a single script for each library that generated otu tables for 16S and ITS. These are found in the otu/ directory for each library. The script we ran is also there: e.g., /project/microbiome/data/seq/psomagen_6mar20/otu/run_slurm_mkotu.pl

This script crawls all of the projects (and loci) in the tfmerged/ folder, and concatenates all input sequences files and passes them through a UNIX pipe for analysis (rather than making another copy of all the sequence data). The key part the script is the job that is created for each project:

Code Block
breakoutModefull-width
$job .= "cd $allprojectsotu{$tfmdirectory};\n"; # this is the otu directory
$job .= "cat $tfmdirectory/*tfmergedreads.fa  | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch -derep_fulllength - -threads 32 --output uniqueSequences.fa --sizeout;\n";
$job .= "vsearch --cluster_unoise uniqueSequences.fa --sizein --consout zotus.fa --minsize 12 --id 0.99 --relabel 'Otu' --iddef 2;\n";
$job .= "vsearch --uchime3_denovo zotus.fa --chimeras chimera_out.fa --nonchimeras zotus_nonchimeric.fa --threads 32;\n";
$job .= "cat $tfmdirectory/*tfmergedreads.fa  | sed -e 's/\\s/_/g' | sed -e 's/^>16/>rna16/' | vsearch --usearch_global - --db zotus_nonchimeric.fa --otutabout 'otutable' --id 0.99;\n";

These steps can be repeated by individual users with different input data and vsearch options, as desired. Please see below for instructions with examples. The settings we used for all projects took only a few hours for each project.

As noted above, we are not trimming primer regions, so there are potentially some spurious OTUs/ESVs introduced by sequence variation in the primer regions.

Next steps for individual users and projects

The next tasks each user will need to perform are QC of the data, generation of taxonomic hypotheses for ESVs/OTUs, identification of those ESVs that are the synthgene and coligo and combining data for those ESVs sensibly, and finally statistical analysis. We will go through each of these tasks below and provide some guidelines. We also provide a few tips for those new to bioinformatics. For the experienced users, we will need to work together to help folks that are new to the game.

Basic info for those who are new to bioinformatics

If you are new to bioinformatics and much of this page is unintelligible, then do not fear. Please talk to those in your lab with experience. If you want to do bioinformatics yourself, great! There is a learning curve, so be patient. It is best to start with basic Linux use because to do much of anything with sequencing data it is important to know how to operate in the Linux environment. One can google around and find various tutorials for the basics. Much useful info regarding Teton use has been collated by the Data Science group at UWYO in a Knowledge base. Also see the ARCC wiki for Teton.

Please consult these resources and figure out the basics of using Teton. You will need to know how to operate in Linux to work on Teton.

Best practices

To avoid wasting a large amount of disk space and potentially causing other problems please do not copy the raw fastq files to your scratch or home directories on Teton. If you want to use them, then do so while leaving them in place (see below for an example) and do not keep intermediate files. However, please feel free to copy OTU tables and work on your data as you see fit.

It is very easy to get confused when doing bioinformatics. One makes tens to thousands of files and has to keep them all organized. It is ideal to copy all one's commands to a digital notebook. This may seem onerous, but it will save time in the long run. This will be particularly helpful if you need to ask someone for help. It is very challenging to provide bioinformatic advice when it is unclear what has been done.

Explanation of files in your project folder

If one navigates to /project/microbiome/data/seq/ one will find folders labeled psomagen_date, where date corresponds with the date we received data from Psomagen.

Inside those files one will find these folders: otu rawdata tfmergedreads

otu - contains otu tables for either 16s or ITS organized by project (project names are user provided). Inside the project specific folder, one will find “chimera_out.fa” and “zotus_nonchimeric.fa”, “uniqueSequences”, “zotus.fa”, and “otutable”.

“Chimera_out.fa” - are chimeric sequences

“zotus_nonchimeric.fa” are non-chimeric ESVs. If you want to do something with the ESV sequences use this file.

“uniqueSequences” are all the unique sequences in the project.

“zotus.fa” has ESVs (which usearch refers to as Zotus), these differ from unique sequences slightly in that some low frequency sequences contained in “uniqueSequences.fa” will not be included in “zotus.fa”.

rawdata - contains raw data and scripts used to parse them (see above).

tfmergedreads - contains trimmed, filtered, merged reads. Note that we are not currently trimming reads, though one could do this if desired. Within this folder one will find folders labeled 16S or ITS and inside them will be fasta files labeled by sample name, MID, user and project name. These reads were processed as described by Alex earlier on this page. One can extract filtering and merging statistics from the file, “all_filtermergestats.csv”. This file includes a row for each file. The first field contains the file name, the second field should be an integer and tells how many reads were in the file originally, third field is how many reads were left after merging, fourth field is how many reads left after filtering.

Taxonomic hypothesis generation

One will want to determine which ESVs correspond to which taxa. This can be done via vsearch like so:

vsearch -sintax ESVFILE -db REFERENCEDATABASE -tabbedout OUTPUT -sintax_cutoff 0.8 -strand both -threads 32

This uses the sintax algorithm to match the input ESV sequences, contained within “ESVFILE”, to the reference database. The reference database has a lot of annotated sequences. For example, there is a 16S sequence for E. coli. By matching one's sequences to those in the database one can hypothesize a taxonomy for the sequence. For more details, see the SINTAX paper. Please read the Vsearch help for more options to specify for this algorithm. The options provided here are sufficient, but it is worth being familiar with the possibilities offered by the tool.

There are many reference databases one can choose and, of course, taxonomic hypotheses will vary to some extent by one’s choice. Several reference databases are at /project/microbiome/ref_db. One can download and make one’s own database as well. This is explained in the documentation of vsearch.

Finding coligos and ISDs

One will want to identify the internal standard (ISD; previously referred to as a synthgene). One can compare ESVs to known variants of the ISD using usearch_global.

vsearch -usearch_global zotus_nonchimeric.fa -db /project/microbiome/ref_db/synthgene.fasta -strand plus -blast6out ~/matches.b6 -id 0.85 -query_cov 0.8 -maxaccepts 50

For more on accept options see: https://drive5.com/usearch/manual/accept_options.html. Note that J Harrison chose the values in the preceding example somewhat arbitrarily (they worked well with a subset of the data we obtained in May 2020). It is worth playing with the parameters to see how results change. Look at the ESV sequences suggested to be ISDs and see if they match expectations.

Those ESVs that correspond to the ISD can be combined. The easiest way to do this is to sum the corresponding rows in the OTU table. A script can be made to pull in the matches provided by vsearch and combine those rows of the OTU table. If this is challenging then please ask for help. Here is a file that has the ISD sequences, including a few known variants of the ISD that occur in the data.

View file
namesynthgene.fasta

--

Identify coligos. One needs to go through the ESV file and figure out which sequences are coligos. Here is an example script and input files to do this.

View file
namesomeESVs.fa
View file
namecoligoSequences.csv
View file
namecoligo_ISD_exploration.R

Quality control

Some suggestions for QC

...