Sequencing depth and uncertainty

A common question in sequencing relates to the diminishing returns of sequencing the same material to greater and greater depth and at what point our uncertainty about the frequency of a component has diminished sufficiently (an allele at a locus, or a taxonomic unit in an environmental sample), given the trade-off between sequencing more at this locus rather than sequencing more samples or genomic regions.

We can simplify this problem to wanting to know the frequency of an allele or taxon, and the frequency of everything else (the complement). This binomial categorization is a simplification of the multinomial and has everything we need (there is no gain in considering the frequencies of other categories for this illustration or in general for knowing the frequency of a component of a composition; the binomial model is entirely representative of those multinomial models).

The greatest uncertainty exists when the category of interest is at frequency 0.5, so I present that as an upper bound in the plots below. I also illustrate the uncertainty for a frequency of 0.05, and the diminishing returns on sequencing at greater depth for that sample. For lower frequency components the uncertainty would be even smaller.

Here is the code should you want to play with this.