This is the fifth in a series of high-level posts reviewing foundational concepts and technologies in genomics. The first four were: “How Many Genes Does a Person Have,” “How do Genomes Vary, Person to Person,” and “Sanger Sequencing,” and “Sequencing by Synthesis.“
The “depth” of a genome sequence refers to the number of times that we observe each location on the chromosome. A “30x whole genome,” means that we have on at least, or on average, or for at least one location (check the fine print) gotten 30 observations on some or all of our ~3.2 billion chromosomal locations. This has emerged as a sort of industry standard. This post explores how that might have happened and whether it’s still the right answer.
There are two main pressures that drive us to sequence deeper:
- It drives down error rates
- At great depths we can reliably detect trace quantities of tumor (liquid biopsy), fetal (non-invasive prenatal tests), or other non germ-line DNA.
The cost of instrument time and reagents creates an obvious pressure to minimize, or at least optimize, sequencing depth. The costs of data storage and analysis act more subtly in the same direction.
Compute requirements for primary bioinformatic analysis (alignment) scale linearly with depth, and memory requirements can grow even faster than that in comparative (secondary and sometimes tertiary) analyses. I have seen more than a few mutational analysis pipelines crumble when teams casually swapped in deeper sequences with way more reads. The files are bigger, you should expect any particular process to take longer. Comparative processes go geometric on RAM.
How much disk space does a 30x genome take up?
Since I mentioned data volumes, here is how to derive approximate file sizes for your sequencing data. I’ve gotten paid to answer this question so many times, it’s time to give it away for free.
Each base pair consists of a call – a letter in the four letter alphabet (G, A, T, C) and an integer for the quality score. Our starting point for estimating data volumes is to represent the call with two bits (four possibilities) and the quality score as an 8-bit integer. That gives us 10 bits per base: 1.25 bytes per base.
Yes yes, we could do better than 8 bits for that integer – especially since the only thing that really matters at the end of the day is whether it’s above or below a particular threshold. More on that later, and figuring out why the industry has not harvested that particular low hanging fruit is left as an exercise for the reader.
Anyway. There are, give or take, 3.2×109 (3.2 billion) chromosomal locations in the human genome, so each 1x whole genome ought to take around 4×109 bytes. Thirty of them (a 30x WGS) ought to take up around 1.2×1011 bytes – approximately 120 gigabytes. I say approximately because our megabyte / gigabyte notation is a little weird.
Conventional metric prefixes for large numbers are defined around powers of 10:
- 103 kilo
- 106 mega
- 109 giga
- 1012 tera
- 1016 peta
Data storage was historically defined in terms of powers of two (for reasons). This approximates but does not equal the powers of ten.
- 210 or 1,024 bytes in a kilobyte
- 220 or 1,0242 or 1,048,576 bytes in a megabyte
- … and so on
All the standards bodies (IEEE, ISO, the EU, and so on) agreed in the 90s and early 2000s that this was deeply confusing, so they decreed that data volumes shall be defined just like every other metric prefix – in terms of powers of ten. The power of ten based units get the old dignified suffixes (KB, MB, GB, and TB) while the powers-of-two way get a silly sounding series (Kibibyte, Mebibyte, Gibibyte, and Tebibyte … with silly suffixes KiB, MiB, GiB, TiB).
This declaration has, of course, changed no hearts and no minds, so now those of us who are aware of the situation just specify both.
So to be really persnickety, 1.2×1011 bytes is 111.75 GiB, which is equal to the 120GB I quoted above.
As a side note, you should never ever ever use “GB” by itself in bioinformatics or genomics and expect people to understand what you meant. It’s easily confused the “giga base pairs,” which is helpfully abbreviated “GB” by the sequencing folks. Always spell it out, whether its gigabytes or gigabases, and speak not of tebibases. This is forbidden.
Real talk about redundant storage
Nobody actually has 120GB sitting around on disk for each of their 30x whole genome sequences. Most teams have somewhere between two and six times that amount, with some but not all of it compressed.
Two to sixfold redundancy. Why? Because we are, collectively, idiots.
When data comes off an Illumina sequencer, it lands in a “run folder” in a precursor format (BCL, for “base call”) where data is organized cycle by cycle from the sequencing reaction. There is one and only one thing that anybody ever does with a run folder, which is to put it through a tool called “bcl2fastq” which converts BCL to FASTQ (get the name?).
Despite the fact that there is no value added between BCL and FASTQ, many organizations retain their run folders forever. If you keep one copy, that’s one whole extra copy of the data. If you also back up that one extra copy to a redundant site / geography (which is not usually a bad idea) … that’s two extra copies.
Seriously, delete your BCLs, they’re an instrument artifact. Yes, I am familiar with the various regulations around clinical data. They don’t really apply here. FASTQ was the landing point and experimental result. BCL is an accident of history that Illumina is working on fixing by putting base calling inside the instrument.
FASTQ files are organized by read, with all the base calls with the same molecular bar code grouped together along with their associated quality scores. FASTQ and BCL contain the same information content and will compress to approximately the same size.
There are three things we ever do with FASTQ reads:
- Align them to a reference genome
- Perform a reference free alignment of the reads against each other.
- Align them to an exome and count up transcripts to try to estimate prevalence and hope it maps to protein expression.
The output of alignment is a Binary AlignMent or BAM file. The BAM contains all the same information content (base calls and quality scores) as either the FASTQ or BCL (assuming that you are not an idiot and include the un-aligned reads), but it sorts them into a common frame of reference suitable for downstream use. BAMs are generally -already- compressed by the various tools used to create them. They may superficially appear smaller than the FASTQs and the BCLs … but once you compress the precursor formats you will find that Shannon was right and they’re all the same size at the end of the day.
There’s your third, duplication. Six if you’re making redundant copies of all of this.
Related: I keep hearing people ask whether or not domain expertise matters in either data science or IT. I assure you that domain expertise matters very much in both domains. Factor of six, I say.
There are several tools that do really good domain specific compressions of FASTQ and BAM. Illumina is pushing the Original Read Archive (ORA) format as part of their DRAGEN hardware accelerator package. PetaGene has some truly impressive results (in addition to other features around cloud transparency). Go ahead and check those out, but also remember the big picture thing. Compression is great, de-duplication is even better, and don’t get me started on unmapped reads.
Fun fact: The Q in FASTQ stands for quality, but the rest of the letters don’t stand for anything at all, at least so far as I know.
Back to Sequencing depth
Illumina’s quality metric, ‘Q’ is a function of the estimated error rate, which is defined as Q = -10log10(e). At Q20, we expect an error rate of one in 100. Most of us use a hard threshold of Q30 for base calling, leading to an expected rate of incorrect calls of 0.1%, one in a thousand or 10-3.
If we observe each locus on the chromosome once, we expect 3.2×109 x 10-3 = 3.2×106, or a little over 3 million errors. As an engineering mentor was fond of saying – “even a very small thing, if it happens many many times over, bears watching.”
Our second observation produces an additional 3 million errors, but the vast majority of them will be in places where we previously had a good read. That gives us 6.4×106 (six and a half million) loci with a single error and 3.2×109 x (10-3)2 = 3.2×109 x 10-6 = 3.2 x 103 = 3,200 locations where -both- observations are incorrect.
Individual errors accumulate linearly while the sites with overlapping errors drop geometrically. We use this as a sort of filter. When we’re confident that all sites have enough good reads – we can stop.
For locations with multiple errors, we have no guarantee that the incorrect base call will be identical from read to read. Sometimes we get the same wrong answer several times in a row, and sometimes not. If we assume that errors at any location are evenly distributed across the three possible incorrect bases (why not?), at 2x sequencing we expect about 1,100 locations where we have what looks like a very confident base call that is actually two errors stacked on top of each other.
A first plateau
The lowest depth where we expect less than one location to be entirely composed of errors is 4x. 3.2×109 x (10-3)4 = 3.2×109 x 10-12 = 3.2×10-3. However, that’s not good enough because we will be left with 3.2×109 x (10-3)3 = 3.2×109 x 10-9 = ~3 locations where three out of four of the reads were errors. Sure, we’ll recognize that something was up at that site (most of the time, unless the errors align), but we’re interested in confident calls at all loci – not merely turning our unknown unknowns into known unknowns.
The lowest depth where we expect less than one location where most or the majority of the reads will be errors is 7x. At 3x, we expect ~3 locations where it’ll be a tie. We need 4x to have a majority.
Recall though, that humans have two copies of each of our chromosomes, meaning that two different observations at the same supposed location might be entirely correct and consistent with the underlying biology. While laboratory techniques do exist to do “phased” sequencing that tags and tracks the two copies of each chromosome independently, they are expensive and introduce still more possible errors.
As an aside, overcoming the phasing gap right on the sequencer without complicated barcoding is one of the most underrated aspects of long read sequencing like Oxford Nanopore and Pac Bio.
Simplistically, you might expect to have to double the sequencing depth (all the way up to 14x) in order to confidently sample twice as much chromosomal terrain. Of course, most loci are homozygous, so you could bring that number back down again if you had a really accurate estimate of the rate of heterozygous sites. This would require deeper sequencing first, on a very large number of individuals, in order to have sufficient confidence in the frequencies of heterozygous site.
It never ends.
Here there be dragons
Beyond this point the math gives way to big round numbers and hand waving.
The relative abundance of different reads at a heterozygous location is important in quality control. If we have a site where half of the observations say “G” and the other half say “T”, that matches our understanding of the biology in most people. On the other hand, if we see a distribution other than 50/50 at certain sites, particularly if it’s 25% / 75%, that is a strong signal of contamination. Four copies of a chromosome rather than two means contamination.
Or maybe tumor, fetal, or transplant DNA – as mentioned above. It never ends.
Similarly, the abundance of reads at a particular location can be used to infer the number of repeats or length of copy number variants, and can even provide clues about large scale re-arrangements.
Either way, we need far more than confidence that we’ll make any particular call correctly – we need to have the error rates all the way down in the noise.
So, like, 30x? For research? And like maybe 100x for clinical?
I know people who do this sort of math for a living, and I know that they would cringe at my rough and ready arithmetic.
It’s not wrong though, and seriously – do check on that whole “mean vs. median coverage and over what fraction of the genome.” There’s some serious dirt under that rug in certain shops.