This is the fourth in a series of high-level posts reviewing foundational concepts and technologies in genomics. The first three were: “How Many Genes Does a Person Have,” “How do Genomes Vary, Person to Person,” and “Sanger Sequencing.” This one is about high throughput DNA sequencing, focusing on Illumina’s Sequencing by Synthesis (SBS) technology.
For the last decade or so, the market for high-throughput DNA sequencing instruments has been utterly dominated by a single company: Illumina. Their “Sequencing By Synthesis” (SBS) approach was originally commercialized by Solexa, who launched the Genome Analyzer in 2006. Illumina acquired Solexa in 2007, and all of their instruments – both the lower throughput MiSeq and the higher capacity HiSeq and NovaSeq – have used variations on the same fundamental process. While other high throughput technologies have made significant inroads in the early 2020s – anybody working with sequence data should be familiar with the fundamentals of SBS.
Two keys to creating high-throughput laboratory processes are multiplexing and miniaturization. We want to concurrently run many different reactions in the same physical container (multiplex), and we also want to use as little stuff (molecules, reagents, energy) per reaction as we possibly can (minimize). usually replaces one set of problems with another. While low throughput processes struggle with the level of effort and cost per reaction, higher throughput processes tend to be more complex. Batch effects and subtle errors inevitably creep in.
It was the combined benefits of miniaturization and multiplexing that drove the radical increase in DNA sequencing capacity and adoption of the early 2000s. High throughput technologies – mostly SBS – meant that sequencing was suddenly both 1,000-fold cheaper per base pair and also 1,000-fold faster per reaction. This caused the industry to, briefly, accelerate “faster than Moore’s law.” It’s important for industry watchers to realize that this acceleration was a one time thing, driven by specific advances in sequencing technology. Single molecule sequencing technologies have the potential to drive a similar change in the next few years by exploiting still further miniaturization (one molecule per read!) combined with exceptionally long read lengths.
The core of SBS technology is the flowcell – a specially prepared piece of glass and plastic slightly larger than a traditional glass microscope slide. Each flowcell has one or more lanes (physical channels) that serve, in an exceptionally broad sense, the same function as the capillary tubes from Sanger sequencing. Both flowcell lanes and capillary tubes are single-use containers into which we put prepared DNA, run a reaction, and from which we read out results. In Sanger sequencing, we prepare millions to billions of copies of the same fragment of DNA, synthesize it in a variety of lengths, and read out the results by weight. Sanger sequencing gives us one read (the order of the residues in a contiguous stretch of DNA) per capillary tube – using millions to billions of molecules to do it. In SBS, we load millions to billions of different fragments of DNA (all of about the same length, and tagged in special ways as described below), copy each of them a few dozen times, and then simultaneously generate millions of reads using high resolution imaging.
In order to spread out and immobilize the DNA fragments so we can keep track of them, the bottom surface of each lane in a flowcell is coated with a “lawn” of oligonucleotide primers. These hybridize with matching reverse complement primers appended to both ends of the DNA fragments to be sequenced. In a sense, these primed locations are the real miniaturized containers for sequencing, and the flowcell – despite its small size relative to human hands – is just a high capacity container.
Most DNA sequencing technologies require consistently sized fragments of input DNA as input. This is accomplished through a process called fragmentation. In enzymatic fragmentation, chemicals are used to cut (“cleave”) the DNA at random locations. By carefully controlling the temperature and time of the reaction, it is possible to achieve consistent fragment lengths. The alternative is sonic fragmentation or sonication, which uses high frequency vibrations to break up longer molecules at particular lengths. Sonication is less sensitive to variations in timing and temperature, but requires additional manipulation of the samples and dedicated instrumentation.
Whatever approach is used, variability in fragment lengths leads to erratic performance of all downstream steps in the process.
Often, we want to sequence only a subset of the genome. Whole Genome Sequencing, where we sequence everything, is the exception. For example, in an Exome, we might want to sequence only the exons that code for proteins. Panels are even more selective, picking out just a few actionable genes and regions of clinical or research interest. Just as with Sanger sequencing, manufactured oligonucleotide primers or baits are used to capture and select out exactly the bits we want while the rest are washed away. Sets of baits developed for a particular purpose are referred to as capture kit, and the process of selecting out the targeted DNA is somewhat casually called capture.
The attentive reader will notice a substantial amount of chicken vs. egg in this process. We’re sequencing because we don’t know the full DNA sequence present in the sample, and targeting our efforts using techniques that actually assume quite a lot of that same knowledge. In a general sense, both things can be true. Genomes are remarkably consistent from person to person and every exon contains highly conserved regions. However, for any sort of detailed analysis, particularly when working with rare or complex variants, it is important to keep in mind the layered stacks of assumption and biases that go into the data. Also bear in mind that at this point in the process we are still talking about strictly chemical manipulations. We are nowhere near alignment and mapping, which are the algorithmic reconstruction of longer sequences out of shorter ones.
Primers, Molecular Bar Codes, and Multiplexing
Having achieved a consistently sized collection of DNA fragments, and having washed away all the bits that are not of interest, it’s time to affix the primers mentioned above, as well as unique labels so we can tell one sample from the next after mixing them together (multiplexing). These indexes or molecular bar codes (more manufactured oligos) are snippets of DNA with well known sequences that will be unique within any particular lane on a flowcell. They get sequenced along with the samples, and are then used to computationally de-multiplex the reads.
Got that? We apply a physical tag to each fragment, sequence it, and then sort it out digitally on the back end.
Adding additional utility sequence to the DNA (like indexes) winds up as an information tradeoff. We can obtain more and more specific information about the history of a particular chunk of DNA by using some of the base pairs on every read as bar codes and tags rather than reading the sample itself. Highly multiplexed technologies like spatial transcriptomics use multiple layers of bar codes – reading less and less sequence from the original sample (which is presumably less important) in exchange for more and more detailed information about where it came from.
In any event, after all of this tagging, fragments from multiple samples are mixed together and loaded onto a lane of a flowcell. As mentioned above, the primers on the ends of the fragments anneal to the oligos on the surface of the flowcell – hopefully resulting in an evenly dispersed lawn of DNA – all attached at one end.
The imaging devices used in modern Illumina instruments are not sensitive enough to consistently detect the fluorescence events from single molecules. To overcome this, a physical process called clustering is used to create a group of identical copies around the original fragment that attached at a particular location on the flowcell. Clustering starts by denaturing (separating) the DNA strand, which exposes the primer that was added to the free end. One copy (the one that is not bound to the flowcell) is detached and washes away. The molecule then flexes over to form an arch, binding to one of the nearby primers on the surface of the flowcell. Nucleotides are washed over the flowcell to create a matched pair to this doubly bound strand (a “synthesis reaction”), and then the DNA is denatured yet again, yielding two adjacent single-stranded fragments in reverse complement from each other. This process is repeated, building up clusters of sufficient numbers of molecules to be reliably detected.
It’s important to remember that this is -still- not a digital manipulation or a computer program. We’re dealing with massive numbers of molecules that all just do their thing in solution with very predictable results. Some DNA fragments will inevitably bind close enough to each other that their clusters will interfere. Others will be lost entirely during the series of denaturing and rebuilding steps. Some synthesis reactions will incorporate errors – substituting, omitting, or repeating stretches of one or more nucleotides.
Finally, we come to sequencing. This part of the process has a lot in common with Sanger sequencing, since it relies on the controlled addition of fluorescent di-deoxynucleotides (ddNTPs). Unlike the Sanger process, which builds out all the various possible lengths of DNA fragment simultaneously and sorts them by weight, SBS proceeds in highly regimented cycles, running one base pair at a time. Each cycle introduces a round of fluorescent ddNTPs to any DNA fragment with the correct residue at the first open position. The flowcell is then illuminated with a laser, and a high resolution image is captured that hopefully shows each of millions of clusters glowing in one and only one of the four frequencies associated with the four types of fluorescent ddNTP. The ddNTPS are then washed away and a single step of normal synthesis is allowed to proceed, advancing the ticker on each DNA fragment by exactly one letter.
The short form is that each cycle gives us information on a single nucleotide, in the same position, from each of millions of fragments of DNA, all at the same time.
Illumina sells reagent kits for 50, 100, and 150 cycle runs, each of which can be either single-ended or paired-end. In either case, the DNA is first sequenced from the 3′ end. In paired end sequencing, the DNA is then allowed to arch over and anneal at the 5′ end (as happened during during clustering) and sequenced 5′ to 3′ for an -additional- 50, 100, or 150 base pairs per fragment. Paired end sequencing provides a critical additional piece of information for downstream informatics, since (assuming that our fragmentation worked well) we know the relative orientation and distance between the two reads.
One cycle on a modern instrument takes around 10 minutes, which leads to runtimes between 11 hours (50 cycles, single end) and 48 hours (300 cycles, paired end).
The result of all this work is a stack of images which still needs to be computationally processed in order to produce reads. In the early days of high throughput sequencing, the raw images had to be offloaded from the instrument and processed (base calling) via a separate computing environment. At the time, there were endless conversations about whether there was enough potential value to merit retaining incremental data taken during the sequencing process – including the raw CCD images. The hope was that future algorithms might allow us to rescue marginal data or to detect errors more accurately. These images were huge, and the memory of their size still distorts some estimates of the scale of the data “problem” in genomics – mostly among certain data storage vendors who still can’t be troubled to know the difference. As costs have come down and volumes have increased, those conversations have tapered off. Most modern instrument vendors no longer even offer the option of downloading raw CCD images. Base calling now happens on-instrument, and the bioinformatic processes start with demultiplexing – sorting the reads out sample by sample.
I hope to cover the next couple of steps in the standard process – primary and secondary bioinformatic analysis – in a future post.