This is the third post in a series reviewing basic concepts in genomics. The first one was titled “How Many Genes Does a Person Have?” and the second was “How do Genomes Vary, Person to Person?” This is a breezy summary of the Sanger method of DNA sequencing. I plan to cover short-read massively parallel (which is occasionally still called “next-gen”), and long-read single molecule (which should never under any circumstances be called “next-next-gen”) in subsequent posts.
Before getting started, I would like to pause for a quick digression on the importance of the attraction between opposites, at least as far as DNA sequence is concerned. It is hard to overstate the utility and power of the highly specific affinity between complementary strands of nucleic acids. This attraction, combined with the sheer intuition-defying number of molecules present in even very small reaction volumes, can come to seem like a magical, near-universal tool to manipulate DNA.
For example, we often want to grab a strand of DNA at some particular location. As discussed in previous posts, the concept of “location” is slippery in genomics. There is no such thing as a coordinate system on the chromosomes, at least as far as the underlying biology is concerned. Instead, we settle for a location relative to some highly conserved sequence of nucleotides.
To go fishing for the DNA near such a location, we construct (or, more commonly, buy) an oligonucleotide (oligo for short) that spells out the reverse complement of the desired sequence. The oligo serves as a bait or a primer that will very selectively latch on to its target. The primer is attached to some molecular chemistry that allows us to hold it back when everything else is washed away. We add the bait to the DNA and raise the temperature until the paired strands separate or denature, making room for the interloper to cut in. After a bit of time and agitation, we lower the temperature again. Raising and then lowering the temperature to open and close the strands of DNA and allow some reaction to take place is called thermal cycling.
The uncanny affinity of DNA for its mate-paired sequence, combined with the remarkable and counterintuitively consistent behavior of truly vast numbers of molecules, means that when we wash away everything that did not take the bait, we are left with a highly purified solution of all and only the DNA containing a match to our desired sequence.
To me, at least, this is pretty remarkable. We’re talking about a technique that can find a very specific needle – some particular substring of DNA – in a haystack of 3.2 billion letters. The remarkable power of exponential math means that we only need an oligo of between 18 and 22 letters to accurately isolate most locations in the human genome, though of course repetitive and low-complexity sequences are always problematic.
I started on the computational side of things, and it took years before I developed any intuition whatsoever about what was easy vs. hard in the lab, and the subtle ways that laboratory realities show up as biases and even errors in downstream data.
Sanger sequencing, after Frederick Sanger – whose team developed it in the late 70s – is one of the original methods for determining the sequence for a DNA molecule. There are a number of variations on the theme, but they all share a common core:
- Make a bunch of copies of some particular chunk of DNA, varying in length from one to about a thousand nucleotides
- Tag or label the molecules somehow so we can identify the final nucleotide (C, G, A, or T)
- Sort the molecules, shortest to longest, physically spreading them out on a gel or in a capillary tube
- Read the terminal nucleotides, shortest substrings first.
We accomplish the first two steps (making sub-strings of all possible lengths and identifying the terminal residue) with modified nucleotide precursors called di-deoxy (ddNTP) rather than the regular deoxy (dNTP) version that makes ordinary DNA. Because of its slightly different structure, ddNTP serves as a terminator and blocks any further nucleotides (either ddNTP or dNTP) from being added to the strand. Once a DNA molecule takes up a ddNTP rather than a dNTP, it becomes fixed in length (at that particular replication site), regardless of the availability of additional raw materials.
We use a mix of dNTP and ddNTP, setting the relative abundances so that approximately one in a thousand replication events will incorporate a ddNTP and stop. It is conceptually simple (though of course there are always details) to run the replication reaction until we have the desired mix of lengths, all of them terminated by a synthetic ddNTP.
Early versions of the Sanger method used radioactive labeling, which made the DNA molecules visible but did not allow observers to distinguish between the nucleotides. The synthetic C, G, A, and T molecules were run as four separate reactions and spread out as four distinct lanes of a rectangular gel, creating the classic “ladder” representation that has become visually synonymous with genomics.
Fluorescent ddNTPs that emit photons when illuminated with a laser were a later innovation. Fluorescent molecules can be created such that different ddNTP molecules shine at different frequencies, which allowed the reactions to be mixed together and sorted in a single capillary tube. Beyond reducing the amount of radioactive isotopes needing to be pipetted around the lab (nearly always a good thing), this allowed substantially reduced the reaction volumes which allowed increases in speed and automation.
The vast majority of modern Sanger sequencing uses this technique, DNA molecules are terminated by fluorescent synthetic nucleotides and sorted by weight in capillary tubes. Most labs use a workhorse of an instrument called the ABI (for Applied Biosystems) 3730 that integrates and automates most of this process.
The length limitation, where we can only read about a thousand base pairs (though realistically it’s closer to 850), is derived from the shrinking ratio between the weight of a single nucleotide and total weight of the DNA molecule as the chain grows longer. Eventually, sorting by weight doesn’t spread the molecules out far enough to tell one position from the next. Read length is one of the most important properties that differentiates the various sequencing technologies.
While Sanger sequencing is incredibly powerful and still used every day all over the world, it has some significant limitations, particularly at scale. The read length of ~850bp (base pairs) means that it would take several million reactions to read through the 3.2 billion locations on the genome. If we want to interrogate some particular region we need to construct or buy custom oligos to bait it out. If we choose to “shotgun” sequence instead – grabbing at random and using computational techniques to stitch things together – we need to do highly redundant reactions to achieve good coverage.
Finally, despite fitting in a capillary tube, the volumes of material required per-molecule in Sanger sequencing become unwieldy at genomic scales. An old engineering mentor of mine was fond of saying that even very small things in your design become big and important when they happen thousands of times per second. Even if we were able to get down to a single instance of each DNA substring per reaction (we cannot), Sanger sequencing would still require hundreds of nucleotides at each location to be interrogated – which turns out to be expensive and inefficient.
My next posts will explore higher throughput techniques, focusing on illumina’s short-read sequencing by synthesis technology and long-read single molecule technologies like those used in Pacific Biosciences and Oxford Nanopore.