This is the second post in a series where I review and explore some basic concepts and confounders in genomics. The first one was titled “How Many Genes Does a Person Have?”
Until quite recently, genetic variation was described in terms of difference from a human reference, of which there have been 38 major versions. The 39th version of the human reference has been indefinitely postponed while the scientific community grapples with the fact that the observed diversity of human genetics does not map well to any single linear reference. Beyond the fact that a simplistic model is inadequate to represent the data, the concept of a “normal” or “reference” genome is problematic and can lead to sloppy, incorrect thinking. Early genomic datasets were heavily biased towards people of European descent. This meant that genetic variation was, practically speaking, being defined in terms of divergence from the white population.
So, we’re on a bit of a pause with the whole idea of a human reference. However, genomes do differ between individuals and across populations and we need to be able to talk about the differences. Here are a few major categories of variation.
Single Nucleotide Polymorphisms (SNPs) (pronounced “snips”)are loci where we encounter different nucleotides in the “same” location on a chromosome. Eliding over the question of what exactly is meant by “the same location,” we might see a G rather than an T there. If a SNP is in a protein coding region and the change does not result in any changes to the amino acid sequence encoded by the DNA, then it is referred to as a silent or synonymous substitution. The ratio of nonsynonymous to synonymous mutations is one of the key metrics used in calculating rates of evolutionary change and generations of divergence between populations.
Admittedly, I’m indulging in a bit of hyperbole when I call out genomic location as a challenge with SNPs. Practically speaking, the location of most SNPs is defined in terms of highly conserved sequences that surround them. The Lightning project from Arvados expands on this concept, using reliable sequence “tags” as signposts to localize more variable features, independent of any reference.
Insertions and Deletions (InDels): are loci where a stretch of DNA is present in some individuals and absent in others. All indels can be represented as either an insertion or a deletion – depending on whether we choose to anchor on the larger or the smaller sequence of DNA. Frameshift mutations are indels within exons where the change is not a multiple of three nucleotides in length. Frameshifts disrupt the three letter code used in translation from RNA to protein, which can have significant biological impacts. Things get messy when the inserted segment contains additional variation that also needs to be represented. Biologists have developed conventions for handling these cases, but it is important to remember is that there are multiple correct ways to describe the same underlying variability.
Inversions happen when a stretch of DNA is written backwards, swapping the two strands for a while before swapping back. Copy Number Variations (CNV) occur when a sequence of nucleotides repeats a variable number of times. Inversions and CNVs are both considered structural variants (SV), and can be difficult to identify beyond a certain scale when using certain DNA sequencing technologies.
SNPs where naturally paired nucleotides are swapped (G/C or A/T) could be described as single base inversions. Nobody talks that way, but it’s not incorrect.
So how much do we vary?
A 2016 paper in the journal “Genome Biology” claimed that “typical” human genomes differ from the reference at between 4 and 5 million out of our 3.2 billion loci – leaving 99.9% of our nucleotides in common. Most of these variants, around 99% of the 0.1%, are shared with a substantial fraction of the human population – though the specific makeup varies. This layered and complex commonality is at the root of the mathematical problem with using a single reference. Of the 40,000 to 200,000 variants that have not yet been seen in very many other people, between 40 and 80 are “de-novo” variants that were not inherited from either parent. Out of those 40 to 80, we only expect a couple to be nonsynonymous mutations within an exon.
So we vary a lot, or a little, or not very much at all, depending on how you choose to look at it.