Somebody recently asked me, innocently enough, “how many genes are there in the human genome?” As one does in these situations, I answered a slightly different question: We are made up of about 20,000 unique proteins. This sufficed to move the conversation along, but I found myself wondering what an honest answer would look like.
The 23 chromosomes of the human genome can be thought of as an ordered string of approximately 3.2 billion (3.2 x 109) nucleotides. Our genome is diploid, which means that we have two similar but non-identical instances of each chromosome in most cells. The sex chromosomes are an exception – chromosomal females have a pair of the larger X chromosomes, while chromosomal males have one X and one much smaller Y. In people with multiple X chromosomes, all but one is disabled at random on a cell-by-cell basis early in development. This “X chromosome inactivation” means that we are all functionally haploid for the X chromosome within any given cell.
Loci that exist on both instances of a chromosome are either the same (homozygous) or different (heterozygous). Loci that occur on only one chromosome are referred to as hemizygous. This can occur when an entire chromosome is haploid (as with ‘Y’), or when a stretch of DNA appears on one chromosome but is absent on the other. The latter situation can lead to multiple representations of the same underlying biology depending on whether we choose to describe it as an ‘insertion’ or a ‘deletion.’
A single DNA molecule is made up of two strands wrapped around each other to form a double helix. This double helix structure is not the chromosomal pairing described above. DNA strands are directional, and the paired strands of a single DNA molecule proceed in opposite directions. One end of a strand is called 5’ (“five prime”), and its opposite is 3’ (“three prime”). These names refer to which specific carbon atom is bound to the adjacent nucleotide. Biological operations like replication and transcription proceed in the 5’ to 3’ direction, and DNA sequences are conventionally written out in that same order.
Each locus in a chromosome consists of a pair of nucleotides, one per strand. The two strands contain the same information, but running in opposite directions and coded as the reverse complement of the other – Guanine (G) swapped with Cytosine (C) and Adenine (A) swapped with Thymine (T). While each strand is directional, the double stranded molecule is not. Neither strand has priority over the other. This means that there are at least two valid representations for the DNA sequence found at any location on a chromosome.
The best understood mechanism of genetic action is codified in the fundamental dogma articulated by Watson and Crick: Portions of the dual stranded DNA in the nucleus are transcribed into single stranded RNA; RNA transcripts move from the nucleus into the cytoplasm, where they are translated by the ribosomes into a series of amino acids; this amino acid chain then folds up into a protein. We refer to the portions of the nuclear DNA that eventually code for the amino acids in proteins as the protein coding or simply the coding regions.
The coding sequence for a particular protein is not a continuous stretch of chromosomal DNA. Coding regions, or exons, are interspersed with non-coding regions called introns. The protein coding regions of RNA are spliced together during transcription into a single strand that omits the introns. In addition to introns, which are non-coding regions within genes, there is also substantial intergenic DNA that is not part of any protein at all. Introns and intergenic regions have important, sequence-specific functions, despite not being transcribed or translated, including regulating the rate of transcription of nearby protein coding regions.
In addition to the messenger, or mRNA, described above – other sorts of untranslated transcripts exist (utRNA) which never wind up contributing their sequence to amino acid chains: Transfer RNA (tRNA) supports protein synthesis; ribosomal RNA (rRNA) is coded by the ribosomal genome – a circular chromosome in the cytoplasm; and micro RNAs (miRNAs) play a significant role in gene regulation. There is also diversity within the mRNA transcripts coding for a single protein. Splice variants occur when exons are omitted during transcription or removed later in a post-transcriptional splicing event. There are even reverse transcriptase proteins that, fundamental dogma notwithstanding, write RNA sequences back into the nuclear DNA in living cells. These are the mechanism of action behind retrotransposons (colloquially known as “jumping genes”), telomere repair, HIV, certain cancer-associated viruses, and of course gene editing technologies like CRISPR.
None of this answers the original question about how many genes we have, unless you are willing to accept the old “one gene, one protein” mantra, in which case “about 20,000” will suffice. To do better, we need to define what, exactly we mean by “gene.” It would probably also be a good idea to put some thought into which of the phenomena above are within or excluded from “the genome,” lest we leave ourselves open to the usual grab-bag of gotcha questions from the well informed audience.