Progress with long read sequencing
But wait, didn’t the Human Genome Project accomplish that feat about 20 years ago?
The simple answer is no, it did not. The first “complete” human genome sequence was actually complete only in terms of what researchers could achieve with the technology available at that time. Nearly 10 percent of the genome, including highly repetitive sequences at the ends (telomeres) and middles (centromeres) of chromosomes, remained inaccessible. Indeed, in 2016, when I investigated the raw data in my own genomic sequence, it contained “only” about 2.8 billion bases, well under the 3+ billion bases of the full human genome.
Short sequences to long reads
The original reference genome sequence, as well as my own, was generated using short read sequencing technology. Relatively fast, inexpensive and accurate, short read sequencing (technically known as high throughput massively parallel sequencing) requires breaking the genome into millions of short—250 bases, plus or minus—segments for sequencing and then reassembling them to generate the full sequence. The method works very well for most of the genome, but, as previously noted, some regions are left out. Highly repetitive segments, where the same nucleotide or pattern (e.g., GCGCGCGC) is repeated over and over again, cannot be accurately reassembled.
The solution? In recent years, long read technologies have progressed in both accuracy and read length. I assessed the state of the field more than five years ago, and while the players (Oxford Nanopore, Pacific Biosciences) remain the same, much has changed, as the T2T accomplishment indicates. Long read sequencing technologies haven’t pushed aside short read sequencing or Illumina, the largest company in the field, as some predicted back then, but they are now being applied to a wider range of research applications.
The Jackson Laboratory (JAX) has considerable expertise in long read sequencing, and many faculty members rely on it for their research. It’s therefore fitting that researchers leading the long read innovation charge at institutions around the country came to JAX for its recent long read sequencing workshop, including some of the leaders of the T2T effort. The workshop included talks and hands-on experience in enhanced methodologies, new research applications and computational approaches for analyzing long read data. The work is providing biological insight not previously achievable, some of which has been highlighted by recent events.
Long read sequencing has long been important for sequencing microbes. Single-molecule reads that span the entire genome can capture species without an existing reference sequence as well as detect variants within a particular species. The importance of this capability has been magnified by the COVID-19 pandemic, as small variations in the SARS-CoV-2 genome can have important public health and clinical implications. Detecting, describing and tracking new variants as they arise remains essential as the Omicron variant in particular has mutated into subvariants that outcompete previous ones and spread rapidly. Identifying and characterizing small changes in the virus’s RNA genome provides an advance warning of likely transmission, as with the current Omicron BA.4 and BA.5 sub-variants. It will be even more vital should a variant with high virulence and/or non-responsiveness to current vaccines arise.
Of course, SARS-CoV-2 is only one of many trillions of microbes with which we share our planet. The recent pandemic has greatly increased awareness of zoonotic disease—a new human infectious disease that first arises in an animal reservoir before jumping to humans—and maintaining a heightened vigil for potential new pathogens is of great importance. For this application, nanopore sequencers, which can be highly portable and easy to use outside a laboratory, are a vital new tool. They provide real-time sequencing capabilities in remote regions, where the risk of zoonotic disease is often highest, as well as surveillance for tick-borne pathogens right in the northern U.S.
As the T2T project indicates, genomics is even more complicated than it first appears. But the challenge of capturing everything about our genomes goes far beyond simply covering highly repetitive regions. Not all sequences are laid out sequentially, with two copies nestled neatly into chromosomes. They can be duplicated, deleted, inverted, and so on, creating changes known as structural variants, or SVs. SVs are very difficult to detect using short read technologies, because they don’t change the actual sequence, they reconfigure it in ways that are lost when a genome is cut up and reassembled. Their importance in many diseases, including cancer, has therefore only lately come to light, particularly after the advent of long read methods.
Another kind of variation is now emerging as a significant player in cancers: extra-chromosomal DNA (ecDNA). These are additional DNA copies that become circular structures that are completely separate from chromosomes in the nucleus. Like SVs within the chromosomal sequences, ecDNAs replicate, rather than alter, portions of the reference sequence and are very hard to detect using standard sequencing methods. Long read protocols can identify ecDNAs and reveal exactly what sequences they contain, end-to-end. Recent research at JAX has gone a step further, visualizing ecDNAs within living cells and showing that they are inherited unequally during cell division, creating and/or enhancing cellular differences and driving cancer evolution.
Beyond the central dogma
Long reads provide crucial insight into another process separate from the chromosomes, known as alternative splicing. Messenger RNAs (mRNAs) are transcribed from the DNA sequences, move outside of the nucleus, are spliced, and are then translated to proteins at the ribosome. This basic process, called the central dogma back in the day, was once thought to be rather straightforward. We now know, however, that that the splicing process, in which segments known as introns are removed and the remaining segments—exons—are spliced back together, is far from simple. Even in normal function, the same mRNA that emerges from the nucleus is spliced in different ways, with different introns and exons removed or retained. That way, a single gene can generate multiple forms of its associated protein. Disruptions to alternative splicing regulation or function have been implicated in multiple human diseases, including cancer.
Detecting and characterizing different forms of mature mRNA from a single gene, called isoforms, is therefore essential for investigating how splicing defects may contribute to cancers. It requires sequencing the entire mRNA at once, however, something not possible with short read methods. Researchers at JAX are therefore using long-read mRNA sequencing to catalog all mRNAs present in breast cancer samples. The research revealed previously undocumented isoforms that can change protein function, with possible implications for cancer progression and therapy response. Capturing complete mRNA isoform catalogs is important for research into other processes and diseases as well, such as neural development and retinal disease and cardiovascular disease.
Now we have the complete, end-to-end human genome sequence completed. We can find SVs and ecDNAs and sequence complete microbial genomes and mRNA isoforms. Where do we go next? The obvious answer may be rather prosaic, but the potential benefits are not: making long-read sequencing more feasible and more accurate for common clinical applications. At this time, long read protocols can be accurate, affordable and high-throughput, but not all three at the same time. Bringing them to the clinic demands near 100% accuracy, which is only achievable at high costs. But as the cost/accuracy equation improves, they could provide crucial insights in many areas, including rare disease diagnosis. Indeed, Sir Mark Caulfield, F.Med.Sci., who led Genomics England and its initial rare disease diagnostic program, sees long read technologies as a potentially important step forward in increasing diagnostic yield for patients.
One such step is the ability to phase genomes. We all have two sets of chromosomes, one from each parent, and two of each gene. Since short read methods chop everything into small pieces, it’s impossible to tell whether a variant is from the mother, father or both parents with sequencing them as well—hence the common trio sequencing for rare disease patients and families. With long reads, it’s possible to phase the variants onto the paternal or maternal chromosomes without that extra step. This is particularly in diagnostics, such as when two variants occur in the same gene but not the other copy, a phenomenon known as compound heterozygosity that commonly underlies recessive Mendelian (single-gene) disorders.
What else may come? It’s difficult to say, and researchers are known for their creativity for applying powerful new technologies in unexpected ways. All I know that it is an exciting space to watch in the months and years ahead.