The Hidden Costs of Genomic Data
By: Kimberly Robasky, PhD | March 04, 2015
Molecular testing has been around for decades and with a broad spectrum of utility: blood typing in hematology, DNA-fingerprinting in forensics, and HLA-typing in immunology, to name just a few. More recently, game-changing breakthroughs in DNA-sequencing are enabling broader and even more accurate assays, further accelerating discovery and directly impacting how we deliver care.
DNA-sequencing has many applications. For example, targeted sequencing focuses on regions of interest, such as specific genes or loci of important personal variation. By contrast, whole-genome DNA-sequencing accesses the entire genome, identifying many more personal variations. On the other hand, transcriptome sequencing (a.k.a., RNA-Seq) accesses the RNA to understand how certain genetic variations affect transcription into RNA, which is eventually translated into proteins.
The benefits are clear but what about the costs? If researchers aren’t thoughtful about the way they perform these tests, they can end up spending hundreds or even thousands of additional dollars per test just to store and transfer the collected data. In some cases the costs are justified, but for others, there are more effective alternatives.
It all adds up
To read off an individual’s exome (e.g., the coding region of the genome) using high throughput sequencing currently costs $500-$1500, plus analysis costs. Similar technologies used to read all three billion plus bases of the whole genome, capturing all of an individual's private genetic variation, costs about $1000-$5000 in labor and reagents. However, these estimates only consider part of the equation. To get the true sense of the financial implications of DNA-sequencing, we must also consider the informatics costs.
Analysis costs can vary widely depending on experimental goals, but data storage costs are slightly more generalizable. For example, a typical whole genome occupies about 100 gigabytes, including all of the detailed information required for reproducing the analysis. This can cost $5-$10 per month in current cloud storage costs, plus data transfer costs, which can add up quickly if the sequencer is not located near the analysis site.
Given that individual human genomes are about 99.9% similar, we could choose to store only those differences, reducing the size to below 5GB for that same typical whole genome, on average. Unfortunately, if the original raw data are not also saved, reanalysis with new and improved algorithms will require a new re-sequencing project, including the $1000-$5000 cost in labor and reagents. Consequently, the longer-term data storage costs for whole genome sequencing should factor into any NGS-based business plan.
Targeted sequencing can significantly reduce data storage costs by only sequencing areas of known interest. Depending upon the panel size, multiple samples can be included on the same sequencing run (a.k.a., multiplexing), and can be sequenced at an order of magnitude greater than the depth used in whole genome sequencing. Sequencing at higher read depths (a.k.a., deep-sequencing) enables the detection of low-frequency variation, such as genetic variation that might only occur in a small fraction of the cells being sequenced. Detecting low-frequency variants provides insight into somatic variation, which is thought to be the key driver behind the onset and spread of cancer. Low-frequency variant detection also abets access to the Immune Repertoire, which is critical in surveilling things like disease state and drug response. Hence deeply-sequenced targeted panels are a cost-efficient alternative to whole genome sequencing. Targeted sequencing can also provide higher detection rates for somatic mutation at a fraction of the cost of whole genome sequencing.
Likewise, sequencing exomes and transcriptomes translates to significantly less storage requirements -- expect to reduce data transfer and storage costs to a tenth of that required for whole genomes for the higher read depths that are considered standard for transcriptomes and exomes.
Data costs may seem like a peripheral expense, but they can dramatically alter the cost benefit scenario of these efforts, depending on the assay and long-term needs. Good experimental design that utilizes the right tool for the job can have a big impact on the cost and benefits achieved.