The speed of genomic analysis has increased while the cost has plummeted. Whole genome sequencing provides nearly complete coverage of an organisms DNA and the advent of next generation sequencing extends this coverage to transcriptome profiling and beyond. The increasing availability of genomic data presents new opportunities for understanding biological function in health and disease including the spatial and temporal profiling of tissues and tumors as well as the promise of personalized medicine.
While the genomic data is valuable, the interpretation of the vast data sets is complicated and other methods are needed to inform the analysis. Data derived from the proteins of the organism being studied is of obvious utility since the nucleic acids code for their production. Mass spectrometry (MS) plays a leading role in proteomics in the high throughput identification and sequencing of proteins. MS has been continuously improved, but it is not close to the >99% coverage attained with genomic data. Merging the two data streams to create proteogenomics takes advantages of the strengths in both areas.
The genome provides a map of while the proteome is what actually occurs. Each stream of data can inform the other. Not all DNA is transcribed and the presence of transcripts does not equate with protein expression so the presence or absence of protein can be used to validate predictions of open reading frames, the start of proteins, splice isoforms, novel genes, and pseudogenes. MS- based studies can also be used to map out posttranslational modifications (PTMs), an essential component of biological function that is not encoded directly in the genome. MS-based proteomics now employs strategies for quantifying the amount of protein present, including several isotope encoded methods.
One big advantage of knowing the genome is that it can be used to search the proteomics data using all six frames for reading the DNA. Conventional search algorithms for proteomics rely on protein databases that do not include variants, can be incorrectly annotated, and are almost never specific to the sample. Also, the methods used for high throughput MS analysis do not lead to comprehensive protein sequence coverage. For example, standard automated proteomics analyses rely on general chromatographic separation methods and selection algorithms that choose abundant peaks in a precursor MS spectrum for sequencing. Consequently, peptides can be lost due to hydrophobicity (e.g., membrane-associated or large peptides) or may never be sequenced if they are in a dense part of the chromatogram. Data from RNA-Seq or whole genome sequencing allows the proteomics researcher to focus on a specific area of the proteome and drill down to determine if the predicted protein is seen in the sample of interest.
Much remains to be done to improve the utility of mass spectrometry in proteogenomics. Method development is needed to improve the handling of membrane proteins and their embedded peptides and for the detection of large peptides and their modifications. The detection of larger peptides from proteins will allow a better map of PTMs (e.g., multiple PTMs on a single fragment) and work is underway using high energy fragmentation for such sequencing. Proteomics also faces the challenge of a wide dynamic range with the need to characterize the low abundance, signaling proteins. Thus, the sensitivity of this experiment will need to improve as well.
An example of our ongoing work in support of this effort is the characterization of proteins in breast cancer subtypes. The tumors of interest are amplified using xenografts that are fully characterized with next generation sequencing. The tumors are also analyzed using quantitative proteomics. The data is then mined to find genomic variants of interest in the proteomics data and to inform the next pass of proteomics experiments (e.g., inclusion list driven tandem MS for sequencing peaks that could be indicators of a splice variant in a particular tumor subtype.)