Introduction
During the post-genomic era, proteomics has become an exciting field and a potential frontier of modern medicine since the early 2000s.1–3 High-throughput technologies enabled quantitative proteomics, revealing unprecedented new insights.4–6 For example, GAPDH, traditionally considered a housekeeping protein consistently expressed in various tissues, was recently found linked to retinoblastoma, lung adenocarcinoma, and intrahepatic cholangiocarcinoma.4,5,7 These findings led to increasing clinical applications of high-throughput proteomics, although these clinical applications remain relatively limited. Therefore, we here summarize recent advances in clinical applications of high-throughput proteomics and discuss the associated challenges, advantages, and future directions (Fig. 1).
The four most commonly used high-throughput proteomic techniques are mass spectrometry (MS), protein pathway array (PPA), next-generation tissue microarrays (ngTMA), and multiplex bead- or aptamer-based assays such as Luminex® and Simoa® (Fig. 2).8 They each have their own methodological strengths and weaknesses and should be used accordingly.8 Briefly, MS analyzes and quantifies proteins, their isoforms, and post-translational modifications through direct assessment of the fragments or specific proteolytic activities. Based on instrumental analysis methods, MS can be roughly classified into direct infusion MS, ion mobility system (IMS) MS, liquid chromatography-mass spectrometry (LC-MS), gas chromatography MS, and supercritical fluid chromatography MS.9,10 IMS MS is gaining popularity for its smaller chemical variations and higher speed than LC-MS.9 The direct infusion shotgun proteome analysis combines shotgun and IMS MS methodologies and appears to be a highly efficient and accurate approach for high-throughput proteomics.9,11 Isobaric tags for relative and absolute quantitation (iTRAQ) is another commonly used MS-based technique, covalently labeling the side-chain amines and/or N-terminus of targeted peptides.12 It has been applied to discovering biomarkers of hepatocellular carcinoma and cervical cancer.13,14 However, iTRAQ has its disadvantages, including isotopic use, contamination, and background noises, despite its very high accuracy (orders of magnitude).15
PPA uses a mixture of antibodies in a gel-based array to simultaneously detect corresponding antigens in a sample, making it high-throughput. ngTMA applies antibodies to a large number of tissue samples/cores simultaneously, usually formalin-fixed paraffin-embedded and arranged in an array on a single histologic slide. This allows large-scale antibody-based molecular analysis of multiple samples at the same time, improving time- and cost-efficiency and decreasing variations and the need for additional controls. Multiplex bead- or aptamer-based assays mix the samples with multiple beads or aptamers that simultaneously bind to various antigens in the samples through conjugated antibodies, aptamers, or probes. Most of these systems use fluorescent-based conjugation and detection systems. Finally, these technologies each have their unique strengths and weaknesses. For example, the acceptable sample types vary by methodology. MS and PPA are mostly for processed proteins, while ngTMA and bead/aptamer-based assays require human tissue and blood samples, respectively. More methodological details can be found in other reviews.8
Advances in the clinical applications of proteomics
High-throughput proteomics methods have broad applications in translational research, clinical practice, and public health. They enable the exploration of molecular mechanisms and biological processes, the identification of novel diagnostic and prognostic biomarkers for precision medicine, and the discovery of therapeutic targets for personalized therapy.16 Given the rapidly expanding high-throughput datasets,17 high-throughput proteomics becomes increasingly important for translational and clinical research. It has been widely used in cancer research and other fields.18–20 Several elegant reviews have focused on one or two disease areas and should be referred to for more details.19,21–26 Here, we briefly summarize recent advances in the clinical application of high-throughput proteomics in selected diseases.
Breast cancer
Song et al.27 performed both PPA and SmartChip, which is an mRNA microarray, and identified 1,243 cancer pathway-related genes in breast cancer. They revealed decreased protein and mRNA expression in CDK6, Vimentin, and SLUG, and different protein expression in BCL6, CCNE1, PCNA, PDK1, SRC, and XIAP between tumor and normal tissues, but no difference in mRNA expression in those genes. At the signaling network level, 15 altered pathways were identified in breast cancer. Among them, the p53, IL17, HGF, NGF, PTEN, and PI3K/AKT pathways, accounting for 6 pathways, were found to be shared between the mRNAs and proteins. Although many dysregulated pathways in breast cancer occur at both mRNA and protein levels, mRNA expression does not necessarily correlate with protein expression. It thus suggests different regulatory mechanisms for proteins and mRNAs in breast cancer pathogenesis.27 Hadi et al.28 used gas chromatography-mass spectrometry to identify potential protein markers for breast cancer. A partial least square discriminant analysis model was built to separate breast cancer patients, achieving a sensitivity of 96% and a specificity of 100% on the validation dataset. Models using the decision tree algorithm for grading, staging, and neoadjuvant status reached predictive accuracies of 71.5%, 71.3%, and 79.8%, respectively.28 Interestingly, Aslebagh et al.29 assessed protein expression patterns in human milk obtained from breastfeeding mothers who had breast cancer using 2D-polyacrylamide gel electrophoresis coupled with nano LC-MS/MS analysis. It showed that breast milk could be an essential and potentially informative biospecimen for breast cancer biomarker discovery.29
Colorectal cancer
Barberini et al.30 applied gas chromatography-mass spectrometry to identify biomarkers in blood plasma for colorectal cancer and found the most significantly altered metabolic pathways in colorectal cancer involve monosaccharides, such as the catabolic pathway of fructose and D-mannose, and amino acids, such as methionine, valine, leucine, and isoleucine. Ang et al.31 described a detailed protocol for revealing candidate protein markers in stool samples using 1D sodium dodecyl sulfate-polyacrylamide gel electrophoresis with LC-MS/MS. Their protein quantitation protocol and validation study will enhance the fecal proteome for the detection of potential fecal biomarkers. Additionally, using a high-density antibody microarray, Rho et al.32 showed that plasma levels of several proteins/glycoproteins were associated with colon cancer diagnosis, including BAG4, IL6ST, and CD44. Adding CD44, EGFR, sialyl Lewis-A, and Lewis-X content further improved the panel’s performance to an area under the curve (AUC) of 0.86 to 0.90.
Gastric cancer
In gastric cancer, Gao et al.33 discovered that VCAM1, FLNA, VASP, CAV1, PICK1, and COL4A2 were differentially expressed using isobaric tags for relative and absolute quantitation (ITRAQ) labeling analysis with LC-MS. They identified VCAM1 as a potential biomarker for treatment, located at the center of the protein-protein interaction network by KEGG pathway analysis. Lian et al.34 identified 20 proteins that were differentially expressed in Helicobacter pylori-associated gastric cancer using PPA. They found that both brassinosteroid-insensitive 1-associated kinase 1 and calpastatin were favorable prognostic factors in H. pylori-associated gastric cancers. The ERK/MAPK signaling pathway was the most significantly affected by H. pylori using PPA and ingenuity pathway analysis. He et al.35 applied PPA to AFP-producing gastric adenocarcinoma, which is more aggressive and associated with liver metastasis, uncovering that cyclin D1, RANKL, LSD1, Autotaxin, Calpain2, stat3, XIAP, IGF-Irβ, and Bcl-2 were up-regulated, and ASC-R and BID were down-regulated with significant differences. Furthermore, high levels of XIAP and IGF-Irβ were independent prognostic factors. These factors can also be used to build a risk model with the pathological stage to separate AFP-positive gastric adenocarcinoma into two subgroups. The protein kinase A pathway was involved in the high-risk score group, while the PTEN pathway had significant enrichment in the low-risk score group by gene set enrichment analysis.35 Tong et al.36 uncovered nine serum markers using the Luminex system for the diagnosis of gastric cancer. Among them, pepsinogen I, pepsinogen II, ADAM8, VEGF, and Anti-H pylori IgG were identified as the panel of classifiers in the three algorithms, including logistic regression, random forest, and support vector machine, with accuracy in the validation set of 78.7%, 82.5%, and 86.1%, respectively.36
Bladder cancer
Proteomics has provided unique insights into the diagnostics, therapeutic targets, and pathogenesis of bladder cancer.37 Chen et al.38 applied differential 12C2-/13C2-dansylation labeling coupled with liquid chromatography/tandem MS to evaluate metabolite-based diagnostic biomarkers in urine for bladder cancer. They used ultra-performance liquid chromatography coupled with a high-resolution Fourier transform ion-cyclotron resonance MS system and an ion trap MS with multiple reactions for precise quantification. Among o-phosphoethanolamine, 3-amino-2-piperidone, uridine, and 5-hydroxyindoleacetic acid, o-phosphoethanolamine and uridine were differentially expressed in the urine of bladder cancer patients compared with controls. Furthermore, o-phosphoethanolamine was the most promising biomarker among the four, with an AUC of 0.709 for bladder cancer diagnosis. The AUC improved to 0.726 with the combination of o-phosphoethanolamine and uridine.38 Hu et al.39 demonstrated that 45 proteins were differentially expressed in bladder cancers compared with non-tumor samples. Among them, EGFR and cdc2p34 were associated with muscle invasion and higher histological grade. Moreover, ß-catenin, HSP70, autotaxin, Notch4, PSTPIP1, DPYD, ODC, cyclinB1, calretinin, and EPO can be employed as a classifier panel to classify muscle-invasive bladder urothelial carcinoma by prognosis. P2X7, cdc25B, and TFIIH p89 were identified as significant prognostic factors by Kaplan–Meier and log-rank analyses on overall survival.39 Recent studies also identified 14 differentially expressed plasma proteins in cancer versus control groups, with apolipoprotein A1 being the most promising candidate (AUC = 0.906) and showed that Cadherin 12 is a predictor of neoadjuvant chemotherapy outcomes.40,41
Laryngeal squamous cell carcinoma
Sewell et al.42 first identified stratifin, S100 calcium-binding protein A9, p21-ARC, stathmin, and enolase as proteomic markers for laryngeal squamous cell carcinoma. Chen et al.43 discovered 16 proteins differentially expressed in laryngeal squamous cell carcinoma. Among them, TTF-1, CDK2, Eg5, PCNA, Bcl-xL, 14-3-3b, p27, SRC-1, and cytokeratin 18 were identified as markers for classification, and JAK2, keratin 10, and IL-3Ra were identified for prognosis. They also developed a risk model based on histological grade, T classification, N classification, JAK2, and IL-3Ra, which can predict the prognosis with 85.5% accuracy.43 Pan et al.44 developed a four-autoantibody-based early diagnostic panel, including TP53, HRAS, CTAG1A, and NSG1, for esophageal squamous cell carcinoma using the Luminex xMAP platform. The panel can discriminate early esophageal squamous cell carcinomas from controls with a sensitivity of 58.0% and specificity of 90.0% in an external validation dataset.44 Using LC-MS, Zhao et al.45 recently revealed that the fatty acid desaturase 1 expression was linked to poor prognosis and advanced clinical features in recurrent laryngeal squamous cell carcinoma patients treated with chemotherapy. They identified that fatty acid desaturase 1 is a potential promoter in laryngeal squamous cell carcinoma progression through the AKT/mTOR signaling pathway by protein-protein interactions (PPIs) and module analysis.
Coronavirus disease 2019 (COVID-19)
High-throughput proteomics is a technology that can be rapidly applied, as demonstrated by its use in studying COVID-19. Despite the progress made in combating COVID-19,46,47 diagnostics and prognostication of COVID-19 and long COVID-19 remain challenging in the post-vaccination period.48,49 Ray et al.50 illustrated that proteomics techniques, including MS, antibody-based assays, and bioinformatics had tremendous potential to uncover the severe-acute-respiratory-syndrome-related coronavirus-2 (SARSr-CoV-2) pathobiology and inform therapeutics and vaccine development. LC-MS has helped identify the host cell pathways modulated by SARS-CoV-2 virus.51 Hierarchical clustering analysis identified two main clusters of proteins: one consisting of proteins involved in cholesterol metabolism, which were reduced during infection, and another of proteins that were increased by infection. They showed that inhibition of these pathways can stop viral replication in vitro and thus can be targets for COVID-19 prevention and treatment.51 Forster et al. successfully used phylogenetic networks to identify undocumented COVID-19 infection sources. Their team found three central variants marked by different amino acid profiles, named A, B, and C types. The A and C types are mostly found in Europeans and Americans, while the B type is more prevalent in East Asia.52,53 Interestingly, the ancestral genome of type B seems to have mutated into derived B types before being transmitted beyond East Asia.52,53
Challenges
Protein property
Degradation has always been the biggest challenge in proteomics, compared to the considerable stability of DNA and cDNA. Protein stability and half-life are modulated by multiple post-translational modifications (PTMs), which regulate various signaling pathways and modify proteins with functional chemical groups, including phosphate, glycan, methyl, acetyl, ubiquitin, and others.54 Although these modifications are significant for protein functions such as activity state, stability, localization, turnover, and interactions with other molecules, they are susceptible to degradation. This degradation can occur during protein extraction, sample collection, and temperature changes.55 Missing PTM detection can lead to misinterpreted results and errors in data analysis. For example, the degradation of phosphorylation can mislead the activity status of proteins and dynamic protein-protein interactions. Loss of PTMs can also significantly change the original multi-dimensional structure of proteins. Low protein concentration and non-specific bindings are additional challenges for immunoassay-based techniques.
Statistical modeling
The missing or inappropriate normalization of data will lead to inaccurate or misleading statistical analysis, despite the use of proper statistical methods.56–58 For example, although housekeeping proteins such as GAPDH and beta-actin are known for their consistent expression across biological sources, it is sometimes inaccurate if there are systematic errors or intrinsic linkages to a disease.4,5,7 These errors may affect protein detection in certain areas or types of protein. Using additional housekeeping proteins can help identify these errors and serve as internal controls.
Statistical modeling algorithms may also be affected by input data and selected features/factors (e.g., quality of samples and biomarker selection). For example, the clustering results can be greatly influenced by changes in samples and available features.8 In such scenarios, unsupervised machine learning (ML) may be particularly useful since it does not rely on the labels/annotations provided by experts but is driven by the intrinsic relationships of the samples.59,60 However, the performance of unsupervised ML may be worse than supervised ML and thus should be compared with that of supervised ML.61 Sample selection, variables, and study goals need to be clarified beforehand to achieve a more meaningful result.
The same statistical method can be performed using various formulas, which focus on different principles or data types and generate different results. For example, clustering analysis can generate different heat maps using different distances as units, such as Euclidean distance, Manhattan distance, Pearson correlation, minimum distance, and maximum distance. Even for the same dataset, different ML models can achieve varying classification accuracy and performance.8,62,63 Moreover, there is no best or standardized statistical algorithm to fit all data types, especially unknown data. Therefore, an optimized statistical model should not only be based on the data itself to seek the most reasonable formula but also on clinical significance and dataset characteristics. This is probably the biggest challenge during data analysis. Therefore, model optimization and comparison should also be performed to ensure that the best-fit model is identified and used while minimizing the risk of overfitting.6,64–66
Data deposition, integration and harmonization
Research and clinical communities are integrating and standardizing data from different studies and sources for a bigger picture of the signaling network and greater statistical power. However, several challenges remain in collecting and merging datasets currently.67 (1) The two major omics data depositories, which include proteomic data, are the Gene Expression Omnibus hosted by the National Library of Medicine, National Institutes of Health, Bethesda, MD, USA, and the ArrayExpress by the European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, UK.68 However, there are limited publicly available resources for acquiring proteomics and genomics data, which makes assessing signaling network changes among the DNA, mRNA, and protein levels difficult in a target disease due to limited overlapping molecules. Indeed, 33.8% of the datasets in the ProteomeXchange consortium were unreleased.69 These two omics data depositories also focus on genomic and transcriptomic data. While the PRoteomics IDEntifications (PRIDE) database, part of the ProteomeXchange consortium, has more proteomic data,70 how to incorporate the 3 omics data types remains challenging. (2) Particularly with clinical samples, protecting patient identity is becoming and should have become a priority for database repositories.69,71,72 It is legally and ethically challenging to balance excessive administrative burdens with sufficient patient protection. (3) Sample preparations in different research or clinical settings will increase system errors and noises in multi-omics analysis. Therefore, it is essential to use a standardized protocol, common data standards, and annotation guidelines during the experiment and computational processing to simplify and acquire qualified data for further data sharing, merging, integration, and mining. Efforts have been focused on standardized submission and data protocols.67,68,73,74 (4) Nearly all proteomic data repositories are designed to accept MS-based data, while non-MS-based data such as that from ngTMA and multi-bead-based technology are currently incompatible.69 This has become a future direction for the ProteomeXchange consortium.
Clinical validation and considerations
Approximately 3,000 genomic and proteomic biomarkers are currently used in clinical trials involving more than 2,000 diseases.75 However, a significant bottleneck in developing useful and marketable proteomics-based assays is the step of moving into clinical validation, with or without a clinical trial.22,76 Any qualified clinical validation study should have not only high sensitivity and specificity but also high precision and high accuracy.77 These characteristics are affected by specimen collection, storage, and processing, which are often not assessed in the scientific research phase. Developing clinical tests, including multi-step procedures that must also be easy to perform with good reproducibility, can be difficult.22 Besides, a complex clinical environment is usually not conducive to high-throughput assays. Cross-contamination is still common and difficult to completely remove/prevent, even for procedures performed in different areas such as the clean area (pre-test room), reaction area (test room), and contaminated area (post-test room).78
Furthermore, real-world clinical settings are more complex than research settings. Patients’ existing treatments and comorbidities are major barriers between real-world and research settings. They may interfere with or confound the expected clinical usefulness of proteomic markers and thus decrease the generalizability of research studies.78 MS might be highly specific, but clinical laboratories may need higher sensitivity from an (immune) assay platform.79 Indeed, it is relatively straightforward to incorporate a particular protein variant into a clinical setting as a screening marker that is sensitive but less specific for a disease or diagnosis. Moreover, other reasons may also significantly delay the clinical application of proteomics, including researchers’ lack of understanding of clinical validation requirements, long development turnaround times, the lack of ready-to-use quality control and calibrators, and the lack of necessary paperwork and regular maintenance. A sufficient number of both positive and negative samples for clinical validation is a unique but not uncommon challenge for small laboratories.
How to meet these challenges
Technological advances, stringent data validation, model optimization, and rigorous validation processes are key to successfully meeting these challenges in clinical proteomics.20,80 PTM, as part of the challenging protein property, can be effectively monitored using advanced MS technologies, such as iTRAQ, multiplexed proteome dynamics profiling, and data-independent acquisition (DIA) MS.81–83
Several considerations are noteworthy for improving the statistical modeling of high-throughput clinical proteomics.20 First, rigorous and high-performance statistical modeling relies on robust data validation and quality control processes, which must be ensured through document control and implementation. Second, all ML modeling must be compared with conventional or existing modeling and undergo an optimization (tuning) process for the best performance. Third, an external dataset should be used as the independent test set so that the generalizability of the final (chosen) model can be reliably and independently tested.
Finally, although proteomics-based clinical trials lag behind other diagnostic modalities (e.g., genomics), plasma proteomic and metabolomic data may greatly help guide precision oncology.19,20,84–86 When developing clinical proteomic tests, attention should be directed to each of the pre-analytics, method development, performance evaluation, and implementation steps.85 Several groups called for standardization and stringent quality control processes for the clinical use of proteomic biomarkers.19,20,85 Education and training of laboratory staff are also important for the robust validation and implementation of clinical proteomics.85
Advantages
Global networks
Understanding cell signaling networks involved in disease and carcinogenesis has significantly advanced our knowledge of disease mechanisms and cancer initiation. These networks provide a global picture of protein-protein interactions, pathway-pathway interactions, and the significant functions of each pathway and sub-network group.87 Protein signaling network alterations, as part of a multi-step model of carcinogenesis, result from genetic, epigenetic, transcriptomics, etc. For example, PPA allows digitalized protein expression to be combined with genomic data to create a more comprehensive multi-dimensional signaling network of diseases.8 This network can be further integrated with existing knowledge, such as epidemiological data and digital pathology. A proteomic network, including large-scale PPI discoveries, will thus bridge the gap between genomics and biological functions, refining or reshaping our understanding of diseases.88,89 Furthermore, the entire network has great potential to investigate the functions and relationships of proteins and metabolites that reflect the disease’s hallmarks, and to understand the strength of each group of PPIs, enabling the discovery of driver proteins or driver pathways.90,91 The single protein expression with the most statistical differences is not necessarily the protein that affects biological functions the most in the network, nor are they the driver proteins that affect the entire network changes or the independent factors that affect disease progress. Therefore, biostatistical models and artificial intelligence can recombine all of the biomarkers and optimize them into a panel,89 providing higher specificity than single-protein assays to meet different clinical needs, such as early diagnosis, prognosis prediction, and targeted treatment. This enables truly personalized medicine because each biomarker contributes differently to each clinical setting.
Discovery of new proteins
Unlike immunoassays, MS-based methods do not require the development of high-affinity antibodies specific to each protein epitope. They allow the simultaneous sequencing of hundreds of proteins in a wide variety of biological matrices, including fresh cells, frozen tissues, and formalin-fixed paraffin-embedded tissues.92,93 MS-based methods can discover unknown proteins by powerful searching against a protein sequence database and quantifying them in a complex compound with high sensitivity, which is the cornerstone of identifying biomarkers and exploring the proteomics network.18,19 Moreover, various types of chemical groups in the peptide structure, such as phosphate groups involved in PTMs, can be captured and mapped back to protein sequences to infer the expressed proteins.94,95 Increasingly popular and improving MS technology makes protein discovery easier, resulting in more comprehensive proteome coverage for various organisms and generating a wealth of information stored in databases and bioinformatics repositories.
Multi-omics
The complex signaling pathway mechanisms that trigger oncogenesis or disease are not dependent on any single factor or only on the proteomic level but instead on the comprehensive effects from genomic and transcriptomic to proteomic alterations. Any individual protein is regulated by PPIs, mRNA, and DNA, working cooperatively in intricate signaling networks.87,89,96 Moreover, protein levels cannot be directly predicted by mRNA abundance, and protein dynamics are also dependent on other factors such as epigenetic and transcriptional regulation, which work together to alter protein levels, abnormal structural conformation, and impair function. Therefore, a multi-omics approach is required to characterize the complex pathophysiology of oncogenesis and explore and reshape the mechanisms of pathogenesis at the molecular level.97–99 Integration of data across -omics areas is promising because it provides a comprehensive view of genomic mutations and transcriptional abnormalities, promoting basic research on cells and animals. Thus, an exciting term and field, namely proteogenomics, has been developed.100,101 For example, Cao et al.102 conducted a comprehensive proteogenomic study on 50,000 MS runs of more than 900 projects. Among the 170,529 identified novel peptides, only about 1/30 (6,048) passed their strictest standard, which included being identified in more than two MS runs, more than one PRIDE project, and other criteria.
Currently, the diagnosis, evaluation, and treatment of most cancers are based on limited protein biomarkers or mutations combined with clinical presentation or characteristics. The integration of “omics” data can serve as the cornerstone of modern medicine to identify a new panel of biomarkers at the molecular level.103 These profiles, generated by the statistical models of multi-omics data, can ultimately serve as risk factors for disease, diagnostic assays for early detection, therapeutic markers for personalized treatment, and many other applications in medicine. A single candidate, whether protein, gene, or clinicopathological characteristics, is often insufficient or ineffective in providing a comprehensive evaluation due to the complexity of human tumors and carcinogenesis.104,105 Furthermore, as accumulating patient specimens undergo multi-omics analysis, a larger sample size with a more diverse population will allow the identification of additional low-frequency driver markers and mutations, especially in rare diseases.
Future directions
The clinical applications of high-throughput proteomics are still limited. Therefore, it is crucial to develop a roadmap for the future. We here recommend a roadmap focusing on single-cell biology, individualized proteomics, digital pathology, pathology informatics, deep learning modeling, and new proteomic technologies (Fig. 3).
Proteomics based single cell biology and its clinical applications
Single-cell mass cytometry is increasingly applied in various biomedical fields as the vast heterogeneity between cells of the same tissue is gradually recognized in medicine.106–109 This technology will likely help create a new division of modern biology, termed single-cell proteomics, providing unique biological insights at the single-cell level.110 For example, protein extraction-based proteomics techniques, such as Western blot or PPA, are challenging to identify whether the target protein is highly expressed in a small portion of cells or weakly expressed in most cells. A sufficiently large population of single cells can be studied as a time series of “snapshots” to recreate a timeline of dynamic biological processes of disease. Next-generation sequencing, as the ultra-high-throughput transcriptome analysis, has tremendously improved sensitivity and increased capacity in genomics, allowing the identification of numerous new genetic variables in rare cell populations.25 Single-cell proteomics can similarly provide maximal information to uncover the uniqueness of each cell in proteomics and has been recently applied to multiple myeloma, chimeric antigen receptor T cell therapy, and ovarian cancers.111–113
Li et al.114 described a nanoliter-scale oil-air-droplet chip for multistep complex sample pretreatment and injection for single-cell proteomic analysis in the shotgun mode, identifying 355 proteins at the single-cell level (mouse oocyte). Zhu et al.115 developed the nanoPOTS (nanodroplet processing in one pot for trace samples) platform, which can identify over 3,000 proteins from as few as 10 cells using the Match Between Runs algorithm of MaxQuant. They demonstrated the quantification of 2,400 proteins from a single human pancreatic islet within sections using this system. Ctortecka et al.116 further developed the proteoCHIP for preparing single-cell proteomics samples and can detect 2,000 proteins per TMT10-plex in single cells that were 170 multiplexed across various human cell types. Krieg et al.117 used high-dimensional single-cell mass cytometry for an in-depth characterization of immune cell subsets in the peripheral blood of patients with stage IV melanoma to predict responses to anti-PD-1 immunotherapy. Various new mass-based cytometry technologies significantly promote high-dimensional proteomics of single-cells,118 opening an exciting field to understand the complicated relationship between tumor cells and the environment and bringing higher clinical value for precision medicine.
Individualized proteomics
Polymorphisms, as individual modifiers of distinct genetic traits, widely influence the genome-based global arrangement of proteomics, resulting in a unique proteomic signature for each individual. Moreover, the structure and expression of the proteomic network vary in functional efficiency, known as “network polymorphisms”.119,120 Furthermore, individual variations lead to different biochemical and physiological baselines. Various modifiable factors, such as smoking, stress, obesity, and nutrition, accumulate in each individual with aging, shaping their unique genome, epigenome, and proteome together to contribute to disease development.121 Therefore, assessing and monitoring an individual’s proteotype of oncogenic proteomic profiles will contribute to the evolution and expansion of individualized precision medicine, ensuring that the right intervention, including diagnosis or treatment, is provided to the right person at the right time.26,119
Pathology informatics and digital pathology
Pathology informatics and digital pathology have emerged in recent years with the rapid development of imaging and computational technology. They enable us to perform more clinical tasks in a shorter time and generate a large amount of medical data. Novel sources of data from electronic medical records, pathology images, and bio-information,122 recorded by wearable personal trackers (e.g., heart rate, activity, sleep, weight), along with multi-omics data, are incorporated into a comprehensive dataset, providing a broader view of disease. No single analytical domain holds the keys to all aspects of disease development, diagnosis, and treatments.37,92,123 These detailed multi-dimensional explorations enable proteomics to play a significant role in the near future. The new multi-dimensional data resources provide a new foundation to generate more comprehensive proteomics biomarkers or panels for modern precision medicine. In turn, proteomics in this integrated multidisciplinary context can provide robust information to rethink and uncover the pathogenesis or specific patterns during the course of disease.
Deep learning
Deep learning, a subdiscipline within artificial intelligence and ML, focuses on algorithms that enable computers to learn to solve problems from existing data (training data).124 With its powerful computing and learning capabilities, deep learning can handle complex situations far beyond what the human brain alone can accomplish.125 It is particularly well-suited for processing massive (big) data with strong internal correlations from high-throughput multi-omics, assisting in diagnosis.126 Deep learning can accelerate the statistical analysis and data visualization of existing proteomics methods. For example, MS data are analyzed by peaks, that represent ions with a specific mass-to-charge ratio and can be biomarkers due to their similarity in the massive sequence database, without determining which peptides or proteins are actually present.127 During this process, deep learning can be a powerful tool to identify biomarkers with higher accuracy.89,128,129 It allows us to explore large, comprehensive datasets combining data from multi-omics.130
Although multiple ML applications have been used to assist with proteomic data analysis, there remains substantial room for improvement.131 Traditional data analysis is based on digital matrices converted from raw image results through multiple steps. For example, the size and density of each band from the antibody-antigen reaction, generated by a protein pathway array, need to be manually converted to numeric data. This multi-step manual processing can cause various systematic errors and make data difficult to merge. Convolutional neural networks and recurrent neural networks, types of deep learning/neural network-based multilayer artificial neural networks, are particularly useful for image analysis. They allow algorithms and statistical models to be built directly on the original image results rather than manually converted digital data. This approach may significantly improve the accuracy and efficiency of proteomics data analysis and increase the availability of high-quality public data resources.74,132,133
New proteomic technology
New proteomic technologies are rapidly developing and currently include DIA MS, nanopore-based proteomics, Python-based high-efficiency data processing packages, 4-D proteomics, and secondary ion MS. For example, the SWATH-MS system, a type of DIA MS, systematically fragments and measures all ionized peptides within a predefined mass range, resulting in fewer biases and better consistency than using the window of precursor isolations.134 In SWATH-MS measurements, peptide-centric scoring is often used and requires a thorough understanding of the peptide’s chromatographic and mass spectrometric characteristics.134 Based on a conceptually novel MS machine,135 an ultra-fast label-free DIA MS (named narrow-window DIA MS) was recently proposed, combining high-resolution MS1 scans and parallel tandem MS/MS scans of ∼200 Hz, delivering high sensitivity, specificity, and speed.136 Nanopore, known for its low cost and high sensitivity in DNA/RNA sequencing, may also be applied to label-free proteomics.137 A Python-based package (AlphaPept) was developed to efficiently process large high-resolution MS datasets using both central processing unit and graphics processing unit.138 The introduction of 4-D proteomics, with the fourth dimension of ion mobility (system), expands the depth of proteomics and has been coupled with DIA MS.139–142 Secondary ion MS involves the detection and mass-to-charge ratio analysis of secondary ions generated when sample surfaces are bombarded with energetic ions,143 offering very granular surface chemical data and sub-monolayer sensitivity.
It is noteworthy that these new proteomic technologies can be integrated to generate synergistic outcomes. For example, 4D-proteomics combined with DIA MS reportedly improves detection sensitivity.142 DIA MS was also used to explore single-cell proteomics.144
We must balance innovation and compliance when applying new proteomic technologies to clinical problems. Innovations in technology will certainly shift the paradigm in clinical proteomics. However, without rigorous validation in multiple datasets and high-level clinical evidence, new technologies, including proteomics, must not be directly used for clinical care, even as a laboratory-developed test (LDT).145,146 Indeed, the College of American Pathologists has a policy requiring rigorous validation of LDTs, and the U.S. Food and Drug Administration is also likely to regulate LDTs.147
Conclusions
High-throughput proteomics is increasingly being applied to translational research, clinical practice, and public health. While others have elegantly reviewed advances in pancreatic cancer, soft tissue sarcomas, or medicine as a whole, we briefly summarize recent advances in the clinical applications of high-throughput proteomics in breast cancer, colorectal cancer, gastric cancer, bladder cancer, laryngeal squamous cell carcinoma, and COVID-19. Future applications of high-throughput proteomics will face challenges related to various protein properties, limitations of statistical modeling, and technical and logistical difficulties in data deposition, integration, and harmonization, as well as regulatory requirements for clinical validation and considerations. However, we are encouraged by the advantages of high-throughput proteomics, including novel global protein networks, the discovery of new proteins, and synergistic incorporation with other omic data. We look forward to future advances in high-throughput proteomics, such as single-cell proteomics and its clinical applications, individualized proteomics, pathology informatics, digital pathology, and deep learning models for high-throughput proteomics. In our view, recent and future advances in high-throughput proteomics will in our view drastically shift the paradigms of translational research, clinical practice, and public health.