Introduction
Rectal cancer is one of the leading causes of cancer-related morbidity and mortality worldwide, with incidence rates rising particularly in individuals under 50. Rectal cancer differs from colon cancer in terms of anatomical location, treatment strategies, local recurrence risk, and response to therapy, making early and accurate risk stratification of rectal precancerous lesions particularly critical for clinical decision-making. Early detection and intervention are critical, as progression from precancerous lesions such as adenomas or dysplastic polyps to invasive carcinoma can occur over several years. Despite routine colonoscopy screening, a significant proportion of high-risk lesions remain undetected or misclassified, highlighting the need for advanced predictive approaches that can stratify patients according to malignant potential. Early detection of precancerous lesions in the rectum that are likely to undergo malignant transformation is critical for effective cancer prevention and personalized treatment strategies.1 Despite advances in histopathological evaluation, conventional diagnostic methods rely primarily on morphological assessment through Whole Slide Images (WSIs) and expert interpretation.2 While WSI provides high-resolution visualization of tissue architecture and cellular atypia,3 it is limited in capturing underlying molecular alterations that often precede visible morphological changes.4 Consequently, patients with high-risk precancerous lesions may not be accurately identified,5 leading to delayed intervention and reduced treatment efficacy.6 This gap highlights the urgent need for integrative approaches that can combine complementary data sources to improve predictive accuracy and clinical decision-making.7
Recent developments in high-throughput technologies have enabled the comprehensive characterization of biological systems through multiomics profiling, including genomics, transcriptomics, proteomics, and epigenomics.8 These datasets provide insights into the molecular mechanisms driving malignant transformation and tumor progression.9 For example, gene expression patterns, mutational landscapes, and protein activity profiles can reveal early oncogenic events that are not detectable through morphology alone.10 Several studies have demonstrated the predictive potential of multiomics data in oncology11; however, their integration with histopathological imaging remains challenging.12 Most existing approaches focus on single-modality analyses, either processing images or molecular profiles independently,13 which limits the ability to fully exploit complementary information across data types. Moreover, variability in feature dimensionality, scale, and noise presents additional obstacles for multimodal integration.14
Artificial intelligence (AI) and deep learning have emerged as powerful tools for analyzing complex biomedical data, offering the ability to learn hierarchical and non-linear relationships across heterogeneous inputs.15 Convolutional neural networks have been widely applied to WSI for tissue classification and cancer detection,16 while autoencoder-based architectures and graph neural networks have shown promise in representing high-dimensional omics data.17 Despite these advances, few studies have proposed frameworks capable of jointly processing WSI and multiomics data for early prediction of malignant transformation in precancerous lesions.18 Integrating these modalities requires carefully designed fusion strategies that can handle differences in data type, dimensionality, and informative content, while preserving interpretability for clinical applicability.19
The potential benefits of a multimodal AI approach extend beyond predictive performance.20 By combining WSI and multiomics information, such a framework could provide more comprehensive insights into disease progression,21 identify key molecular drivers of transformation,22 and highlight tissue regions of interest that contribute most to risk assessment.23 Interpretability methods such as Grad-CAM for imaging features and SHAP for molecular features are proposed for future analysis to better understand model decision-making. Publicly available datasets, including The Cancer Genome Atlas (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC), provide opportunities for both methodological development and future validation of multimodal predictive frameworks,24 although currently no dataset fully integrates high-resolution histopathology with matched multiomics for precancerous lesion progression.25
In this study, we propose a novel AI-based multimodal framework that integrates WSI and multiomics data to predict the malignant transformation of precancerous lesions. The model employs a Vision Transformer (ViT) for extracting high-level histopathological features from WSI and a Variational Autoencoder (VAE) for learning latent representations from multiomics profiles. These features are fused through a cross-attention mechanism to capture inter-modality dependencies and provide a robust predictive output. By designing the framework with technical detail, interpretability, and computational feasibility in mind, this study aimed to establish a foundation for data-driven early detection tools that can enhance precision oncology. The primary objective of this study was to develop and evaluate a technically detailed, integrative AI framework capable of accurately predicting the malignant transformation of precancerous lesions, thereby addressing a critical gap in current diagnostic capabilities.
Materials and methods
Study design
This study is a retrospective, publicly database-driven computational modeling study aimed at developing and evaluating a multimodal AI framework for predicting the malignant transformation of precancerous rectal lesions. The primary objective was to integrate morphological features extracted from WSI with molecular features from multiomics data—including genomics, transcriptomics, and proteomics—into a unified model capable of learning cross-modal relationships and improving predictive performance.
The study was purely computational, and no human or animal interventions were conducted; therefore, ethical approval was not required. All datasets used were publicly available, de-identified, and obtained from TCGA (n = 450 samples) and the CPTAC (n = 450 samples), including both precancerous and malignant tissues, to ensure representative coverage of different dysplasia stages.26
To prevent data leakage, dataset splitting was performed at the patient level, ensuring that samples from the same patient were not shared across training, validation, or test sets. The dataset was divided into training (70%), validation (15%), and test (15%) subsets while maintaining class balance. The proposed framework was designed to jointly process WSI and multiomics data, capturing the complementary information from both modalities and ensuring robust generalization across unseen samples.27,28
Data collection and preprocessing
WSI were obtained from specific TCGA projects related to colorectal and rectal cancer, including histopathological slides representing various stages of dysplasia and carcinoma. Corresponding multiomics data, RNA-seq gene expression, somatic mutation profiles (whole exome sequencing), and proteomics measurements,were retrieved from TCGA and CPTAC databases. Only samples with matched WSI and multiomics data were retained, and incomplete profiles were excluded or imputed as described below. All data were de-identified and publicly accessible via the Genomic Data Commons and CPTAC Data Portal.29
WSI processing
Digital pathology slides were processed using OpenSlide (OpenSlide Technologies, Chicago, IL, USA). Tissue regions were automatically identified using Otsu thresholding, and non-overlapping patches of 512×512 px were extracted at 20× magnification. Color normalization was performed using the Macenko method to mitigate staining variability across slides. Patches containing more than 80% background were excluded to ensure tissue relevance. For each retained patch, low-level handcrafted features, including mean RGB intensities, entropy, contrast, and homogeneity, were computed prior to deep feature extraction.
Multiomics processing
RNA-seq count data were normalized using DESeq2 to account for sequencing depth variability and dispersion inherent in count-based transcriptomic data, followed by log2 transformation. Alternative normalization strategies, including transcripts per million, were evaluated during preliminary experiments; however, DESeq2 normalization demonstrated more stable performance across cross-validation folds and was therefore selected. Somatic mutation data were encoded as binary gene-level matrices indicating mutation presence or absence. Proteomic features were standardized using z-score normalization across samples to ensure comparability during multimodal integration. Missing values were imputed using a k-nearest neighbors approach (k = 5). Batch effects across cohorts were corrected using the ComBat algorithm. Prior to multimodal fusion, low-variance features were removed, and principal component analysis (PCA) was applied to mutation and proteomic datasets to reduce noise and computational complexity while preserving informative variance.
Sample selection
Precancerous samples were identified based on TCGA and CPTAC pathological annotations corresponding to adenomatous, dysplastic, or non-invasive neoplastic rectal tissues, while malignant samples corresponded to invasive rectal adenocarcinoma. Where available, dysplasia grading information (low-grade vs. high-grade dysplasia) was retained to ensure representative coverage of early and advanced precancerous stages. Although detailed dysplasia stratification was not uniformly available for all cases, the final cohort encompassed heterogeneous precancerous phenotypes, enabling the model to learn a continuum of malignant transformation risk rather than discrete pathological categories. This approach aligned with the clinical objective of early risk prediction rather than precise histological staging. After preprocessing and quality control, a total of 450 paired WSI–multiomics samples were retained, including 230 precancerous and 220 malignant tissues. All retained samples had complete multiomics profiles and sufficient tissue coverage in WSI patches.
Model architecture
The proposed multimodal framework consisted of two parallel encoders for WSI and multiomics data and a cross-attention fusion module.
Histopathology encoder (ViT-B/16)
The ViT-B/16 model (Google Research) was pretrained on ImageNet-21k and fine-tuned on the rectal histopathology dataset. WSI patches of 512×512 px were input to the model. To obtain slide-level representations, patch embeddings were aggregated using attention-based pooling, capturing both local and global tissue features. Domain adaptation was performed by fine-tuning the model on histopathology patches while freezing the initial layers to preserve pretrained features. Each slide was ultimately represented as a 768-dimensional embedding. To integrate histopathological and molecular representations, a cross-attention–based fusion module was employed. WSI embeddings and multiomics latent vectors were first projected into a shared 128-dimensional latent space. A single cross-attention layer with four attention heads was applied, where WSI embeddings served as queries and multiomics embeddings as keys and values. Attention was computed in a unidirectional manner from WSI to multiomics features and optimized jointly during training, allowing the model to dynamically attend to molecular signals conditioned on tissue morphology. The output was aggregated via mean pooling and passed to a classification head consisting of two fully connected layers with 128 and 64 neurons, each followed by ReLU activation and dropout (rate = 0.3). A final sigmoid unit produced the probability of malignant transformation.
Multiomics encoder (VAE)
A VAE was implemented in TensorFlow 2.12 to learn latent representations from concatenated transcriptomic, mutational, and proteomic features. The encoder consisted of three fully connected layers (512, 256, 128 neurons) with ReLU activation, producing a 64-dimensional latent space. The decoder reconstructed the input omics features to minimize the reconstruction loss, while the KL divergence ensured a smooth latent distribution. The VAE was jointly trained with the classification head to optimize both latent representation quality and predictive performance.
Fusion module (cross-attention)
To integrate histopathological and molecular representations, a cross-attention–based fusion module was employed. The 768-dimensional WSI embeddings and 64-dimensional multiomics latent vectors were first projected into a shared latent space of 128 dimensions using learnable linear transformation layers to ensure dimensional compatibility. Two stacked cross-attention layers with four attention heads were then applied, where WSI embeddings served as queries and multiomics embeddings as keys and values, enabling the model to attend to molecular features conditioned on spatial tissue representations. Each cross-attention layer was followed by layer normalization and residual connections to stabilize training. The output of the fusion module was aggregated via mean pooling and passed to a classification head consisting of two fully connected layers with 128 and 64 neurons, respectively, each followed by ReLU activation and dropout (rate = 0.3). A final sigmoid output unit produced the probability of malignant transformation.
Model training and validation
The dataset was split at the patient level to avoid data leakage, with 70% of patients allocated to the training set, 15% to validation, and 15% to the independent test set, while maintaining class balance across precancerous and malignant samples. All experiments were conducted on NVIDIA RTX A6000 GPUs.
The model was trained using the Adam optimizer (β1 = 0.9, β2 = 0.999, weight decay = 0.01) with an initial learning rate of 1×10−4 and a batch size of 32. A learning rate scheduler reduced the learning rate by a factor of 0.5 if the validation loss did not improve over 5 epochs. The maximum number of epochs was set to 100, and early stopping was applied after 10 consecutive epochs without validation loss improvement.
Data augmentation was applied to WSI patches, including random rotations (±15°), horizontal and vertical flipping (probability = 0.5), and color jittering (brightness, contrast, saturation adjustments ±0.2). Dropout (rate = 0.3) and batch normalization were applied in fully connected layers to improve generalization. For multiomics data, random feature masking (10%) during training enhanced robustness to missing molecular information.
When training the VAE jointly with the classifier, the total loss was computed as a weighted sum of the reconstruction loss, KL divergence, and binary cross-entropy classification loss, with weights empirically set to balance latent representation quality and predictive performance.
All results were averaged over five independent runs using different random seeds to ensure reproducibility and statistical stability.
Evaluation metrics
Model performance was assessed using standard classification metrics, including accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC). For each metric, the mean ± standard deviation was reported across five independent runs with different random seeds to ensure reproducibility.
Both macro- and micro-averaged AUC values were computed to account for potential class imbalance. Calibration curves were generated using Platt scaling, and confusion matrices were constructed to evaluate misclassification tendencies and model reliability.
All analyses and visualizations were performed using Python 3.11 with scikit-learn, Matplotlib, and Seaborn libraries.
Statistical analysis
All statistical analyses were performed using Python 3.11 and R 4.2. Performance metrics were compared between models using paired Student’s t-tests, with significance defined as P < 0.05. For multiple comparisons across metrics, Bonferroni correction was applied to control for type I error where appropriate. The correlation between omics-derived risk scores and histopathology-based predictions was evaluated using two-tailed Spearman’s rank correlation coefficient with 95% confidence intervals. All analyses were conducted at the sample level, averaged across five independent runs with fixed random seeds to ensure reproducibility. Model explainability outputs were visualized using Matplotlib (v3.7) and Seaborn (v0.12) libraries, and statistical computations were performed with SciPy and R stats packages.
Results
Dataset overview
The dataset comprised 450 paired WSI–multiomics samples, including 230 precancerous and 220 malignant tissues. After quality control, 96.8% of WSI patches and 93.5% of multiomics features (RNA-seq, mutation, proteomics) were retained.30 The distribution of samples, retained features, and raw counts for each omics layer, is summarized in Table 1. Table 2 presents the comparative performance of the multimodal ViT+VAE fusion model versus unimodal baselines (WSI-only and omics-only). The fusion model outperformed the baselines across all metrics, achieving an accuracy of 0.892 ± 0.012 and an AUC of 0.927 ± 0.009. The clinical characteristics of patients with rectal lesions are summarized in Table 3, including patient age, gender, year of diagnosis, tumor stage, and treatments received. Table 4 presents the pathology characteristics of these rectal lesions, including tumor cell percentage, tumor size, and vascular invasion status.
| Description | Value |
|---|
| Total paired samples (Whole Slide Image (WSI) + multiomics) | 450 |
| Precancerous rectal lesions | 230 |
| Malignant rectal lesions | 220 |
| WSI patches retained after Quality Control (QC) (%) | 96.8% |
| Omics features retained after Quality Control (%) | 93.5% |
| Gene expression features (raw) | 18,562 |
| Proteomic measurements (raw) | 7,914 |
| Mutation matrix features | 12,430 |
| Model | Modality | Accuracy | Area under the curve (AUC) | F1 |
|---|
| Multimodal Vision Transformer (ViT) + Variational Autoencoder (Variational Autoencoder (VAE)) (Fusion) | Whole Slide Image (WSI) + Omics | 0.892 ± 0.012 | 0.927 ± 0.009 | 0.894 ± 0.010 |
| ViT Only | WSI | 0.781 ± 0.018 | 0.859 ± 0.015 | 0.784 ± 0.017 |
| Omics Only (VAE + Classifier) | Omics | 0.764 ± 0.020 | 0.842 ± 0.013 | 0.772 ± 0.018 |
Table 3Clinical characteristics of patients with rectal lesions
| cases.submitter_id | year_of_diagnosis | tumor_stage | treatment_type | age | gender |
|---|
| TCGA-AF-2692 | 2008 | II | Radiation Therapy, Pharmaceutical Therapy | 62 | Male |
| TCGA-AF-3911 | 2009 | III | Pharmaceutical Therapy, Radiation, External Beam | 57 | Female |
| TCGA-AG-3574 | 2005 | II | Radiation Therapy, Pharmaceutical Therapy | 65 | Male |
| TCGA-AG-3728 | 2006 | I | Chemotherapy, Radiation Therapy | 54 | Male |
| TCGA-AG-3878 | 2007 | II | — | 59 | Female |
Table 4Pathology characteristics of rectal lesions
| Patient ID | Tumor cell percentage (%) | Tumor size (mm) | Vascular invasion |
|---|
| TCGA-AF-2692 | 66 | 26 | Yes |
| TCGA-AF-3911 | 71 | 29 | Yes |
| TCGA-AG-3574 | 53 | 19 | No |
| TCGA-AG-3728 | 78 | 35 | Yes |
| TCGA-AG-3878 | 60 | 23 | No |
The distribution of tumor and normal samples, along with portions, aliquots, and analytes, is summarized in Table 5. This overview provides insight into the biospecimen preparation for multiomics and histopathological analyses of rectal lesions. An ablation study was conducted to assess the contribution of each model component, as shown in Table 6. Removing the cross-attention module or replacing encoders resulted in decreased test AUC, highlighting the importance of each architectural element.
Table 5Summary of biospecimen samples from patients with rectal lesions
| Patient ID | Sample ID | Tissue type | Portions | Aliquots | Analytes |
|---|
| TCGA-AF-2692 | S-001 | Tumor | 2 | 3 | 5 |
| TCGA-AF-2692 | S-002 | Normal | 1 | 1 | 2 |
| TCGA-AF-3911 | S-003 | Tumor | 3 | 4 | 6 |
| TCGA-AF-3911 | S-004 | Normal | 1 | 1 | 2 |
| TCGA-AG-3574 | S-005 | Tumor | 2 | 2 | 4 |
| TCGA-AG-3728 | S-006 | Tumor | 3 | 3 | 5 |
| TCGA-AG-3878 | S-007 | Tumor | 2 | 2 | 3 |
| TCGA-AG-3878 | S-008 | Normal | 1 | 1 | 2 |
| TCGA-AG-3878 | S-009 | Tumor | 2 | 3 | 4 |
| TCGA-AF-3911 | S-010 | Tumor | 1 | 1 | 2 |
| TCGA-AG-3574 | S-011 | Normal | 1 | 1 | 1 |
| TCGA-AG-3728 | S-012 | Tumor | 2 | 2 | 3 |
| TCGA-AF-2692 | S-013 | Tumor | 2 | 2 | 3 |
| TCGA-AG-3878 | S-014 | Normal | 1 | 1 | 1 |
| TCGA-AG-3574 | S-015 | Tumor | 2 | 3 | 4 |
| Configuration | Test area under the curve (AUC) |
|---|
| Full Model (Vision Transformer (ViT) + Variational Autoencoder (VAE) + Cross-Attention) | 0.927 |
| Without Cross-Attention | 0.889 |
| ViT replaced by ResNet-50 | 0.901 |
| VAE replaced by Principal Component Analysis (PCA) | 0.884 |
The ViT outperformed ResNet-50 due to its ability to capture long-range spatial dependencies and global contextual features in WSI, which were critical for identifying subtle histopathological patterns in rectal lesions. Although borderline dysplasia cases were included in the dataset, their sample size was limited. Future work will specifically analyze model performance on these clinically challenging cases.
WSI preprocessing and feature extraction
WSIs from 450 paired WSI–multiomics samples were first loaded and downsampled to generate 2,048×2,048 px thumbnails for visualization. Tissue regions were identified by applying Otsu thresholding to the grayscale thumbnails, producing binary tissue masks that delineated areas containing tissue versus background. These masks were subsequently used to guide patch extraction for downstream analysis. Patches of 512×512 px were extracted at 20× magnification to preserve sufficient tissue detail for ViT feature extraction while maintaining computational efficiency. Figure 1 provides an overview of the WSI preprocessing and patch extraction workflow. From each WSI, 300 representative 512×512 px patches were randomly sampled based on the tissue masks, while low-quality or mostly white patches were excluded.
For each patch extracted from precancerous and malignant rectal lesions, six features were computed, including mean RGB intensities, entropy, contrast, and homogeneity. The figure illustrates three rows corresponding to these preprocessing and feature extraction steps. The first row displays thumbnails of three representative WSIs of rectal cancer, showing the overall tissue morphology of each slide. The second row presents the corresponding binary tissue masks generated via Otsu thresholding, highlighting tissue regions used for patch extraction. The third row illustrates representative grids of 5×5 extracted 512×512 px patches for each WSI, demonstrating the diversity of tissue morphology captured across rectal cancer lesions.
Multiomics data summary
Corresponding multiomics features were summarized using z-scores and latent embeddings prior to fusion modeling. These extracted patch features were subsequently used for heatmap generation, multimodal fusion modeling, and statistical analyses. For clarity and consistency, detailed interpretability analyses (Grad-CAM and SHAP) are discussed in the Discussion section; here, only a brief mention is included. Figure 2 presents three rows of images for three representative WSIs of precancerous and malignant rectal lesions. The first row shows thumbnails of the WSIs, highlighting the overall tissue morphology. The second row displays the corresponding binary tissue masks generated using Otsu thresholding to indicate tissue regions for patch extraction. The third row illustrates representative 5×5 grids of extracted 512×512 px patches from rectal lesions, demonstrating the diversity of tissue morphology captured in the dataset. Multiomics feature distributions for the same samples are summarized in accompanying histograms (not shown) to maintain figure clarity.
Patch-level heterogeneity visualization
To assess the heterogeneity of tissue morphology in precancerous and malignant rectal lesions at the patch level, 512×512 px features extracted from tissue regions were analyzed. Dimensionality reduction using PCA provided a quantitative overview of variability across patches, highlighting differences in tissue architecture and staining intensity (Fig. 3). t-Distributed Stochastic Neighbor Embedding further revealed clustering of patches with similar visual characteristics, illustrating patterns of morphological diversity among lesions (Fig. 4). Additionally, representative 5×5 grids of randomly selected patches from each slide were generated to visually demonstrate the diversity of cellular patterns, stromal regions, and morphological details captured across the dataset (Fig. 5). These combined analyses provided both quantitative and qualitative insights into patch-level heterogeneity, supporting subsequent heatmap generation and multimodal fusion modeling.
Model performance and comparative evaluation
On the independent test set of precancerous and malignant rectal lesions, the multimodal ViT+VAE fusion model achieved superior predictive performance: AUC = 0.927 ± 0.009, Accuracy = 0.892 ± 0.012, F1-score = 0.894 ± 0.010, Precision = 0.889 ± 0.014, and Recall = 0.901 ± 0.011. Both unimodal baselines (WSI-only and omics-only) performed lower (WSI-only AUC ≈ 0.859; omics-only AUC ≈ 0.842), with improvements statistically significant (P < 0.01, paired t-test). These results highlighted the benefit of integrating histopathological and molecular features: the inclusion of multiomics information notably improved the model’s ability to correctly classify morphologically ambiguous lesions. The cross-attention module enabled effective alignment of WSI and multiomics features, contributing directly to the observed increase in AUC and overall predictive performance.
Ablation study
To evaluate the contribution of each component of the multimodal framework, an ablation study was performed. The results (Table 3) showed that removing the cross-attention module led to the largest decrease in AUC, indicating that the integration of histopathological and multiomics features was crucial for accurate prediction of malignant transformation. Replacing the ViT with ResNet-50 or the VAE with PCA also resulted in reduced performance, highlighting the importance of both the selected encoders and the cross-modal fusion strategy. These findings suggested that molecular information was particularly valuable for classifying morphologically ambiguous lesions, reinforcing the added predictive value of multiomics data. To further evaluate the limitations of the proposed multimodal framework, misclassified samples in the independent test set were analyzed. Among the 15 false negatives and 12 false positives, most false negatives corresponded to high-grade precancerous lesions with heterogeneous morphology, while false positives included low-grade precancerous lesions exhibiting molecular signatures similar to malignant samples. Grad-CAM visualizations revealed that misclassified WSIs contained regions with ambiguous tissue architecture, while SHAP analyses indicated that overlapping gene expression and mutation patterns contributed to misclassification. These findings suggest that morphological ambiguity and partially overlapping molecular profiles are primary contributors to model errors, highlighting areas for potential refinement.
To assess the translational relevance of the model, predicted malignancy probabilities were correlated with key clinical parameters, including tumor stage, vascular invasion, and patient age. Spearman correlation analysis demonstrated a moderate positive association between predicted risk scores and tumor stage (ρ = 0.41, P < 0.01) as well as vascular invasion status (ρ = 0.36, P < 0.05). Although follow-up survival data were limited in the TCGA/CPTAC datasets, these results support that higher model-predicted malignancy scores correspond to more clinically aggressive rectal lesions, confirming the potential utility of the multimodal framework in early risk stratification and translational applications.
Interpretability analysis
Interpretability of the model was assessed using Grad-CAM for WSI features and SHAP values for multiomics contributions. Representative Grad-CAM and SHAP visualizations are shown in Figure 6, alongside receiver operating characteristic curves (Fig. 6a) and the confusion matrix (Fig. 6b) for the multimodal classifier. These visualizations illustrate how the model integrates histopathological and molecular features at a patch level across samples.
Discussion
This study developed a multimodal deep learning framework that integrates WSI and multiomics data to predict the malignant transformation of precancerous rectal lesions. Using a ViT-based histopathology encoder, a VAE-based molecular encoder, and a cross-attention fusion mechanism, the model achieved higher predictive performance than single-modality approaches. The integration of WSI and multiomics information yielded several important insights. ViT demonstrated superior performance compared with ResNet-50 due to its ability to capture long-range spatial dependencies and global tissue context. Unlike convolutional architectures with limited receptive fields, ViT models are particularly suited for modeling glandular organization and heterogeneous dysplastic patterns commonly observed in rectal precancerous lesions. The cross-attention module played a central role in aligning the two modalities, modulating visually subtle morphological cues using molecular information, particularly in borderline dysplasia cases.31 The fused latent space demonstrated clearer separation between precancerous and malignant samples compared with unimodal embeddings, suggesting that morphological features alone were insufficient to capture the full biological complexity of malignant transformation. Multiomics inputs, particularly gene expression and mutation patterns, provided discriminative molecular signatures that complemented histological alterations. Signals from TP53 and MKI67 expression, as well as PI3K/AKT pathway–related mutations, aligned with regions of nuclear atypia and stromal remodeling identified in the ViT attention maps, indicating that the fusion model captured underlying biological mechanisms rather than merely correlational patterns.Attention weights demonstrated that molecular features modulated the contribution of visually subtle morphological cues, particularly in borderline dysplasia cases. This effect likely explains the model’s strong performance in samples that were misclassified by unimodal WSI-only baselines. Compared with existing studies that rely solely on either histology or transcriptomics,32,33 the present framework demonstrates the advantage of leveraging complementary biological layers to improve diagnostic precision. These findings are consistent with emerging literature emphasizing the clinical utility of multimodal pathology–omics integration for early cancer risk stratification.[34,35] Despite these strengths, several limitations should be acknowledged. The study utilized 450 paired samples, which may limit the statistical generalizability of the findings. Multiomics layers were restricted to transcriptomics, mutation profiles, and proteomics; inclusion of additional modalities such as methylation or metabolomics could further enhance biological resolution. Notably, in borderline dysplasia cases, the multimodal model demonstrated improved classification compared with WSI-only baselines, suggesting that integration of molecular features helped resolve challenging cases. Nevertheless, the limited number of borderline samples constrains statistical confidence and warrants further validation on larger cohorts. External validation using independent datasets beyond TCGA and CPTAC was not performed due to the lack of publicly available WSI–multiomics datasets for precancerous rectal lesions. Interpretability of the model was further explored to support translational relevance. Grad-CAM analyses highlighted regions within tissue patches that were most influential for the model’s predictions, often corresponding to areas of high cellular atypia or dysplasia. SHAP values identified key genes, mutations, and proteomic markers contributing to classification decisions, revealing molecular signatures associated with malignant transformation. These analyses demonstrate that the multimodal model relies on biologically meaningful features rather than spurious correlations, enhancing confidence in its applicability for precision pathology. To address the limited sample size and enhance statistical robustness, future work will include expansion of the dataset through collaborations with external institutions, collection of prospective WSI–multiomics samples, and advanced data augmentation techniques for molecular features, such as generative modeling using variational autoencoders or GANs. Cross-cohort harmonization and domain adaptation strategies will also be employed to integrate multi-site datasets while minimizing technical bias, thereby improving the generalizability and reliability of the model. The proposed multimodal framework could be integrated into clinical diagnostic workflows as a decision-support tool, highlighting high-risk tissue regions via Grad-CAM and providing molecular risk scores to assist pathologists. Key challenges for clinical translation include standardization of WSI acquisition, harmonization of multiomics assays across laboratories, computational requirements for timely analysis, and regulatory approval. Addressing these barriers through prospective validation, user-friendly software development, and clinical collaborations is essential for enabling real-world implementation.
While our framework demonstrates promising predictive performance, several challenges remain for clinical deployment. Limited sample size may affect model generalizability, highlighting the need for external validation and multi-institutional data collection. Computational requirements, including GPU resources and processing time for WSI and multiomics integration, may limit real-time clinical use. Furthermore, standardization of WSI acquisition protocols and multiomics assays is essential to ensure reproducibility across centers. Addressing these challenges is crucial for translating our multimodal framework into practical clinical decision-support tools.
Conclusions
Multimodal integration of WSI and multiomics data using a ViT+VAE framework with cross-attention significantly improves predictive accuracy in distinguishing precancerous from malignant rectal lesions. The approach provides interpretable insights into both histopathological and molecular features associated with malignancy, highlighting the added value of combining morphological and molecular information. These findings underscore the potential of multimodal models to enhance precision diagnostics and inform clinical decision-making in personalized oncology, while future validation on larger and external cohorts will be essential to confirm generalizability.
Declarations
Ethical statement
Ethical approval was not required for this study because all data used were publicly available, de-identified, and did not involve any direct human or animal experimentation.
Data sharing statement
The datasets used to support the findings of this study are publicly available from The Cancer Genome Atlas (https://portal.gdc.cancer.gov/) and the Clinical Proteomic Tumor Analysis Consortium (https://cptac-data-portal.georgetown.edu/). No additional data are available.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Conflict of interest
The authors have no conflicts of interest related to this publication.
Authors’ contributions
NA is the sole author of the manuscript.