v
Search
Advanced

Publications > Journals > Journal of Clinical and Translational Pathology> Article Full Text

  • OPEN ACCESS

Associations of Normalization and Regularization with Machine Learning Overfitting in Cross-dataset Classification of Deaths Using Transcriptomic and Clinical Data: A Secondary Analysis of Publicly Available Databases

  • Fei Deng1 and
  • Lanjing Zhang1,2,3,4,* 
 Author information 

Abstract

Background and objectives

Normalization can standardize and improve machine learning (ML) performance on omics data. However, it is unclear whether normalization is associated with overfitting (i.e., worse cross-dataset performance than intra-dataset performance). Therefore, we aimed to examine associations of normalization and regularization with overfitting of ML on omics data.

Methods

Using three paired transcriptomic and clinical datasets (lung adenocarcinoma: the Cancer Genome Atlas (TCGA)/Oncology Singapore; melanoma: TCGA/Dana-Farber Cancer Institute; glioblastoma: TCGA/Clinical Proteomic Tumor Analysis Consortium), we applied ANOVA-based gene selection methods, six normalization methods, and six ML models to classify cancer patients’ deaths. Balanced accuracy (BA) and area under the curve (AUC) in intra- and cross-dataset settings were compared using inferential analyses.

Results

Normalization consistently improved intra-dataset performance (median BA/AUC changes: 0.035–0.214/0.115–0.279) on all data, particularly with Z_Raw, but decreased or slightly increased cross-dataset performance (median BA/AUC changes: −0.029–0.079/0.029–0.064). Least Absolute Shrinkage and Selection Operator (LASSO) model without normalization consistently outperformed most of the ML models in cross-dataset testing across cancer types. ML models on all and molecular-alone data showed similar best performances.

Conclusions

Normalization increases ML’s intra-dataset performance and overfitting in three paired cancer transcriptomic and clinical datasets. Regularized models such as LASSO appear to mitigate overfitting and achieve robust cross-dataset performance. Therefore, cross-dataset evaluation and regularized models are recommended to assess and reduce overfitting, while normalization should be used cautiously. Adding clinical data seems to have little impact on ML models’ performance. However, future work on other diseases and datasets is warranted.

Keywords

Cancer, Overfitting, Machine learning, Regularization, Normalization, Transcriptomics, Clinical feature

Introduction

Machine learning (ML) has become a cornerstone of bioinformatics, enabling predictive modeling for classification of diseases and patient outcomes using high-dimensional omics data.1–4 It is particularly helpful in the era of massive production and application of high-throughput data.5–7 However, the generalizability of ML models across datasets remains a critical challenge due to heterogeneity in experimental platforms, sample populations, and preprocessing techniques, reaching an F1 of 61% or an area under the curve (AUC) of the receiver-operating curve of 71% (dropped from 91% in intra-dataset testing) in cross-dataset testing.8–11 ML models may indeed exhibit performance biases for sociodemographic groups.12 Normalization is often assumed to enhance model performance.7,13–17 However, its impact on cross-dataset performance is largely unknown, particularly for high-dimensional omics data where overfitting risks are high.18–20

A known cause of the ML generalizability problem is the possible over-reliance on intra-dataset cross-validation for model evaluation and selection.21,22 While valuable in many cases, it suffers from selection bias and leads to overly optimistic estimates of a model’s true performance.8,19,20,23 Moreover, preprocessing strategies, such as data normalization and aggressive feature selection, can improve performance metrics within a single dataset,24–26 but may unintentionally cause model overfitting. This intensive optimization can paradoxically harm the model’s ability to generalize, a finding that has been noted in recent studies.13,27 Finally, feature selection methods (e.g., differentially expressed gene (DEG)) can improve intra-dataset performance but may exacerbate overfitting in cross-dataset validation.19 The evaluation of ML performance also faces scrutiny, as intra-dataset metrics often fail to predict cross-dataset generalizability.16,28 However, the association between preprocessing methods and ML’s cross-dataset performance is unclear.

Regularization techniques, such as the Least Absolute Shrinkage and Selection Operator (LASSO),29,30 have shown promise in reducing overfitting by penalizing model complexity, but their interaction with normalization remains poorly understood in classifying omics data. Recent studies suggest that simpler ML models may outperform complex methods in transcriptomics due to robustness to data variability.27 However, it is largely unknown whether LASSO or other simple ML algorithms retain their performance in cross-dataset testing.

Therefore, we investigated the impact of normalization, regularization, and evaluation strategies on ML performance in classifying cancer deaths, focusing on cross-dataset performance. Using three pairs of transcriptomic and clinical datasets, we explored whether normalization can universally improve performance, assessed the impact of regularization, and evaluated the trade-offs of preprocessing and feature selection techniques. Our study may help develop robust ML pipelines with better generalizability in precision medicine and multi-omics applications.31

Materials and methods

Workflow and dataset selection

We searched for cancer transcriptomic datasets with clinical and death data in cBioPortal,32 that also had at least one matched dataset with clinical and death data and could be used for independent cross-dataset testing. Three pairs of transcriptomic and clinical datasets in cancer were identified and used, including those of lung adenocarcinoma in the Cancer Genome Atlas (TCGA) and Oncology Singapore (OncoSG),33,34 those of melanoma in TCGA and Dana-Farber Cancer Institute,35,36 and those of glioblastoma in TCGA and the Clinical Proteomic Tumor Analysis Consortium.36,37

Specific experimental steps were described previously and repeated in all three pairs of cancer datasets (Fig. 1).13 Briefly, 90% of randomly selected samples from the training dataset were used for training with five-fold cross-validation, while the remaining 10% served as an internal test set. Then the other dataset was used for cross-dataset testing, and vice versa. The entire process was repeated at least five times. Basic modeling factor values and key model hyperparameter settings were employed across all experimental steps of each process, including data cleaning, dataset partitioning, gene selection, normalization, classification model training, prediction, classification performance evaluation, and statistical analysis (Supplementary Table 1). Python version 3.11.9 64-bit was used for the code implementation.

The workflow that was repeated for each of the three cancer types, including lung adenocarcinoma, melanoma, and glioblastoma.
Fig. 1  The workflow that was repeated for each of the three cancer types, including lung adenocarcinoma, melanoma, and glioblastoma.

CV, cross-validation; DEGs, differentially expressed genes; ML, machine learning; NDEGs, non-differentially expressed genes.

The classification outcome/label was binary (living versus deceased) in all three pairs of datasets. Only the features shared by the training and testing datasets were used for model training and testing. After applying the sample inclusion and exclusion criteria (Fig. 2), all remaining samples with paired transcriptomic and clinical data were carried forward to the downstream workflow. Transcriptomic data are in RNA-seq FPKM format and are further normalized using Z-transformation. Some datasets, such as the TCGA and OncoSG lung adenocarcinoma datasets, are class-imbalanced. For binary classification, sample numbers with living and deceased are 212:74 in TCGA (total 286), and 125:42 in OncoSG (total 167). The same 4:1 split (i.e., 80% for training and 20% for intra-dataset testing) was applied to the melanoma and glioblastoma datasets.

Sample selection flow diagram.
Fig. 2  Sample selection flow diagram.

CPTAC, the Clinical Proteomic Tumor Analysis Consortium; DFCI, Dana-Farber Cancer Institute; OncoSG, Oncogenomic-Singapore; OS, overall survival; PATH_M_STAGE, pathologic distant metastasis (M) stage; TCGA, The Cancer Genome Atlas; TMB, tumor mutational burden.

Data cleansing

To enable analyses for two datasets, we cleaned the samples by retaining only those with matching labels, keeping shared gene features, and filling missing values feature-wise across molecular data with training-set medians. After this preprocessing, the dataset of lung adenocarcinoma included 16,196 gene features and four clinical features: age, gender, tumor stage, and tumor mutational burden. These features were chosen because they are shared between the two datasets. Some features are numerical, while others are categorical, requiring tailored processing methods.

Gene selection

As in nearly all transcriptomic studies, the number of samples is significantly smaller than the number of features (e.g., 16,196 genes in lung adenocarcinoma datasets), leading to potential multicollinearity and an increased risk of overfitting. Therefore, feature selection was performed with ANOVA, as shown before,3,4,13,38–40 while the F-value, which measures the ratio of these variances, was used to test the null hypothesis that all group means are equal.13 By setting different thresholds of P-values, gene sets can be defined accordingly. For example, genes with P-values below a selected threshold are designated as DEGs for classification, while those above a chosen threshold are designated as non-differentially expressed genes (NDEGs) for normalization. Specifically, gene selection was performed using the training set only, and the selected feature sets (DEGs and NDEGs) were then fixed and directly applied to the internal and cross-dataset testing sets.

Normalization

To evaluate model generalizability on independent external cohorts and avoid information leakage across cohorts, we focused on a set of classical normalization strategies that can be applied without joint modeling across cohorts. Since the transcriptomic data used here were already Z-transformed, we first examined the effect of classification on both the original dataset (Z_Original) and the gene-filtered dataset (Z_Raw data). We then evaluated binarization (Z_Binary) and four other reference gene-based normalization methods applied to Z_Raw data: Non-Parametric Normalization (Z_NPN), Quantile Normalization (Z_QN), Quantile Normalization with Z-Score (Z_QNZ), and Normalization using Internal Control Genes (Z_NICG), as described before (Supplementary Table 2).13,15,41–43 Each normalization method was applied independently to training, internal test, and external test datasets.

ML models

We trained six commonly used ML classifiers on different training sets using specific hyperparameter tuning settings (Supplementary Table 1), including multilayer perceptron,44,45 extreme gradient boosting (XGB),46,47 logistic regression,48 LASSO,29 support vector machine (SVM),49 and random forest.50 Considering the imbalance in the dataset, class weights were applied in the XGB and SVM models, referred to as XGB_W and SVM_W, respectively.

Classification performance evaluation

Due to the binary and unbalanced nature of the data in this study, balanced accuracy (BA) was the primary performance metric and AUC was the secondary.21,22 We also used the median of the changes (delta) in model performance (versus Z_Original) to evaluate the impact of normalization methods on the changes in model performance. A P-value less than 0.05 was considered statistically significant.

Statistical analysis

A layered statistical analysis framework was adopted for model performance. Following our previous work in Ref.13,28, we first constructed internal- and external-test “mean performance matrices” indexed by combinations of DEG and NDEG thresholds. The optimal value in the matrix was used as the representation for the corresponding model-normalization combination.

The first layer of analysis was based on the underlying repeated-run results corresponding to each representation (five repetitions for the internal test and 15 repetitions for the external test). In order to evaluate the benefit of incorporating clinical features during training, we applied Welch’s t-test to compare model performance under each model-normalization combination with versus without clinical features.51,52 In the second layer analysis, to assess whether feature selection and subsequent normalization improved model performance, we performed within-model paired comparisons of Z_Original and the other five normalization methods against the reference Z_Raw using Welch’s t-test. The third layer analysis was only for lung adenocarcinoma datasets. To examine the impact of training-set choice on performance and cross-dataset generalization, Welch’s t-test was also used to compare the optimal internal-test results (also the optimal external-test results) obtained when using TCGA versus OncoSG as the training set.

The fourth layer analysis was conducted on multiple “optimal model performance tables” generated under different training set choices and clinical features settings using the Wilcoxon signed-rank test.53,54 Two paired tests were included: (1) row-wise comparison of Z_Original and the other normalization methods against Z_Raw; (2) column-wise comparison of the other models against LASSO. These analyses were used to evaluate the generalizability of normalization, feature selection, and model effects across different settings.

For each predefined comparison family, we controlled multiplicity by performing false discovery rate correction via the Benjamini-Hochberg procedure (q = 0.05). For layers 1–3, our primary goal was to compare mean performance across independent conditions. Because heteroscedasticity and unbalanced sample sizes might arise across repeated runs under different settings, we used Welch’s t-test for two-group comparisons.51,52 For layer 4, because comparisons involved greater differences in settings and distributional assumptions were harder to satisfy, we used the nonparametric paired Wilcoxon signed-rank test to compare paired differences.53,54

Results

Baseline characteristics of the datasets

The datasets all included transcriptomic and clinical data (Supplementary Tables 35). The outcome was binary living status. For lung adenocarcinoma, there were 212 alive and 74 deceased patients in the TCGA dataset (total 286) and 125 alive and 42 deceased in the OncoSG dataset (total 167) at the end of their follow-ups. For glioblastoma, there were 52 alive and 99 deceased patients in the TCGA dataset (total 151) and 35 alive and 62 deceased in the Clinical Proteomic Tumor Analysis Consortium dataset (total 97). For melanoma, there were 173 alive and 187 deceased patients in the TCGA dataset (total 360) and 13 alive and 27 deceased in the Dana-Farber Cancer Institute dataset (total 40).

Performances of ML models on lung adenocarcinoma data

We analyzed models’ performances under various conditions, including multiple randomly selected sample combinations from the internal or external test sets. The best-performing models, when present, had statistically better BA and/or AUC than the average performance of all models (Supplementary Tables 621).

Models trained on the TCGA dataset and the OncoSG dataset exhibited different performances in external datasets. We then compared the best internal testing performances of models trained on the TCGA dataset with those trained on the OncoSG dataset under the three conditions mentioned above. When only transcriptomic data was used, the performance differences between the two datasets using the same method were statistically significant (Table 1). Moreover, the statistical significance of this difference was even more pronounced in cross-platform external testing. Models trained on the TCGA dataset showed significantly better predictive performance on the OncoSG dataset than the models trained on the OncoSG dataset when tested on the TCGA dataset. This discrepancy may stem from the fact that the OncoSG dataset primarily consists of samples from Asian populations.

For narrative convenience, we referred to the model based on genetic features and four clinical features as Data grouping A, the model using only genetic feature data as Data grouping B, and the one based on genetic features and three clinical features as Data grouping C.

Table 1

Comparison of the best internal testing performance of models trained on the TCGA dataset (n = 510) versus those trained on the OncoSGdataset (n = 181)

All data
Molecular data alone
All data except tumor stage
TCGA as training setOncoSG as training setFDR-adjusted P-valueTCGA as training setOncoSG as training setFDR-adjusted P-valueTCGA as training setOncoSG as training setFDR-adjusted P-value
Intra-dataset testing
Balanced accuracy0.814 ± 0.0100.935 ± 0.0040.1790.848 ± 0.0010.977 ± 0.000*0.1800.853 ± 0.0110.927 ± 0.0030.480
AUC0.888 ± 0.0230.953 ± 0.0020.2330.925 ± 0.0191.000 ± 0.000*0.1800.885 ± 0.0080.912 ± 0.0100.892
Accuracy0.821 ± 0.0060.977 ± 0.0010.0760.890 ± 0.0010.965 ± 0.0010.1800.910 ± 0.0050.941 ± 0.0020.107
DEG, n (p threshold)78 (0.2%)534 (0.4%)1,430 (5%)2,382 (4%)996 (2%)2,070 (2%)
NDEG, n (p threshold)62 (99%)230 (95%)120 (98%)65 (99%)120 (98%)65 (99%)
Normalization methodZ-RawZ-RawZ-NPNZ-NICGZ-RawZ-NICG
Classification modelSVM_WMLPMLPSVM_WLRMLP
Cross-dataset testing
Balanced accuracy0.645 ± 0.0030.556 ± 0.000*0.0030.657 ± 0.0010.571 ± 0.000*<0.0010.654 ± 0.0010.569 ± 0.000*<0.001
AUC0.654 ± 0.0020.579 ± 0.000*<0.0010.687 ± 0.0010.599 ± 0.000*0.1340.665 ± 0.0020.595 ± 0.000*0.001
Accuracy0.645 ± 0.0030.556 ± 0.000*0.0030.657 ± 0.0010.571 ± 0.000*<0.0010.654 ± 0.0010.569 ± 0.000*<0.001
DEG, n (p threshold)161 (0.6%)617 (0.4%)176 (0.7%)2,382 (4%)816 (1%)2,960 (5%)
NDEG, n (p threshold)120 (98%)1,729 (85%)120 (98%)230 (95%)120 (98%)562 (92%)
Normalization methodZ-BinaryZ-BinaryZ-NICGZ-QNZ-BinaryZ-Binary
Classification modelSVM_WLRLRSVM_WSVM_WSVM_W

We also compared the best performance of ML in internal testing and that in external testing obtained for data groupings A, B, and C (Supplementary Tables 22 and 23). Interestingly, no models exhibited statistically significant differences, while the prediction performance of models trained on OncoSG data and applied to TCGA data showed significant differences under the three conditions.

Modelling with data in three cancer types

In intra-dataset testing across three cancer types (Table 2 and Supplementary Fig. 1), normalization methods consistently improved model performance compared to the reference Z_Original (no normalization after initial Z-score transformation). Improvements were substantial, as reflected in BA and AUC, which were as high as 0.814 and 0.889 in lung adenocarcinoma, 0.756 and 0.807 in melanoma, and 0.803 and 0.887 in glioblastoma, respectively. Across all cancer types, normalization markedly enhanced intra-dataset predictive performance for death classification, with Z_Raw often providing the greatest median improvement in glioblastoma and competitive gains in the other cancers. The performances of ML models using Z_Original and the other five normalization methods appeared overall better than those using Z_Raw (as the reference), as shown by Welch’s t-test (Supplementary Tables 2427).

Table 2

Intra-dataset testing of death classification on all data in three cancer types

Normalization methodLASSODeltaLRDeltaMLPDeltaRFDeltaSVM_WDeltaXGB_WDeltaMedian of delta
Lung adenocarcinoma
BA
  Z_Original0.500(W)Ref0.500(W)Ref0.640(W)Ref0.500(W)Ref0.665**(W)Ref0.580*(W)RefRef
  Z_Raw0.570(B)0.0700.570(B)0.0700.7400.1000.570(B)0.0700.814(B)0.1490.6980.1180.085
  Z_Binary0.5250.0250.5250.0250.6850.0450.5250.0250.7740.1090.6850.1050.035
  Z_NICG0.5630.0630.5630.0630.792(B)0.1520.5630.0630.7650.1000.6720.0920.078
  Z_NPN0.5250.0250.5250.0250.7700.1300.5250.0250.7900.1250.6960.1160.071
  Z_QN0.5380.0380.5380.0380.7550.1150.5380.0380.7820.1170.6870.1070.073
  Z_QNZ0.5500.0500.5500.0500.7830.1430.5500.0500.7570.0920.709(B)0.1290.071
AUC
  Z_Original0.607(W)Ref0.607(W)Ref0.754(W)Ref0.607(W)Ref0.645**(W)Ref0.656*(W)RefRef
  Z_Raw0.7760.1690.7760.1690.857(B)0.1030.7760.1690.8880.2430.8060.1500.169
  Z_Binary0.7850.1780.7850.1780.7960.0420.7850.1780.8450.2000.7790.1230.178
  Z_NICG0.7420.1350.7420.1350.8380.0840.7420.1350.889(B)0.2440.7740.1180.135
  Z_NPN0.7540.1470.7540.1470.8520.0980.7540.1470.8710.2260.7830.1270.147
  Z_QN0.7560.1490.7560.1490.8420.0880.7560.1490.8380.1930.7880.1320.149
  Z_QNZ0.787(B)0.1800.787(B)0.1800.8360.0820.787(B)0.1800.8170.1720.812(B)0.1560.176
Melanoma
BA
  Z_Original0.595(W)Ref0.573*(W)Ref0.588(W)Ref0.582(W)Ref0.575(W)Ref0.543(W)RefRef
  Z_Raw0.7040.1090.7050.1320.7060.1180.6610.0790.6990.1240.6810.1380.121
  Z_Binary0.6990.1040.6800.1070.6650.0770.6780.0960.6650.090.6870.1440.100
  Z_NICG0.6740.0790.7120.1390.6910.1030.6820.10.728(B)0.1530.708(B)0.1650.121
  Z_NPN0.6900.0950.7150.1420.7140.1260.6890.1070.7250.150.6740.1310.129
  Z_QN0.713(B)0.1180.756(B)0.1830.719(B)0.1310.6930.1110.7110.1360.6870.1440.134
  Z_QNZ0.6920.0970.7190.1460.7110.1230.707(B)0.1250.7060.1310.6650.1220.124
AUC
  Z_Original0.625(W)Ref0.605(W)Ref0.604(W)Ref0.635(W)Ref0.610(W)Ref0.578(W)RefRef
  Z_Raw0.767(B)0.1420.7880.1830.7550.1510.7220.0870.7740.1640.7270.1490.150
  Z_Binary0.7380.1130.7520.1470.7350.1310.7310.0960.7050.0950.7460.1680.122
  Z_NICG0.7140.0890.7800.1750.7620.1580.7360.1010.807(B)0.1970.770(B)0.1920.167
  Z_NPN0.7390.1140.7780.1730.794(B)0.190.7460.1110.7970.1870.7280.150.162
  Z_QN0.7660.1410.794(B)0.1890.7740.170.7490.1140.7750.1650.7350.1570.161
  Z_QNZ0.7610.1360.7890.1840.7590.1550.758(B)0.1230.7780.1680.7370.1590.157
Glioblastoma
BA
  Z_Original0.525**(W)Ref0.519**(W)Ref0.586**(W)Ref0.557(W)Ref0.519***(W)Ref0.552(W)RefRef
  Z_Raw0.761(B)0.2360.7820.2630.777(B)0.1910.604(B)0.0470.7820.2630.6260.0740.214
  Z_Binary0.650*0.1250.638**0.1190.655*0.0690.557(W)00.578*0.0590.6420.090.080
  Z_NICG0.619**0.0940.7810.2620.7760.190.5790.0220.736*0.2170.6710.1190.155
  Z_NPN0.632*0.1070.803(B)0.2840.7720.1860.6030.0460.783(B)0.2640.696(B)0.1440.165
  Z_QN0.584**0.0590.7520.2330.7470.1610.5720.0150.7020.1830.6670.1150.138
  Z_QNZ0.619**0.0940.7210.2020.7210.1350.5830.0260.7310.2120.6220.070.115
AUC
  Z_Original0.531**(W)Ref0.500***(W)Ref0.568**(W)Ref0.619(W)Ref0.588*(W)Ref0.572(W)RefRef
  Z_Raw0.835(B)0.3040.8670.3670.8540.2860.6990.080.8590.2710.6920.120.279
  Z_Binary0.7200.1890.7760.2760.7340.1660.6790.060.7350.1470.6860.1140.157
  Z_NICG0.7740.2430.8750.3750.8730.3050.6730.0540.8520.2640.7260.1540.254
  Z_NPN0.7770.2460.887(B)0.3870.878(B)0.310.7180.0990.884(B)0.2960.771(B)0.1990.271
  Z_QN0.7060.1750.8030.3030.7990.2310.6980.0790.8260.2380.7510.1790.205
  Z_QNZ0.7720.2410.8010.3010.8110.2430.724(B)0.1050.8140.2260.7310.1590.234

In contrast to intra-dataset results, cross-dataset (external) testing revealed limited benefits from normalization and, in several cases, performance comparable to or worse than Z_Original (Table 3 and Supplementary Fig. 2).

Table 3

Cross-dataset testing of death classification on all data in three cancer types

Normalization methodLASSODeltaLRDeltaMLPDeltaRFDeltaSVM_WDeltaXGB_WDeltaMedian of delta
Lung adenocarcinoma: tested in OncoSG data (n = 181) with models trained on TCGA (n = 510)
BA
  Z_Original0.506***(W)Ref0.507***(W)Ref0.520***(W)Ref0.508**(W)Ref0.527***(W)Ref0.539***(W)RefRef
  Z_Raw0.5530.0470.555(B)0.0480.5740.0540.553(B)0.0450.6080.0810.6110.0720.051
  Z_Binary0.520**0.0140.524***0.0170.589(B)0.0690.524*0.0160.645(B)0.1180.618(B)0.0790.043
  Z_NICG0.557(B)0.0510.5520.0450.5610.0410.5360.0280.5980.0710.583*0.0440.045
  Z_NPN0.5470.0410.5320.0250.5440.0240.5390.0310.5800.0530.591*0.0520.036
  Z_QN0.5380.0320.5340.0270.5620.0420.5350.0270.5970.070.5910.0520.037
  Z_QNZ0.5300.0240.5340.0270.5570.0370.5300.0220.5890.0620.6020.0630.032
AUC
  Z_Original0.620Ref0.639Ref0.525(W)Ref0.648Ref0.560**(W)Ref0.574***(W)RefRef
  Z_Raw0.6800.060.6760.0370.5990.0740.6700.0220.6140.0540.6580.0840.057
  Z_Binary0.685(B)0.0650.677(B)0.0380.6320.1070.690(B)0.0420.654(B)0.0940.6660.0920.079
  Z_NICG0.593(W)−0.030.608(W)−0.0310.5960.0710.602(W)−0.050.654(B)0.0940.6470.0730.022
  Z_NPN0.6530.0330.6580.0190.6020.0770.6620.0140.6270.0670.6490.0750.050
  Z_QN0.6790.0590.677(B)0.0380.633(B)0.1080.6780.030.6130.0530.681(B)0.1070.056
  Z_QNZ0.6680.0480.6660.0270.6120.0870.6650.0170.6130.0530.6670.0930.051
Melanoma: tested in DFCI data (n = 40) with models trained on TCGA data (n = 360)
BA
  Z_Original0.587**Ref0.574(W)Ref0.536*Ref0.587(W)Ref0.509*(W)Ref0.545(W)RefRef
  Z_Raw0.647(B)0.0600.6160.0420.5830.0470.6390.0520.5770.0680.6130.0680.056
  Z_Binary0.6280.0410.641(B)0.0670.5850.0490.6480.0610.613(B)0.1040.6140.0690.064
  Z_NICG0.528**(W)−0.0590.5800.0060.541*0.0050.6390.0520.5440.0350.6020.0570.021
  Z_NPN0.6080.0210.6360.0620.595*(B)0.0590.6200.0330.5830.0740.6170.0720.061
  Z_QN0.537***−0.0500.6120.0380.526*(W)−0.0100.650(B)0.0630.5580.0490.620(B)0.0750.044
  Z_QNZ0.578**−0.0090.5760.0020.5460.0100.6220.0350.5460.0370.6000.0550.023
AUC
  Z_Original0.616**Ref0.593Ref0.591Ref0.614(W)Ref0.621Ref0.621RefRef
  Z_Raw0.664(B)0.0480.6400.0470.620(B)0.0290.6610.0470.661(B)0.0400.671(B)0.0500.047
  Z_Binary0.6470.0310.652(B)0.0590.555**−0.0360.6670.0530.6460.0250.6250.0040.028
  Z_NICG0.481***(W)−0.1350.578*(W)−0.0150.5910.0000.706**(B)0.0920.543***−0.0780.6360.015−0.008
  Z_NPN0.611**−0.0050.6330.0400.6130.0220.699*0.0850.573**−0.0480.615(W)−0.0060.009
  Z_QN0.547−0.0690.6100.0170.507**(W)−0.0840.7030.0890.528***−0.0930.6330.012−0.029
  Z_QNZ0.542*−0.0740.5960.0030.530−0.0610.6630.0490.508**(W)−0.1130.6290.008−0.029
Glioblastoma: death classification tested in CPTAC data (n = 97) with models trained on TCGA data (n = 145)
BA
  Z_Original0.638***Ref0.629***Ref0.619**Ref0.516(W)Ref0.551***(W)Ref0.572(W)RefRef
  Z_Raw0.6500.0120.654(B)0.0250.580−0.0390.5580.0420.6370.0860.6100.0380.032
  Z_Binary0.578***−0.0600.610*(W)−0.0190.579(W)−0.0400.5490.0330.578*0.0270.6140.0420.004
  Z_NICG0.621*−0.0170.6470.0180.661***(B)0.0420.5810.0650.657(B)0.1060.6250.0530.048
  Z_NPN0.657(B)0.0190.638*0.0090.647*0.0280.5750.0590.6420.0910.634(B)0.0620.044
  Z_QN0.568***(W)−0.0700.654*(B)0.0250.617−0.0020.627(B)0.1110.6380.0870.6260.0540.040
  Z_QNZ0.614***−0.0240.630**0.0010.6330.0140.6130.0970.626*0.0750.6200.0480.031
AUC
  Z_Original0.710**Ref0.696***Ref0.684***Ref0.642*(W)Ref0.727**(B)Ref0.692RefRef
  Z_Raw0.787(B)0.0770.7030.0070.659−0.0250.6930.0510.689−0.0380.687−0.0050.001
  Z_Binary0.667−0.0430.690−0.0060.642(W)−0.0420.6820.0400.643(W)−0.0840.7070.015−0.024
  Z_NICG0.666(W)−0.0440.695−0.0010.745***(B)0.0610.6590.0170.702−0.0250.648(W)−0.044−0.013
  Z_NPN0.7380.0280.713**(B)0.0170.707***0.0230.698*0.0560.687−0.0400.731**(B)0.0390.026
  Z_QN0.7440.0340.695*−0.0010.682−0.0020.713*(B)0.0710.679**−0.0480.694**0.0020.001
  Z_QNZ0.704−0.0060.659*(W)−0.0370.7010.0170.687*0.0450.685*−0.0420.677−0.015−0.011

A striking pattern emerged for the LASSO model. In cross-dataset testing of all three cancer types, LASSO achieved positive or minimally negative deltas across nearly all normalization methods, frequently outperforming the normalized versions of more complex models.

Overall, while normalization and associated gene selection markedly boosted intra-dataset performance, these preprocessing steps provided only marginal gains in cross-dataset testing and occasionally led to reduced performance. Differences in model performance were more pronounced in cross-dataset settings than within the same dataset, highlighting greater sensitivity to dataset heterogeneity in cross-dataset validation. Simpler, regularized approaches such as LASSO demonstrated consistent cross-dataset robustness, whereas more complex models showed variable and sometimes diminished generalizability after extensive normalization.

Modelling with molecular data alone in three cancer types

In intra-dataset testing using only molecular (transcriptomic) features across the three cancer types (Table 4 and Supplementary Fig. 3), normalization methods again substantially improved performance relative to Z_Original, with gains observed in both BA and AUC. Overall, normalization markedly enhanced intra-dataset death classification when using molecular data alone, with Z_Raw frequently delivering the strongest median gains, particularly in lung adenocarcinoma and glioblastoma.

Table 4

Intra-dataset testing of death classification on molecular data alone in three cancer types

Normalization methodLASSODeltaLRDeltaMLPDeltaRFDeltaSVM_WDeltaXGB_WDeltaMedian of delta
Lung adenocarcinoma
BA
  Z_Original0.622**Ref0.619(W)Ref0.606**(W)Ref0.500**(W)Ref0.649(W)Ref0.577(W)RefRef
  Z_Raw0.781(B)0.1590.8290.210.7680.1620.563(B)0.0630.834(B)0.1850.739(B)0.1620.162
  Z_Binary0.7600.1380.7670.1480.7310.1250.5210.0210.7560.1070.6850.1080.117
  Z_NICG0.580**(W)−0.0420.8350.2160.848(B)0.2420.563(B)0.0630.8210.1720.6820.1050.139
  Z_NPN0.635**0.0130.838(B)0.2190.8300.2240.5420.0420.7940.1450.6890.1120.129
  Z_QN0.664**0.0420.8010.1820.7410.1350.563(B)0.0630.8160.1670.6720.0950.115
  Z_QNZ0.630**0.0080.7940.1750.7570.1510.5420.0420.7800.1310.7020.1250.128
AUC
  Z_Original0.641*(W)Ref0.643*(W)Ref0.685(W)Ref0.625(W)Ref0.694(W)Ref0.617(W)RefRef
  Z_Raw0.923(B)0.2820.935(B)0.2920.9150.230.8080.1830.8950.2010.875(B)0.2580.244
  Z_Binary0.8350.1940.827**0.1840.8150.130.8150.190.8530.1590.8250.2080.187
  Z_NICG0.8210.180.8950.2520.925(B)0.240.7640.1390.907(B)0.2130.7860.1690.197
  Z_NPN0.8750.2340.9150.2720.9190.2340.8150.190.8990.2050.8250.2080.221
  Z_QN0.861*0.220.8990.2560.8710.1860.8100.1850.8930.1990.8370.220.210
  Z_QNZ0.8850.2440.8830.240.8570.1720.839(B)0.2140.8590.1650.8390.2220.218
Melanoma
BA
  Z_Original0.592(W)Ref0.576(W)Ref0.580(W)Ref0.574(W)Ref0.648(W)Ref0.605(W)RefRef
  Z_Raw0.7200.1280.7160.140.7080.1280.692(B)0.1180.6950.0470.6750.070.123
  Z_Binary0.6850.0930.6900.1140.6840.1040.6760.1020.7050.0570.6610.0560.098
  Z_NICG0.6690.0770.7120.1360.6910.1110.6900.1160.7060.0580.697(B)0.0920.102
  Z_NPN0.6750.0830.7070.1310.7120.1320.6790.1050.7020.0540.6830.0780.094
  Z_QN0.7170.1250.7180.1420.723(B)0.1430.6800.1060.7160.0680.697(B)0.0920.116
  Z_QNZ0.724(B)0.1320.720(B)0.1440.7210.1410.6870.1130.732(B)0.0840.6770.0720.123
AUC
  Z_Original0.633(W)Ref0.610(W)Ref0.638(W)Ref0.627(W)Ref0.672(W)Ref0.653(W)RefRef
  Z_Raw0.802(B)0.1690.787(B)0.1770.7690.1310.7430.1160.7730.1010.7360.0830.124
  Z_Binary0.7590.1260.7660.1560.7410.1030.7310.1040.7630.0910.7130.060.104
  Z_NICG0.7160.0830.7810.1710.7600.1220.7130.0860.7770.1050.765(B)0.1120.109
  Z_NPN0.7410.1080.7820.1720.7860.1480.7410.1140.7830.1110.7440.0910.113
  Z_QN0.7890.1560.7830.1730.809(B)0.1710.7330.1060.7810.1090.7580.1050.133
  Z_QNZ0.7990.1660.7860.1760.8040.1660.744(B)0.1170.784(B)0.1120.7570.1040.142
Glioblastoma
BA
  Z_Original0.570**Ref0.515**(W)Ref0.547*(W)Ref0.539(W)Ref0.526**(W)Ref0.524(W)RefRef
  Z_Raw0.776(B)0.2060.746(B)0.2310.760(B)0.2130.5920.0530.750(B)0.2240.6320.1080.210
  Z_Binary0.7310.1610.6900.1750.7180.1710.5900.0510.7450.2190.668(B)0.1440.166
  Z_NICG0.615**0.0450.7240.2090.7260.1790.5840.0450.7410.2150.6430.1190.149
  Z_NPN0.670*0.10.7040.1890.6930.1460.624(B)0.0850.7200.1940.6360.1120.129
  Z_QN0.564**(W)−0.0060.6930.1780.6850.1380.5800.0410.6660.140.6230.0990.119
  Z_QNZ0.581**0.0110.6960.1810.6940.1470.5660.0270.6970.1710.6040.080.114
  AUC
  Z_Original0.577***(W)Ref0.542**(W)Ref0.649*(W)Ref0.675(W)Ref0.570**(W)Ref0.566**(W)RefRef
  Z_Raw0.853(B)0.2760.8420.30.8480.1990.7100.0350.8550.2850.7300.1640.238
  Z_Binary0.8460.2690.8010.2590.7820.1330.767(B)0.0920.8030.2330.797(B)0.2310.232
  Z_NICG0.761*0.1840.847(B)0.3050.863(B)0.2140.6880.0130.865(B)0.2950.6950.1290.199
  Z_NPN0.763*0.1860.8450.3030.8250.1760.7120.0370.8440.2740.7300.1640.181
  Z_QN0.657**0.080.775*0.2330.7790.130.6860.0110.780*0.210.6660.10.115
  Z_QNZ0.653**0.0760.8160.2740.7480.0990.6760.0010.8250.2550.7170.1510.125

Cross-dataset testing using only molecular features showed more limited and inconsistent benefits from normalization, similar to patterns observed with combined data, though overall performance levels were generally lower (Table 5 and Supplementary Fig. 4).

Table 5

Cross-dataset testing of death classification on molecular data alone in three cancer types

Normalization methodLASSODeltaLRDeltaMLPDeltaRFDeltaSVM_WDeltaXGB_WDeltaMedian of delta
Lung adenocarcinoma: death classification trained in TCGA (n = 510) and tested in OncoSG (n = 181) dataset
BA
  Z_Original0.539***Ref0.513***Ref0.527*Ref0.515***Ref0.586***Ref0.532***RefRef
  Z_Raw0.613(B)0.0740.5660.0530.5690.0420.5570.0420.630(B)0.0440.608(B)0.0760.049
  Z_Binary0.500***(W)−0.0390.500***(W)−0.010.500***(W)−0.030.500***(W)−0.020.500***(W)−0.0860.500***(W)−0.032−0.030
  Z_NICG0.537***−0.0020.657**(B)*0.1440.579(B)0.0520.559(B)0.0440.630(B)0.0440.6040.0720.048
  Z_NPN0.578**0.0390.634***0.1210.529**0.0020.541*0.0260.6210.0350.5780.0460.037
  Z_QN0.549**0.010.5830.070.550.0230.537*0.0220.610**0.0240.5900.0580.024
  Z_QNZ0.557**0.0180.5770.0640.5660.0390.529***0.0140.6280.0420.5830.0510.041
AUC
  Z_Original0.553***Ref0.531**Ref0.537*Ref0.629***Ref0.601***Ref0.584***RefRef
  Z_Raw0.6330.080.5640.0330.5670.030.701(B)0.0720.6610.060.655(B)0.0710.066
  Z_Binary0.500***(W)−0.0530.500**(W)−0.030.500(W)−0.040.500***(W)−0.130.500***(W)−0.1010.500***(W)−0.084−0.069
  Z_NICG0.601*0.0480.687***(B)0.1560.589*0.0520.686*0.0570.684(B)0.0830.6470.0630.060
  Z_NPN0.645(B)0.0920.647**0.1160.634*(B)0.0970.662**0.0330.6430.0420.608**0.0240.067
  Z_QN0.6300.0770.620**0.0890.625**0.0880.6670.0380.6510.050.6500.0660.072
  Z_QNZ0.6260.0730.622**0.0910.618**0.0810.684*0.0550.6420.0410.6220.0380.064
Melanoma: death classification tested in DFCI data (n = 40) with models trained on TCGA data (n = 360)
BA
  Z_Original0.582*Ref0.594*(W)Ref0.523(W)Ref0.620Ref0.500*(W)Ref0.560**(W)RefRef
  Z_Raw0.6160.0340.594(W)00.5530.030.6340.0140.6260.1260.632(B)0.0720.032
  Z_Binary0.564*(W)−0.0180.637*(B)0.0430.5750.0520.6310.0110.654(B)0.1540.5950.0350.039
  Z_NICG0.5870.0050.6300.0360.5430.020.607−0.010.6370.1370.563**0.0030.013
  Z_NPN0.650(B)0.0680.6170.0230.582(B)0.0590.617−0.0030.6310.1310.5840.0240.041
  Z_QN0.6000.0180.617*0.0230.582(B)0.0590.661(B)0.0410.594*0.0940.590*0.030.036
  Z_QNZ0.572*−0.010.6360.0420.5610.0380.602(W)−0.020.591*0.0910.6110.0510.040
  AUC
  Z_Original0.574*Ref0.579**(W)Ref0.492(W)Ref0.647Ref0.544**(W)Ref0.581RefRef
  Z_Raw0.6070.0330.579(W)00.5700.0780.6820.0350.667(B)0.1230.651(B)0.070.053
  Z_Binary0.544*(W)−0.030.628*0.0490.5480.0560.6550.0080.6560.1120.580*−0.0010.029
  Z_NICG0.624(B)0.050.592**0.0130.629(B)0.1370.626(W)−0.020.6460.1020.563***(W)−0.0180.032
  Z_NPN0.6100.0360.617**0.0380.5830.0910.6510.0040.603***0.0590.6010.020.037
  Z_QN0.5890.0150.641**(B)0.0620.5270.0350.686(B)0.0390.605**0.0610.6190.0380.039
  Z_QNZ0.5920.0180.638**0.0590.5490.0570.6500.0030.593**0.0490.6390.0580.053
Glioblastoma: death classification tested in CPTAC data (n = 97) with models trained on TCGA data (n = 145)
BA
  Z_Original0.602Ref0.548***(W)Ref0.516**(W)Ref0.527**Ref0.536***(W)Ref0.529(W)RefRef
  Z_Raw0.620(B)0.0180.599(B)0.0510.5580.0420.5290.0020.5920.0560.5550.0260.034
  Z_Binary0.564***−0.0380.580*0.0320.5740.0580.518(W)−0.010.5520.0160.5770.0480.024
  Z_NICG0.567***−0.0350.5940.0460.604**(B)0.0880.5420.0150.604(B)0.0680.5630.0340.040
  Z_NPN0.6100.0080.5980.050.5900.0740.5290.0020.6000.0640.5710.0420.046
  Z_QN0.538***(W)−0.0640.5830.0350.5710.0550.5390.0120.5860.050.5610.0320.034
  Z_QNZ0.546***−0.0560.5740.0260.5780.0620.545(B)0.0180.5730.0370.589(B)0.060.031
  AUC
  Z_Original0.599**(W)Ref0.559(W)Ref0.564***(W)Ref0.579***(W)Ref0.63**Ref0.571***(W)RefRef
  Z_Raw0.6360.0370.6370.0780.5930.0290.633(B)0.0540.651(B)0.0210.6310.060.046
  Z_Binary0.6480.0490.6390.080.619***0.0550.6160.0370.588**(W)−0.0420.6330.0620.052
  Z_NICG0.6130.0140.6340.0750.631***0.0670.6020.0230.622**−0.0080.646(B)0.0750.045
  Z_NPN0.663(B)0.0640.640(B)0.0810.655***(B)0.0910.6220.0430.651(B)0.0210.6260.0550.060
  Z_QN0.6000.0010.6050.0460.6030.0390.581*0.0020.628−0.0020.6170.0460.021
  Z_QNZ0.6010.0020.6180.0590.632**0.0680.6070.0280.622**−0.0080.6010.030.029

Across the three cancer types, normalization offered only marginal benefits in cross-dataset settings when relying solely on molecular data, and certain methods (e.g., Z_Binary) occasionally degraded performance. As with combined features, LASSO without additional normalization demonstrated remarkable consistency, achieving positive or near-neutral deltas in nearly all cross-dataset scenarios. In contrast, more complex models exhibited greater variability, underscoring the robustness of regularized approaches for generalization across heterogeneous datasets.

Discussion

This large and comprehensive study using three pairs of cancer transcriptomic and clinical datasets reveals critical insights into ML performance in bioinformatics, particularly for cross-dataset generalization and preprocessing strategies. These results challenge conventional practices and may help develop robust, generalizable models for applications such as gene expression analyses.

First, we showed that the LASSO method without normalization consistently performed well in cross-dataset (external) testing of three pairs of transcriptomic and clinical datasets (i.e., three cancer types), challenging the necessity of normalization for cross-dataset tasks. This suggests that regularization inherent in LASSO effectively mitigates overfitting, simplifies preprocessing pipelines, and reduces computational costs.29,55 Our finding also aligns with a study showing robust LASSO performance without extensive preprocessing.27 However, LASSO may not be robust for some tasks, as shown in one study on spatial gene expression in the brain.56

Second, normalization and gene selection (DEG/NDEG) significantly improve intra-dataset performance but yield limited gains in cross-dataset testing, often leading to overfitting.13,19,23–25 This underscores the need for cautious application of extensive preprocessing to avoid models that fail in external validation.20 Interestingly, as shown by us and others, LASSO and other regularized methods can be used to reduce overfitting in microarray and single-cell RNA-seq data.24,26

Third, performance differences among ML models are smaller in intra-dataset testing than in cross-dataset settings, partly due to limited normalization and gene selection benefits in external contexts. This highlights the importance of selecting robust algorithms for heterogeneous datasets, such as random forest.13,57–59 Indeed, recent works confirm that simpler, regularized models often maintain consistent performance across datasets, unlike complex models prone to overfitting.60,61 This finding may help select models for applications requiring broader generalizability.1

Fourth, reliance on intra-dataset evaluation (e.g., cross-validation) may overestimate model generalizability, as shown by others and by us.9,10 For example, the performance of ML models achieved with negative data generation cannot be transferred to cross-dataset testing, either.9 We thus advocate shifting toward cross-dataset evaluation to prioritize models with consistent, acceptable performance, enhancing applicability in clinical settings like precision medicine,8 while it is noteworthy that intra-dataset evaluation may match that of cross-dataset evaluation in some scenarios.62 This paradigm shift addresses the gap between intra-dataset optimization and real-world robustness.59 Others have also introduced benchmark datasets to robustly assess models’ performance.63 Further studies are needed to address this issue in more depth.

Finally, normalization’s impact varies by ML model and is more pronounced with all data than with molecular data alone. This is particularly relevant for multi-omics integration, where data-specific preprocessing strategies are critical.3,38,64 Recent studies on multi-omics data modeling support tailored normalization approaches to improve ML performance.31,65,66 Therefore, we recommend data-specific ML workflows to enhance cross-dataset robustness.

Certain limitations of this study should be acknowledged. Our analyses were conducted on a specific set of transcriptomic and clinical datasets for each of the three cancer types, albeit repeated in three pairs of cancer datasets, and on a selected repertoire of ML models and preprocessing techniques. Future work is required to examine the generalizability of our specific quantitative findings to other datasets or ML methods, such as those for diabetes, digestive diseases, and neurological diseases. Moreover, the consistently good performance of LASSO is an empirical finding and warrants additional theoretical and experimental research. It will also be interesting to assess whether and how other regularized methods, as well as batch-effect correction strategies, can mitigate overfitting yet maintain cross-dataset performance. Future work will extend the current binary survival prediction to time-to-event survival analyses to leverage follow-up information. Finally, developing novel evaluation metrics that better capture cross-dataset robustness will be highly useful but is beyond this study’s scope and was not performed.

Conclusions

Our findings challenge the reliance on normalization and intra-dataset evaluation, advocating for regularized models and cross-dataset validation to improve the generalizability of ML modeling. Future work should explore optimal preprocessing strategies for specific data types and develop standardized cross-dataset evaluation frameworks to advance bioinformatics ML applications.

Supporting information

Supplementary material for this article is available at https://doi.org/10.14218/JCTP.2025.00051 .

Supplementary Table 1

Basic modeling factor values and model hyperparameter grids for three pairs of transcriptomic and clinical datasets.

(DOCX)

Supplementary Table 2

Definitions of normalization methods.

(DOCX)

Supplementary Table 3

Baseline characteristics of lung adenocarcinoma patients in both the TCGA and OncoSG datasets.

(DOCX)

Supplementary Table 4

Baseline characteristics of glioblastoma patients in both the TCGA and CPTAC datasets.

(DOCX)

Supplementary Table 5

Baseline characteristics of melanoma patients in both the TCGA and DFCI datasets.

(DOCX)

Supplementary Table 6

Performance metrics of internal testing obtained by models trained on TCGA data with molecular and four clinical features (age, gender, TMB, and tumor stage) (Data grouping A).

(DOCX)

Supplementary Table 7

Performance metrics of internal testing obtained by models trained on TCGA data with molecular features (Data grouping B).

(DOCX)

Supplementary Table 8

Performance metrics of internal testing obtained by models trained on TCGA data with molecular and three clinical features (age, gender, and TMB) (Data grouping C).

(DOCX)

Supplementary Table 9

P-values of Welch’s t-test between data grouping B and data grouping C for performance metrics of internal testing obtained by models trained on TCGA data.

(DOCX)

Supplementary Table 10

Performance metrics of external testing obtained by predicting on OncoSG data with models trained on TCGA data (including molecular and four clinical features) (Data grouping A).

(DOCX)

Supplementary Table 11

Performance metrics of external testing obtained by predicting on OncoSG data with models trained on TCGA data (including molecular features) (Data grouping B).

(DOCX)

Supplementary Table 12

Performance metrics of external testing obtained by predicting on OncoSG data with models trained on TCGA data (including molecular and three clinical features) (Data grouping C).

(DOCX)

Supplementary Table 13

P-values of Welch’s t-test between data grouping B and data grouping C for performance metrics of external testing obtained by predicting on OncoSG data with models trained on TCGA data.

(DOCX)

Supplementary Table 14

Performance metrics of internal testing obtained by models trained on OncoSG data with molecular and four clinical features (age, gender, TMB, and tumor stage) (Data grouping A).

(DOCX)

Supplementary Table 15

Performance metrics of internal testing obtained by models trained on OncoSG data with molecular features (Data grouping B).

(DOCX)

Supplementary Table 16

Performance metrics of internal testing obtained by models trained on OncoSG data with molecular and three clinical features (age, gender, and TMB) (Data grouping C).

(DOCX)

Supplementary Table 17

P-values of Welch’s t-test between data grouping B and data grouping C for performance metrics of internal testing obtained by models trained on OncoSG data.

(DOCX)

Supplementary Table 18

Performance metrics of external testing obtained by predicting on TCGA data with models trained on OncoSG data (including molecular and four clinical features) (Data grouping A).

(DOCX)

Supplementary Table 19

Performance metrics of external testing obtained by predicting on TCGA data with models trained on OncoSG data (including molecular features) (Data grouping B).

(DOCX)

Supplementary Table 20

Performance metrics of external testing obtained by predicting on TCGA data with models trained on OncoSG data (including molecular and three clinical features) (Data grouping C).

(DOCX)

Supplementary Table 21

P-values of Welch’s t-test between data grouping B and data grouping C for performance metrics of external testing obtained by predicting on TCGA data with models trained on OncoSG data.

(DOCX)

Supplementary Table 22

Comparison of the best performances (balanced accuracy) in data grouping A, B, and C based on the model trained on TCGA data.

(DOCX)

Supplementary Table 23

Comparison of the best performances (balanced accuracy) of data grouping A, B, and C based on the model trained on OncoSG data.

(DOCX)

Supplementary Table 24

Per-model 95% confidence intervals across repeated runs and within-model FDR-adjusted Welch’s t-test comparisons based on the model trained on TCGA lung adenocarcinoma data (Z_Raw as reference).

(DOCX)

Supplementary Table 25

Per-model 95% confidence intervals across repeated runs and within-model FDR-adjusted Welch’s t-test comparisons based on the model trained on OncoSG lung adenocarcinoma data (Z_Raw as reference).

(DOCX)

Supplementary Table 26

Per-model 95% confidence intervals across repeated runs and within-model FDR-adjusted Welch’s t-test comparisons based on the model trained on TCGA melanoma data (Z_Raw as reference).

(DOCX)

Supplementary Table 27

Per-model 95% confidence intervals across repeated runs and within-model FDR-adjusted Welch’s t-test comparisons based on the model trained on TCGA glioblastoma data (Z_Raw as reference).

(DOCX)

Supplementary Fig. 1

Heatmap of death classification results from intra-dataset testing on all data in three cancer types.

*P < 0.05; **P < 0.01; ***P < 0.001 compared with Z_Raw. AUC, area under the curve of the receiver operating characteristic curve; BA, balanced accuracy; LASSO, Least Absolute Shrinkage and Selection Operator; LR, Logistic Regression; MLP, Multilayer Perceptron; RF, Random Forest; SVM_W, (linear) support vector machine with weighting; XGB_W, extreme gradient boosting with weighting; Z_Original, Z-transformed RNA-seq data in FPKM format, including all gene features shared between the two cohorts; Z_Raw, Z_Original data restricted to the selected DEGs; Z_Binary, binarization applied to Z_Raw data; Z_NPN, Non-Parametric Normalization (NPN) applied to Z_Raw data; Z_QN, Quantile Normalization (QN) applied to Z_Raw data; Z_QNZ, Quantile Normalization with Z-Score (QNZ) applied to Z_Raw data; Z_NICG, Normalization using Internal Control Genes (NICG) applied to Z_Raw data.

(TIF)

Supplementary Fig. 2

Heatmap of death classification results from cross-dataset testing on all data in three cancer types.

*P < 0.05; **P < 0.01; ***P < 0.001 compared with Z_Raw. AUC, area under the curve of the receiver operating characteristic curve; BA, balanced accuracy; LASSO, Least Absolute Shrinkage and Selection Operator; LR, Logistic Regression; MLP, Multilayer Perceptron; RF, Random Forest; SVM_W, (linear) support vector machine with weighting; XGB_W, extreme gradient boosting with weighting; Z_Original, Z-transformed RNA-seq data in FPKM format, including all gene features shared between the two cohorts; Z_Raw, Z_Original data restricted to the selected DEGs; Z_Binary, binarization applied to Z_Raw data; Z_NPN, Non-Parametric Normalization (NPN) applied to Z_Raw data; Z_QN, Quantile Normalization (QN) applied to Z_Raw data; Z_QNZ, Quantile Normalization with Z-Score (QNZ) applied to Z_Raw data; Z_NICG, Normalization using Internal Control Genes (NICG) applied to Z_Raw data.

(TIF)

Supplementary Fig. 3

Heatmap of death classification results from intra-dataset testing on molecular data alone in three cancer types.

*P < 0.05; **P < 0.01; ***P < 0.001 compared with Z_Raw. AUC, area under the curve of the receiver operating characteristic curve; BA, balanced accuracy; LASSO, Least Absolute Shrinkage and Selection Operator; LR, Logistic Regression; MLP, Multilayer Perceptron; RF, Random Forest; SVM_W, (linear) support vector machine with weighting; XGB_W, extreme gradient boosting with weighting; Z_Original, Z-transformed RNA-seq data in FPKM format, including all gene features shared between the two cohorts; Z_Raw, Z_Original data restricted to the selected DEGs; Z_Binary, binarization applied to Z_Raw data; Z_NPN, Non-Parametric Normalization (NPN) applied to Z_Raw data; Z_QN, Quantile Normalization (QN) applied to Z_Raw data; Z_QNZ, Quantile Normalization with Z-Score (QNZ) applied to Z_Raw data; Z_NICG, Normalization using Internal Control Genes (NICG) applied to Z_Raw data.

(TIF)

Supplementary Fig. 4

Heatmap of death classification results from cross-dataset testing on molecular data alone in three cancer types.

*P < 0.05; **P < 0.01; ***P < 0.001 compared with Z_Raw. AUC, area under the curve of the receiver operating characteristic curve; BA, balanced accuracy; LASSO, Least Absolute Shrinkage and Selection Operator; LR, Logistic Regression; MLP, Multilayer Perceptron; RF, Random Forest; SVM_W, (linear) support vector machine with weighting; XGB_W, extreme gradient boosting with weighting; Z_Original, Z-transformed RNA-seq data in FPKM format, including all gene features shared between the two cohorts; Z_Raw, Z_Original data restricted to the selected DEGs; Z_Binary, binarization applied to Z_Raw data; Z_NPN, Non-Parametric Normalization (NPN) applied to Z_Raw data; Z_QN, Quantile Normalization (QN) applied to Z_Raw data; Z_QNZ, Quantile Normalization with Z-Score (QNZ) applied to Z_Raw data; Z_NICG, Normalization using Internal Control Genes (NICG) applied to Z_Raw data.

(TIF)

Declarations

Acknowledgement

None.

Ethical statement

This exempt study using publicly available de-identified data did not require IRB review. Data acquisition and use complied with cBioPortal’s data access policies and ethical guidelines. All procedures were conducted in accordance with the principles of the Declaration of Helsinki (as revised in 2024).

Data sharing statement

The datasets used in this study are available on the cBioPortal website (https://www.cbioportal.org/). The program code is available from the corresponding author on reasonable request.

Funding

This work was supported by the National Cancer Institute, National Institutes of Health (grant number R37CA277812 to LZ). The funder of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the report. The corresponding author had full access to all the data in the study and had final responsibility for the decision to submit the manuscript for publication.

Conflict of interest

Lanjing Zhang is a deputy editor-in-chief of Journal of Clinical and Translational Pathology. The authors declare no other conflicts of interest.

Authors’ contributions

Study conceptualization and design, ensuring data access, accuracy and integrity (LZ), and manuscript writing (FD and LZ). Both authors contributed to the writing or revision of the review article and approved the final publication version.

References

  1. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286(5439):531-537 View Article PubMed/NCBI
  2. Deng F, Zhao L, Yu N, Lin Y, Zhang L. Union With Recursive Feature Elimination: A Feature Selection Framework to Improve the Classification Performance of Multicategory Causes of Death in Colorectal Cancer. Lab Invest 2024;104(3):100320 View Article PubMed/NCBI
  3. Deng F, Zhou H, Lin Y, Heim JA, Shen L, Li Y, et al. Predict multicategory causes of death in lung cancer patients using clinicopathologic factors. Comput Biol Med 2021;129:104161 View Article PubMed/NCBI
  4. Deng F, Shen L, Wang H, Zhang L. Classify multicategory outcome in patients with lung adenocarcinoma using clinical, transcriptomic and clinico-transcriptomic data: machine learning versus multinomial models. Am J Cancer Res 2020;10(12):4624-4639 View Article
  5. Cui M, Deng F, Disis ML, Cheng C, Zhang L. Advances in the Clinical Application of High-throughput Proteomics. Explor Res Hypothesis Med 2024;9(3):209-220 View Article PubMed/NCBI
  6. Cui M, Cheng C, Zhang L. High-throughput proteomics: a methodological mini-review. Lab Invest 2022;102(11):1170-1181 View Article PubMed/NCBI
  7. Liu DD, Zhang L. Trends in the characteristics of human functional genomic data on the gene expression omnibus, 2001-2017. Lab Invest 2019;99(1):118-127 View Article PubMed/NCBI
  8. Bernau C, Riester M, Boulesteix AL, Parmigiani G, Huttenhower C, Waldron L, et al. Cross-study validation for the assessment of prediction algorithms. Bioinformatics 2014;30(12):i105-i112 View Article PubMed/NCBI
  9. Cohen-Davidi E, Veksler-Lublinsky I. Benchmarking the negatives: Effect of negative data generation on the classification of miRNA-mRNA interactions. PLoS Comput Biol 2024;20(8):e1012385 View Article PubMed/NCBI
  10. Mohammadzadeh-Vardin T, Ghareyazi A, Gharizadeh A, Abbasi K, Rabiee HR. DeepDRA: Drug repurposing using multi-omics data integration with autoencoders. PLoS One 2024;19(7):e0307649 View Article PubMed/NCBI
  11. Yu AC, Mohajer B, Eng J. External Validation of Deep Learning Algorithms for Radiologic Diagnosis: A Systematic Review. Radiol Artif Intell 2022;4(3):e210064 View Article PubMed/NCBI
  12. Feng CH, Deng F, Disis ML, Gao N, Zhang L. Towards machine learning fairness in classifying multicategory causes of deaths in colorectal or lung cancer patients. Brief Bioinform 2025;26(4):bbaf398 View Article PubMed/NCBI
  13. Deng F, Feng CH, Gao N, Zhang L. Normalization and Selecting Non-Differentially Expressed Genes Improve Machine Learning Modelling of Cross-Platform Transcriptomic Data. Trans Artif Intell 2025;1(1):5 View Article PubMed/NCBI
  14. Sun R, Zhu H, Wang Y, Wang J, Jiang C, Cao Q, et al. Circular RNA expression and the competitive endogenous RNA network in pathological, age-related macular degeneration events: A cross-platform normalization study. J Biomed Res 2023;37(5):367-381 View Article PubMed/NCBI
  15. Foltz SM, Greene CS, Taroni JN. Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously. Commun Biol 2023;6(1):222 View Article PubMed/NCBI
  16. Koo B, Kim J, Nam Y, Kim Y. The Performance of Post-Fall Detection Using the Cross-Dataset: Feature Vectors, Classifiers and Processing Conditions. Sensors (Basel) 2021;21(14):4638 View Article PubMed/NCBI
  17. Junet V, Farrés J, Mas JM, Daura X. CuBlock: a cross-platform normalization method for gene-expression microarrays. Bioinformatics 2021;37(16):2365-2373 View Article PubMed/NCBI
  18. Montesinos López OA, Montesinos López A, Crossa J. Overfitting, model tuning, and evaluation of prediction performance. Multivariate Statistical Machine Learning Methods for Genomic Prediction. Cham: Springer; 2022:109-139 View Article
  19. Krawczuk J, Łukaszuk T. The feature selection bias problem in relation to high-dimensional gene data. Artif Intell Med 2016;66:63-71 View Article PubMed/NCBI
  20. Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003;95(1):14-18 View Article PubMed/NCBI
  21. Vujović Ž. Classification model evaluation metrics. International Journal of Advanced Computer Science and Applications 2021;12(6):599-606 View Article
  22. Raschka S. Model evaluation, model selection, and algorithm selection in machine learning. arXiv 2018
  23. Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A 2002;99(10):6562-6566 View Article PubMed/NCBI
  24. Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol 2019;20(1):296 View Article PubMed/NCBI
  25. Huang D, Chow T. Effective gene selection method with small sample sets using gradient-based and point injection techniques. IEEE/ACM Trans Comput Biol Bioinform 2007;4(3):467-475 View Article PubMed/NCBI
  26. Xiong Y, Ling QH, Han F, Liu QH. An efficient gene selection method for microarray data based on LASSO and BPSO. BMC Bioinformatics 2019;20(Suppl 22):715 View Article PubMed/NCBI
  27. Smith AM, Walsh JR, Long J, Davis CB, Henstock P, Hodge MR, et al. Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. BMC Bioinformatics 2020;21(1):119 View Article PubMed/NCBI
  28. Deng F, Zhang Y, Zhang L. Toward the Best Generalizable Performance of Machine Learning in Modeling Omic and Clinical Data. Lab Invest 2025;105(12):104253 View Article PubMed/NCBI
  29. Tibshirani R. Regression Shrinkage and Selection via the Lasso. J R Stat Soc Series B Stat Methodol 1996;58(1):267-88 View Article
  30. Benchekroun M, Velmovitsky PE, Istrate D, Zalc V, Morita PP, Lenne D. Cross Dataset Analysis for Generalizability of HRV-Based Stress Detection Models. Sensors (Basel) 2023;23(4):1807 View Article PubMed/NCBI
  31. Tarazona S, Balzano-Nogueira L, Gómez-Cabrero D, Schmidt A, Imhof A, Hankemeier T, et al. Harmonization of quality metrics and power calculation in multi-omic studies. Nat Commun 2020;11(1):3092 View Article PubMed/NCBI
  32. Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov 2012;2(5):401-404 View Article PubMed/NCBI
  33. Chen J, Yang H, Teo ASM, Amer LB, Sherbaf FG, Tan CQ, et al. Genomic landscape of lung adenocarcinoma in East Asians. Nat Genet 2020;52(2):177-186 View Article PubMed/NCBI
  34. Sanchez-Vega F, Mina M, Armenia J, Chatila WK, Luna A, La KC, et al. Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell 2018;173(2):321-337.e10 View Article PubMed/NCBI
  35. Van Allen EM, Miao D, Schilling B, Shukla SA, Blank C, Zimmer L, et al. Genomic correlates of response to CTLA-4 blockade in metastatic melanoma. Science 2015;350(6257):207-211 View Article PubMed/NCBI
  36. Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, et al. Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell 2018;173(2):291-304.e6 View Article PubMed/NCBI
  37. Wang LB, Karpova A, Gritsenko MA, Kyle JE, Cao S, Li Y, et al. Proteogenomic and metabolomic characterization of human glioblastoma. Cancer Cell 2021;39(4):509-528.e20 View Article PubMed/NCBI
  38. Deng F, Huang J, Yuan X, Cheng C, Zhang L. Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data. Lab Invest 2021;101(4):430-441 View Article PubMed/NCBI
  39. Feng CH, Disis ML, Cheng C, Zhang L. Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models. Lab Invest 2021 View Article PubMed/NCBI
  40. Bhuva DD, Cursons J, Davis MJ. Stable gene expression for normalisation and single-sample scoring. Nucleic Acids Res 2020;48(19):e113 View Article PubMed/NCBI
  41. Thompson JA, Tan J, Greene CS. Cross-platform normalization of microarray and RNA-seq data for machine learning applications. PeerJ 2016;4:e1621 View Article PubMed/NCBI
  42. Brodsky E, Darkhovsky BS. Non-Parametric Statistical Diagnosis: Problems and Methods. Dordrecht: Springer; 2013
  43. Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A, et al. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol 2002;3(7):RESEARCH0034 View Article PubMed/NCBI
  44. Karthik S, Sudha M. A survey on machine learning approaches in gene expression classification in modelling computational diagnostic system for complex diseases. Int J Eng Adv Technol 2018;8(2):182-191
  45. Dunne RA. A statistical approach to neural networks for pattern recognition. John Wiley & Sons; 2007 View Article
  46. Ma B, Meng F, Yan G, Yan H, Chai B, Song F. Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data. Comput Biol Med 2020;121:103761 View Article PubMed/NCBI
  47. Sheridan RP, Wang WM, Liaw A, Ma J, Gifford EM. Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships. J Chem Inf Model 2016;56(12):2353-2360 View Article PubMed/NCBI
  48. Hosmer DW, Lemeshow S, Sturdivant RX. Applied logistic regression. 3rd ed. Hoboken (NJ): John Wiley & Sons; 2013 View Article
  49. Huang YM, Du SX. Weighted support vector machine for classification with uneven training class sizes. Proceedings of the 2005 International Conference on Machine Learning and Cybernetics; 2005 Aug 18-21; Guangzhou, China. Piscataway (NJ): IEEE; 2005:4365-4369 View Article
  50. Parmar A, Katariya R, Patel V. A review on random forest: an ensemble classifier. In: Hemanth J, Fernando X, Lafata P, Baig Z (eds). International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI 2018); 2018 Aug 7-8; Coimbatore, India. Cham: Springer; 2019:758-763 View Article
  51. Derrick B, Toher D, White P. Why Welch’s test is Type I error robust. Quant Methods Psychol 2016;12(1):30-38 View Article
  52. Ruxton GD. The unequal variance t-test is an underused alternative to Student’s t-test and the Mann-Whitney U test. Behav Ecol 2006;17(4):688-690 View Article
  53. Sedgwick P. A comparison of parametric and non-parametric statistical tests. BMJ 2015;350:h2053 View Article PubMed/NCBI
  54. Hollander M, Wolfe DA, Chicken E. Nonparametric statistical methods. 3rd ed. Hoboken (NJ): John Wiley & Sons; 2013
  55. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005;102(43):15545-15550 View Article PubMed/NCBI
  56. Lu S, Ortiz C, Fürth D, Fischer S, Meletis K, Zador A, et al. Assessing the replicability of spatial gene expression using atlas data from the adult mouse brain. PLoS Biol 2021;19(7):e3001341 View Article PubMed/NCBI
  57. Breiman L. Random Forests. Mach Learn 2001;45(1):5-32 View Article
  58. Ben Or G, Veksler-Lublinsky I. Comprehensive machine-learning-based analysis of microRNA-target interactions reveals variable transferability of interaction rules across species. BMC Bioinformatics 2021;22(1):264 View Article PubMed/NCBI
  59. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York (NY): Springer; 2009 View Article
  60. Santos CFGD, Papa JP. Avoiding overfitting: a survey on regularization methods for convolutional neural networks. ACM Comput Surv 2022;54(10s):1-25 View Article
  61. Moradi R, Berangi R, Minaei B. A survey of regularization strategies for deep models. Artif Intell Rev 2020;53(6):3947-3986 View Article
  62. Aliniya P, Nicolescu M, Nicolescu M, Bebis G. Towards Robust Supervised Pectoral Muscle Segmentation in Mammography Images. J Imaging 2024;10(12):331 View Article PubMed/NCBI
  63. Napoli O, Duarte D, Alves P, Soto DHP, de Oliveira HE, Rocha A, et al. A benchmark for domain adaptation and generalization in smartphone-based human activity recognition. Sci Data 2024;11(1):1192 View Article PubMed/NCBI
  64. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015;43(7):e47 View Article PubMed/NCBI
  65. Charmpi K, Chokkalingam M, Johnen R, Beyer A. Optimizing network propagation for multi-omics data integration. PLoS Comput Biol 2021;17(11):e1009161 View Article PubMed/NCBI
  66. Béal J, Montagud A, Traynard P, Barillot E, Calzone L. Personalization of Logical Models With Multi-Omics Data Allows Clinical Stratification of Patients. Front Physiol 2018;9:1965 View Article PubMed/NCBI

About this Article

Cite this article
Deng F, Zhang L. Associations of Normalization and Regularization with Machine Learning Overfitting in Cross-dataset Classification of Deaths Using Transcriptomic and Clinical Data: A Secondary Analysis of Publicly Available Databases. J Clin Transl Pathol. Published online: Mar 19, 2026. doi: 10.14218/JCTP.2025.00051.
Copy        Export to RIS        Export to EndNote
Article History
Received Revised Accepted Published
December 13, 2025 February 11, 2026 March 2, 2026 March 19, 2026
DOI http://dx.doi.org/10.14218/JCTP.2025.00051
  • Journal of Clinical and Translational Pathology
  • pISSN 2993-5202
  • eISSN 2771-165X
Back to Top

Associations of Normalization and Regularization with Machine Learning Overfitting in Cross-dataset Classification of Deaths Using Transcriptomic and Clinical Data: A Secondary Analysis of Publicly Available Databases

Fei Deng, Lanjing Zhang
  • Reset Zoom
  • Download TIFF