Home
JournalsCollections
For Authors For Reviewers For Editorial Board Members
Article Processing Charges Open Access
Ethics Advertising Policy
Editorial Policy Resource Center
Company Information Contact Us
OPEN ACCESS

Gene Cluster Expression Index and Potential Indications for Targeted Therapy and Immunotherapy for Lung Cancers

  • Aibing Rao* 
Cancer Screening and Prevention   2024;3(1):24-35

doi: 10.14218/CSP.2023.00034

Received:

Revised:

Accepted:

Published online:

 Author information

Citation: Rao A. Gene Cluster Expression Index and Potential Indications for Targeted Therapy and Immunotherapy for Lung Cancers. Cancer Screen Prev. 2024;3(1):24-35. doi: 10.14218/CSP.2023.00034.

Abstract

Background and objectives

About 30% of lung cancer patients are accessible to targeted therapy or immunotherapy based on the current criteria. In this study, a novel gene cluster expression analysis was introduced with a goal to potentially expand the treatments to more patients based on the proposed criteria.

Methods

Selected gene expression omnibus data sets were downloaded, normalized, and analyzed. A univariate recurrence prediction model was built based on the receiver operating characteristic, for which an optimal cutoff was determined to set abnormality status, called the gene cluster expression index (GCEI). Recurrence and survival risks were calculated and compared between two subgroups indexed by the GCEI. Moreover, a combinatory GCEI was also introduced and its performance was analyzed for combined multiple cluster statuses.

Results

The recurrence risks of the patient subgroups with abnormally expressed clusters with GCEI = 1 were much higher than for the corresponding normal subgroup with GCEI = 0. The higher risks ranged from 120–300% that of the corresponding lower-risk group.

Conclusions

The GCEI can be used to classify lung cancers with dramatically different recurrence risks and may be used to guide targeted therapy or immunotherapy for patients who are in a high-risk group but do not qualify for such treatment according to conventional companion tests.

Keywords

Transcriptome profiling, Gene cluster expression index, RNA expression analysis, Multivariate modeling, Lung cancer, Targeted therapy, Immunotherapy

Introduction

Gene expression analysis and transcriptome profiling have been extensively explored in lung cancer1–5; however, there has not been much research on gene expression profiling for targeted therapy and immunotherapy. The current standard approach to targeted therapy is via companion DNA tests,6 while the immunotherapy option involves routine tests, such as pathological immunoassay for the protein expression of PD1 or PD-L1 or by DNA-based Next Generation Sequencing (NGS) assessment of the tumor mutation burden, mismatch repair, and microsatellite instability. Transcriptome profiling has emerged as a promising biomarker for cancer treatment and has shown encouraging clinical results.7 In non-small cell lung cancer (NSCLC), a study showed that gene expression profiling might have better prognostic prediction power than considering the mutation status.8 In this in-silico study, we present a framework with a novel analysis procedure to introduce the gene cluster expression index (GCEI) and demonstrate its power to stratify lung cancer patients with dramatically different prognostic risks.

Materials and methods

Preparation and preprocessing of the data sets

Training data set

Two lung cancer microarray data sets: GSE30219, originating from Rousseaux et al.,9 and GSE31210, originating from Okayama et al.,10 were downloaded from the Lung Cancer Explorer (LCE) web portal with standardized clinical data according to Cai et al.11 There were 482 patients with non-empty recurrence labels, among whom 168 cases (35%) were labeled as recurred within two years since diagnosis. The two data sets were further normalized by aligning the median of all the samples to 0 and then by aligning the median of all the genes to 0 independently. There were about 17,000 common genes in the selected data sets here and those used in the training data listed below. All samples with missing recurrence or missing expression value of a common gene were omitted. A combined data set was obtained by slicing and aligning the common genes and common clinical variables from the two normalized sets and then by stacking them together. There were 310 patients with Stage I cancers, 111 with Stage II, 53 with Stage III or IV, and 8 with unknown stages. The average patient age was 61 years old, with the youngest being 15 years old and the eldest 84 years old. There were 330 males and 152 females.

Testing data set

Other data sets, namely GSE37745, GSE41271, GSE50081, and GSE74777 with recurrence annotations, were also downloaded from the same LCE web portal as above and also normalized with the medians of the samples and genes aligned to 0, respectively. In addition, the expression of each gene in each data set was further normalized according to the distribution of the training set for the sake of applying thresholds from the training set directly to the testing set. The goal was to align the first and the third quartiles between the testing and the training data sets by linear mapping. Here, for a given gene, we let Q1, Q2 be the first and the third quartiles of its expression vector in the training data, and T1, T2 be the first and the third quartiles of its expression vector in the testing data. The normalized value N (x) was then obtained from the original value x via the formula: N (x) = (xT1)/(T2T 1) × (Q2Q1) + Q1. Finally, all 4 normalized data sets were stacked together. Note that T1 and T2 were calculated only using a subset with the same proportion of recurred samples as that of the training set. This normalization step was solely for directly applying the modelling recurrence threshold derived with the training set to the testing set.

Pre-selected gene clusters

To start the analysis, we chose 11 genes: ALK, BRAF, EGFR, MET, NTRK, RAS, RET, ROS1, TP53, PDCD1, and CTLA4. These were chosen because the first eight genes have been intensively studied and demonstrated to be drivers for lung cancers and their mutation status was used to guide targeted therapy and the last two were for immunotherapy. However, it should be noted that the procedure we report is very general and can be applied to any other genes and clusters. For a given gene, the literature and online information were used to select the cluster members. For example, the ALK cluster consisted of fusion partners cataloged in Ou et. al.12 and some other genes from the String database and the gene card description, with 107 genes finally pre-selected for the ALK cluster. Table 1 lists the cluster members. Each cluster was analyzed independently using the same method. Note that the cluster members can be changed in the future when more insights about an important gene seed are added.

Table 1

Pre-selected gene clusters for important lung cancer genes

SEEDGENE
ALKADAM17, AKAP8L, ALK, ALKAL2, ATAD2B, ATIC, ATP13A4, BCL11A, BIRC6, C12ORF75, C9ORF3, CAMKMT, CBL, CDK15, CEBPZ, CEP55, CLIP1, CLIP4, CLTC, CMTR1, CRIM1, CUX1, CYBRD1, DCHS1, DCTN1, DYSF, EIF2AK3, EML4, EML6, EPAS1, ERC1, FBN1, FBXO11, FBXO36, FRS2, FUT8, GCC2, HIP1, IRS1, ITGAV, KIF5B, KLC1, LCLAT1, LIMD1, LMO7, LPIN1, LYPD1, MAPK1, MAPK3, MDK, MPRIP, MSN, MTA3, MYT1L, NCOA1, NPM1, NYAP2, PHACTR1, PICALM, PLEKHA7, PLEKHH2, PLEKHM2, PPFIBP1, PPM1B, PRKAR1A, PRKCB, PTN, RANBP2, RBM20, SEC31A, SHC1, SLC16A7, SLMAP, SMPD1, SMPD2, SMPD3, SMPDL3A, SMPDL3B, SOCS5, SORCS1, SOS1, SPECC1, SPTBN1, SQSTM1, SRBD1, SRD5A2, STRN, SWAP70, TACR1, TANC1, TCF12, TFG, THADA, TNIP2, TOGARAM2, TPM4, TPR, TRIM66, TSPYL6, TTC27, TUBB, VIT, VKORC1L1, WDPCP, WDR37, WNK3, YAP1
BRAFBRAF, MAP2K1, MAP2K2, MAP2K3, MAP2K4, MAP2K5, MAP2K6, MAP2K7, MAP3K1, MAP3K10, MAP3K11, MAP3K12, MAP3K13, MAP3K14, MAP3K14.AS1, MAP3K19, MAP3K2, MAP3K20, MAP3K21, MAP3K3, MAP3K4, MAP3K5, MAP3K6, MAP3K7, MAP3K7CL, MAP3K8, MAP3K9, MAP4K1, MAP4K2, MAP4K3, MAP4K4, MAP4K5, RAF1
EGFRAREG, BRAF, BTC, CTNNB1, EGF, EGFR, EREG, MUC1, NRG1, NRG2, NRG3, NRG4, NRGN, RGS16, SRC, TGFA
METGAB1, GRB2, HGF, MET, PIK3R1, PLCG1, SRC, STAT3
NTRKAFAP1, AGBL1, AGBL2, AGBL3, AGBL5, ARHGEF2, BCAN, BCR, BTBD1, CD74, CHTOP, CTRC, DAB2IP, EML4, ETV6, GRIPAP1, HNRNPA2B1, IGFBP7, IRF2BP2, LMNA, LRRC71, LYN, MPRIP, MRPL24, MYO5A, NACC2, NFASC, NTRK1, NTRK2, NTRK3, PAN3, PDE4DIP, PLEKHA6, PPL, QKI, RABGAP1L, RBPMS, RFWD2, SCYL3, SLITRK1, SLITRK2, SLITRK3, SLITRK4, SLITRK5, SLITRK6, SQSTM1, STRN, TFG, TLE4, TP53, TPM3, TPM4, TPR, TRAF2, TRIM24, TRIM63, UBE2R2, VCL
RASFRAS1, GRASP, HRAS, HRASLS, HRASLS2, HRASLS5, KRAS, MRAS, NRAS, RASA1, RASA2, RASA3, RASAL1, RASAL2, RASAL3, RASD1, RASD2, RASEF, RASGEF1A, RASGEF1B, RASGEF1C, RASGRF1, RASGRF2, RASGRP1, RASGRP2, RASGRP3, RASGRP4, RASIP1, RASL10A, RASL10B, RASL11A, RASL11B, RASL12, RASSF1, RASSF10, RASSF2, RASSF3, RASSF4, RASSF5, RASSF6, RASSF7, RASSF8, RASSF9, RRAS, RRAS2
RETADD3, ALOX5, ANK3, ANKS1B, ARHGAP12, CCDC186, CCDC3, CCDC6, CCDC88C, CCNY, CCNYL1, CDC123, CLIP1, CTNNA3, CUX1, DOCK1, DUSP5, DYDC1, EML4, EML6, EPC1, EPHA5, ERC1, FRMD4A, GDNF, GFRA1, GFRA2, GFRA3, GFRA4, GPRC5B, IL2RA, KIAA1217, KIAA1468, KIF13A, KIF5B, LSM14A, MINDY3, MPRIP, MRPS30, MYO5C, NCOA4, NRP1, PARD3, PCM1, PICALM, PRKAR1A, PRKCQ, PRKG1, PRPF18, PTER, PTK2, PTPRK, RASSF4, RBPMS, RET, RETN, RETNLB, RETREG1, RETREG2, RETREG3, RETSAT, RUFY2, SIRT1, SORBS1, TBC1D32, TRIM24, TRIM33, TSSK4, UBE2D1, WAC, ZNF43, ZNF438
ROS1AKT1, CCDC6, CD74, CEP72, CLTC, EZR, GOPC, IRS1, KDELR2, KMT2C, LIMA1, LRIG3, MAPK1, MAPK3, MSN, MYO5C, PLCG2, PROS1, PTPN11, RBPMS, ROS1, SDC4, SLC34A2, SLC6A17, SLMAP, STAT3, TFG, TMEM106B, TPD52L1, TPM3, VAV3, WNK1, ZCCHC8
TP53TP53, TP53BP1, TP53BP2, TP53I11, TP53I13, TP53I3, TP53INP1, TP53INP2, TP53RK, TP53TG1, TP53TG5
CTLA4CD274, CD276, CD28, CD80, CD86, CTLA4, FOXP3, GRB2, LCK, NFAM1, NFAT5, NFATC1, NFATC2, NFATC2IP, NFATC3, NFATC4, PTPN11
PDCD1CD247, CD274, CD3D, CD3E, CD4, CD80, FGL1, HLA.DQB1, HLA.DRB1, LAG3, PDCD1, PDCD1LG2, PRKCQ, PTPN11, ZAP70

Gene cluster expression index (GCEI)

The goal was to assign samples with a binary index for a given gene cluster, called the gene cluster expression index (GCEI). This process comprises two steps: (1) Determination of the expression index of each member gene, such that a GCEI of 1 represents a higher recurrence risk stratified by the expression, or 0 otherwise; (2) Determination of the percentage of genes with a GCEI of 1 for each patient and labeling the patient as abnormal with a GCEI of 1 if there are too many abnormal members in the cluster. Both steps involve univariate prediction modeling via receiver operating characteristic (ROC) curve analysis.

Univariate modeling with the ROC curve and setting the cutoff value

The ROC curve is a basic technique in medical diagnostic test evaluation.13 It is used for univariate modeling. Given a training set with a binary index vector, say recurrence, and a predicting vector, say expression vector of a gene, we can sort the training samples by the predictor values in increasing order, then by assuming a cutoff value that goes from the minimal to the maximal value with a fixed step size, each sample can be labeled as a binary prediction based on the cutoff. The prediction and the index (truth) give rise to a confusion matrix, such that a false positive rate (FPR) and true positive rate (TPR) are computed. The ROC curve is then plotted on a unit box with the FPR as the x-axis and TPR as the y-axis, as shown in Figure 1. The perfect prediction is at the top-left corner as (FPR, TPR) = (0, 1); and therefore, along the curve from left to right, we can find the point closest to the corner (0, 1). This point is the optimal decision point leveraging both the specificity (1-FPR) and sensitivity (TPR) and the corresponding cutoff is thus set. The area of the bottom region is called the area under the curve (AUC), with values ranging from 0.5 to 1 (note: for a predictor with an AUC in between 0 and 0.5, a reversal with the 0-predictor flips the AUC to be above 0.5).

Univariate ROCs of the top 12 genes in the ALK cluster in the decreasing order of <italic>P</italic><italic><sub>δ</sub></italic>.
Fig. 1  Univariate ROCs of the top 12 genes in the ALK cluster in the decreasing order of Pδ.

Inverted expression value (0 − Expression) was used to plot the ROC for the downregulated genes, similarly hereinafter. ALK, anaplastic lymphoma kinase; AUC, area under the curve; ROC, receiver operating characteristic.

Determination of single gene expression abnormality concerning recurrence

For a given cluster, as listed in Table 1, for each member gene, we used its expression to predict recurrence and draw an ROC to obtain its optimal cutoff, which we then used to determine a sample expression status: normal or abnormal. Here, given a member gene g, we let Tg be the chosen cutoff, then the training samples were divided into two populations: one greater than or equal to Tg, the other less than. Now for each population, a recurrence percentage was computed, denoted as Pabove, Pbelow, respectively.

We let Pδ = |PabovePbelow|, which represents the prediction power of gene g by using its expression to stratify patients. In addition, if Pabove > Pbelow, then g is considered over-expressed and showing a higher recurrence risk, or else it is under-expressed. Next, we set a significance level Tdiff = 5% (note that this value was only for demonstration purposes, it can be set as another value based on a particular application), and the gene g was considered significant if PδTdiff. With respect to g, samples were labeled as: (1) normal if Pδ < Tdiff; (2) up if Pδ ≥ Tdiff and Pabove > Pbelow; (3) down if Pδ ≥ Tdiff and Pabove < Pbelow. Both up and down are considered abnormal. In this way, all the member genes were labeled as 0 (normal) or 1 (abnormal, either up or down).

Cluster member voting and the GCEI

Next, we calculated the percentage of abnormal gene members for each sample to form a new feature vector. We plotted the ROC using the abnormal percentage to predict recurrence and denoted the chosen cutoff as Tp. We labeled the sample as 1 if the abnormal percentage is greater than or equal to Tp, or as 0 otherwise. This characteristic index is called the gene cluster expression index (GCEI). A GCEI of 1 represents an abnormal expression for the cluster, while a GCEI of 0 represents a normal expression. GCEI thus represents the abnormality of a gene cluster within which the percentage of abnormal member genes with GCEI = 1 is beyond Tp.

Combined GCEI (cGCEI)

A combined GCEI was defined first by concatenating a single cluster GCEI into binary string and second by counting the number of 1’s in the string. This thus represents a summary of the expression abnormality of selected gene clusters. Here, for the targeted therapy genes, we fixed the ordered list of genes (ALK, BRAF, EGFR, MET, NTRK, RAS, RET, ROS1, TP53), and concatenated the corresponding GCEI of each cluster to obtain a binary string of 9 bits; for example, 000000000 represents that all 9 gene clusters were normally expressed, 100000000 represents that only the first ALK cluster was abnormally expressed and the rest 8 were normal, 111111111 represents that all 9 clusters were abnormally expressed, and so on. The 9-bit GCEI classified lung cancers into 29 = 512 subtypes. For the immunotherapy gene couple (CTLA4, PDCD1), GCEI was a two-digit string with four combinations: 00, 01, 10, 11, representing that none, CTLA4 only, PDCD1 only, or both CTLA4 and PDCD1 clusters were abnormally expressed, respectively.

In practice, for the 9-bit GCEI string, since it would be difficult to accumulate enough patient cases for most of the 512 subtypes, we collapsed the 512 subtypes into only 10 super-subtypes by counting the number of digits that were 1 in the string, whereby patients were grouped into 10 subtypes with aggregated GCEIs of 0, 1, 2, 3, …, 9, respectively, denoted as cGCEI, with each cGCEI value representing how many gene clusters were abnormal among the nine clusters. To simplify it further, after analyzing the recurrence risk profiles of the 10 subtypes, we found that they could be further divided into two groups, denoted by the binary variable DGCntGT5, where the group of DGCntGT5 = 1 included all subtypes with cGCEI from 6 to 9, namely, all samples with at least 6 abnormal clusters; and DGCntGT5 = 0, which included all subtypes with cGCEI from 0 to 5, i.e. all samples with at most 5 abnormal clusters.

Recurrence and survival concerning GCEI status

Recurrence and survival were assessed with respect to the subgroups stratified by a single GCEI, a combinatory cGCEI, or by DGCntGT5. Given a binary index, this classified the samples into two subgroups with an index of 1 or 0, respectively. Recurrence/Survival risk was defined as the percentage of recurred/dead patients within each subgroup.

Data analysis and software

The data analysis and plots were mostly performed using by R scripts in RStudio 2022.07.1 with R version 4.0.5 on the Mac platform with OS version darwin17.0. The ROC analysis was based on prediction and performance in the R package ROCR, where performance is a perfect function to obtain almost all the evaluation results of a prediction model, such as FPR, TPR, and AUC. Quartiles were calculated with the R function quantile.

Results

Univariate models of the ALK cluster members

Among the 107 pre-selected members in the ALK cluster, 72 abnormal genes had Pδ ≥ 5%, accounting for 67% of the members, and the other 35 normal genes had Pδ < 5%. The corresponding AUCs, FPRs, TPRs, threshold Tg, and population risks of the abnormal and the normal genes are listed in Tables 2 and 3, respectively. As shown in Table 2, 33 genes were over-expressed for a higher recurrence risk: CEP55, TUBB, MDK, NPM1, CEBPZ, TFG, ATIC, LYPD1, LCLAT1, LPIN1, MYT1L, WNK3, TNIP2, C12ORF75, TPM4, TTC27, SOS1, ADAM17, TSPYL6, KLC1, PPFIBP1, SPECC1, FRS2, SHC1, FBN1, THADA, SQSTM1, CLIP1, CBL, CLTC, FBXO36, FUT8 and ITGAV; while 39 were under-expressed for a higher recurrence risk: ATP13A4, LMO7, WDR37, EPAS1, GCC2, CRIM1, PLEKHH2, TRIM66, FBXO11, SMPD1, YAP1, MPRIP, TANC1, SEC31A, PRKAR1A, CYBRD1, SPTBN1, ALKAL2, WDPCP, SLMAP, CLIP4, SLC16A7, SWAP70, LIMD1, BIRC6, SOCS5, PLEKHA7, EIF2AK3, PPM1B, KIF5B, PHACTR1, CAMKMT, RBM20, SRD5A2, NYAP2, PTN, PICALM, VKORC1L1 and HIP1. Note that for the under-expressed genes, an inverted expression vector, namely 0-expression, should be used as a predictor to plot the ROC correctly.

Table 2

AUCs and recurrence risks of 72 abnormal ALK genes with Pδ ≥ 5%

GENEAUCFPRTPRTgPabove (%)Pbelow (%)Pδ (%)Status
CEP550.680.420.76−0.007649.0318.3930.64up
ATP13A40.660.430.7−0.019724.1946.1521.96down
TUBB0.640.390.620.025346.022521.02up
MDK0.640.420.640.040344.7725.119.67up
LMO70.60.390.580.050922.9942.3719.38down
NPM10.610.460.68−0.005343.6324.6618.97up
WDR370.630.430.630.011824.5443.2318.69down
EPAS10.670.310.580.088223.1241.4218.3down
GCC20.620.320.570.045824.8741.5216.65down
CRIM10.630.390.620.030325.2541.7916.54down
PLEKHH20.620.370.570.07924.734116.27down
CEBPZ0.590.390.570.029943.5827.6515.93up
TFG0.580.350.510.071843.8828.6715.21up
ATIC0.610.410.580.025442.9227.7315.19up
TRIM660.590.410.580.00527.0442.1715.13down
LYPD10.60.450.610.01142.1527.514.65up
FBXO110.620.340.560.04125.5740.214.63down
LCLAT10.590.50.66−0.01341.226.9814.22up
SMPD10.570.390.540.028126.8740.5713.7down
LPIN10.560.360.50.05342.8629.3713.49up
MYT1L0.570.470.62−0.001141.2727.8313.44up
YAP10.590.370.550.033527.5240.9113.39down
MPRIP0.590.460.630.003827.6540.7513.1down
WNK30.540.350.480.075742.5529.9312.62up
TANC10.620.320.570.063126.7439.3512.61down
SEC31A0.60.420.57−0.002529.1741.7412.57down
PRKAR1A0.590.440.60.007427.9840.5312.55down
TNIP20.560.460.67.00E-0440.9828.5712.41up
C12ORF750.580.480.62−0.02740.6228.3212.3up
TPM40.560.50.63−0.013140.4628.1812.28up
TTC270.570.410.540.02241.5529.2812.27up
CYBRD10.590.480.61028.5140.5512.04down
SPTBN10.570.420.550.021928.1439.5811.44down
ALKAL20.60.350.520.110428.0439.2511.21down
SOS10.550.460.580.006940.4229.3411.08up
ADAM170.570.460.58−0.00440.0829.5810.5up
TSPYL60.560.480.6−0.002239.8429.4410.4up
KLC10.520.320.420.030241.5231.1910.33up
PPFIBP10.550.460.57−0.014539.9229.9210up
SPECC10.570.470.58−0.011939.7529.839.92up
WDPCP0.560.440.550.016329.4939.259.76down
SLMAP0.580.390.530.033929.0238.759.73down
CLIP40.580.340.50.065528.9638.469.5down
SLC16A70.580.440.540.004130.2439.749.5down
SWAP700.560.480.590.002429.7339.239.5down
LIMD10.570.490.580.005429.8239.029.2down
FRS20.520.440.540.006239.4730.718.76up
BIRC60.550.380.520.015930.2638.988.72down
SHC10.520.340.430.061640.2231.688.54up
FBN10.530.460.55−0.004139.1530.778.38up
SOCS50.560.360.470.038829.7338.058.32down
PLEKHA70.560.530.64−0.075931.7739.898.12down
EIF2AK30.530.450.54031.0838.967.88down
THADA0.530.440.520.008938.9431.257.69up
SQSTM10.510.450.530.016338.731.357.35up
PPM1B0.540.420.540.020530.8838.117.23down
KIF5B0.530.440.550.011331.0538.026.97down
PHACTR10.570.410.520.075430.6937.546.85down
CLIP10.510.40.460.029738.8932.046.85up
CAMKMT0.540.430.530.009631.2537.986.73down
RBM200.520.470.560.009331.3137.696.38down
CBL0.540.390.450.01938.5832.286.3up
SRD5A20.560.520.64−0.012132.0938.326.23down
NYAP20.540.540.67−0.018632.2638.426.16down
CLTC0.510.390.450.032138.3832.395.99up
FBXO360.520.510.58−0.008637.631.75.9up
PTN0.540.440.560.034131.6337.455.82down
PICALM0.550.390.460.031831.3537.045.69down
FUT80.530.470.530.041737.5532.245.31up
VKORC1L10.530.470.540.002732.2337.55.27down
HIP10.540.440.540.034831.8637.055.19down
ITGAV0.510.510.579.00E-0437.2532.165.09up
Table 3

AUCs and recurrence risks of 35 normal ALK genes with Pδ < 5%

GENEAUCFPRTPRTgPabove (%)Pbelow (%)Pδ (%)Status
TOGARAM20.560.510.62−0.009532.3737.344.97normal
BCL11A0.520.40.510.039132.3737.344.97normal
ATAD2B0.510.360.410.062137.91334.91normal
MSN0.550.390.510.051131.7436.514.77normal
PRKCB0.550.380.490.070131.9836.454.47normal
AKAP8L0.510.490.56−0.004632.837.074.27normal
CUX10.540.40.490.039632.2836.524.24normal
NCOA10.520.420.510.034432.2636.494.23normal
PLEKHM20.50.480.52−0.004736.9732.794.18normal
SORCS10.510.540.59−0.007533.0836.993.91normal
SMPDL3B0.530.50.6−0.055433.3337.173.84normal
CMTR10.510.470.510.006732.8936.583.69normal
MAPK10.520.490.52−0.00536.6733.063.61normal
TCF120.540.450.490.003636.7733.23.57normal
SMPDL3A0.520.510.54−0.027536.4332.863.57normal
MTA30.50.460.50.002136.6833.23.48normal
SMPD20.510.380.390.053536.6333.872.76normal
MAPK30.540.420.480.014233.33362.67normal
DCTN10.50.430.460.020936.3233.72.62normal
DCHS10.530.410.490.053136.4733.972.5normal
SMPD30.520.460.480.014136.1633.722.44normal
SRBD10.520.490.530.000333.7535.952.2normal
TPR0.510.460.520.006333.7635.892.13normal
ALK0.520.540.61−0.017635.7133.662.05normal
TACR10.520.550.6−0.009633.9635.982.02normal
VIT0.530.420.520.01533.6735.661.99normal
DYSF0.510.510.53−0.025735.7433.911.83normal
IRS10.520.480.520.012933.9135.711.8normal
EML40.510.410.450.021133.9335.661.73normal
CDK150.520.450.510.004733.9435.611.67normal
ERC10.510.410.430.016235.8234.161.66normal
EML60.510.540.6−0.067834.4335.591.16normal
STRN0.510.490.480.001234.3235.371.05normal
RANBP20.540.310.440.05134.2535.220.97normal
C9ORF30.510.430.480.019535.3534.460.89normal

For demonstration purposes, the ROC curves of the top 12 genes in decreasing order of Pδ are shown in Figure 1. The highest one in the first row of Table 2 is CEP55. Here, for the chosen cutoff Tg = −0.0076, a patient with a CEP55 expression ≥ (−0.0076) has a recurrence risk of Pabove = 49.03%, while a patient with a CEP55 expression < (−0.0076) has a recurrence risk of Pbelow = 18.39%, and hence the difference between the two is Pδ = 30.64%. CEP55 is considered over-expressed because Pabove > Pbelow. CEP55, called centrosomal protein 55, is related to DNA damage and cytoskeletal signaling and plays a role in mitotic exit and cytokinesis. CEP55 was found to be a fusion partner of ALK and a high CEP55 expression was reported to be associated with a poor prognosis.14,15 The second gene we consider is ATP13A4, which was under-expressed with Pabove = 24.19%, Pbelow = 46.15%, for a difference of Pδ = 21.96%. ATP13A4, called ATPase 13A4, may enable ATPase-coupled cation transmembrane transporter activity and may be involved in cellular calcium ion homeostasis. In one lung cancer case study,16 a 53-year-old metastatic Stage IV patient harboring ATP13A4-ALK and two other ALK-fusions COX7A2L-ALK and LINC01210-ALK underwent first-line crizotinib therapy, which showed 12 months of Progress Free Survival/Partial Remission (PFS/PR), then a new SLCO2A1-ALK fusion led to resistance. Afterward, second-line ceritinib therapy was applied and resulted in a further 8 months of PFS, and the NGS results demonstrated the loss of ATP13A4-ALK and SLCO2A1-ALK. Interestingly, the ALK expression itself was normal and only showed a difference of Pδ = 2.02%.

Note that the results for the remaining 10 gene clusters in this study are presented in the Supplementary File 1.

Cluster member voting models

Next, for the training sample, we calculated the percentage of abnormal members for each cluster. Again, we plotted the ROC but with an abnormal percentage as a new recurrence predictor. The ROC curves are presented in Figure 2. For each ROC curve, the horizontal and vertical dashed lines mark the point on the curve that is the closest to the top-left corner (0, 1), and the corresponding FPR and TPR are shown near each dashed line. The AUC is also shown. Taking ALK as an example, the closest point to the top-left corner is (0.32, 0.73), indicating that the specificity (1-FPR) was 68% and the sensitivity (TPR) was 73%, and the AUC was 0.763. The corresponding cutoff was set as the voting threshold for the ALK cluster. Table 4 lists the corresponding AUCs, FPRs, TPRs, threshold Tp, Pabove, and Pbelow. In summary, across the 11 studied clusters, the recurrence risk of the abnormal group (of all pathological stages) ranged from 174% (PDCD1) to 320% (ALK) of the corresponding normal group.

Univariate ROCs of 11 clusters.
Fig. 2  Univariate ROCs of 11 clusters.

The percentage of the abnormal members in each cluster was used as a recurrence predictor. For each ROC curve, the horizontal and vertical dashed lines mark the point on the curve that is the closest to the top-left corner (0, 1), and the corresponding FPR and TPR are shown near each dashed line. The AUC is also shown. Taking ALK as an example, the closest point to the top-left corner is (0.32, 0.73), indicating that the specificity (1-FPR) is 68% and sensitivity (TPR) is 73%, and the AUC is 0.763. The corresponding cutoff is set as the voting threshold for the ALK cluster. ALK, anaplastic lymphoma kinase; AUC, area under the curve; FPR, false positive rate; ROC, receiver operating characteristic; TPR, true positive rate.

Table 4

AUC, TPR, FPR, threshold Tp and recurrence risks for 11 clusters

SEEDAUCTp (%)Pabove (%)Pbelow (%)Pδ (%)FPRTPRACCPPV
ALK0.76355.5655.4117.3138.10.320.730.70.55
BRAF0.68157.8948.6223.4825.140.360.630.640.49
EGFR0.67158.3346.5825.121.480.370.610.620.47
MET0.65657.1443.7524.7818.970.460.670.590.44
NTRK0.71551.3552.1919.2932.90.350.710.670.52
RAS0.6856050.5224.3126.210.310.580.660.51
RET0.73455.3250.2119.2530.960.390.730.650.5
ROS10.68252.3848.4422.9625.480.370.650.640.48
TP530.68255.5650.4923.1927.30.320.620.660.5
CTLA40.686048.9625.5223.440.310.560.640.49
PDCD10.65662.547.9827.5120.470.290.490.640.48

Recurrence and survival analysis

In the above, lung cancers were labeled as normal (GCEI = 0) or abnormal (GCEI = 1) using a given cluster GCEI or a combination of atomic GCEIs. Next, the recurrence risks were assessed for the subpopulations defined by individual GCEI status and combinations of GCEIs. For a given atomic or combinatory GCEI, the recurrence risk, defined as the percentage of recurred patients, was calculated based on the GCEI status of patients at different pathological stages, such as Stage I, Stages II–V, and all stages. Table 5 lists the recurrence risks for the subpopulations labeled by the atomic GCEI indicators and DGCntGT5 indicator. It can be seen that the ALK cluster gave the largest risk ratio for the lung cancer group with GCEI = 1 over GCEI = 0 for the 3 stage groups, with 320%, 332%, 188% for all stages, Stage I, and Stages II–IV, respectively. As for the minimal ratio, PDCD1 gave 174% for all stages, MET 169% for Stage I, and EGFR 109% for Stages II–IV. On average, the risk ratios of the group with GCEI = 1 over GCEI = 0 were 222%, 247%, 134% for all stages, Stage I, Stages II–IV, respectively, indicating that on average the recurrence risk of patients with an abnormally expressed cluster was more than double that of the normal counterpart for all stages or Stage I, while even for the late Stages II–IV, the risk was still increased by 34%. This demonstrates the power of recurrence risk stratification with the GCEI.

Table 5

Recurrence percentages of lung cancers in different stage groups flagged by the GCEI

SubpopulationAll (Stages I–IV)
Stage I
Stages
II–IV
GCEI = 0 (%)GCEI = 1 (%)GCEI = 0 (%)GCEI = 1 (%)GCEI = 0 (%)GCEI = 1 (%)
ALK17.3155.4112.5641.7535.8567.23
BRAF23.4848.6215.3836.2753.5759.48
EGFR25.146.5816.6733.0254.2459.29
MET24.7843.7517.1428.8950.9860.33
NTRK19.2952.1912.8739.8144.2363.33
RAS24.3150.5216.3635.4247.365.31
RET19.2550.2114.1435.2939.5864.52
ROS122.9648.4416.9232.1144.6463.79
TP5323.1950.4914.5637.548.5763.73
CTLA425.5248.9613.7938.3252.8762.35
PDCD127.5147.9816.1336.5654.3561.25
DGCntGT518.8459.4713.349.3540.6866.37
Average22.6350.2214.9837.0247.2463.08

In addition, the recurrence risks of the 10 subgroups of cGCEI = 0, 1, 2, …, 9, as defined by counting the number of 1’s in the binary string of the ordered list (ALK, BRAF, EGFR, MET, NTRK, RAS, RET, ROS1, TP53), are listed in Table 6. It can be seen that the risk increased along with the cGCEI values, indicating that the more abnormal clusters there were, the higher the risk. For cGCEI = 0, where all the clusters were normally expressed, the recurrence risk was merely 7.02%; whereas, when there was one and only one abnormal cluster (cGCEI = 1), the risk was more than doubled to 15.28%; and it then increased to 20.41% for cGCEI = 2. However, a hiccup then occurred in the trend, whereby the risk went down to 17.50% for cGCEI = 3, which might be due to the data size. The risk then again kept increasing along with the cGCEI, albeit also with a hiccup. After cGCEI 6, the risk was beyond 56.36% until it hits an astonishing 72.73% for the group of patients with cGCEI = 9, where all 9 clusters showed abnormal expressions. This also shows the rationale of why we defined a new combined GCEI based on DGCntGT5 to collapse the 10 subtypes into only two.

Table 6

Number of none-recurred and recurred cases and the recurrence risks of cGCEI derived from 9-digit string signatures (only evaluated for all stages)

cGCEIExemplary signaturesNone-recurredRecurredTotalRecurrence (%)
0000000000534577.02
1100000000,00000000161117215.28
2110000000,00000001139104920.41
3111000000,0000001113374017.5
4111100000,00000111130124228.57
5111110000,00001111121113234.38
6111111000,00011111124315556.36
7111111100,00111111124265052
8111111110,01111111120325261.54
91111111119243372.73

Moreover, population survival analysis was applied to the subgroups of GCEI = 0 or 1 for Stage I, Stages II–IV, and all stages. Figure 3 shows the percentage of death of each subgroup within the different stages or all stages. In summary, for Stage I, the median increase was 6.74% with a maximum of 11.14% (TP53, Stage I); for Stages II–IV, the median increase was 9.60% with a maximum of 16.18% (RAS, Stages II–IV); for all stages, the median increase was 9.85% with a maximum of 14.46% (DCGCntGT5, all stages). However, an exception was noted for MET (Stage I, blue), where the percentages of death for GCEI = 0, 1 were 26.75% and 25.28%, respectively, and both subgroups had similar death risks. In conclusion, the survival population results indicated a modest survival risk difference based on GCEI status as defined by recurrence. The same procedure could be applied here by targeting the OS. Since the OS and RS are correlated but not the same, optimizing one of them may only guarantee a suboptimal risk profile for the other.

Percentage of deaths of the lung cancer subgroups (GCEI = 0, 1) within different stages or all stages.
Fig. 3  Percentage of deaths of the lung cancer subgroups (GCEI = 0, 1) within different stages or all stages.

For each gene cluster expression index or combinatory DCGCntGT5, 3 vertical pairs are plotted with different colors (All stages: black, Stage I: blue, Stages II–IV: red). Each pair consists of GCEI = 0 (circle) and GCEI = 1 (*). The vertical gap from the circle to * shows the increased percentage of GCEI = 1 compared to GCEI = 0. In summary, for Stage I, the median increase is 6.74% with a maximum of 11.14% (TP53, blue); for Stages II–IV, the median increase is 9.60% with a maximum of 16.18% (RAS, red); for all stages, the median increase is 9.85% with a maximum of 14.46% (DCGCntGT5, black). GCEI, gene cluster expression index.

Validation

There were 703 patients in the validation set combined from the GSE37745, GSE41271, GSE50081, and GSE74777 data sets, within which there were 272 recurrences (39%) (vs. 35% in the training set), the average patient age was 66 years old (vs. 61 in the training set), and there were 278 females (40%) (vs. 31% in the training set), and 397 Stage I patients (49%) (vs. 64% in the training set). Table 7 shows the recurrence risks of GCEI = 0 vs. GCEI = 1, where the GCEI was determined based on the thresholds in the training phase. The average risk increase was 11.12%, and the maximum was 35.5% (RAS). This is a modest validation result compared with the training risk profiles (Table 5). Note that CTLA4 showed a risk reversal while ALK and NTRK showed barely different risks between the two groups. These modest results might mainly be due to several reasons; first, the data were from different microarray chips: both gene expression omnibus sets in the training set came from Affymetrix Human Genome U133 Plus 2.0 Array, while in the validation set, although GSE37745 and GSE50081 were from the same chip, GSE41271 came out of RnaSeq of the Illumina HumanWG-6 v3.0 expression beadchip and GSE74777 was from the Affymetrix Human Transcriptome (HT) Array 2.0; second, the different patient profiles as stated in the above.

Table 7

Recurrence risks of the normal group (GCEI = 0) vs. abnormal group (GCEI = 1) computed from the validation set using the same thresholds from the training phase

Cluster% Recurrence (GCEI = 0)% Recurrence (GCEI = 1)% Increase of GCEI = 1 from 0
ALK37.7338.812.86
BRAF36.3641.1313.12
CTLA439.5135.92−9.08
EGFR36.364010.01
MET36.3339.839.63
NTRK38.0138.380.97
PDCD137.0640.278.66
RAS34.3746.5735.5
RET35.3842.7420.8
ROS136.3940.4911.27
TP5335.4242.0718.77
DGCntGT536.9941.0510.98

Comparison with conventional methods

Up to now, we have demonstrated that classification using the GCEI could stratify lung cancers into groups with dramatically different recurrence risk profiles. Next, we compared the GCEI with other conventional characteristics, namely stage, node (N), and T of TNM, and prognosis of recurrence and survival via correlation analysis. The average correlation coefficients of GCEIs across these five clinical variables were: DGCntGT5 (0.39), ALK (0.36), NTRK (0.34), BRAF (0.32), RAS (0.32), RET (0.31), EGFR (0.27), ROS1 (0.27), MET (0.23), TP53 (0.22), PDCD1 (0.12), CTLA4 (0.09).

On the other hand, the average correlation coefficients of the clinical variables across 12 GCEIs were: recurrence (0.25), survival (0.23), stage (0.30), N (0.28), T (0.30), indicating that the GCEIs were modestly correlated with clinically important pathological variables and prognosis. The advantages of the GCEI include that the method involves molecular profiling and has the potential to guide targeted therapy and immunotherapy for lung cancers.

Discussion

The goal of this research was not to predict recurrence risk but to provide a novel approach to classify lung cancers based on gene cluster expression profiles. The original intention was to complement the current personalized approach with DNA-based classifications. Recurrence risk was used as a convenient guiding prognostic objective here to derive GCEIs, but this method can be applied to other objectives too, such as prognosis of survival, treatment response, and to clinically important pathological variables, such as stage, metastatic node count, or distance metastasis. As far as personalized medicine in lung cancer is concerned, although DNA-based tests have been successfully used for targeted therapy and immunotherapy, the proportion of patients whose tumors can be targeted therapeutically is limited and is usually less than 30%. A retrospective study of 2257 metastatic NSCLC patients showed that more than half of the tested patients did not have their results before first-line treatment and fewer than 20% of tested patients had their results for all 4 driver mutations (ALK, EGFR, ROS1, BRAF), and PD-L1 before first-line treatment. Moreover, although the turnaround time improved from the year 2017 to 2019, not all patients who tested positive for driver mutations received targeted therapy in the first-line setting.17 Therefore it shows there is an unmet need for a large proportion of lung cancer patients who are not qualified for personalized medicines following the current paradigm. We can imagine that an RNA expression network (a cluster) centered around an important gene is disturbed not just by a particular DNA mutation, which might be just one thread in the whole picture, but by a lot of other factors. The abnormality of an RNA expression network is then gauged by the percentage of abnormally expressed nodes (cluster members). It is only after the percentage of abnormal nodes goes beyond a threshold is the collapse of the whole network triggered. GCEI was introduced here to label whether an RNA expression network looks normal or abnormal concerning the guiding objective, such as recurrence in the current study. When an expression network centered around an important gene for which there are available drug targets looks abnormal, the same drugs might come to the rescue and adjust the network to look more normal. Hence we propose that the patient group of abnormal status with GCEI = 1 who cannot access the same targeted therapy and immunotherapy might benefit from the same treatment. Evidence has already emerged in a study called the WINTHER trial (NCT01856296),18 which was the first clinical trial to navigate lung, colon, head and neck, and other cancer patients with previous treatments to therapy on the basis of fresh biopsy-derived DNA sequencing or RNA expression (tumor versus normal). This study showed that transcriptome profiling is as useful as DNA tests for improving therapy recommendations and patient outcomes.

On the other front, novel RNA drugs have emerged and generated more and more enthusiasm in the pursuit of new lung cancer treatment.19 Although the expression of a single targeted gene can be relatively easily evaluated, it will be important to know how the RNA expression of a gene network centered around the targeted gene is disturbed and how the disturbance is related to the clinical outcome. Hence it will be a routine requirement to measure whether a given RNA expression network is normal or abnormal clinically. The GCEI is a simple attempt to address this coming revolution.

Conclusions

Gene cluster expression index can be used to classify lung cancers with dramatically different recurrence risks and the recurrence risk (percentage) of the patient group with index 1 is typically 20% to 200% higher than the group with index 0. We expect that the higher risk group of index 1 may also be suitable for the corresponding targeted therapy or immunotherapy. Therefore, it may be used to guide targeted therapy or immunotherapy when the conventional companion tests give no recommendation. Nevertheless, this should be validated by clinical trials before it is applied in the clinical practice.

Supporting information

Supplementary material for this article is available at https://doi.org/10.14218/CSP.2023.00034 .

Supplementary File 1

Supplemental materials

(DOCX)

Abbreviations

ALK: 

anaplastic lymphoma kinase

AUC: 

area under the curve

cGCEI: 

combined gene cluster expression index

FPR: 

false positive rate

GCEI: 

gene cluster expression index

NSCLC: 

non-small cell lung cancer

PPV: 

positive prediction value

ROC: 

receiver operating characteristic

TPR: 

true positive rate.

Declarations

Acknowledgement

We thank UT Southwestern Medical Center for the curated data sets via https://lce.biohpc.swmed.edu/lungcancer/dataset.php.

Ethical statement

The raw data sets were downloaded from the open web portal Lung Cancer Explorer (LCE). This was an in-silico research study and ethics approval was not applicable.

Data sharing statement

The post-processed data sets used in support of the findings of this study are available from the corresponding author at [email protected] upon request.

Funding

All funding of the study is supported by the R&D department of Shenzhen Luwei (Biomanifold) Biotechnology Limited, Shenzhen, China.

Conflict of interest

Aibing Rao is a full-time employee of Shenzhen Luwei (Biomanifold) Biotechnology Limited, Shenzhen, China.

References

  1. Tang H, Wang S, Xiao G, Schiller J, Papadimitrakopoulou V, Minna J, et al. Comprehensive evaluation of published gene expression prognostic signatures for biomarker-based lung cancer clinical studies. Ann Oncol 2017;28(4):733-740 View Article PubMed/NCBI
  2. Woodard GA, Wang SX, Kratz JR, Zoon-Besselink CT, Chiang CY, Gubens MA, et al. Adjuvant Chemotherapy Guided by Molecular Profiling and Improved Outcomes in Early Stage, Non-Small-Cell Lung Cancer. Clin Lung Cancer 2018;19(1):58-64 View Article PubMed/NCBI
  3. Bueno R, Richards WG, Harpole DH, Ballman KV, Tsao MS, Chen Z, et al. Multi-Institutional Prospective Validation of Prognostic mRNA Signatures in Early Stage Squamous Lung Cancer (Alliance). J Thorac Oncol 2020;15(11):1748-1757 View Article PubMed/NCBI
  4. Luo Y, Deng X, Que J, Li Z, Xie W, Dai G, et al. Cell Trajectory-Related Genes of Lung Adenocarcinoma Predict Tumor Immune Microenvironment and Prognosis of Patients. Front Oncol 2022;12:911401 View Article PubMed/NCBI
  5. Yu J, Li G, Tian Y, Huo S. Establishment of a Lymph Node Metastasis-Associated Prognostic Signature for Lung Adenocarcinoma. Genet Res (Camb) 2023;2023:6585109 View Article PubMed/NCBI
  6. Nagl L, Pall G, Wolf D, Pircher A, Horvath L. Molecular profiling in lung cancer. memo 2022;15:201-205 View Article PubMed/NCBI
  7. Buzdin A, Sorokin M, Garazha A, Glusker A, Aleshin A, Poddubskaya E, et al. RNA sequencing for research and diagnostics in clinical oncology. Semin Cancer Biol 2020;60:311-323 View Article PubMed/NCBI
  8. Nagy Á, Pongor LS, Szabó A, Santarpia M, Győrffy B. KRAS driven expression signature has prognostic power superior to mutation status in non-small cell lung cancer. Int J Cancer 2017;140(4):930-937 View Article PubMed/NCBI
  9. Rousseaux S, Debernardi A, Jacquiau B, Vitte AL, Vesin A, Nagy-Mignotte H, et al. Ectopic activation of germline and placental genes identifies aggressive metastasis-prone lung cancers. Sci Transl Med 2013;5(186):186ra66 View Article PubMed/NCBI
  10. Okayama H, Kohno T, Ishii Y, Shimada Y, Shiraishi K, Iwakawa R, et al. Identification of genes upregulated in ALK-positive and EGFR/KRAS/ALK-negative lung adenocarcinomas. Cancer Res 2012;72(1):100-111 View Article PubMed/NCBI
  11. Cai L, Lin S, Girard L, Zhou Y, Yang L, Ci B, et al. LCE: an open web portal to explore gene expression and clinical associations in lung cancer. Oncogene 2019;38(14):2551-2564 View Article PubMed/NCBI
  12. Ou SI, Zhu VW, Nagasaka M. Catalog of 5′ Fusion Partners in ALK-positive NSCLC Circa 2020. JTO Clin Res Rep 2020;1(1):100015 View Article PubMed/NCBI
  13. Hajian-Tilaki K. Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation. Caspian J Intern Med 2013;4(2):627-635 View Article PubMed/NCBI
  14. Couëtoux du Tertre M, Marques M, Tremblay L, Bouchard N, Diaconescu R, Blais N, et al. Analysis of the Genomic Landscape in ALK+ NSCLC Patients Identifies Novel Aberrations Associated with Clinical Outcomes. Mol Cancer Ther 2019;18(9):1628-1636 View Article PubMed/NCBI
  15. Jiang C, Zhang Y, Li Y, Lu J, Huang Q, Xu R, et al. High CEP55 expression is associated with poor prognosis in non-small-cell lung cancer. Onco Targets Ther 2018;11:4979-4990 View Article PubMed/NCBI
  16. Cai C, Long Y, Li Y, Huang M. Coexisting of COX7A2L-ALK, LINC01210-ALK, ATP13A4-ALK and Acquired SLCO2A1-ALK in a Lung Adenocarcinoma with Rearrangements Loss During the Treatment of Crizotinib and Ceritinib: A Case Report. Onco Targets Ther 2020;13:8313-8316 View Article PubMed/NCBI
  17. Nadler E, Vasudevan A, Wang Y, Ogale S. Real-world patterns of biomarker testing and targeted therapy in de novo metastatic non-small cell lung cancer patients in the US oncology network. Cancer Treat Res Commun 2022;31:100522 View Article PubMed/NCBI
  18. Rodon J, Soria JC, Berger R, Miller WH, Rubin E, Kugel A, et al. Genomic and transcriptomic profiling expands precision cancer medicine: the WINTHER trial. Nat Med 2019;25(5):751-758 View Article PubMed/NCBI
  19. Khan P, Siddiqui JA, Lakshmanan I, Ganti AK, Salgia R, Jain M, et al. RNA-based therapies: A cog in the wheel of lung cancer defense. Mol Cancer 2021;20(1):54 View Article PubMed/NCBI