Introduction
MicroRNAs (miRNAs) are small noncoding RNAs that are typically 22 nucleotides in length.1 miRNAs have been found to regulate approximately 30% of genes in humans at the post-transcriptional level.2 In recent years, miRNA expression has been shown to correlate to diseases, most notably cancers.3 An example of an miRNA that has been correlated with tumor suppression and cancer inhibition is the miRNA precursor lethal-7 (let-7).4 As such, miRNAs are now being studied as promising biomarkers for various cancers including pancreatic cancer.5
There has been a lot of computational research surrounding miRNA–disease association prediction. For example, Shi and co-workers have proposed a calculation method for miRNA–disease relationship prediction based on random walk analysis.6 This model uses the connection between miRNAs and disease genes in protein–protein interaction (PPI) networks to predict potential miRNA–disease associations. In addition, Chen and co-authors have proposed a bipartite network projection model for predicting potential associations between miRNAs and disease (BNPMDA) using miRNA functional similarity, disease semantic similarity, and the known human miRNA–disease associations.7 This model constructs bias ratings between diseases and miRNAs using miRNA function similarity and disease semantic similarity. Then, the bipartite network recommendation algorithm is applied to predict miRNA–disease association. Moreover, You and colleagues have proposed an miRNA–disease association prediction model called a path-based miRNA–disease association (PBMDA).8 This model implements a personalized recommendation algorithm that recommends potential miRNA–disease pairs based on information of related miRNAs and diseases. Ji and co-authors also have proposed a network embedding-based heterogeneous information integration method to predict the potential associations between miRNA and disease.9 This model first used a heterogeneous information network constructed using known associations between drugs, miRNA, protein, lncRNA, and disease and then applied the graph-representations (GraRep) method to learn and predict potential miRNA–disease associations.
Using the previously obtained results in miRNA–disease relation prediction, a set of databases of miRNA–disease relationships was created during the past several years. We conducted the next step—we used these databases to construct a set of descriptors for machine-learning diagnostics using miRNA. In our study, a novel miRNA descriptor system was proposed to predict potential associations between miRNAs and diseases. Based on our hypothesis that miRNA–disease association can be elucidated using sequence information of miRNAs and genes targeted by miRNA, we constructed our miRNA descriptor system using numerical sequence information of miRNAs and target genes.
Methods
Classification model
To show the effectiveness of our miRNA descriptor system, we constructed a classification model using known associations of miRNAs with various cancers. We illustrated the concept of the system in more detail using a pancreatic cancer model as an example. From the miRNA cancer association database miRCancer,10 we extracted 107 miRNAs that are associated with breast cancer, and from the miRNA database miRBase,11 we extracted 107 random miRNAs as training/testing data. The model was then constructed based on miRNA descriptors created using the training/testing data. Figure 1 shows the flowchart of our method. The model was then evaluated using the Random Forest machine-learning algorithm. The results reveal that our method performed with a high accuracy of 86.9%.
Developing the miRNA descriptor system
We developed a system of miRNA descriptors taking in consideration the known miRNA–cancer associations and miRNA target predictions (Fig. 2). The system was tested based on pancreatic cancer as an example as follows. A list of miRNAs that are known to be associated with pancreatic cancer was downloaded from the miRCancer10 database, and miRNA target predictions were downloaded from the miRNA target prediction database miRDB.12 A list of all known miRNAs was also downloaded from miRBase.11 To extract the sequence information of the miRNAs that are associated with pancreatic cancer, a Python script was written to find the pancreatic cancer-associated miRNAs (name) in the miRDB and to extract the corresponding miRNA sequence. In total, 152 miRNA sequences (associated with pancreatic cancer) were extracted in this manner. An additional 152 human miRNA sequences with no known association to pancreatic cancer were randomly selected from miRBase, for a total of 304 miRNA sequences. The 152 miRNA sequences with a known association to pancreatic cancer were assigned “associated” labels and the 152 randomly selected miRNA sequences were assigned “non-associated” labels to create two categories for classification. These miRNA sequences were later used as inputs to create the miRNA descriptors.
Another Python program was developed to automatically extract the miRNA descriptors based on the miRNA sequences (associated with pancreatic cancer) as the input. One part of the miRNA descriptors consisted of numerical miRNA sequence information. The miRNA sequence information used in this study is more complete and comprehensive compared to previous studies.12 The miRNA descriptors based on the sequence information consisted of the number of base pairs, the assigned number of each base pair, the frequency of each base pair, the mean mass of each base pair, the number of hydrogen bonds, the symmetry of the miRNA sequence, the motifs within the entire miRNA sequence (2, 3, 4 base pair motifs), and the motifs within the first five base pairs and within the last five base pairs. Each motif was a distinct descriptor and was assigned a score of “1” if the miRNA sequence had the motif and a score of “0” if it did not have the motif. Table 1 includes the names and formulas/descriptions for each of the numerical descriptors based on miRNA sequence information. In total, there were 996 miRNA descriptors based on the sequence information.
Table 1miRNA descriptors based on the sequences
Name of descriptor | Description/Formula |
---|
Number of base pairs | N |
Number of each base pair | xA, xU, xC, xG |
Frequency of each base pair | xA/N, xU/N, xC/N, xG/N |
Mean mass of each base pair | (135.1(xA) + 112.1xU) + 111.1(xC) + 151.1(xG))/N |
Number of hydrogen bonds | 2(xA + xU) + 3(xC + xG) |
Symmetry score | If the first base pair is the same as the last base pair, add 1 to the symmetry score. If the second base pair is the same as the second-to-last base pair, add 1 to the symmetry score. Repeat until the middle of the miRNA (N/2 term) is reached. |
2-base-pair motifs (i.e., AA, AU, AC) of the entire sequence | Each motif is a separate descriptor. If the miRNA has the motif, a “1” is assigned. Otherwise, a “0” is assigned. |
3-base-pair motifs (i.e., AAA, AAU, AAC) of the entire sequence | Each motif is a separate descriptor. If the miRNA has the motif, a “1” is assigned. Otherwise, a “0” is assigned. |
4-base-pair motifs (i.e., AAAA, AAAU) of the entire sequence | Each motif is a separate descriptor. If the miRNA has the motif, a “1” is assigned. Otherwise, a “0” is assigned. |
Motifs (2-, 3-, 4-base pair) within the first 5 base pairs | Each motif is a separate descriptor. If the 5 first base pairs of the miRNA contains the motif, a “1” is assigned. Otherwise, a “0” is assigned. |
Motifs (2-, 3-, 4-base pair) within the last 5 base pairs | Each motif is a separate descriptor. If the 5 first base pairs of the miRNA contains the motif, a “1” is assigned. Otherwise, a “0” is assigned. |
The other part of the miRNA descriptor set was based on miRNA target genes from the miRDB database.12 These miRNA target genes were included as descriptors because we hypothesized that miRNAs that are associated with the same disease will share similar targets as well.13 For this study, a target score threshold of 99 was used to make sure the miRNA and selected target genes were strongly correlated. A Python program was developed to automatically find all target genes with a target score of 99 or more for all miRNAs. Then, the program created a new descriptor for each unique gene selected. A target gene descriptor was assigned a score of “1” if the miRNA sequence had a target score of 99 or more with that target gene and a score of “0” otherwise. Figure 2 shows how the miRNA target gene descriptors were created. In this study, a total of 6,436 target gene descriptors were created from the 304 miRNA sequences.
The miRNA descriptor system was developed to take a list of miRNAs as the input and to issue a table with the sequence information and target gene descriptors for each miRNA as the output. This system could be applied to any disease with known miRNA–disease associations.
Machine learning
We describe the system performance based on pancreatic cancer as an example. The 6,436 target gene-based descriptors and the 996 numerical miRNA sequence-based descriptors from the 304 miRNA sequences were combined to create a single miRNA descriptor table with 304 miRNA sequences (rows) and 7,432 descriptors (columns). An additional column was added to the descriptor table to label the two classes of data for classification. The 304 miRNAs that are associated with pancreatic cancer were given the class of “associated” while the 304 randomly selected miRNAs were given the class of “non-associated.” The descriptor table was then used as the input for multiple machine-learning classification algorithms. Out of all of the classification algorithms, Random Forest14 with an 80%/20% training–testing split had the highest classification accuracy.
We first used Random Forest with an 80%/20% training-testing split to evaluate the performance of the model before any feature selection was done. An 80%/20% training-testing split ensures that there is no overfitting as 20% of the data is not used to build the model but is used for testing. Then, the InfoGainAttributeEval15 algorithm was used to determine which descriptors contribute the most to information gain during classification. The descriptors that have no contribution to information gain were removed, thus leaving a list of descriptors that have a positive contribution to classification, ordered from the greatest contribution to the least contribution.
The reduced table of descriptors then went through more precise feature selection. A script removed descriptors one by one starting from the descriptors with the least information gain contribution, evaluated the performance of the model using Random Forest, and kept the deletion if the classification accuracy increased. Overall, the number of descriptors for our 304 miRNA sequences was reduced from 7,432 to 3,648 descriptors.
Results
The results of the proposed classification were evaluated using confusion matrices and their derivatives: the accuracy (ACC), precision (PREC), Matthews correlation coefficient (MCC), true-positive rate (TPR) or recall (REC), false-positive rate (FPR), as well as the area under the receiver-operating characteristic (ROC) curve (AUC), and the area under the precision-recall curve (PRC area). Comparison of different classifiers results (pancreatic cancer) is presented in Table 2. The best weighted averages for each of these metrics were as follows: ACC, 86.9%; PREC, 87.1%; MCC, 73.9%; TPR (REC), 86.9%; FPR, 12.8%; AUC, 86.4%; and PRC area, 86.1%.
Table 2Performance comparison of the different classifiers for the developed machine-learning models
Classifier | ACC | PREC | MCC | TPR | FPR | AUC | PRC area |
---|
LMT | 81.97% | 82.1% | 80.4% | 82.0% | 18.5% | 85.9% | 84.9% |
SVM | 81.97% | 82.1% | 64.0% | 82.0% | 17.9% | 82.1% | 76.4% |
Naїve Bayes | 80.26% | 82.6% | 63.0% | 80.3% | 18.4% | 84.9% | 83.0% |
Random Forest | 86.88% | 87.1% | 73.9% | 86.9% | 12.8% | 86.4% | 86.1% |
The ROC curve compares the sensitivity and specificity across a range of values. Thus, the vertical axis is the TPR, that is, the sensitivity or recall; and the horizontal axis is the FPR or (1−specificity). The FPR is the probability of falsely classifying a positive class. The best performance showed the model based on the Random Forest classifier. The model’s low FPR of 12.8% demonstrates a low probability of wrongly classifying an miRNA–breast cancer pair that is associated. The TPR (sensitivity) is the probability of correctly classifying a positive class. The model’s high TPR of 86.9% indicates a high probability of correctly classifying an miRNA–breast cancer pair that is associated. The large average AUC value of 86.4% indicates that the Random Forest classifier is very robust. Another way to evaluate the performance of the proposed method is the PRC area, which shows precision values for the corresponding sensitivity (recall, i.e., TPR) values. The model’s large PRC area value of 86.1% once again shows the good performance of our method.
Performance comparison of the different classifiers for the developed machine-learning models
To further test the significance of the classifier on our model, we compared the performance of the four classifiers Random Forest,14 Naїve Bayes,16 Logistic Model Tree,17 and Support Vector Machine18 using the 80%/20% training–testing split. In the comparison, the environment and training/testing set were kept the same and only the classifier engine was changed. Additionally, the same statistic metrics of ACC, PREC, MCC, TPR (REC), FPR, AUC, and PRC area were used. Table 3 shows the comparison of the performance of all of the classifiers. The comparisons show that the Random Forest classifier had a better performance, robustness, accuracy, and sensitivity than the other classifiers for our system.
Table 3Comparison of the miRNA-based diagnostics on various cancers and different target gene thresholds
Cancer type and prediction threshold | ACC | PREC | MCC | TPR | FPR | AUC | PRC area |
---|
Breast cancer (target gene threshold of 90) | 87.7% | 88.0% | 76.4% | 87.7% | 12.9% | 88.7% | 86.4% |
Breast cancer (target gene threshold of 99) | 85.1% | 85.6% | 70.1% | 85.1% | 14.3% | 88.3% | 86.5% |
Lung cancer (target gene threshold of 90) | 86.3% | 86.9% | 73.2% | 86.3% | 13.2% | 88.5% | 85.7% |
Lung cancer (target gene threshold of 99) | 86.3% | 86.7% | 72.8% | 86.3% | 13.0% | 88.9% | 87.9% |
Pancreatic cancer (target gene threshold of 90) | 88.5% | 88.7% | 78.4% | 88.5% | 11.5% | 88.9% | 87.5% |
Pancreatic cancer (target gene threshold of 99) | 86.9% | 87.1% | 73.9% | 86.9% | 12.8% | 86.4% | 86.1% |
miRNA-based diagnostics of various cancers
To prove the robustness of our system of descriptors for miRNA–disease prediction, we conducted case studies using pancreatic cancer, lung cancer, and breast cancer. Previously, we tested our method on pancreatic cancer using target gene descriptors based on target gene prediction scores of 99 or higher. To explore whether the number of parameters (descriptors) has a significant impact on the prediction performance (statistic metrics), we conducted each case study using two different target gene prediction thresholds, 90 and 99. Each case study was conducted using Random Forest with an 80%/20% training-testing data split, and we evaluated the models using the same statistic metrics of ACC, PREC, MCC, TPR (REC), FPR, AUC, and PRC area. Additionally, the same method was used to create the miRNA target gene-based descriptors and to perform feature selection on each study. Table 3 shows the average prediction statistic metrics of performance for each case study.
The results show that the accuracies of the case studies are both consistent and high, ranging from 85.1% to 88.5%, proving the robustness of the method for miRNA–disease association prediction. Additionally, the prediction accuracies are consistently high for both a target gene prediction threshold of 99 and a target gene prediction threshold of 90 for each case study, showing that the method operates robustly even with small numbers of descriptors.
To further explore the relationship between the number of descriptors and the prediction accuracies, the number of descriptors and the prediction accuracy for each case study are compared in Figures 3–5. For each case study, descriptors were removed based on their information gain contribution (the descriptors with the least information gain were removed first). While there were fluctuations in the prediction accuracies when the number of descriptors was reduced, the prediction accuracies in each case study were still consistently high across all numbers of descriptors for all case studies. This finding proves that although the results were the highest when there was a high number of descriptors, high accuracies could still be achieved across all diseases with lower numbers of descriptors.
Testing of outside datasets on the developed models
To ensure that our models are able to differentiate between different diseases, various datasets were tested on each model. From each disease dataset, approximately 50 randomly selected associated miRNAs were paired with the same number of randomly selected non-associated miRNAs. Then, the selected data were used to test the model.
First, randomly selected data from the lung cancer and breast cancer datasets were tested on the pancreatic cancer model with a target gene threshold of 99. The model classified the lung cancer data with 57.8% accuracy and the breast cancer data with 56.4% accuracy. Next, randomly selected data from the pancreatic cancer and breast cancer datasets were tested on the lung cancer model with a target gene threshold of 99. The model classified the pancreatic cancer data with 56.7% accuracy and the breast cancer data with 58.3% accuracy. Finally, randomly selected data from the lung cancer and pancreatic cancer datasets were tested on the breast cancer model with a target gene threshold of 99. The model classified the lung cancer data with 56.7% accuracy and the pancreatic cancer data with 55.9% accuracy. The accuracies are presented in Table 4. The accuracies were all higher than 50% because of some overlapping miRNAs between the three datasets. However, each model performed consistently worse when classifying datasets from other diseases. Thus, we conclude that the models are able to differentiate between different cancers.
Table 4Comparison of the diagnostic accuracies using different cancer datasets for tests on the developed models
| Pancreatic cancer model (99 target genes) | Lung cancer model (99 target genes) | Breast cancer model (99 target genes) |
---|
Pancreatic cancer dataset accuracy | 86.9% | 57.8% | 56.4% |
Lung cancer dataset accuracy | 56.7% | 86.3% | 58.3% |
Breast cancer dataset accuracy | 55.9% | 56.7% | 85.1% |
We also tested a noncancer disease on the system to provide further verification. From the HMDD database,19 we extracted 86 miRNAs associated with Alzheimer’s disease and also extracted 86 miRNAs not associated with Alzheimer’s disease. The selected data were tested on the pancreatic cancer, lung cancer, and breast cancer association models with 90 target genes. The pancreatic cancer model classified the Alzheimer’s data with 48.1% accuracy, the lung cancer model classified the Alzheimer’s data with 53.0% accuracy, and the breast cancer model classified the Alzheimer’s data with 56.7% accuracy. The low classification accuracies of the Alzheimer’s data further demonstrate that the models are able to differentiate between different diseases. The accuracies are shown in Table 5.
Table 5Comparison of the diagnostic accuracies of Alzheimer’s disease tested on the developed models
| Pancreatic cancer model (99 target genes) | Lung cancer model (99 target genes) | Breast cancer model (99 target genes) |
---|
Alzheimer’s disease dataset accuracy | 48.1% | 53.0% | 56.7% |
Finally, we tested pancreatic, breast, and lung cancer data from other studies5,20-22 on our corresponding models. From the other studies, we were able to extract 12 lung cancer-associated miRNAs, 13 pancreatic cancer-associated miRNAs, and 30 breast cancer-associated miRNAs that were not presented in our model. Then, the same number of unassociated miRNAs was paired with the cancer data and tested on each model.
The lung cancer data from other studies21 yielded a 91.7% accuracy when tested on our lung cancer model; the pancreatic cancer data from other studies5,20 yielded a 92.3% accuracy when tested on our pancreatic cancer model; and the breast cancer data from other studies22 yielded a 95.0% accuracy when tested on our breast cancer model. These results verify the validity of our models.
After establishing that our individual classifiers for breast, pancreatic, and lung cancer were able to differentiate between different diseases, we used a hard-voting scheme to recognize different cancers from a single input dataset. A hard-voting scheme uses majority voting for classification. The hard-voting scheme was applied to each of the three individual models. Table 6 shows examples of the results from the hard-voting scheme.
Table 6Hard-voting scheme for different cancer diagnostics
Breast cancer | Lung cancer | Pancreatic cancer | Hard-voting consensus |
---|
Accuracy >80 | Accuracy <60 | Accuracy <60 | Breast cancer |
Accuracy <60 | Accuracy >80 | Accuracy <60 | Lung cancer |
Accuracy <60 | Accuracy <60 | Accuracy >80 | Pancreatic cancer |
Accuracy <60 | Accuracy <60 | Accuracy <60 | None |