Introduction
The quantity of Gleason pattern 4 (GP4) in Grade Group 2 (Gleason score 7) prostate cancer (PCa) is an important prognostic factor and may influence treatment decisions. Studies have shown that the quantity of GP4 in Grade Groups 2 and 3 PCa in prostate biopsies, measured in a percentage or linear length, correlates with adverse pathological findings in radical prostatectomies (RP), including extraprostatic extension, seminal vesicle invasion, lymph node metastases, positive surgical margins, and oncological outcomes (postsurgical biochemical progression).1–9 For these reasons, reporting the quantity of GP4 in a needle biopsy has been recommended by the World Health Organization (WHO),10 Genitourinary Pathology Society (GUPS),11 International Society of Urological Pathology (ISUP),12 and College of American Pathologists (CAP).13
For the GP4 quantity to be impactful for patient prognosis and management, quantification should be done in a uniform and reproducible way. However, the interobserver reproducibility of quantifying GP4 by pathologists has been rarely studied. It is also not known if the methodology of GP4 quantification and histological features, including GP4 sub-patterns, may affect the interobserver reproducibility. This study was designed to answer these questions and provide guidance to pathologists for reproducible quantification of GP4 in prostate biopsies.
Material and methods
Case selection
The study was performed in accordance with the ethical standards of the contributing authors’ institutions and with the Declaration of Helsinki (as revised in 2013), and was approved by the lead author’s Institutional Review Boards. The patients’ consents were waived, as the slides were anonymized. Forty-seven glass slides containing 55 biopsy cores with PCa of various amounts of GP4 were selected for this study. Of the 12 participating pathologists, nine were practicing in university teaching hospitals and three in specialized urological pathology laboratories.
Review and quantification of Gleason pattern 4
De-identified glass slides were distributed to 12 participants who then reviewed the biopsy cores to confirm the cancer diagnosis. When GP4 cancer was present, the participants quantified the percentage of the GP4 of the entire cancer focus in the biopsy core (ranging from 1–100%), and documented the presence of GP4 sub-patterns (poorly formed glands [P], fused glands [F], cribriform [C], and glomeruloid [G]) and the most common sub-pattern in each biopsy core. Specific diagnostic criteria for P, F, C and G were not provided to the participants, who graded and quantified the biopsy cores based on their own experience. Eleven of the 12 participants quantified the GP4 percentage based on the areas of GP4 vs total cancer area, and one participant based the quantification on the length of GP4 vs total cancer length. For the final analysis, the mean percentage of GP4 measured by 12 participants was calculated as the consensus of GP4 quantity in each biopsy core. A GP4 sub-pattern was considered the most common pattern by the group consensus when identified as the most common pattern by ≥9 (75%) participants. Since the G sub-pattern was identified as the most common pattern in only one case, it was combined with the C sub-pattern for analysis. The length of the tumor in the biopsy cores was measured by the lead author of this study (MZ) in millimeters from one end to the other without excluding intervening benign glands.
Statistical analysis
Statistical analyses were performed using R software for statistical computing (R Foundation for Statistical Computing, Vienna, Austria). Interobserver reproducibility of quantifying GP4 by 12 participating pathologists was assessed by Fleiss k using the package “psy.” The reproducibility of quantifying GP4 was also calculated for biopsy cores grouped based on the tumor length (≤2 mm, 2.1–5 mm and >5 mm) and the most common sub-pattern in each core (P, F, C/G, and no consensus). A p value <0.05 was considered statistically significant.
Results
The mean percentage of GP4 in 55 biopsy cores ranged from 6% to 82%. Thirty-eight cores contained Grade Group 2 PCa with GP4 ≤50%, while 17 contained Grade Group 3 PCa with GP4 >50%. The interobserver reproducibility for 12 pathologists to quantify GP4 in 55 biopsy cores was moderate (κ = 0.57).
The mean PCa length was 5.4 (range of 0.6–13) mm. None of the cores had discontinuous cancer involvement defined as cancer foci >3 mm apart.14 These 55 cores were categorized into three groups based on the PCa length: ≤2 mm (Group 1), 2.1–5 mm (Group 2), and >5 mm (Group 3). There were seven cores in Group 1, 23 cores in Group 2, and 25 cores in Group 3. The quantification reproducibility of GP4 in these three groups is shown in Table 1 and Figure 1. The k value was 0.51, 0.50, and 0.66, respectively, for these three groups. The reproducibility was significantly higher for Group 3 than that for Groups 1 and 2 (p value was >0.05 for Group 1 vs 2, and <0.05 for Group 2 vs 3).
Table 1Gleason pattern 4 quantification reproducibility in prostate cores stratified by cancer length
| N | icc/kappa value | 95% CIL | 95% CIU | P value |
---|
Group 1 | 7 | 0.51 | 0.07 | 0.73 | n.a. |
Group 2 | 23 | 0.50 | 0.30 | 0.60 | Group 1 vs 2: >0.05 |
Group 3 | 25 | 0.66 | 0.50 | 0.76 | Group 2 vs 3: <0.05 |
The P, F, C, and G glands (Fig. 2) were the most common sub-pattern in 18 (33%), seven (13%), nine (16%), and one (2%) of the cores, respectively. The remaining 20 (36%) had no consensus in terms of the most common sub-pattern. These 55 cores were categorized into four groups according to their most common sub-pattern (P, F, C/G, and no consensus). The κ value for these four groups was 0.43, 0.57, 0.74, and 0.57, respectively (p value <0.05 for P vs F and F vs [C/G]) (Table 2). The interobserver reproducibility for the C/G sub-patterns was significantly better than that for P and F.
Table 2Gleason pattern 4 quantification reproducibility in prostate cores stratified by sub-patterns
| N | Icc/kappa value | 95% CIL | 95% CIU | P value |
---|
P | 18 | 0.43 | 0.22 | 0.55 | n.a. |
F | 7 | 0.57 | 0.12 | 0.77 | P vs F: <0.05 |
C/G | 10 | 0.74 | 0.38 | 0.88 | F vs (C/G): <0.05 |
No consensus | 20 | 0.57 | 0.39 | 0.67 | n.a. |
We also investigated how often individual participant’s quantification would fall in the same clinically meaningful GP4 range as the consensus measurements (0% [Grade Group 1], 1–10% and 11–50% [both would be graded as Grade Group 2], and ≥51% [Grade Group 3]) (Table 3). Quantification of a biopsy core by one pathologist was considered as one event. The number of classification events for any one core was 12. When the consensus of GP4 was 1–10% (Grade Group 2), there were 4/24 (17%) individual measurements as 0% (Grade Group 1), and 2/24 (8.3%) individual measurements as 11–50% (Grade Group 1); the latter was considered as misclassified. When the consensus of GP4 was 11–40% (Grade Group 2), there were 12/372 (3.2%) individual measurements as 0% (Grade Group 1), and 28 (7.5%) individual measurements as ≥51% (Grade Group 3); therefore, 10.8% (40/372) were considered as misclassified. When the consensus of GP4 was 41–50% (Grade Group 2), there were 21/60 (35%) individual measurements as >50% (Grade Group 3) that were considered misclassified. When the consensus of GP4 was 51–60% (Grade Group 3), there were 2/60 (3.3%) individual measurements as 0% (Grade Group 1), and 21 (35%) individual measurements as 10–50% (Grade Group 2); therefore, 38.3% (23/60) were considered misclassified. When the consensus of GP4 was ≥61% (Grade Group 3), there were 19/144 (13.2%) individual measurements as 10–50% (Grade Group 2); therefore, they were considered misclassified. The misclassification rate was significantly higher when the consensus of GP4 measurements was 41–60% (Table 3).
Table 3Quantification of Gleason pattern 4 in prostate biopsies by individual pathologists stratified by consensus Gleason pattern 4 measurements
Consensus GP 4 measurement (%) | # biopsy cores | Total # classification events* | Classified as GP4% = 0, # (%) | Classified as GP4% 1–10, # (%) | Classified as GP4% = 11–50, # (%) | Classified as GP4% ≥ 51, # (%) | % mis-classification, % (#/total) | P value for mis-classification |
---|
1–10 | 2 | 2 × 12 = 24 | 4 (17) | 18 (75) | 2 (8.3) | 0 (0) | 8.3 (2/24) | <0.0001 |
11–40 | 31 | 31 × 12 = 372 | 12 (3.2) | 132 (35.5) | 200 (53.8) | 28 (7.5) | 10.8 (40/372) | |
41–50 | 5 | 5 × 12 = 60 | 0 (0) | 4 (6.7) | 35 (58.3) | 21 (35.0) | 35 (21/60) | |
51–60 | 5 | 5 × 12 = 60 | 2 (3.3) | 0 (0) | 23 (38.3) | 35 (62.7) | 41.7 (25/60) | |
≥61 | 12 | 12 × 12 = 144 | 0 (0) | 0 (0) | 19 (13.2) | 125 (86.8) | 13.2 (19/144) | |
Discussion
The GP4 quantity in Grade Group 2 PCa is an important prognostic factor. In Grade Group 2 PCa, GP4 is associated with the adverse histopathological outcomes and an increased risk of biochemical failure in patients undergoing radical prostatectomy.1,2,4–6,8,9 In recent studies, Dean et al. and Perera et al. demonstrated several GP4 quantification methods in Grade Group 2 PCa, including the maximum percentage of GP4 in any single core. The overall percentage of GP4 (GP4 mm/total cancer mm) and total length of GP4 in mm in all cores were significantly associated with an increased risk of adverse pathology in RP and biochemical recurrence risk.3,7 Cole et al. also reported that the incremental percentage of GP4 in the biopsies was an important predictor of adverse pathology and PSA recurrence across the entire range of GS 7–8 PCa.2
The quantity of GP4 has been used in determining the candidacy for active surveillance, which has been increasingly used for patients with the National Comprehensive Cancer Network (NCCN) very low-risk/low-risk PCa. The guidelines now also consider active surveillance for select favorable intermediate risk patients, which includes low-volume Grade Group 2 disease, depending on life expectancy and other clinical/radiologic factors, and for the selection of different radiation therapy protocols.15 Therefore, WHO, GUPS, ISUP and CAP have all recommended including the percentage of GP4 in the prostate biopsy reports.10–13
However, only one study so far investigated the interobserver reproducibility and histological features of PCa that may affect GP4 quantification in prostate biopsies.16 Sadimin et al. compared the quantification of GP4 by the primary author and his four trainees. It found that in 32% and 75% of cases, the GP4 quantification between the primary author and his trainees were exact match and within +10%, respectively, with a weighted kappa value (κw) = 0.67. No significant difference was observed when the cases were stratified based on (1) GP4 component was scattered vs clustered in the background of GP3 cancer, (2) cancer in the biopsy was continuous vs discontinuous, and (3) the cribriform/glomeruloid pattern only versus poorly formed/fused pattern versus mixed cribriform and poorly formed/fused pattern. However, kw for cases with >10% cancer involvement of the biopsy core was significantly higher than those with ≤10% involvement (0.70 vs 0.50). While it showed that the GP4 quantification can be reproducibly performed, and the GP4 quantification in a small focus of Grade Group PCa is less reliable, the study has several limitations. There were only five participants in the study, and they were the primary author and his four trainees at the same institution. To understand the true degree of and the factors that affected the interobserver reproducibility of quantifying GP4, survey of more pathologists with more diverse practice settings is needed.
In the current study, the interobserver reproducibility for 12 pathologists to quantify GP4 in 55 biopsy cores was moderate (κ = 0.57). This k value is lower than Sadimin’s study (0.67). The difference could be explained by the aforementioned reason that participants in that study comprised the primary author and four of his trainees at one institution, while 12 pathologists in this study had diverse training and practice background. Our study may still have overestimated the interobserver reproducibility as most of the participants in our study had GU fellowship training and are practicing in a GU subspecialty signout. Sadimin’s and our present studies do, however, suggest that training can significantly improve the reproducibility of the GP4 quantification.
We also found that the interobserver reproducibility in the biopsy cores with a tumor length >5 mm is significantly better than that for the cores with tumor lengths ≤2 mm and 2.1–5 mm (κ value 0.66 vs 0.50 and 0.51), indicating that GP4 quantification in a larger cancer focus is more reproducible. This finding is similar, although not identical, to Sadimin’s study that showed the kw for cases with >10% involvement of the core was significantly higher than those with ≤10% involvement. We therefore advocate caution when quantifying the percentage of GP4 in a small focus of cancer. A recent survey by GUPS reported no consensus in quantifying the percentage of GP4 in needle biopsies with low-volume cancer with 58% of pathologists assigning and 42% not assigning the percentage of GP4.11 A practical approach is to assign either a Grade Group 2 or 3 with a comment that the cancer focus is too small to accurately quantify the percentage of GP4.
We also studied the impact of the sub-patterns on the reproducibility of GP4 quantification. We showed that poorly formed glands are the most common GP4 sub-pattern in 18 (33%) cores and cribriform glands are the most common sub-pattern in nine (16%) cores. The reproducibility of GP4 quantification varies significantly among the four groups with predominantly poorly formed glands, fused glands, cribriform/glomeruloid gland, or no consensus. The biopsy cores with the poorly formed glands demonstrate lowest κ value (0.43), while the cribriform/glomeruloid glands exhibit the highest κ value (0.74). The low k value associated with the quantifying poorly formed glands may be explained by the significant grading variation of the poorly formed glands among the pathologists. Our previous study found that the poorly formed glands sub-pattern suffers definitional ambiguity and has the lowest diagnostic reproducibility (κ = 0.34).17 It is obvious that grading and quantifying poorly formed glands need more standardization.
The reproducibility of the quantifying cores with the fused glands as the most common sub-pattern is fair (κ = 0.57; better than that of the poorly formed glands but worse than the cribriform/glomeruloid glands. The biopsy cores with the cribriform/glomeruloid patterns as the most common pattern have the substantial and highest reproducibility in the GP4 quantification (κ = 0.74). This finding is clinically relevant and assuring as recent studies found that the cribriform sub-pattern is associated with a worse prognosis than other GP4 sub-patterns and independently predicts biochemical recurrence and risk of distant metastasis.1,18,19 Our study suggests that the most common GP4 sub-pattern may be reported along with the percentage of GP4, as the quantification reproducibility for different GP4 sub-patterns differs significantly with the cribriform pattern having the highest quantification reproducibility.
We also investigated how often individual participant’s quantification would fall in the same clinically meaningful GP4 range as the consensus measurements (0%, 1–10%, 11–50%, and ≥51%), which would imply that individual participant’s Grade Group is concordant with the Grade Group based on the consensus. When the consensus of GP4 was 1–10% (Grade Group 2), there were 4/24 (17%) individual measurements of 0% (Grade Group 1). Several studies found that Grade Group 2 PCa with a minor GP4 component has pathological features and clinical outcomes similar to Grade Group 1 PCa,6,20 and patients with Grade Group 2 PCa with a minor GP4 component may still be eligible for active surveillance. Therefore, a measurement of GP4 = 0% in these cores was considered concordant with the consensus measurement of 1–10%. However, when the consensus of GP4 was 41–50% and 51–60%, the discordance is seen in 35% and 41.7% of the individual measurements, implying that these cores have a significant risk of being upgraded from Grade Group 2 to 3 or downgraded from Grade Group 3 to 2.
The findings of this study have several important implications for clinical practice. First, the reproducibility for GP4 quantification is moderate (κ value = 0.57), and is affected by the size of the PCa focus in the biopsy cores. Therefore, for a small focus of Grade Group 2 PCa, pathologists may consider not providing the percentage of GP4; instead, they should comment on the unreliability of the percentage of the GP4 quantification in such cases. Second, the quantification of GP4 as 40–60% has a significant error rate that may not only affect the GP4 quantification, but also the Grade Group, a caveat that both pathologists and clinicians need to be aware of. Assigning a smaller or larger percentage of GP4 imparts a greater confidence in the quantity of GP4. In contrast, GP4 between 40–60% may indicate that a tumor is borderline between Grade Groups 2 and 3. Third, the quantification reproducibility is affected by the GP4 sub-patterns with the poorly formed glands and cribriform glands having the lowest and highest reproducibility, respectively. The GP4 sub-patterns, especially the poorly formed glands, therefore need to be morphologically better defined. Methodology, i.e., the linear length vs the area of GP4, should be standardized. Education and training of pathologists may help improve the reproducibility of the GP4 quantification.
Conclusions
The reproducibility of quantifying GP4 PCa is only moderate, and significantly lower in a small focus of cancer and cancer with predominantly poorly formed glands sub-patterns. There is significant variability in quantifying GP4 when it is 40–60%. Both pathologists and clinicians should understand the limitations of GP4 quantification. More training and education for pathologists, and standardization of quantification methodology to improve the GP4 quantification is warranted.
Abbreviations
- C:
cribriform
- CAP:
College of American Pathologists
- F:
fused
- G:
glomeruloid
- GP:
Gleason pattern
- GUPS:
Genitourinary Pathology Society
- ISUP:
International Society of Urological Pathology
- P:
poorly formed
- PCa:
prostate cancer
- WHO:
World Health Organization
Declarations
Ethical statement
The study was performed in accordance with the ethical standards of the contributing authors’ institutions and with the Declaration of Helsinki (as revised in 2013), and was approved by the corresponding author’s Institutional Review Boards. The patients’ consents were waived as the slides were anonymized. The study described in this publication was conducted and reported in accordance with the guidance from the Committee on Publication Ethics (COPE) and practices according to the Recommendations for the Conduct, Reporting, Editing, and Publication of Scholarly work in Medical Journals from the International Committee of Medical Journal Editors (ICMJE).
Data sharing statement
The data set used in support of the findings of this study are available from the corresponding author at [email protected] upon request.
Funding
None.
Conflict of interest
Dr. Deng FM has been an editorial board member, and Dr. Zhou M has been an advisory board member of the Journal of Clinical and Translational Pathology since 2021. The other authors have no conflicts of interest related to this study.
Authors’ contributions
Study concept and design (JL and MZ), acquisition of data (JL, ME, AA, RB, KD, FD, PL, AM, JM, SM, YT, OY, RS, and MZ), analysis and interpretation of data (JL and MZ), drafting of the manuscript (JL and MZ), critical revision of the manuscript for important intellectual content (JL, ME, AA, RB, KD, FD, PL, AM, JM, SM, YT, OY, RS, and MZ), administrative, technical, or material support (JL and MZ), and study supervision (MZ). All authors made a significant contribution to this study and approved the final manuscript.