Introduction
Colorectal cancer (CRC) is a leading cause of global cancer mortality, with epidemiological projections indicating a 60% increase in incidence by 2030, resulting in approximately 2.2 million new cases and 1.1 million fatalities annually.1 Histopathological assessment of tissue architecture using hematoxylin and eosin (H&E)–stained sections is the gold standard for evaluating glandular structures in various types of adenocarcinomas, including CRC,2 breast cancer,3 prostate cancer,4 and endometrial adenocarcinoma.5 Histopathologic grading relies heavily on the degree of gland formation: well- and moderately differentiated tumors are classified as low grade, retaining largely intact glandular architecture, whereas poorly differentiated tumors display markedly complex, abortive, or absent glandular formation and are associated with worse survival outcomes.6 Consequently, accurate gland segmentation in digitized whole-slide images (WSIs) is crucial for quantifying and characterizing glandular morphology, which directly informs tumor grading and risk stratification of colorectal and other types of adenocarcinomas.3–5
Fully supervised deep learning methods have set the benchmark for gland segmentation in histopathological images. Early work by DCAN established a multi-task learning framework that simultaneously segments glands and their contours to delineate benign, malignant, and closely apposed glands, and subsequent advances have incorporated domain-specific inductive biases,7 such as Gabor-based encoders and topology-aware networks.8,9 To address residual segmentation errors, especially in apposed glands, Xie et al.10 introduced a Deep Segmentation-Emendation model, which employs a dedicated emendation network to predict and correct inconsistencies in initial segmentation masks. While fully supervised methods demonstrate impressive performance, their effectiveness is intrinsically contingent upon the availability of large-scale, pixel-level annotated datasets—a major bottleneck in clinical practice due to the significant time and expertise required from pathologists.
This annotation burden has spurred interest in weakly supervised semantic segmentation (WSSS), which substantially reduces the demand for dense labels by leveraging weaker forms of supervision such as image-level labels or sparse annotations, thereby reducing annotation time by approximately sixty-fold.11,12 The predominant WSSS approach uses classification networks to generate class activation maps (CAMs) as initial pseudo-labels for training segmentation models.13–22 However, CAMs have inherent limitations; they tend to activate only the most discriminative regions of an object, resulting in pseudo-masks with ambiguous boundaries, noise, discontinuity, and structural fragmentation.17 Various CAM refinement techniques have been developed to address these limitations, including SEAM, which enforces spatial consistency,23 and AMR, which uses complementary activation branches to enhance under-activated regions.24 Similarly, Kweon et al.20 adopted an adversarial strategy, using an image reconstructor to force the classifier to generate more complete activation maps by minimizing inter-segment inferability. In medical imaging, domain-specific modifications such as C-CAM and MLPS have also been proposed.11,12,25
Nevertheless, a significant challenge persists in learning from the noisy, CAM-generated pseudo-masks in the subsequent segmentation stage. Although recent frameworks such as ARML attempt to address both CAM refinement and noisy label learning in histopathology,21 general WSSS methods still often underperform on gland segmentation due to the high morphological similarity between gland types and the critical need for precise instance boundaries. Therefore, there remains a significant methodological gap for a WSSS framework specifically designed to address the challenges of gland segmentation—namely, to generate high-quality, complete pseudo-masks from sparse annotations that can reliably guide the training of a dense segmentation model.
To bridge this gap, we introduced a novel weakly supervised teacher–student framework with progressive pseudo-mask refinement for multi-class gland segmentation in colorectal histopathology. The framework integrates an Exponential Moving Average (EMA)–stabilized teacher network with confidence-based filtering, curriculum-guided loss weighting, and adaptive pixel-wise fusion of sparse expert annotations to progressively discover and segment previously unannotated glandular structures.
Our contributions are threefold:
We introduce a pixel-wise pseudo-label fusion strategy that preserves pathologist-provided sparse annotations while leveraging EMA-stabilized teacher predictions to supervise unlabeled regions during self-training.
We propose a curriculum-driven refinement mechanism that combines cosine-decayed confidence thresholding with dynamic loss weighting, enabling progressive expansion of supervision from high-confidence gland regions to previously unannotated and ambiguous regions. This approach explicitly addresses annotation sparsity in the dense and morphologically complex setting of glandular histopathology.
We perform a comprehensive, clinically grounded multi-cohort evaluation reflecting real-world variability. The framework is validated on (i) an institutional dataset with sparse annotations, (ii) the fully annotated public Gland Segmentation (GlaS) benchmark,26 and (iii) three external cohorts—The Cancer Genome Atlas (TCGA) Colon Adenocarcinoma (COAD), Rectum Adenocarcinoma (READ), and SPIDER27—to assess cross-domain generalization. This multi-tiered evaluation demonstrates competitive performance relative to fully supervised methods and systematically characterizes robustness and failure modes under substantial domain shift, providing actionable insights for clinical translation.
Materials and methods
Study design and problem formulation
The proposed framework leverages the nnUNet backbone for robust semantic segmentation and comprises two identical networks: a student model (θS) trained via gradient descent using a supervised segmentation loss and a consistency regularization term, and a teacher model (θT) updated exclusively through an EMA of the student parameters, providing stable pseudo-labels that guide student learning. Formally, let x ∊ RH×W×3 denote an input image, and y ∊ {0,1,…,C}H×W the corresponding pixel-level labels in the segmentation mask, where C = 4 represents the total number of classes (background stroma, benign glands, malignant glands, and poorly differentiated clusters/glands). The goal is to learn a function fθ (x) that outputs pixel-wise class probabilities pθ (x) ∊ [0,1]H×W×C to segment both annotated and unannotated glandular structures at the pixel level.
Datasets
We conducted experiments on the in-house The Ohio State University Wexner Medical Center (OSUWMC) dataset containing limited pathologist annotations, as well as on the publicly available GlaS dataset with high-quality pixel-level annotations to demonstrate the broad applicability of the framework.26 Additionally, three external publicly available CRC histopathology datasets, TCGA-COAD, TCGA-READ, and SPIDER,27 were used to qualitatively assess the generalizability of the proposed framework on external cohorts where ground-truth (GT) annotations are not available.
OSUWMC in-house dataset
We used an in-house CRC histology dataset collected at OSUWMC, consisting of 60 H&E-stained WSIs from independent patients with histologically confirmed colorectal adenocarcinoma. All WSIs were retrospectively acquired from surgical resection specimens, annotated by two pathology residents using sparse pixel-level labels, and scanned at 40× magnification. The dataset is WSIs only; no patient-level clinical or demographic metadata (e.g., age, sex, tumor stage, grade, or treatment history) were collected or were available, as this study focused exclusively on technical development of weakly supervised segmentation algorithms rather than clinical outcome prediction. The annotations include four tissue categories: benign glands, malignant glands (better-formed tumor glands with obvious lumina), poorly differentiated clusters/glands (encompassing tumor buds, poorly differentiated clusters, and poorly formed tumor glands with absent or minimal lumina), and background stroma. The cohort captures a broad range of glandular morphologies, including well-formed glands, irregular malignant glands, and poorly differentiated structures, reflecting real-world histopathologic variability. WSIs were scanned at 40× magnification, and 512 × 512-pixel patches were extracted at 5× magnification for model development. In total, 74,179 patches were generated and split into 63,191 training, 5,460 validation, and 5,528 test patches. Approximate class prevalence at the patch level was ∼45% benign glands, ∼35% malignant glands, ∼15% background stroma, and ∼5% PDC/G, with stratified sampling used to preserve class proportions across splits. Figure 1 shows representative samples from our in-house dataset, which contains sparse annotations for background stroma, benign glands, malignant glands, and poorly differentiated clusters/glands. Notably, most patches contained both annotated and unannotated glands, posing a significant challenge for accurate segmentation under weak supervision.
GlaS Dataset
We subsequently conducted experiments using the GlaS dataset,26 a publicly available histological image collection released as part of the MICCAI 2015 Gland Segmentation Challenge. The dataset comprises 165 H&E-stained images extracted from 16 colorectal tissue sections, each obtained from a different patient diagnosed with stage T3 or T4 colorectal adenocarcinoma. All cases correspond to advanced-stage disease, and no earlier-stage tumors are included in the cohort.26 Per the official GlaS challenge protocol, patient-level demographic information (e.g., age, sex, exact TNM substage) is not provided with the dataset and is not required for the benchmark segmentation task, which is defined strictly at the image and pixel level. The images were scanned at 20× magnification with a native spatial resolution of 0.465 µm/pixel, and most have an original size of 775 × 522 pixels. Each image is accompanied by instance-level segmentation ground truth, providing precise delineation of glandular boundaries. Within each image, both benign and malignant glandular structures are present, reflecting the heterogeneous histologic architecture typical of advanced colorectal adenocarcinoma. The dataset is divided into 85 training images (37 benign and 48 malignant) and 80 test images (37 benign and 43 malignant). This benign/malignant distribution at the image level is reported in accordance with established benchmark practice and provides sufficient characterization for the segmentation task.26 To ensure consistency with prior work and standardize input resolution, all images were resized to 512 × 512 pixels. For model development, the training set was further partitioned into 70 images for training (∼82.4% of the training set) and 15 for validation (∼17.6% of the training set) using a stratified sampling strategy to preserve the benign–malignant class balance, while the 80 test images were used exclusively for final performance evaluation. Because pathological stage is fixed (T3–T4) across the dataset, no stage-based stratification was required during training or evaluation. A key challenge posed by GlaS is the substantial inter-subject variability in staining characteristics and tissue morphology, arising from differences in laboratory processing, which makes the dataset a rigorous benchmark for gland segmentation algorithms.
Two-phase training protocol
Figure 2 illustrates a schematic overview of the proposed framework, which comprises two phases: a supervised warm-up phase and a teacher–student co-training phase.
Phase 1: Supervised warm-up
During the warm-up phase, the teacher network remains inactive, and the student network is trained solely on the available sparse annotations. This strategy ensures the student learns robust and meaningful representations that are essential for subsequent pseudo-label generation. The student network is optimized using a supervised loss (Lsupervised), defined as a weighted combination of Dice loss (Ldice) and categorical cross-entropy loss (Lcce) as follows:
Lsupervised=Ldice+Lcce
here, Ldice maximizes the overlap between predicted and ground truth masks, while Lcce evaluates the pixel-wise classification accuracy across the C = 4 classes. Formally, these losses are defined as follows: Ldice=1-2∑iyi,c⋅y ˆi,c∑iyi,c+∑iy ˆi,c
Lcce=-1N∑i=1N∑c=1C=4yi,clogy ˆi,c
where N denotes the total number of pixels, yi,c ∊ {0,1} indicates whether pixel i belongs to class c, and ŷi,c represent the predicted probability of class c at pixel i. The warm-up phase typically spans 20% to 25% of the total epochs, providing a stable initialization for the teacher network.Phase 2: Teacher–Student co-training
Upon completion of the warm-up phase, the teacher is initialized with the student’s parameters, i.e., θT←θS. Subsequently, the student network (θS) is optimized via gradient descent, while the teacher network (θT) is updated using an EMA of the student’s weights. Formally, the teacher parameters are updated as follows:
θT←βθT+(1-β)βS
where the EMA decay coefficient β is set to 0.999 to ensure temporally smooth teacher updates and to suppress short-term fluctuations in the student model. A high decay value is particularly important in weakly supervised dense segmentation settings, as it stabilizes pseudo-label generation and mitigates confirmation bias arising from noisy early predictions. This choice is consistent with prior teacher–student and Mean Teacher frameworks, which commonly adopt decay values in the range of 0.99–0.999 for segmentation tasks.This update strategy yields temporally smooth teacher predictions, enhancing pseudo-label stability and mitigating confirmation bias induced by noisy CAMs. During this phase, the student is trained using a hybrid loss that integrates supervised learning on labeled data with consistency regularization provided by the teacher as follows:
Ltotal=α(t)Lsupervised+(1-α(t))Lconsistancy
here, α(t) is a dynamic, epoch-dependent weighting factor that governs the trade-off between the supervised and consistency losses. We employ a cosine-decaying schedule for α(t) to gradually shift emphasis from GT supervision to teacher-guided consistency. After warm-up, α(t) is initialized at 0.9, placing 90% reliance on supervised loss, and decays to 0.01 by the end of training, progressively increasing reliance on teacher-generated pseudo-labels. The smooth cosine decay prevents abrupt transitions, reduces early over-reliance on noisy pseudo-labels, and enables stable late-stage refinement.Teacher-generated pseudo-mask
The consistency term in Eq. (5) encourages the student to align with the teacher’s segmentation predictions on both labeled and unlabeled pixels. To ensure the reliability of the teacher-generated pseudo-labels, we employ a confidence-based filtering mechanism that suppresses low-confidence or ambiguous pseudo-labels, particularly during the early phase of training. Formally, the confidence mask is defined as follows:
m(x)=1[max(σ(fθT(x)))>τconfidence(t)]
where σ(.) is the softmax, 1[.] is the indicator function, and τconfidence (t) is a cosine-decaying threshold that monotonically decreases from 0.95 to 0.25 over the course of training. The high initial threshold restricts supervision to only the most confident teacher predictions when the teacher model is still stabilizing, while the gradual relaxation allows progressively more ambiguous regions—such as gland boundaries and poorly differentiated structures—to be incorporated as training proceeds. This curriculum-guided design enables stable expansion of pseudo-label coverage while minimizing noise propagation.These bounds are empirically selected to incorporate a curriculum learning strategy that emphasizes high-confidence teacher pixel-level supervision in early training and gradually incorporates other teacher-generated pseudo-labels for unlabeled regions as the teacher stabilizes. To maximally leverage sparse annotations, the teacher-generated pseudo-labels are fused with GT labels using a pixel-wise integration strategy defined as follows:
m(x)={GT(m(x)),ifGT(x)>0m(x),otherwise.
This formulation ensures that pathologist-provided annotations are preserved exactly in labeled regions, while teacher-generated pixel-level pseudo-masks supervise the unlabeled regions. This fusion strategy is employed only after the teacher model has reached sufficient stability. The consistency loss is defined as follows:
Lconsistency=‖σ(fθS(x))-m(x)‖2
where σ(.) denotes the softmax function, converting the student network output logits into per-pixel class probabilities. We employ logit-level mean squared error for consistency regularization, which empirically stabilizes training and reduces sensitivity to early-stage noise in the pseudo-labels.Baselines
To benchmark the efficacy of our proposed framework, we evaluated its performance against a comprehensive set of existing methods, including thirteen WSSS and eleven fully supervised segmentation approaches.28–30 The WSSS baselines include SEAM,23 ReCAM,19 AMR,24 MLPS,12 OEEM,31 AME-CAM,32 HAMIL,33 CBFNet MPFP,29,34 Adv-CAM,35 SC-CAM, and MAA.13,28,29 The fully supervised baselines consist of widely used architectures: UNet,36 Seg-Net,37 MedT,38 TransUNet,39 Attention Unet,40 UNet++,41 KiU-Net,42 ResUNet++,43 DA-TransUNet,39 TransAttUNet,44 and EWASwin UNet.30 To ensure a fair comparison, we adhered to the experimental protocols and key hyperparameters (e.g., patch size) specified in the respective original baseline publications.28–30
Implementation details
All experiments were conducted using PyTorch 1.13.1 with CUDA 11.7 on Python 3.10. Training was performed on NVIDIA A100 GPUs. To ensure reproducibility, the random seed was fixed at 42 across all libraries (Python, NumPy, PyTorch, and CUDA), and deterministic algorithms were enforced. However, to quantify statistical variability, we performed five independent training runs with different random seeds for all experiments and report the mean ± standard deviation across these runs. The models were trained using the AdamW optimizer with an initial learning rate of 0.01 and a weight decay of 0.001.45 A cosine annealing schedule was employed to decay the learning rate to a minimum of 0.00001. We used a batch size of 16 and an input patch resolution of 512×512 pixels. To stabilize training, gradient clipping was applied with a maximum norm of 1.0. To enhance generalization, we utilized a comprehensive data augmentation strategy, including random discrete rotations (0°,90°,180°,270°), horizontal flipping (P = 0.5), hue–saturation–value jittering, Gaussian noise, and Gaussian blur, followed by standard ImageNet normalization.46 The maximum training duration was set to 250 epochs, with an early stopping mechanism triggered to prevent overfitting if validation performance did not improve for 50 consecutive epochs.
Evaluation metrics
We employed two widely adopted metrics in gland segmentation47: mean Intersection over Union (mIoU) and mean Dice coefficient (mDice). Both metrics are derived from pixel-level classification outcomes, where each pixel is categorized as true positive (TP), false positive (FP), or false negative (FN) with respect to the GT annotation. The mIoU measures the overlap between the predicted and GT gland regions and is defined as:
mIoU=TPTP+FP+FN.
The mDice evaluates the similarity between the predicted mask and the ground truth and is formulated as:
mDice=2×TP2×TP+FP+FN.
Both metrics are normalized to the range [0,1], where a value of 1 indicates perfect alignment between the prediction and the ground truth, and 0 implies no overlap. Higher scores correspond to superior segmentation accuracy and better boundary delineation.
Results
To validate the efficacy of our proposed framework, we assessed the proposed framework using the public GlaS dataset with dense annotations and an in-house OSUWMC cohort to evaluate performance under sparse-label conditions. Generalization beyond the training domain was examined by applying the model trained on the OSUWMC cohort to the TCGA-COAD, TCGA-READ, and SPIDER datasets. As GT annotations are unavailable for these external cohorts, evaluation was limited to qualitative analysis. The proposed framework was compared against a broad range of state-of-the-art approaches, including weakly supervised methods (summarized in Table 112,13,19,23,24,28,29,31–35) and fully supervised architectures (summarized in Table 230,36–44).
Table 1Comparison with weakly supervised gland segmentation methods on the GlaS dataset
| Method | Year | mIoU (%) | mDice (%) |
|---|
| SEAM23 | 2020 | 71.36 ± 0.49 | 79.59 ± 4.88 |
| ReCAM19 | 2022 | 56.31 ± 2.53 | – |
| AMR24 | 2022 | 72.83 ± 0.37 | – |
| MLPS12 | 2022 | 73.60 ± 0.16 | – |
| OEEM31 | 2022 | 76.48 ± 0.10 | 83.40 ± 5.36 |
| AME-CAM32 | 2023 | 74.09 ± 0.13 | – |
| HAMIL33 | 2023 | 77.37 ± 0.73 | – |
| CBFNet34 | 2024 | 76.30 ± 0.26 | – |
| MPFP29 | 2025 | 80.44 ± 0.05 | – |
| Adv-CAM35 | 2021 | 68.54 ± 3.36 | 81.33 ± 5.26 |
| SC-CAM13 | 2020 | 71.52 ± 3.50 | 83.40 ± 5.36 |
| MAA28 | 2025 | 81.99 ± 2.26 | 90.10 ± 3.31 |
| Ours | – | 80.10 ± 1.52 | 89.10 ± 2.10 |
Table 2Comparison with fully supervised gland segmentation methods on the GlaS dataset
| Method | Year | mIoU (%) | mDice (%) |
|---|
| UNet36 | 2015 | 64.8 | 77.6 |
| Seg-Net37 | 2017 | 66.0 | 78.6 |
| MedT38 | 2021 | 69.6 | 81.0 |
| TransUNet39 | 2021 | 70.1 | 81.5 |
| AttentionUNet40 | 2018 | 70.1 | 81.6 |
| UNet++41 | 2018 | 70.2 | 81.9 |
| KiU-Net42 | 2020 | 72.8 | 83.3 |
| ResUNet++43 | 2019 | 73.8 | 84.1 |
| DA-TransUNet39 | 2024 | 75.6 | 85.3 |
| TransAttUNet44 | 2023 | 77.7 | 86.7 |
| EWASwin UNet30 | 2025 | 81.5 | 89.4 |
| Ours | – | 80.1 | 89.1 |
Performance against weakly supervised methods
As Table 1 summarizes, our framework achieves competitive state-of-the-art performance on the GlaS benchmark,28,29 achieving an mIoU of 80.10% and an mDice of 89.10%, while our mIoU is slightly below that of the leading MAA method. The proposed framework demonstrates markedly superior training stability, evidenced by a lower variance (±1.52 mIoU, ±2.10 mDice) compared to MAA (±2.26 mIoU, ±3.31 mDice). The high consistency and lower variance are critical prerequisites for clinical translation and underscore the robustness of our pseudo-label refinement strategy.
Performance against fully supervised methods
As Table 2 summarizes, our framework achieves competitive performance compared to fully supervised state-of-the-art methods on GlaS.30 Specifically, our framework attains 0.801 mIoU and 0.891 mDice, which are on par with the top-performing supervised baseline, EWASwin UNet (0.815 mIoU).30 Moreover, our framework surpasses traditional architectures such as UNet++ and ResUNet++ (mIoU ∼0.70–0.74) as well as other advanced models such as TransAttUNet (0.777 mIoU).41,43,44 These quantitative findings are corroborated by the qualitative comparisons shown in Figure 3 (with additional examples in Fig. 4), which demonstrate the model’s ability to generate precise segmentation masks.30 Notably, these results underscore the capability of our weakly supervised framework to effectively leverage sparse annotations to obtain performance competitive with leading fully supervised methods.
Results on the OSUWMC dataset and out-of-domain generalization to TCGA-COAD, TCGA-READ, and SPIDER
Figure 5 illustrates the qualitative performance of our framework on the in-house OSUWMC dataset. These visualizations demonstrate how the stabilized teacher network effectively guides the student model via pseudo-masks, enabling the discovery and precise segmentation of unannotated gland structures using only limited supervision. To evaluate robustness and clinical transferability, we performed whole-slide inference on three external cohorts, including TCGA-COAD, TCGA-READ, and SPIDER. Despite significant inter-institutional variations in staining protocols and scanner characteristics, our model maintained consistent qualitative performance (see Fig. 6), successfully identifying benign glands, malignant glands, and poorly differentiated clusters/glands on TCGA-COAD and TCGA-READ. In contrast, on the SPIDER dataset, we observed notable qualitative performance degradation, characterized by fragmented gland boundaries, increased FPs in stromal regions, and reduced sensitivity to poorly differentiated glandular structures. Quantitative evaluation was not performed, as pixel-level gland annotations are not available for these datasets. The observed performance degradation is therefore reported qualitatively and is attributed to severe domain shift, including lower image quality, pronounced staining heterogeneity, and higher morphological variability inherent to that specific cohort, highlighting the challenges of cross-domain generalization in histopathology.48,49
Discussion
We developed and validated a novel weakly supervised teacher–student framework for multi-class gland segmentation in CRC histopathology. The core of our approach lies in an EMA-stabilized teacher network, which employs confidence-based filtering and an adaptive fusion strategy to iteratively refine pseudo-masks, thereby guiding the student network with increasingly reliable supervision. Notably, the competitive results on the GlaS benchmark indicate that access to high-quality annotations allows our framework to further narrow the performance gap with fully supervised methods. In clinical settings, where dense, pixel-level annotation remains a major bottleneck, our framework offers practical benefits by substantially reducing annotation requirements while maintaining strong segmentation performance. Furthermore, its ability to generalize to TCGA-COAD and TCGA-READ without additional fine-tuning underscores its potential for multi-center application, where variations in staining and scanning protocols are common.
However, the performance drop on SPIDER highlights the well-known challenge of domain generalization in computational pathology.48,49 While our method generalizes well to TCGA-COAD and TCGA-READ domains (similar domains), SPIDER represents a severe domain shift that likely requires explicit domain adaptation techniques. Future work will focus on incorporating advanced domain adaptation strategies to improve broader cross-institutional generalization.48,49 Additionally, we plan to extend the framework to other adenocarcinoma types, such as prostate, breast, and lung cancers, where glandular segmentation is equally critical for diagnosis and grading. By further reducing reliance on manual annotations while maintaining high segmentation fidelity, our framework offers a scalable and practical pathway toward wider adoption of computational pathology tools in clinical workflows.
Limitations
Despite the promising performance of the proposed framework, several limitations merit consideration. First, the OSUWMC dataset lacks patient-level clinical metadata, precluding clinicopathologic correlation analyses. While this does not affect the technical validity of pixel-wise segmentation evaluation, it limits assessment of downstream prognostic or clinical utility. Second, although results are reported as mean ± standard deviation across independent runs, additional statistical measures, e.g., confidence intervals, will be investigated in future work. Third, the performance degradation observed on the SPIDER dataset highlights the impact of severe domain shift; addressing this limitation will likely require explicit domain adaptation or stain normalization strategies. Finally, while the proposed framework substantially reduces annotation burden, it still relies on limited sparse expert annotations. Achieving fully annotation-free segmentation remains an open and important direction for future research.
Conclusions
In this study, we introduced a novel weakly supervised teacher–student framework for multi-class gland segmentation in CRC histopathology, specifically designed to overcome the critical bottleneck and demand for extensive pixel-level annotations. By leveraging an EMA-stabilized teacher network, our framework efficiently utilizes sparse annotations, progressively refining pseudo-labels through confidence-based filtering and adaptive GT fusion. Comprehensive evaluation demonstrates that our framework achieved performance competitive with state-of-the-art methods in both weakly and fully supervised settings. Furthermore, the model exhibits strong in-house performance and robust generalization to external cohorts, including TCGA-COAD and TCGA-READ. While performance limitations on SPIDER highlight challenges under extreme domain shift, overall this work establishes an annotation-efficient paradigm that directly addresses a fundamental impediment in computational pathology. By substantially reducing reliance on costly manual curation while maintaining high segmentation fidelity, the proposed framework offers a practical, translatable solution to accelerate the adoption of automated diagnostic tools in clinical workflows.
Declarations
Ethical statement
The use of the in-house OSUWMC dataset was approved by the Institutional Review Board of The Ohio State University Wexner Medical Center (IRB No. 2018C0098). Written informed consent was obtained from all patients or was waived by the IRB due to the retrospective nature of the study. Public datasets (TCGA and SPIDER) were used in compliance with their respective data usage agreements and ethical guidelines and do not require additional institutional approval. All procedures were performed in accordance with the ethical standards of the Declaration of Helsinki (as revised in 2024).
Data sharing statement
The in-house OSUWMC dataset used in this study is available upon reasonable request by contacting the corresponding author, Hikmat Khan, at Hikmat.Khan@osumc.edu. All code was implemented in Python using PyTorch as the primary deep-learning library. The complete pipeline for processing WSIs, as well as training and evaluating the deep-learning models, will be available at: https://github.com/hikmatkhan/gland-segmentation-teacher-student.
Funding
This project was supported by R01 CA276301 (PIs: Niazi, Chen) from the National Cancer Institute. The project was also supported by The Ohio State University Comprehensive Cancer Center, Pelotonia Research Funds, and the Department of Pathology. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Cancer Institute.
Conflict of interest
The authors declare no competing interests.
Authors’ contributions
Leadership, experimental design, data analysis, figure and table preparation, manuscript drafting (HK), funding acquisition, and writing—review and editing (WC, MKKN). All authors have approved the final version and publication of the manuscript.