Introduction
Primary sclerosing cholangitis (PSC) is a progressive, immune-mediated cholestatic liver disease characterized by chronic inflammation and fibrosis of the intra- and extrahepatic bile ducts. This pathology frequently leads to biliary strictures and cirrhosis and confers a markedly elevated lifetime risk of cholangiocarcinoma.1,2 A hallmark of PSC is its strong clinical association with inflammatory bowel disease, particularly ulcerative colitis, implicating shared dysregulation of mucosal immunity and gut–liver axis interactions in its pathogenesis.3 This systemic dimension suggests that key molecular drivers of PSC may extend beyond the hepatic microenvironment and could be accessible in peripheral circulation.
The molecular etiology of PSC remains incompletely defined, a critical knowledge gap that hinders the development of mechanism-based diagnostics and targeted therapies.4 Current diagnostic paradigms rely primarily on cholangiographic imaging and liver biochemistry, modalities that often identify advanced disease stages and offer limited prognostic insight.5 Consequently, there is a pressing, unmet need to identify robust molecular biomarkers for early detection, risk stratification, and, crucially, for illuminating causal, druggable biological pathways.
Genomic studies have established a robust genetic architecture for PSC, with numerous risk loci identified through genome-wide association studies (GWAS), many of which are implicated in immune function.6,7 However, a fundamental challenge lies in translating these statistical genetic associations into actionable biological mechanisms. While expression quantitative trait loci analyses connect genetic variants to gene expression, proteins serve as the primary functional effectors of cellular processes and represent central targets for pharmacotherapy.8 The recent advent of large-scale plasma protein quantitative trait locus (pQTL) maps provides a transformative resource, enabling the direct linkage of genetic variation to circulating protein levels, a proximal and clinically actionable molecular layer for causal inference.9
Mendelian randomization (MR) employs genetic variants as instrumental variables to infer causal relationships between an exposure and an outcome, largely circumventing confounding and reverse causation biases inherent in observational studies.10 Integrating pQTL data with disease GWAS through MR methodologies offers a powerful and systematic framework for identifying circulating proteins with putative causal roles in disease pathogenesis.11
Here, we outline a multi-stage integrative framework designed to systematically discover and causally validate protein biomarkers for PSC. Our approach begins by leveraging bioinformatic and machine learning analyses of disease-relevant transcriptomic data to nominate a robust set of high-confidence candidate proteins. The central aim was then to rigorously test the causal role of these prioritized candidates in PSC pathogenesis. To achieve this, we employed two-sample MR, integrating large-scale plasma pQTL data with the latest PSC GWAS summary statistics. This comprehensive strategy was designed to bridge the gap between associative findings and causal mechanisms, with the ultimate goal of identifying high-confidence diagnostic biomarkers and revealing novel therapeutic targets for future validation.
Results
Identification of DEGs and co-expression modules in PSC
Comparative transcriptomic analysis of the PSC training cohort versus healthy controls identified 514 DEGs (|log2FC| > 0.379, FDR < 0.05) (Fig. 2A). Hierarchical clustering based on these DEGs demonstrated clear separation between PSC and control samples (Fig. 2B). WGCNA of these DEGs resolved several distinct co-expression modules (Fig. 2C). Module–trait relationship analysis identified multiple modules significantly correlated with the PSC phenotype, including MEpurple (r = 0.25, P = 0.009), MEbrown (r = 0.37, P = 8e-5), and MEblack (r = 0.42, P = 8e-6) (Fig. 2D). The MEblack module, demonstrating the strongest association, was selected for downstream analysis, comprising 159 high-confidence PSC-related candidate genes.
Functional enrichment implicates ribosome biogenesis and protein homeostasis
Functional enrichment analysis of the candidate genes from the MEblack module revealed significant terms related to ribosome biogenesis, cytoplasmic translation, and protein homeostasis (Fig. 3A and B). Gene Ontology analysis across biological process, cellular component, and molecular function categories highlighted prominent clusters including “ribosome biogenesis,” “cytoplasmic translation,” “structural constituent of ribosome,” “protein stabilization,” and “unfolded protein binding” (all FDR < 0.05) (Fig. 3C).
An integrated network and machine learning pipeline identifies a core diagnostic signature
The PPI network constructed from the candidate genes yielded 21 topologically central hub genes (Fig. 4A). These were refined using a multi-algorithm machine learning framework. Among 113 model combinations evaluated via stratified 10-fold cross-validation, an RF classifier achieved the highest and most generalizable performance, with a mean AUC of 0.908 across training and independent validation cohorts (Fig. 4B). In the training cohort, individual candidate genes showed diagnostic AUCs ranging from 0.728 to 0.792 (Fig. 4C).
Feature importance analysis within the optimal RF model identified a nine-gene diagnostic signature: PTMA, SUMO1, Shwachman-Bodian-Diamond syndrome (SBDS), RPL7, EIF1AX, ANP32A, PCNA, FAM98A, and MPHOSPH6. SHAP analysis quantified their contributions, identifying SUMO1 (mean |SHAP| = 0.0576) and PTMA (0.0519) as the most influential predictors (Fig. 4D and E). SHAP dependence plots illustrated the directional impact of each feature on the model’s prediction of PSC risk (Fig. 4F). A detailed force analysis for a representative sample showed SUMO1 and SBDS as primary negative regulators, substantially lowering the prediction score below the baseline expectation (Fig. 4G).
Consistent with their diagnostic role, all nine signature genes were significantly downregulated in PSC samples versus controls (Wilcoxon test, P < 0.05) (Fig. 4H). Strong positive pairwise co-expression correlations among these genes (e.g., SUMO1 vs. SBDS: r = 0.85, P < 0.001) suggested functional co-regulation within the PSC state (Fig. 4I).
The PSC immune microenvironment exhibits myeloid skewing and altered transcriptional–immune coupling
Deconvolution of the peripheral immune cell composition revealed profound remodeling in PSC, characterized by a shift toward myeloid expansion and lymphoid contraction (Fig. 5A).
Quantitatively, we observed a significant reduction in CD8+ T cells, γδ T cells, and activated NK cells (P < 0.05), alongside a marked expansion of neutrophils (P < 0.01) (Fig. 5B).
Integrated correlation analysis uncovered that expression of the nine-gene signature was selectively coupled to specific immune cell covariance modules, a pattern we term “transcriptional-immune circuit aliasing” (Fig. 5C). For instance, PCNA expression correlated positively with a CD8+ T cell/resting mast cell module but negatively with a neutrophil/M0 macrophage module. PTMA aligned with a monocyte/resting mast cell axis while anti-correlating with a neutrophil/plasma cell axis. SBDS expression was inversely linked to a γδ T cell/M2 macrophage module.
Inference of a dysregulated transcriptional regulatory network
Analysis of predicted transcription factor–hub gene interactions within the PSC cohort revealed distinct regulatory patterns (Fig. 6A). POU2F1 emerged as a central negative regulator, showing strong inverse correlations with multiple hub genes including RPL7 (r = −0.56, P < 0.001), EIF1AX (r = −0.52, P < 0.001), PCNA (r = −0.53, P < 0.001), and SBDS (r = −0.43, P < 0.01). In contrast, LMO2 exhibited significant positive correlations with ANP32A (r = 0.49, P < 0.001), SUMO1 (r = 0.48, P < 0.001), and SBDS (r = 0.29, P < 0.01).
A reconstructed regulatory interaction network positioned ANP32A as a major hub, densely connected to several repressive TFs (POU2F1, IRF4, MYB, EBF1), suggesting it may be a focal point for concerted transcriptional dysregulation (Fig. 6B). The network also highlighted a specific regulatory axis for SBDS, potentially under the control of NFIC.
Drug repurposing analysis highlights potential therapeutic agents
Drug enrichment analysis targeting the hub genes identified several pharmacological agents with significant associations (P < 0.01), suggesting repurposing potential (Fig. 6C and D). Notable findings included the association of the topoisomerase I inhibitor topotecan and the DNA polymerase inhibitor aphidicolin with PCNA, implicating DNA replication pathways. Agents with anti-inflammatory or redox-modulatory properties, such as hesperetin (linked to PTMA and ANP32A) and dihydroergotamine (associated with EIF1AX and ANP32A), were also enriched. Compounds with hepatobiliary relevance, including epirubicin hydrochloride (SUMO1) and the tyrosine kinase inhibitor vandetanib (PTMA), were identified, alongside antibiotics and NSAIDs such as cefixime (SUMO1) and indomethacin (EIF1AX).
MR identified plasma SBDS as a causal protective factor in PSC
Two-sample MR analysis was performed to assess the causal relationship between genetically predicted plasma levels of the signature proteins and PSC risk. SBDS protein demonstrated a significant causal association with PSC, and four conditionally independent SNPs (rs1354034, rs138783789, rs74490291, rs79344818; pairwise r2 < 0.01) served as instrumental variables.
Genetically predicted higher plasma SBDS levels demonstrated a protective causal effect on PSC risk, with an OR of 0.525 (95% CI: 0.356–0.773; P = 0.001) using the IVW method (Fig. 7A). A scatter plot of SNP-specific effects showed consistent directional estimates aligning with the IVW and weighted-median regression lines (Fig. 7B).
This causal inference was robust to sensitivity analyses. We found no significant heterogeneity among instrument-specific estimates (Cochran’s Q = 1.063, P = 0.786) and no evidence of directional pleiotropy (MR-Egger intercept = 0.028, P = 0.616) (Fig. 7C and D). Steiger directionality testing confirmed the assumed causal direction from SBDS to PSC risk (P = 0.007). Leave-one-out analysis confirmed that the association was not driven by any single influential SNP (Fig. 7E). These results provide robust genetic evidence that elevated plasma SBDS is causally associated with a reduced risk of PSC.
Discussion
In this study, we developed and applied an integrated multi-omics framework to delineate the causal molecular architecture of PSC. This systematic strategy culminated in the identification of a robust, peripherally accessible gene expression signature for disease diagnosis. Furthermore, it provided genetic evidence nominating the protein SBDS as a putatively causal protective factor, wherein genetically predicted higher plasma SBDS levels were associated with reduced disease risk. Collectively, these findings advance a novel mechanistic understanding of PSC, centered on ribosome homeostasis, and establish a direct translational pathway for biomarker development and therapeutic target identification.
The nine-gene diagnostic signature (PTMA, SUMO1, SBDS, RPL7, EIF1AX, ANP32A, PCNA, FAM98A, MPHOSPH6), refined from a broader candidate set through a stringent machine learning pipeline, constitutes a functionally cohesive molecular network whose activities converge on ribosome biogenesis, translational control, and nucleic acid metabolism. These interconnected processes are fundamental to cellular homeostasis and the adaptive stress response, with their collective dysregulation forming a compelling pathogenic axis in chronic inflammatory and fibrotic diseases.33,34
The most salient functional theme uniting this signature is ribosome biology. Central to this is SBDS, which encodes a key assembly factor essential for the maturation of the 60S ribosomal subunit and the maintenance of translational fidelity; its deficiency is a known cause of ribosomopathies.35,36 This core function is supported by RPL7, a canonical structural component of the 60S subunit,37 and EIF1AX, a crucial translation initiation factor.38 Their coordinated downregulation in PSC (Fig. 4H) strongly implies a systemic impairment of global protein synthesis capacity. Such a proteostatic deficit could underpin cellular dysfunction and heightened sensitivity to injury within the cholangiocyte microenvironment, a mechanism implicated in other forms of liver injury.34
This perturbation in core translational machinery is further modulated by signature genes governing post-translational modification and nuclear chaperone function. SUMO1 mediates SUMOylation, a dynamic post-translational modification critical for regulating protein stability, localization, and activity, with particular importance for factors involved in DNA damage and stress responses.39 Concurrently, ANP32A and PTMA function as histone chaperones and acidic nuclear phosphoproteins deeply involved in chromatin remodeling, transcription, and apoptosis.40,41 The dysregulation of this suite of nuclear regulators points to a concomitant disruption of nuclear homeostasis and gene expression control, potentially locking cells into a maladaptive stress state that exacerbates inflammation and hampers repair.
Expanding beyond translational and nuclear control, the signature also encompasses direct modulators of cell cycle progression and proliferative capacity. PCNA acts as an essential processivity factor for DNA polymerases, serving as a central platform that coordinates DNA replication and repair.42 MPHOSPH6 participates in mRNA export and cell cycle regulation, while FAM98A contributes to the assembly of protein complexes and methyltransferase activity.43 The coordinated downregulation of these pro-growth and pro-synthesis factors within the signature may indicate a fundamental rewiring of cellular physiology in PSC. This could represent a state of proliferative arrest, which may be a consequence of sustained inflammatory damage, or it may alternatively reflect an adaptive cellular strategy to mitigate stress by reducing metabolic and biosynthetic demand.34
While the nine-gene signature as a whole provides a powerful diagnostic tool through the combined interpretation of its gene expression levels, our subsequent MR analysis allowed us to probe the individual causal roles of its components at the protein level to uncover underlying disease mechanisms. It is crucial to distinguish the role of the multi-gene panel in achieving high diagnostic accuracy from the role of SBDS as a single, causally implicated factor for mechanistic understanding and therapeutic targeting. Among these components, SBDS emerges as the central and causally validated hub. Our MR analysis, which focused on genetically predicted higher plasma levels of SBDS protein, provides the first genetic evidence that these higher levels are causally protective against PSC (IVW OR = 0.525, P = 0.001). This elevates SBDS from a merely correlative biomarker to a putative upstream regulator within the pathogenic network, highlighting its potential as a therapeutic target. We propose that a deficit in SBDS function, potentially arising from genetic predisposition, inflammatory inhibition, or other mechanisms, could initiate a molecular cascade. This cascade would be characterized by ribosomal stress, triggering the coordinated downregulation of associated translational machinery (RPL7, EIF1AX). This primary defect could subsequently dysregulate downstream processes vital for protein stabilization (SUMO1, ANP32A, PTMA) and cellular replication (PCNA). Consequently, the SBDS-centric signature likely delineates a critical pathogenic axis in PSC, wherein compromised ribosome homeostasis acts as a nexus, linking fundamental cellular stress to the overt inflammatory and fibrotic disease phenotypes.36,44
The immune microenvironment in PSC, marked by prominent neutrophil expansion and a relative contraction of CD8+ T cells, is consistent with established reports of dominant innate immunity in this disease.45 Our analysis uncovers a deeper layer of dysregulation, which we term “transcriptional-immune circuit aliasing.” This phenomenon describes the selective coupling between the expression of key signature genes (e.g., PCNA, PTMA, and SBDS) and specific covariance modules of immune cells.22 This finding suggests that the diagnostic signature is not merely a passive reflection but is functionally embedded within, and may actively participate in regulating, the dysregulated immune crosstalk characteristic of PSC. For instance, the inverse correlation observed between SBDS expression and a module comprising γδ T cells and M2 macrophages points to a potential role for this ribosome assembly factor in modulating pro-fibrotic or immunosuppressive immune axes, a hypothesis that merits dedicated experimental inquiry.46
The translational implications derived from our integrative analysis are significant and fall into two distinct yet complementary categories. First, for diagnostic purposes, the nine-gene expression signature as a whole presents a compelling candidate for a minimally invasive, blood-based molecular assay. The combined predictive power of these nine markers, reflecting a functionally cohesive molecular network, underpins its high diagnostic accuracy (AUC = 0.908) for distinguishing or stratifying PSC patients. Second, on a mechanistic and therapeutic level, our genetic causal inference provides robust evidence for SBDS’s role. This elevates SBDS beyond its function as a component of the diagnostic panel to a high-confidence therapeutic target for intervention, wherein genetically predicted higher protein levels are causally protective. This distinction is vital: the signature offers a practical tool for diagnosis, while SBDS provides a specific, causally validated target for therapeutic intervention. Supporting this direction, our in silico drug repositioning analysis identified several existing compounds with favorable safety profiles that interact with signature gene products, indicating tangible repurposing avenues. Notable among these are the flavonoid hesperetin, associated with PTMA and ANP32A, and chemotherapeutic agents with known hepatobiliary distribution such as epirubicin, linked to SUMO1.32 Consequently, a pivotal future direction involves delineating the precise mechanistic role of SBDS in cholangiocyte pathobiology and immune cell function, coupled with screening for pharmacological agents capable of augmenting its expression or activity. This approach aims to restore ribosome homeostasis, representing a novel and mechanistically grounded therapeutic strategy for PSC.44
Several limitations merit consideration. First, while MR strongly supports causality, functional validation of SBDS in relevant in vitro and in vivo models is essential to confirm its mechanistic role in bile duct inflammation and fibrosis. Second, the primary data sources (GWAS, pQTL) are predominantly from European ancestry populations, which limits the generalizability of our findings across diverse ethnic groups.47 Future studies should incorporate multi-ancestry cohorts. Third, our analysis focused on plasma pQTLs; integrating protein QTL data from liver tissue or single-cell contexts could provide more tissue-specific mechanistic insights.48 Finally, prospective validation in independent, clinically well-characterized cohorts is needed to assess the real-world utility of the diagnostic signature.