Introduction
Critical care is characterized by the need for rapid, specialized medical interventions informed by constantly evolving patient data. Decisions such as whether to administer additional intravenous fluids, adjust vasopressor dosing,1 or modify positive end-expiratory pressure (PEEP) can determine whether an unstable patient moves toward recovery or deteriorates further.2 Clinical decision-making in the intensive care unit (ICU) is characterized by high risk, minimal tolerance for error, and rapidly evolving patient conditions. Each patient presents a unique clinical trajectory; even within the same individual, physiological states can change significantly over short time intervals. Modern ICUs generate vast amounts of heterogeneous data, such as laboratory results, vital signs, free-text clinical notes, high-frequency physiological waveforms, imaging, and device outputs. Artificial intelligence (AI) has created unprecedented opportunities to harness this data and potentially guide patient management in real time.3,4
In this context, personalization goes far beyond broad risk prediction or standardized protocols.5 It entails tailoring therapy to the unique physiological state, comorbidities, and evolving trajectory of each patient. For example, the optimal fluid resuscitation strategy for one patient with septic shock may differ dramatically from another,6 depending on cardiac function, vascular tone, prior fluid balance, and even biomarker or genomic profiles. Similarly, ventilator management in acute respiratory distress syndrome (ARDS) is not a fixed recipe but a dynamic decision that balances oxygenation, lung protection, and hemodynamic stability, with careful consideration of patient-specific factors.7 To achieve this level of nuance, clinicians need tools that can move from population-level averages to individualized estimates of treatment response tools that can anticipate not only the risks a patient faces but the consequences of clinical actions for that particular patient at that particular time.
AI provides the technical foundation for enabling such personalized care.5,8 By leveraging multimodal ICU data streams, predictive models can forecast deterioration while causal machine learning (ML) approaches can estimate the likely impact of specific interventions.9,10 Reinforcement learning (RL) extends this further by optimizing sequences of decisions across time to maximize long-term patient outcomes.11 Together, these approaches enable AI systems to move beyond static prediction toward dynamic, adaptive decision support that accounts for patient heterogeneity, temporal evolution, and counterfactual reasoning.12 In doing so, AI can complement established clinical decision-making by enhancing the ability of clinicians to deliver proactive and individualized therapy.8
This review examines how such a transformation might unfold, contrasting current predictive AI approaches that rely on population-level or average treatment effects with prescriptive AI methods that model the causal impact of interventions and their temporal sequencing to enable patient-specific decision making (Fig. 1). By integrating causal inference and reinforcement learning, prescriptive AI recommends patient-specific treatment strategies that adapt to evolving physiology, thus enabling true personalization in critical care. We begin by surveying the current state of AI in critical care, highlighting both the achievements and limitations of existing predictive models. Then, we turn to the question of what it means to achieve true personalization, focusing on RL and causal ML as complementary methods capable of estimating individualized treatment effects and optimizing sequential decision-making. Next, we consider the major challenges that arise in implementing these methods in real-world ICUs, spanning methodological, technical, clinical, ethical, and regulatory domains. Finally, we look ahead to future directions—digital twins,13 causality-aware foundation models, and clinician-in-the-loop systems14—before concluding with a vision for how AI can help deliver on the long-standing promise of critical care of providing the right treatment, for the right patient, at the right time.
Current state of AI use in critical care
In recent years, AI has begun to make inroads into the practice of critical care, although most applications to date have focused on prediction rather than providing direct guidance in clinical action. The earliest and most widespread examples are early warning systems that forecast patient deterioration.15–17 By analyzing temporal trends in vital signs, laboratory data, and clinical documentation, these models provide advance notice of events that might otherwise become apparent only after significant physiological decline. Parallel lines of work have focused on predicting specific outcomes—such as acute kidney injury (AKI),18 sepsis,6 or mortality15,16—thus offering risk stratification that can support clinical vigilance and guide the allocation of resources.
AI techniques have been utilized to stratify critically ill patients into subpopulations, thus revealing hidden patterns within syndromes traditionally treated as uniform. AI models have revealed subphenotypes of sepsis, AKI, and acute respiratory distress syndrome,7 thus highlighting biologically and clinically distinct patient clusters. These discoveries are valuable both at the bedside, where they may help clinicians tailor therapies to individual physiology, and in the design and interpretation of clinical trials, where failure to account for heterogeneity has often obscured actual treatment effects.
The increasing sophistication of ML methods has also made it possible to integrate the ICU’s diverse data sources into shared representations of patient state.19–22 Rather than relying on a limited set of vital signs or laboratory values, contemporary models can integrate data obtained from structured electronic health records,23 high-frequency physiological waveforms,24 imaging studies, and even free-text notes. Time-series encoders—such as recurrent neural networks,25 temporal convolutional networks,26 and transformers27—have been applied to handle the irregular sampling and sparse dynamics of ICU data.22 By leveraging clinical documentation, large language models (LLMs) can capture nuanced information regarding comorbidities, goals of care, and narrative context,28,29 which are often missed in structured electronic health record data.30 Meanwhile, vision-based networks trained on chest radiographs or computed tomography scans have been combined with clinical and physiological embeddings,31 thus creating multimodal models that more closely approximate the manner in which intensivists synthesize information in practice.22,32,33
Despite this progress, most current systems remain predictive rather than prescriptive. While they estimate the likelihood of outcomes, they stop short of recommending specific interventions to alter those outcomes. A model may forecast that a patient has a high probability of requiring mechanical ventilation within six hours, but it does not advise whether to escalate noninvasive support, adjust sedation, or initiate prone positioning to change that trajectory. Similarly, a system that predicts impending hypotension cannot independently determine whether fluids, vasopressors, or both are most appropriate. This gap between prediction and action emphasizes a central limitation of current AI in critical care and points toward the need for approaches that explicitly model causality and sequential decision-making. Only by moving in this direction can AI evolve from a passive risk calculator into an active partner in delivering personalized critical care.
Achieving true personalization
The promise of personalized critical care is to move from generalized prediction to individualized prescription. Rather than merely anticipating adverse outcomes, true personalization requires understanding how specific interventions will affect specific patients at specific times and then using this information to guide the sequences of clinical decisions as the illness evolves. Two complementary families of methods, causal ML and RL, provide the methodological foundation for this transition to prescriptive or navigational AI.
Causal ML
Causal ML estimates the heterogeneous effects of interventions rather than simple associations between covariates and outcomes. It does so by defining explicit causal estimands—such as the average treatment effect, the individualized treatment effect, and the value of an individualized treatment rule—which correspond to questions regarding how a particular intervention would change outcomes for a given patient or population.5,34,35 In critical care, treatments are rarely randomly applied; clinicians select interventions based on evolving patient states. This introduces confounding, as the sickest patients may be more likely to receive a certain treatment, thus making it difficult to distinguish the effect of the intervention from the underlying severity of the illness. The credibility of these estimands relies on certain assumptions—such as conditional exchangeability, positivity, consistency, and appropriate model specification—all of which warrant explicit consideration in ICU data. In practice, these assumptions are commonly strained by unmeasured confounding, limited overlap produced by entrenched treatment patterns, time varying confounding as physiology evolves, coarse timing of interventions, and measurement error inherent in routine clinical documentation. Causal ML methods aim to correct this by explicitly modeling the data-generating process and estimating individualized treatment effects. Approaches such as propensity matching and inverse probability weighting estimation enable researchers to predict what outcomes would have occurred under different treatment strategies.34–37 These challenges can be diagnosed or mitigated using overlap and balance assessments34; quantitative sensitivity analyses38; negative controls39; instrumental variable methods40; and longitudinal causal methods, such as marginal structural models or g-formula estimators to address time varying confounding.41,42 Causal ML methods, such as causal forests and meta-learners,43,44 have been utilized to identify heterogeneity in treatment response across patients, effectively moving from average treatment effects to conditional treatment effects. In addition, targeted learning frameworks and longitudinal g-methods also provide principled,45,46 theory-based approaches for estimating individualized effects in complex, high-dimensional clinical settings. Applied to the ICU, these techniques can help answer questions such as whether additional fluid resuscitation is likely to help or harm a specific patient in septic shock or whether a higher PEEP strategy will improve oxygenation without worsening hemodynamics in a patient with ARDS. By grounding predictions in counterfactual reasoning, causal ML provides individualized estimates that go beyond traditional predictive models and support patient-specific treatment decisions. For example, while classical predictive models can forecast the likelihood of AKI or fluid overload,47,48 they cannot tell us whether giving or withholding fluids will improve outcomes for a specific patient. In a recent study from our group, causal ML was used to tackle precisely this problem in septic patients with AKI.49 By leveraging causal forests to estimate individualized treatment effects and applying a policy tree to make those effects interpretable, the study identified subgroups of patients most likely to benefit from a restrictive fluid strategy. In both development and external validation cohorts, those predicted to benefit and those who actually received restrictive fluids had higher rates of AKI reversal and fewer adverse kidney events. This work exemplifies how causal ML can move beyond population-level associations to individualized, counterfactual predictions that inform treatment strategies tailored to the patient in front of us. While promising, this analysis remains observational and, thus, susceptible to residual confounding, thus emphasizing the need for prospective evaluation before such treatment policies are applied in clinical practice.
RL
While causal ML addresses the effect of a single intervention, RL extends the framework to optimize sequences of interventions over time.50 Critical care is inherently dynamic in situations where insulin infusions are titrated hourly, ventilator settings are adjusted as lung mechanics evolve, and fluid/vasopressor balance is revisited with every lab and vital sign update. Each decision influences both the immediate physiology and trajectory of the illness. RL formalizes this process as a Markov decision process,51 comprising the following core elements:
States (s)
The set of all possible patient conditions at a given time, which may include demographics, comorbidities, vital signs, labs, mechanical ventilation parameters, and medications. As ICU data provide only a partial view of the underlying physiological state, representation learning plays a central role in offline RL. Deep sequence models—such as recurrent networks, temporal convolutional networks, and Transformers, as well as multimodal encoders that integrate labs, vitals, waveform data, imaging, and clinical text—can help recover latent patient trajectories from noisy, irregularly sampled observations.27,52–54 Contrastive learning can further improve robustness to missingness and sensor dropout.55 Once deployed, these learned state representations must be monitored for drift, for example by tracking shifts in embedding distributions, KL divergence or cosine similarity relative to training distributions, model performance on stability anchors, or abrupt changes in representation clustering that may reflect evolving practice patterns or patient mix.56 Such monitoring is essential because changes in the representation space can invalidate the learned policy even when raw input features appear stable.
Actions (a)
This includes the set of all possible interventions available to the clinician—for example, giving intravenous fluids, vasopressors, insulin, or adjusting PEEP.
Transitions (T)
This includes the transition probabilities that map a given state and action to the next state, thus reflecting the patient’s physiological response and random variability.
Rewards (r)
These are the immediate benefits or costs associated in moving from one state to the other as a result of a specific outcome. Reward functions may incorporate clinical outcomes or physiological targets.57 In practice, clinically useful rewards are often multiobjective and must capture trade-offs—such as hemodynamic stability versus renal safety, oxygenation targets versus ventilator induced lung injury, and tight glycemic control versus hypoglycemia—with explicit safety constraints that prohibit clearly harmful actions regardless of short term reward.57–59
The goal of RL is to learn a sequence of actions (the treatment policy) for given states that can maximize the expected cumulative rewards. The logic of RL naturally resonates with the practice of intensive care medicine. Clinicians do not simply make one decision at the onset of illness—they orchestrate a series of decisions, each informed by prior responses and each shaping future possibilities. For example, a fluid bolus now may increase the probability of pulmonary edema later, which may influence the decision to intubate; in turn, this then changes the trajectory of weaning and sedation. What distinguishes RL from conventional prediction is precisely this temporal chaining, where RL recognizes that the best action is not always the one that maximizes immediate physiologic improvement but the one that sets the patient on the best overall path. In this sense, personalization emerges not only from accounting for individual baseline characteristics but also from dynamically adapting to how that individual responds to prior care.
However, in practice, learning such policies in medicine is constrained by the impossibility of trial-and-error experimentation on critically ill patients. Unlike games or robotics, where RL agents can interact with a simulated environment millions of times,60,61 in health care, the environment is real patients, and the exploration of untested actions carries unacceptable risks such as adverse events and mortality. Consequently, almost all applications of RL in critical care adopt an offline paradigm. Offline RL learns policies from retrospective data collected under historical clinician practice,62 without active experimentation. This makes it well suited to medicine, where vast repositories of electronic health records and ICU databases provide the observational trajectories of states, actions, and outcomes for training. The challenge then becomes to extract, from these imperfect and biased records of human practice, a policy that generalizes beyond what clinicians happened to do in certain situations. Offline RL provides a framework to estimate that policy.63 As policies are learned without prospective exploration, careful off-policy evaluation is essential, typically combining importance-sampling-based estimators, doubly robust methods, and fitted Q evaluation to estimate policy value, along with high confidence bounds that quantify uncertainty before any bedside use.64
This framework has already been applied in early proof-of-concept studies. For example, RL has been used to identify optimal fluids and vasopressor doses in patients with sepsis and for ventilator management.59,65 More recently, offline RL has been applied to improve glycemic control among critically ill patients after cardiac surgery.57 Importantly, this RL model underwent multiphase human validations, thus demonstrating that its recommendations were at least as safe, accurate, and acceptable as those of experienced clinicians. Nevertheless, these studies are largely retrospective, and prospective trials are needed to establish safety, effectiveness, and generalizability in real-world ICU practice. These studies highlight how RL can move beyond one-size-fits-all guidelines by tailoring sequences of interventions to the evolving characteristics of individual patients. RL can also be useful for many other high-impact decisions clinicians make daily in the ICU. In sedation and analgesia management, RL could help with titrating or switching sedative or analgesic agents, with guardrails informed by hemodynamic stability, respiratory drive, and delirium prevention. Decisions regarding antibiotic initiation and de-escalation—including choice, timing, and duration—are similarly sequential and context-dependent, constrained by hemodynamic instability, organ dysfunction, evolving microbiologic data, and stewardship considerations, thus making them another domain where RL could be helpful. RL could also inform many additional complex decision processes, such as transfusion of blood products, initiation and adjustment of anticoagulants, delivery and titration of nutritional support, initiation and dosing of dialysis modalities (including continuous renal replacement therapy), and the titration of extracorporeal membrane oxygenation and other forms of mechanical circulatory or respiratory support. These high-stakes decisions share a common structure in that they require balancing competing physiological priorities under uncertainty, adapting actions over time as a patient’s condition evolves, and respecting explicit safety constraints—all situations that are well suited to sequential decision-making frameworks such as offline RL.
Across both causal ML and RL, existing studies should be interpreted as hypothesis-generating ones, with prospective evaluation as a prerequisite for clinical deployment.
Challenges with the implementation of causal ML and RL
While causal ML and RL models have enormous potential to provide personalized solutions for a wide variety of medical applications, the leap from algorithmic development to clinical implementation involves significant challenges,50,66 which are explained below.
Data quality and integration
One of the first barriers to implementing causal ML and RL in the ICU is the quality and structure of the underlying data. Electronic health records are riddled with missingness, delayed documentation, inconsistencies in units, and artifacts from devices.23,24 Waveform data may be stored at high frequency but fragmented across vendors; medication administration records often lack precise timestamps or infusion-rate adjustments. For causal ML, this undermines the reliability of confounder adjustment; for RL, it erodes the fidelity of state representations. Therefore, implementation requires investment in robust data engineering pipelines, real-time ingestion, harmonization across systems, and validation of physiological plausibility before these models can even begin to run in clinical practice. An additional challenge is the lack of interoperability across ICU information systems, where heterogeneous data schemas, vendor-specific formats, and limited adherence to standards (such as HL7 FHIR) impede the reliable integration of multimodal datasets.67,68 When institutions utilize incompatible documentation workflows or nonstandardized device interfaces, even simple features—such as vasopressor dose or ventilator settings—may be represented differently, thereby complicating model training and deployment. Therefore, effective implementation of these AI models requires robust data engineering pipelines, real time data ingestion, harmonization across disparate systems, and physiological validation layers that ensure data plausibility before these models can begin to operate in clinical practice.
Interpretability and clinician trust
Another challenge is interpretability.69 Causal ML may estimate individualized treatment effects and RL may recommend sequences of actions, but unless the rationale underlying these recommendations can be explained, clinicians are unlikely to trust or adopt them. In practice, intensivists need to know not only what the model suggests but also the “why.” Which clinical features are driving the recommendation, which counterfactual scenarios were considered, and which uncertainties remain. Translating complex algorithms into intuitive explanations is essential. Without this, models risk being perceived as “black boxes”,70 thus leading to skepticism or rejection at the bedside. Recent work in explainable AI, counterfactual reasoning, and human-centered design provides tools to bridge this gap, but these approaches remain under explored in high acuity settings such as the ICU.71
Workflow integration and human factors
Critical care workflows are fast-paced and team-based, with decisions often made under severe time pressure. Implementing causal ML or RL systems requires more than algorithmic accuracy. It requires seamless integration into these workflows. A system that issues alerts or recommendations at inconvenient times or in formats disconnected from the workflow of end-users risks adding cognitive burden rather than alleviating it.72 Furthermore, ICUs function through multidisciplinary collaboration. Thus, a recommendation made to a bedside nurse, a fellow, or an attending must fit into the communication patterns of the team. Therefore, effective implementation requires codesign with clinicians to ensure that the recommendations are timely, context-aware, and aligned with existing decision pathways, rather than disruptive. Thus, it is essential to incorporate human-centered design and ensure usability testing during the development and deployment of AI models. After deployment, the recommendations should surface within the natural flow of team-based ICU activities rather than through intrusive pop-up alerts. Moreover, clinicians must retain full authority over treatment decisions, with AI systems providing suggestions or risk estimates that can be accepted, modified, or overridden. Override mechanisms should be straightforward, encouraged, and automatically logged to create an auditable record that supports transparency and iterative model refinement. To minimize alert fatigue, thresholds for when to display recommendations should be carefully tuned and routinely monitored using metrics such as alert frequency, acceptance rates, and downstream clinical actions. Explanations should remain concise and clinically meaningful, highlighting the patient features and tradeoffs that drove a recommendation, to ensure that clinicians can rapidly judge appropriateness during time-based decision-making.
Technical integration, privacy, and security
These systems will need to interface with electronic health records and monitoring systems to ingest data and return outputs in real time. Standards such as FHIR and SMART on FHIR provide a practical basis for interoperable integration of real-time clinical data and AI-driven recommendations into the bedside record.68 Privacy and security safeguards—including strong authentication, role-based access control, and audit logging—are critical given the sensitivity of ICU data. Where possible, data processing should adhere to data-minimization principles and institutional policies should specify how logs, overrides, and model outputs are stored and accessed.
Prospective validation and evaluation
Prospective evaluation remains one of the most significant barriers to translating causal ML and RL into real-world critical care. Similar to traditional risk prediction models, RL and causal ML cannot be entirely validated by retrospective accuracy metrics alone. In RL, various off-policy evaluation techniques—such as fitted Q-evaluation (FQE),73 weighted importance sampling (WIS),74 and, more recently, DICE75—have been developed to approximate how a learned policy might perform in practice.
Although these methods provide essential safeguards and enable the identification of unsafe or unstable policies during development, they cannot fully anticipate the behavior of models deployed in dynamic clinical environments. Therefore, prospective validation is critical, but designing such studies poses ethical and methodological challenges. Randomized controlled trials are resource-intensive and may be difficult to justify when a model proposes actions that diverge from accepted clinical practice. Emerging strategies—such as silent deployment, where recommendations are generated but withheld from clinicians—provide a lower risk pathway for assessing reliability, stability, and usability of policies before they are actively integrated. Further, simulation-based evaluation environments, such as digital twins, should be explored as a means of stress testing policies under controlled conditions without exposing patients to harm.76 However, these approaches remain technically demanding and have not yet been implemented at scale. Consequently, the field still lacks standardized prospective evaluation frameworks capable of ensuring that prescriptive AI systems can be safely deployed in high-acuity settings.
Documentation, transparency, and continuous monitoring
Transparent documentation is essential for safe deployment of policies. Model cards and similar standardized summaries can specify the model’s intended use, provide training data, measure performance across subgroups, work with known limitations, and ensure appropriate monitoring.77 Moreover, data provenance and versioning should be tracked so every model prediction can be linked to its underlying code, parameters, and data snapshots. Continuous monitoring for dataset shifts—using changes in input distributions, calibration, or outcome frequencies—can identify when retraining or recalibration is needed.78
Governance, regulation, and liability
Bringing RL and causal ML into the ICU also raises issues of liability and governance.79 Unlike static models, RL-based policies may evolve over time as more data are ingested, thereby complicating regulatory oversight. It is important for hospitals and regulators to determine the level of autonomy these systems can have, who bears the responsibility for adverse outcomes, and how updates to models are controlled. Further, causal ML introduces additional questions, such as, if individualized treatment effect estimates differ from guideline-recommended care, who decides whether to follow the model or the guideline? As these systems mature, collaboration with regulatory bodies will be essential to ensuring safety and public trust. Regulatory and governance frameworks, such as Good Machine Learning Practice and guidance for software as a medical device (SaMD), emphasize clear intended use, data quality, risk management, and a prespecified change control plan for adaptive models.80 Under both Food and Drug Administration and European Union Medical Device Regulation approaches to SaMD, sponsors are expected to define how models will be updated, how performance will be monitored post-market, and how evidence will be generated when substantial changes are introduced.80–82 Who will assume responsibility when model recommendations diverge from guidelines should be explicitly specified in institutional governance, with clear documentation regarding how recommendations are generated, when they may be safely ignored, and how conflicts are resolved. Clear governance frameworks, change-control protocols, and legal clarity must be the prerequisites for implementation.
Fairness and equity in real-world deployment
Finally, there is the issue of fairness.83,84 Both causal ML and RL are only as good as the data they are trained on, and historical ICU data often reflect inequities in care delivery across race, sex, geography, and socioeconomic status. A model trained on such data may learn policies that inadvertently perpetuate disparities—for example, recommending fewer interventions in patients from groups that historically received less aggressive care. In research settings, subgroup analyses can identify such patterns, but in implementation, continuous auditing and fairness-aware retraining will be required. Without explicit attention to equity, deployment risks the widening of gaps rather than their narrowing in critical care outcomes.
Future directions
The next stage in the development of AI for personalized critical care lies not in further demonstration of feasibility but in building systems that can be trusted, validated, and safely deployed at the bedside. For causal ML, future research will need to move beyond exploratory analyses of heterogeneity toward methods that produce clinically reliable treatment effect estimates. One important direction is the use of target trial emulation frameworks,85 where observational ICU data are explicitly structured as though they were randomized trials. This approach strengthens causal validity and provides estimates that are more easily aligned with clinical reasoning. Advances in methods that address time-varying confounding will also be crucial, as patient physiology and treatment decisions interact in feedback loops over hours and days. Further, high-dimensional extensions of g-methods and doubly robust estimators could enable more faithful estimations of individualized treatment effects in these longitudinal settings.86,87 In addition, causal ML studies should prespecify estimands,88 articulate assumptions using directed acyclic graphs,89 report positivity and overlap diagnostics, and conduct quantitative sensitivity analyses for unmeasured confounding. In addition, external validation together with explicit assessment of transportability across hospitals and health systems should become standard practice. Ensuring that causal ML tools generate outputs that clinicians can interpret and apply is equally important. Instead of abstract effect estimates, these tools will need to provide clear narratives, such as explaining that “given this patient’s fluid balance, urine output, and hemodynamic profile, a restrictive fluid strategy is likely to improve renal recovery.” Our group’s recent work on fluid management in septic patients with AKI illustrates this trajectory, where individualized treatment effects estimated using causal ML were externally validated and shown to identify those subgroups that would be the most likely to benefit from a restrictive approach.49 Expanding this paradigm to other decisions within critical care represents an important future path.
In RL, offline RL methods will remain central, since exploration in actual patients is not feasible. In this setting, the learned policy is constrained by the behavior policy that generated the data; thus, adequate coverage of clinically relevant actions is essential to avoid extrapolation to areas of the state action space that were rarely or never visited in practice.90 Directly estimating the behavior policy and examining action frequency and overlap across patient subgroups provide practical diagnostics for coverage. Further, algorithms such as conservative Q-learning,91 which penalize unsafe deviations from clinician behavior, and distributional RL, which models the entire distribution of possible outcomes rather than merely averages, will be particularly important for safe policies. Additional conservative offline approaches—such as batch-constrained deep Q learning, behavior regularized actor critic, and implicit Q learning—further limit unsupported extrapolation; thus, providing safeguards against high-risk actions can improve stability in high-stakes clinical settings.92–94 Another critical direction is reward design. Mortality and duration of hospital stay are too sparse to guide useful policies on their own; thus, multiobjective rewards that balance competing priorities—such as hemodynamic stability versus renal safety, glucose control versus hypoglycemia avoidance—will bring RL policies closer to the trade-offs intensivists make daily. However, the greatest challenge will be prospective validation. To ensure credibility, RL studies should explicitly estimate and report the behavior policy, assess state–action coverage to avoid unsupported extrapolation, and apply multiple off-policy evaluation methods, including importance sampling, doubly robust estimators, fitted Q evaluation, and DICE.95–99 Ablation studies on state representations should test robustness to missing or noisy modalities. Silent deployment phases (where policies generate recommendations but clinicians are blinded to them), simulation studies, and, ultimately, pragmatic trials will be required to ensure safety and benefit. A practical translational path begins with retrospective model development using prespecified diagnostics and internal validation, which must be followed by rigorous off-policy evaluation with uncertainty bounds and predefined safety thresholds. Thereafter, high-fidelity simulators and digital twins of the ICU environment can then be used to stress test proposed policies and explore clinically important scenarios.13,100,101 Subsequent silent prospective evaluation would enable teams to compare model recommendations with actual clinician actions and outcomes without influencing care, after which clinician in the loop pilots can introduce recommendations into the workflow with explicit override options, logging, and auditing. It is also important to ensure alignment with Good Machine Learning Practice and FDA guidance for adaptive software.81 It is only once safety, usability, and fidelity to clinical priorities are demonstrated in these stages that pragmatic trials should be considered to assess impact on patient and system-level outcomes.
A third strand of progress will likely originate from the utilization of digital twins, which are virtual patient simulators that blend mechanistic physiology with real-time data to create individualized models of disease evolution.13 Digital twins can provide safe and controlled environments to test RL policies, stress-test treatment strategies, derived from causal ML, and explore counterfactual scenarios before implementing interventions in real patients. By synchronizing with live ICU data, a digital twin could project likely trajectories under different interventions, thus providing clinicians both a predictive forecast and an adaptive decision-support tool. Over time, this technology could enable RL systems to “practice” policies in silico before recommending them at the bedside while also providing clinicians a virtual sandbox to query counterfactual scenarios, such as, “what if I reduce PEEP instead of increasing vasopressors?” As both causal ML and RL mature, digital twins may emerge as the bridge that enables safe prospective evaluation and, ultimately, real-world deployment.
Further, to ensure that causal ML and RL studies in critical care are transparent, reproducible, and clinically interpretable, investigators should follow established guidelines. For causal ML, target trial emulation checklists, prespecified estimands, DAGs and adjustment sets, positivity diagnostics, sensitivity analyses, and external validation with transportability assessment should be routinely reported. For RL, behavior-policy estimation, coverage diagnostics, off-policy evaluation with uncertainty, ablations on state representation, and preregistered evaluation protocols should be standard. For prediction components that feed into causal or RL workflows, TRIPOD-AI and PROBAST-AI provide updated guidance for transparent reporting and bias assessment.102,103 Moreover, early-stage clinical evaluations should be reported in line with DECIDE-AI, while randomized trials involving AI decision support should follow SPIRIT-AI and CONSORT-AI.104,105 Accordingly, a list of common pitfalls for use of AI in critical care is presented in Box 1.
With the careful development and integration of these technologies, the ICU of the future will look very different from that of today. Instead of relying on generic protocols or population-based guidelines, clinicians will likely have access to individualized treatment effect estimates that clarify which interventions are most likely to help or harm the patient in front of them. RL-based systems will provide adaptive recommendations that evolve hour by hour as physiology changes, balancing competing priorities with a view toward long-term recovery. Digital twins will likely run silently alongside patients, projecting trajectories under alternative strategies and providing clinicians a safe environment to test and refine decisions. In such a world, intensivists will not be replaced by algorithms but will be augmented by them, supported by tools that continuously synthesize data, reason across counterfactuals, and anticipate consequences, thereby ensuring that the right decision can be made for each patient at the right time.
Limitations
This review has several limitations. We conducted a narrative rather than a systematic review, so our selection of studies reflects editorial judgment and may not capture all relevant work or quantify the strength of the underlying evidence. The literature we surveyed on causal ML and RL in critical care remains largely retrospective, observational, and proof-of-concept, and we identified no prospective trials demonstrating that prescriptive AI improves patient outcomes at the bedside. Finally, because this is a rapidly evolving field, the methods, tools, and regulatory frameworks we describe may change quickly, and we did not formally grade the certainty of the evidence we cite.
Conclusions
AI has opened the door to a new era in critical care, where decision support can move beyond generic prediction and toward individualized prescription. Causal ML can estimate individualized treatment effects, while RL can optimize sequences of actions over time. Together, these approaches move from population averages towards patient-specific guidance. However, a few significant challenges remain, such as data quality, interpretability, workflow integration, validation, governance, and equity; however, advances such as digital twins offer pathways for safe testing and deployment. Importantly, these systems are intended as decision-support tools, not as autonomous decision-makers. They are designed to inform and contextualize clinical judgment, with clinicians retaining ultimate authority and responsibility for patient care. If responsibly developed, these tools may augment clinical judgment, thereby enabling clinicians to deliver on the core promise of intensive care—the right treatment, for the right patient, at the right time.
Declarations
Funding
This work was supported by the National Institutes of Health (NIH) (Grant No. 5K08DK131286 to AS).
Conflict of interest
AS is a consultant for Roche Diagnostics Corporation. Other authors have no conflicts of interest.
Authors’ contributions
Conceptualization (MS, AS); data curation (MS); methodology (MS, AS); software (MS); resources (AS); writing—original draft (MS, BK); writing—review & editing (AS); visualization (MS, AS); supervision (AS); project administration (AS); funding acquisition (AS). All authors have approved the final version and publication of the manuscript.