Skip to main content

The incorporation of artificial intelligence (AI) algorithms into the radiotherapy planning workflow represents one of the most profound transformations experienced by medical physics in recent decades. For more than thirty years, the dose calculation engine has been synonymous with deterministic or stochastic physics: analytical convolutions, particle transport by Boltzmann equations or Monte Carlo (MC) simulation. These methods operate on explicit models of radiation transport, with parameters derived from commissioning data and validated against independent dosimetric measurements. Now a new category of engines emerges—models trained from data—whose capabilities and limitations do not naturally fit into the quality assurance (QA) protocols developed for deterministic algorithms.

What is generically called “AI in dose calculation” encompasses very different technological realities: neural networks that predict dose distributions based on structure geometries, reinforcement learning models for plan optimization, and emulators that reproduce outputs from slow engines — such as MC — with fractions of a second latency. None of these models transport particles. They learn statistical correlations between inputs (CT images, contours, beams) and outputs (dose distributions) from a training set. The clinically relevant question is not “AI or Monte Carlo?” but rather: under what conditions can a surrogate model be used with confidence, and what safeguards are needed to detect when it silently fails?

AI dose surrogate model with clinical validation guardrails
Technical infographic from the dose-calculation algorithm cluster.

This article examines these questions from the perspective of medical physicists, dosimetrists, and radiation oncologists who need to make adoption or oversight decisions for AI-based tools. The text differentiates physical description of the phenomenon, commercial implementation and published validation evidence — three dimensions often confused in discussions on the topic. It is not intended to recommend specific products, but to provide a conceptual map for critical evaluation of these technologies.

What it means to use AI as a dose surrogate model

A surrogate model (surrogate model or emulator) is a computational system trained to reproduce the behavior of another more expensive or slower system, accepting the same inputs and producing approximate outputs. In the dose context, the “expensive” system is typically a high-fidelity MC engine or a linear Boltzmann transport equation (LBTE) solver such as Acuros XB. The surrogate model — typically a deep convolutional neural network, often with a similar architecture to U-Net — learns, from reference (input, reference output) pairs, a mapping that can be evaluated in milliseconds rather than minutes or hours.

It is important to distinguish two sub-cases that the literature often conflicts with. In the first, the network predicts the dose based on already optimized treatment plans, functioning as a quick check or generation of an initial plan (knowledge-based planning). In the second, the network directly replaces the calculation engine within TPS (treatment planning system), being invoked during each optimization iteration. The second case imposes much more severe requirements on accuracy and robustness: a systematically low error in a critical region will propagate to the optimization, producing plans with lower actual coverage than projected, without any warning signal to the user.

There are prototypes and products that use machine learning in the planning and dose estimation stages, but the intended use must be verified in the documentation for each version. The distinction between “AI-accelerated” and “MC/LBTE-calculated with hardware acceleration” is crucial. GPUMCD, for example, is Monte Carlo on GPU, not a neural network.

Reduced latency can support adaptive flows and repeated calculations. The cost is to transfer part of the performance guarantee to the data, validity domain, and failure detection controls.

Difference between predicting dose and transporting particles

Transporting particles, in the physical sense, means solving — exactly, approximately or stochastically — the Boltzmann equation for radiation transport, considering interaction cross sections dependent on the material crossed, locally deposited energy and secondary scattering. MC samples individual trajectories of photons, electrons and secondary particles. LBTE/Acuros XB solves the equation in its deterministic form over a spatial mesh. Pencil Beam decomposes the beam into pencils and applies water-calibrated scattering kernels, with empirical corrections for inhomogeneities. The AAA (Anisotropic Analytical Algorithm) utilizes separate energy convolutions for primary photons, lateral scattering, and contaminating electrons. All of these algorithms have parameters with direct physical meaning and can be, at least in principle, commissioned and validated against independent phantom measurements.

A dose prediction neural network does not solve any of these equations. It learns a function — potentially of very high dimensionality — that maps the geometry of the problem (CT morphology in Hounsfield units, structure contours, beam configuration) to a dose distribution, minimizing a loss functional over the training set. The learned mapping is, by construction, an interpolation over the manifold of cases seen during training. Outside of this manifold — an unusual anatomy, an unrepresented combination of energies, an atypical beam geometry — the network will extrapolate in an unpredictable manner, with no guarantee of physical coherence.

This distinction has direct implications for concepts such as dose to medium (Dm) and dose to water (Dw). Algorithms such as Acuros XB allow you to explicitly choose which quantity is calculated, with clinical consequences discussed in the literature especially in bone-tissue interfaces and in proton therapy. A surrogate model trained on Dm outputs implicitly “learns” this convention, but will not make it explicit. A convention change in the reference engine during retraining may go unnoticed — a structural example of silent failure.

Another relevant aspect is incremental convergence: in MC, more particle histories equate to lower statistical uncertainty, and the user can balance calculation time and accuracy in a controlled way. In an ML model, there is no equivalent mechanism — the output is deterministic for a given input, and the model’s uncertainty is fixed, determined by the training phase.

Training Data, Bias, and Validity Domain

The performance of any surrogate model is fundamentally limited by the quality, quantity, and diversity of the training data. For dose prediction, the data set is generally clinically approved plans at one or more institutions, with dose distributions calculated by institutional TPS as the label (ground truth). Two structural problems immediately emerge.

First, the label is not the actual dose — it is the dose calculated by the TPS algorithm, with its own uncertainties and approximations. If TPS used Pencil Beam for lung cases with severe heterogeneities, and the model learns to reproduce Pencil Beam, there is no gain in physical accuracy; there is only acceleration of an imprecise method. Second, the training data reflects local planning patterns and biases: preferred beam topologies, normalization criteria, margin philosophies. A model trained in a highly specialized center may not generalize to a center with different patient populations, equipment or practices.

The table below summarizes the most relevant sources of bias in training datasets for dose models:

Source of bias Description Potential clinical impact
Case selection bias Atypical or difficult cases excluded from clinical approval Model underestimates complexity; failure in difficult scenarios
Reference algorithm bias Ground truth generated by engine with known limitations (e.g., PB in lung) Preserves systematic errors from the original engine
Institutional bias Single-center planning patterns Low generalizability to other institutions
Selection bias anatomical Underrepresentation of rare or post-surgical anatomies Silent failure in cases outside the distribution
Time bias Changes in protocols, fixtures or equipment throughout collection Inconsistency in training labels

The concept of domain of validity — the space of inputs over which the model can be considered reliable — is analogous to the commissioning scope of a physical engine, but much more difficult to delimit. For a conventional TPS, commissioning explicitly defines the energies, field sizes, phantom geometries, and tissues for which the engine has been validated. For an ML model, this space is implicitly defined by the distribution of the training data, and there is no standardized protocol to formally characterize it.

Generalization to machines, energies and anatomies

One of the most practical challenges for clinical adoption is the transferability of models between linear accelerators, beam energies and patient populations. A model trained on data from a specific accelerator with 6 MV FFF has, a priori, no guarantee of correct behavior on a different platform, at 10 MV, or in physically filtered beams. Differences in the shape of the energy spectrum, electronic contamination, virtual source size and beam profiles result in qualitatively distinct dose distributions in regions of build-up, penumbra and inhomogeneities.

The literature describes approaches to transfer learning and domain adaptation to reduce the cost of re-training when migrating to a new machine, but validation evidence for clinical use is still limited and mostly comes from academic groups. Commercial implementations must be evaluated for the exact scope of machines and energies for which the model has been validated by the manufacturer — information that should appear in the system’s technical documentation, not in marketing material.

The anatomical dimension is equally critical. Models trained predominantly on prostate cases tend to perform better in this location and lower in the head and neck, where proximity to critical OARs and anatomical variability are greater. The following table summarizes the relationship between case complexity and extrapolation risk:

Case category Relative complexity Model extrapolation risk
Conventional prostate (7 fields IMRT) Low Low, if represented in training
Head and neck (VMAT) High Moderate to high
Lung with severe heterogeneities High High — especially Dm/Dw and dim light
Post-surgery with metallic prostheses Very high High — CT artifacts out of distribution
Pediatric Medium-high High — anatomy underrepresented in most sets
Re-irradiation High High — accumulated dose not modeled in training

Post-surgical anatomies, the presence of metallic implants with CT artifacts, and pediatric cases represent high-risk extrapolation scenarios that deserve specific escalation protocols for verification by an independent physical engine.

Uncertainty, outlier detection and silent failures

A limitation of classical deterministic engines (AAA, Acuros XB, Pencil Beam) is that they produce a single dose value per voxel, with no estimate of uncertainty associated with the model itself — only the commissioning measurements. Paradoxically, machine learning methods offer tools for estimating predictive uncertainty: Monte Carlo Dropout, deep ensembles, conformal prediction and probabilistic models such as Bayesian neural networks. When implemented, these techniques allow the model to indicate regions of greater uncertainty—a valuable diagnostic signal that deterministic engines do not provide.

The problem is that these techniques are rarely available in commercial implementations and still lack robust clinical validation. The opposite — and clinically more dangerous — risk is that of silent failure: the model produces a dose distribution that is plausible in appearance (passing simple DVH and isodose checks) but systematically wrong in specific regions, without any warning indicator. Documented examples include errors in regions of high heterogeneity (air-tissue interfaces, lung), shallow build-up, and small fields — exactly the regions where simpler algorithms like Pencil Beam also fail, but for well-understood and auditable physical reasons.

Outlier detection — identifying cases outside the domain of validity before using prediction — is an active area of ​​research. Metrics such as distance in latent feature space, anomaly scores based on autoencoders, and comparison with training distributions have been explored. In the absence of automatic tools, the practical approach is to: (1) define explicit exclusion criteria based on the characteristics of the training set; (2) require independent verification by physical engine for cases in high-risk categories; and (3) implement discrepancy reporting processes as part of routine QA.

How to compare AI, Monte Carlo and deterministic solvers

The comparison between calculation engines must be structured in at least three independent dimensions: physical accuracy, computational performance and clinical validation maturity. Often, discussions about AI versus MC inappropriately collapse these dimensions, generating claims that are true in one dimension and misleading in the others.

The AAPM report TG-105 establishes a methodological framework for MC commissioning in radiotherapy that remains relevant as a reference for any high-fidelity engine. The proposed acceptance criteria — gamma comparisons, DVH analyses, specific test scenarios — can and should be applied to surrogate models when they are used as the primary calculation engine. The fundamental difference is that, for MC, statistical convergence can be increased with more particle histories; for an ML model, there is no equivalent self-refinement mechanism at inference time.

The gamma analysis is common, but alone does not demonstrate clinical equivalence. Assessment should include DVHs, metrics by framework, error maps, worst-performing cases, and out-of-distribution testing, with criteria defined before validation.

The proton physics literature specifically discusses validation challenges where range uncertainties add a dimension that analytical algorithms address in a simplified way and MC addresses more fully. Surrogate models for protons face the additional challenge of correctly modeling the Bragg peak region and halo effects, which are highly sensitive to tissue composition—exactly the type of variability that may not be well represented in the training set.

Clinical validation, governance and responsible use

Clinical validation of a dose replacement model goes beyond technical commissioning. It covers the complete process of introducing a new technology into patient care, including risk assessment, staff training, definition of scope of use and continuous monitoring mechanisms. The concept of digital twins in oncology, discussed in recent reviews, illustrates the ambition for personalized models of treatment response — but also highlights the gap between technological promise and the clinical evidence available for routine use.

From a regulatory perspective, classification and responsibilities depend on jurisdiction, intended use and commercial configuration. On-site retraining, self-integration, or out-of-scope use may change applicable obligations. The institution must involve quality, regulatory affairs and safety before using healthcare.

Internal governance must establish, at a minimum:

  • Commissioning protocol with pre-defined and non-adjustable acceptance criteria post hoc;
  • Documented definition of the clinical scope of use (anatomical sites, techniques, energies, age groups);
  • Escalation process for cases that exceed the scope, with independent engine verification;
  • Periodic audits comparing surrogate model outputs with independent calculations on a sample of real clinical cases;
  • Reporting and investigation process for discrepancies, integrated into the institution’s quality management system.

The underlying ethical issue is that radiotherapy planning involves decisions with consequences for the patient. Gaining speed is only clinically useful when uncertainty, domain validity, oversight, and accountability are defined.

FAQ

Can an AI model with a high concordance rate gamma with respect to MC be considered equivalent to MC for clinical use?

Not necessarily. High gamma agreement on the validation set demonstrates average performance over the tested cases, but does not guarantee correct behavior outside the training domain. Clinical equivalence requires validation on cases representative of the entire range of situations in which the model will be used, including edge cases and adverse scenarios — not just typical cases. Furthermore, MC has an incremental convergence mechanism (more stories, lower statistical uncertainty); the ML model does not. The comparison should include worst-case analysis and DVH metrics per structure, not just the median gamma rate.

How to differentiate, in the TPS documentation, whether the engine uses real AI or GPU acceleration?

Search the technical documentation for the terms “machine learning”, “neural network”, “deep learning” or “trained model”. GPU-accelerated engines like GPUMCD are stochastic MC on GPU; Your documentation will describe particle samples, cross sections, and statistical convergence. An ML model will describe network architecture, training data, and validation metrics. In case of ambiguity, ask the manufacturer for the Intended Use Statement and the clinical validation documentation for the specific engine — documents that must exist for any regulated device.

What is the impact of the distinction dose to medium / dose to water on dose substitute models?

The model learns to reproduce the convention of the engine that generated the training data (Dm or Dw), but rarely makes this convention explicit to the user. If the reference engine is Acuros XB set to Dw, the model will output Dw implicitly; if set to Dm, it will output Dm. In anatomies with a high proportion of cortical bone or air-tissue interface, the difference between Dm and Dw may be clinically relevant. The user must track and document which convention the model reproduces, ensuring that the plan acceptance criteria are consistent with it.

Is it possible to use an AI-based dose model trained at another institution without local re-training?

Transferability depends on population, equipment, energy and protocols. Even with multicenter training, it is necessary to validate performance in the local environment with representative cases and adequate independent references. The validation scope must match the intended use.

What are the highest risk scenarios for silent failure in dose surrogate models?

Highest risk scenarios include small fields, high heterogeneity, superficial build-up, post-surgical anatomies, implants and re-irradiation. In these cases, the protocol should require additional controls proportionate to the risk, including independent comparison when technically appropriate.

References