Skip to main content

Validating AI-predicted dose requires more than running gamma agreement on average cases. The correct question is: for which intended use, patient population, TPS, energies, structures, and failure limits will the model be accepted?

This checklist complements the comparison of MVision, RayStation, and OptiPlan and the DoseRAD2026 benchmark. The goal is to turn enthusiasm for AI into an auditable medical physics process.

Clinical validation guardrails for AI-predicted dose models
Original RT Medical Systems infographic for the AI-predicted dose cluster.

1. Define intended use before metrics

The same model may be acceptable for pre-planning and unacceptable for automatic approval. Document whether the output will be used as a visual estimate, optimization reference dose, difficult-case triage, secondary dose, adaptive support, or operational substitute for a physical calculation.

  • Required input: CT, MRI, sCT, structures, prescription, beam geometry, or complete plan.
  • Allowed output: beam dose, plan dose, optimization objectives, or warning.
  • Authorized user: dosimetrist, physicist, physician, or automated pipeline.
  • Allowed action: inform, suggest, automate, or block approval.

2. Build a local validation cohort

Multicenter training does not remove local validation. The sample must cover real protocols, anatomical extremes, implants, prostheses, air, cortical bone, re-irradiation, PTV near OAR, and TPS version changes. Easy cases matter, but edge cases test safety.

Risk Example Minimum control
Out of domain Post-surgical anatomy or metal implant OOD detector and mandatory physics review
Localized physical error Air-tissue interface, build-up, small field Region-specific metric and independent comparison
Clinically hidden error OAR DVH worsens with acceptable gamma D2%, Dmean, Vx and structure-level review
Version change New MLC, TPS, or protocol Regression revalidation before use

3. Use layered metrics

Gamma is useful, but insufficient alone. Combine voxel metrics, DVH, structure-level error, and worst-case analysis. For fast beam-level models, include segment or beamlet error. For full planning, include prescription, coverage, and OAR metrics.

  • 3D local gamma with documented dose threshold and strict criteria.
  • MAE in high-, mid-, and low-dose regions.
  • D98%, V95%, D2%, Dmean, and protocol-specific metrics.
  • Error at interfaces and high-density materials.
  • Inference time and batch failure rate.

4. Treat OOD as a safety requirement

AI models can fail silently. A mature protocol must state when the model should refuse, warn, or require independent review. Examples include anatomy outside training, missing contours, non-standard names, degraded MRI, unusual isocenter, and incompatible prescription.

5. Separate scientific validation from clinical validation

A benchmark such as DoseRAD2026 measures performance under controlled rules. Clinical validation must also include DICOM integration, permissions, logs, traceability, model updates, cybersecurity, user training, and rollback.

6. Suggested acceptance structure

Final thresholds belong to the department and intended use. Still, validation must define limits before final testing, not after. Include automatic rejection, physics-review, and assistive-use approval criteria.

  • No critical case with clinically relevant DVH error without warning.
  • Stratified performance by site, protocol, and complexity.
  • Reproducibility after TPS, model, or library update.
  • Failure log and periodic technical committee review.

FAQ

What is the biggest validation mistake?

Validating only the average. A model may look good overall and fail exactly in the rare cases that require stronger physics control.

Can the model be accepted if gamma is high?

Not automatically. Gamma must be combined with DVH, structure-level assessment, error in critical regions, and out-of-domain analysis.

When should revalidation happen?

After updates to TPS, model, protocol, structure set, scanner, imaging modality, energy, MLC, or treated population.

References

  1. AAPM TG-218. https://www.aapm.org/pubs/reports/RPT_218.pdf
  2. FDA AI/ML-enabled Software as a Medical Device. https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device
  3. DoseRAD2026 metrics and ranking. https://doserad2026.grand-challenge.org/metrics-and-ranking/
  4. RaySearch deep learning planning. https://www.raysearchlabs.com/media/publications/white-papers/deep-learning-planning/
  5. MVision Dose+. https://mvision.ai/dose/