How to Validate AI-Predicted Dose: QA, Commissioning, and Clinical Risk

Q: What is the biggest validation mistake?

Validating only the average. Rare and out-of-domain cases are the most important safety tests.

Q: When should revalidation happen?

After changes to TPS, model, protocol, structure set, scanner, imaging, energy, MLC, or treated population.

Validating AI-predicted dose requires more than running gamma agreement on average cases. The correct question is: for which intended use, patient population, TPS, energies, structures, and failure limits will the model be accepted?

This checklist complements the comparison of MVision, RayStation, and OptiPlan and the DoseRAD2026 benchmark. The goal is to turn enthusiasm for AI into an auditable medical physics process.

Clinical validation guardrails for AI-predicted dose models — Original RT Medical Systems infographic for the AI-predicted dose cluster.

1. Define intended use before metrics

The same model may be acceptable for pre-planning and unacceptable for automatic approval. Document whether the output will be used as a visual estimate, optimization reference dose, difficult-case triage, secondary dose, adaptive support, or operational substitute for a physical calculation.

Required input: CT, MRI, sCT, structures, prescription, beam geometry, or complete plan.
Allowed output: beam dose, plan dose, optimization objectives, or warning.
Authorized user: dosimetrist, physicist, physician, or automated pipeline.
Allowed action: inform, suggest, automate, or block approval.

2. Build a local validation cohort

Multicenter training does not remove local validation. The sample must cover real protocols, anatomical extremes, implants, prostheses, air, cortical bone, re-irradiation, PTV near OAR, and TPS version changes. Easy cases matter, but edge cases test safety.

Risk	Example	Minimum control
Out of domain	Post-surgical anatomy or metal implant	OOD detector and mandatory physics review
Localized physical error	Air-tissue interface, build-up, small field	Region-specific metric and independent comparison
Clinically hidden error	OAR DVH worsens with acceptable gamma	D2%, Dmean, Vx and structure-level review
Version change	New MLC, TPS, or protocol	Regression revalidation before use

3. Use layered metrics

Gamma is useful, but insufficient alone. Combine voxel metrics, DVH, structure-level error, and worst-case analysis. For fast beam-level models, include segment or beamlet error. For full planning, include prescription, coverage, and OAR metrics.

3D local gamma with documented dose threshold and strict criteria.
MAE in high-, mid-, and low-dose regions.
D98%, V95%, D2%, Dmean, and protocol-specific metrics.
Error at interfaces and high-density materials.
Inference time and batch failure rate.

4. Treat OOD as a safety requirement

AI models can fail silently. A mature protocol must state when the model should refuse, warn, or require independent review. Examples include anatomy outside training, missing contours, non-standard names, degraded MRI, unusual isocenter, and incompatible prescription.

5. Separate scientific validation from clinical validation

A benchmark such as DoseRAD2026 measures performance under controlled rules. Clinical validation must also include DICOM integration, permissions, logs, traceability, model updates, cybersecurity, user training, and rollback.

6. Suggested acceptance structure

Final thresholds belong to the department and intended use. Still, validation must define limits before final testing, not after. Include automatic rejection, physics-review, and assistive-use approval criteria.

No critical case with clinically relevant DVH error without warning.
Stratified performance by site, protocol, and complexity.
Reproducibility after TPS, model, or library update.
Failure log and periodic technical committee review.

FAQ

What is the biggest validation mistake?

Validating only the average. A model may look good overall and fail exactly in the rare cases that require stronger physics control.

Can the model be accepted if gamma is high?

Not automatically. Gamma must be combined with DVH, structure-level assessment, error in critical regions, and out-of-domain analysis.

When should revalidation happen?

After updates to TPS, model, protocol, structure set, scanner, imaging modality, energy, MLC, or treated population.

References

AAPM TG-218. https://www.aapm.org/pubs/reports/RPT_218.pdf
FDA AI/ML-enabled Software as a Medical Device. https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device
DoseRAD2026 metrics and ranking. https://doserad2026.grand-challenge.org/metrics-and-ranking/
RaySearch deep learning planning. https://www.raysearchlabs.com/media/publications/white-papers/deep-learning-planning/
MVision Dose+. https://mvision.ai/dose/

How to Validate AI-Predicted Dose: QA, Commissioning, and Clinical Risk

1. Define intended use before metrics

2. Build a local validation cohort

3. Use layered metrics

4. Treat OOD as a safety requirement

5. Separate scientific validation from clinical validation

6. Suggested acceptance structure

FAQ

What is the biggest validation mistake?

Can the model be accepted if gamma is high?

When should revalidation happen?

References

DoseRAD2026: Monte Carlo as Reference for AI Dose Calculation

RT Medical

RT Medical Systems

Endereço

How to Validate AI-Predicted Dose: QA, Commissioning, and Clinical Risk

1. Define intended use before metrics

2. Build a local validation cohort

3. Use layered metrics

4. Treat OOD as a safety requirement

5. Separate scientific validation from clinical validation

6. Suggested acceptance structure

FAQ

What is the biggest validation mistake?

Can the model be accepted if gamma is high?

When should revalidation happen?

References

DoseRAD2026: Monte Carlo as Reference for AI Dose Calculation

Related Posts

Head and Neck SBRT: Technique and Cases

Larynx Cancer: Target Delineation and Fields

NPC Target Delineation and Field Setup

RT Medical

RT Medical Systems

Endereço