Skip to main content

Not all imaging AI is created equal when the task is reading a chest x-ray. A new study published in Radiology compared seven commercially available algorithms for detecting lung cancer on chest radiographs and found wide variation in performance. The findings, covered by The Imaging Wire on May 21, reinforce that algorithm choice is not an operational detail but a decision with direct clinical and financial impact.

Chest x-ray being interpreted with AI support for lung cancer detection
The chest x-ray is the most-used imaging exam in the world and the starting point for most commercial thoracic AI algorithms.

Why comparing algorithms matters

Chest x-ray is, by far, the most widely used modality in medical imaging. It is often the first exam a patient receives and serves as a gateway to more advanced investigations. It also has well-known weaknesses: structure overlap, low sensitivity to small lesions, dependence on technique. That is why several developers have bet on AI to extract more value from the exam and identify findings that escape the human eye.

The problem so far has been that each vendor publishes its own numbers, in different populations, with artificial prevalences or controlled scenarios. For the manager who must choose a solution for a network of services, comparison is hard. The U.K. group therefore organized a kind of technical competition — what it describes as an AI bake-off — testing commercial algorithms simultaneously on a single base, with real-world lung cancer prevalence.

How the study was designed

Researchers included chest x-rays from approximately 5,200 patients, with a lung cancer prevalence rate representative of clinical practice. The compared algorithms came from Annalise/Harrison.ai, Gleamer, Infervision, Milvue, Oxipit, Qure.ai, and Rayscape. Performance results were anonymized, avoiding direct public exposure of brands while preserving comparative technical analyses.

The choice of vendors matters because it covers most of the players already available for integration into PACS and viewer platforms. In other words, the competition analyzed solutions that radiologists can in fact buy and install, not academic prototypes. That brings the study close to the practical decisions facing managers and clinical leads.

Results worth paying attention to

Variation across algorithms was striking. Sensitivity — the ability to detect patients with cancer — ranged from 21% to 78%. Specificity — the ability to avoid false positives among patients without disease — oscillated between 59% and 98%. Positive predictive value, perhaps the most uncomfortable number, sat between 1.5% and 28%. In other words, with some systems only 1 out of 67 patients flagged by AI actually has cancer.

All algorithms increased the number of false positives compared with human radiologists, but with significant variation. One model generated only 10 more false positives than the doctors; another generated more than 2,000 additional false positives. When those numbers are turned into cost, considering the use of AI to triage patients for follow-up CT, the difference comes out to about $1,600 versus $327,000 in additional costs. That is nearly a 200-fold gap for the same clinical task.

What explains the difference

The most likely factor is the composition of the training data sets behind each model. Algorithms trained on bases skewed toward advanced cases tend to lose sensitivity on early findings; models trained on aggressive sets generate too many false positives by trying not to miss any case. Without standardized benchmarks, it becomes difficult for the customer to measure that difference before contracting.

Aspects such as acquisition protocols, source equipment, patient demographics, and labeling strategy by radiologists also count. A model that performs well in a U.K. hospital may show very different behavior elsewhere without revalidation in the new population.

What this changes for managers and radiologists

The practical takeaway is simple: comparing AI algorithms is not an academic luxury, it is managerial due diligence. Before signing a contract, it pays to require tests on local samples and to define acceptance metrics for sensitivity, specificity, and positive predictive value. As we discussed in our guide to the five critical questions every radiology director should ask before adopting AI, skipping local validation tends to be more expensive than delaying deployment.

Another lesson is that generic AI rarely fits all workflows. High-volume hospital services may absorb more false positives in exchange for higher sensitivity, while outpatient networks with population screening may prioritize specificity. The discussion echoes what we covered in our story on AI in pulmonary embolism detection on CTPA in real-world settings, where trust depends on the fit between algorithm and population.

Regulatory implications and the multi-algorithm future

The variability documented in the study raises a discussion about what regulators such as the FDA, EMA, and NHS may require as a condition for authorizing clinical use. Some authors defend public benchmarks and periodic audits; others argue that this variation is a strength, not a defect, and that the future of diagnosis runs through ensembles, i.e., groups of algorithms with complementary biases analyzing the same exam.

In that scenario, the radiologist stops competing with AI and starts orchestrating different analytical layers. The final report integrates what each model saw, flags disagreements, and contextualizes with clinical history. It is an evolution of the specialist role worth watching closely in the coming months, especially for services starting to structure their AI adoption strategies.

Source: The Imaging Wire