Skip to main content

A radiomics model designed to grade hand osteoarthritis on plain radiographs has shown performance comparable to human readers, according to coverage published by AuntMinnie. The approach extracts hundreds of quantitative parameters from each image and points to a future where the classic Kellgren-Lawrence (KL) grading can be aided by automated systems, reducing inter-observer variability that has plagued musculoskeletal radiology for decades.

Hand radiograph showing osteoarthritis findings at the interphalangeal joints
Hand radiographs remain the imaging backbone of osteoarthritis grading and are now being analyzed by radiomics models.

Why hand osteoarthritis matters

Hand osteoarthritis affects up to 40% of people over 60 and is a leading cause of chronic pain and loss of fine motor function. Imaging diagnosis remains anchored in plain radiography, an inexpensive, widely available, and relatively standardized modality. Even so, severity grading in clinical practice suffers from inter- and intra-observer variability. The Kellgren-Lawrence scale, still dominant, has five grades based on visual criteria such as joint space narrowing, osteophytes, sclerosis, and deformity.

Studies published over the past few years in Radiology, European Radiology, and Skeletal Radiology have shown that even experienced radiologists reach only moderate agreement when classifying distal interphalangeal (DIP), proximal interphalangeal (PIP), and trapezium-metacarpal joints, the latter with its own semiology described by the Eaton-Littler scale.

What radiomics is and why it fits here

Radiomics is the process of extracting hundreds to thousands of quantitative parameters from medical images, going well beyond what the human eye captures. In radiography, that includes texture descriptors such as gray-level co-occurrence matrices (GLCM), gray-level run-length matrices (GLRLM), and local variances that correlate with bone microstructure and tissue calcification. These features are combined into machine-learning models — random forests, gradient boosting, or neural networks — that learn to map features to an outcome.

In hand osteoarthritis, the most common outcome is KL grading itself or an automated version of the OARSI scale. Other work targets prediction of radiographic progression, presence of erosions, or response to therapy. Because the DIP joint occupies only a few square centimeters in the image, success depends on accurate segmentation, a step where deep learning has gained ground over the past five years.

Performance versus radiologists

Papers from 2024 and 2025 report substantial agreement between radiomics models and consensus grading by two or three musculoskeletal radiologists, with quadratic kappa typically between 0.70 and 0.85. On external test sets, performance drops by about 0.05 to 0.10 when data comes from different equipment or populations than the training set. This pattern, familiar from other domains, reinforces the need for cross-calibration protocols before clinical use.

Compared with human readers, the model offers two practical advantages: it is reproducible (the same image always yields the same grade) and it is fast, processing in seconds a joint series that would take minutes to read. These gains matter most in cohort studies and clinical trials, where standardized grading directly affects the sample size needed to detect an effect.

Implications for practice and research

In daily practice, hand osteoarthritis radiomics will not replace the radiologist any time soon. The more likely route is an assistance layer: the system suggests a grade for each joint, flags disagreements, and offers a quantitative measure that can sit in a structured report. In services with high rheumatology volume, that support tends to reduce average reading time and friction in peer review.

In research, quantitative parameterization opens doors. Researchers can test associations between radiomic features, serological markers such as anti-CCP and CRP, and functional outcomes measured by AUSCAN-style questionnaires. It is worth revisiting our coverage of AI in other imaging challenges, such as early pancreatic cancer detection where AI outperformed radiologists and whole-body MRI with AI for tissue composition, two examples of automated quantification that point in the same direction.

Limitations that still need attention

Three recurring limitations show up across the literature: dependence on segmentation quality, sensitivity to acquisition protocol, and relatively small sample sizes given the clinical heterogeneity of the disease. Variations in kVp, mAs, and source-detector distance change image contrast and texture, which can confuse radiomic texture features. Image harmonization programs, such as the use of ComBat or learned normalizers, are paths to mitigate the problem.

Another weakness is auditability. Gradient boosting models with hundreds of features are not trivially interpretable. To reduce that risk, many groups also report the five to ten most influential features via SHAP values, offering a partial explanation of what the model prioritizes. In osteoarthritis, texture descriptors in the subchondral region and shape parameters of the joint space tend to dominate the rankings.

What to expect next

The natural next step is integration of these models into PACS and structured reporting systems, with calibration specific to the equipment base of each service. Societies such as ESSR and RSNA have signaled in recent consensus papers that prospective multicenter studies are the priority. For radiology leaders, the scenario suggests that musculoskeletal AI is moving from niche to commodity, alongside chest, breast, and neuro modules in commercial packages.

Source: AuntMinnie