ISO/IEC TS 25058:2022 — Quality Evaluation and Measurement of AI Systems

ISO/IEC TS 25058 — Technical Specification Overview

Introduction to ISO/IEC TS 25058

ISO/IEC TS 25058:2022 represents a landmark extension of the SQuaRE framework into the domain of artificial intelligence systems. As AI systems — particularly those based on machine learning — become embedded in critical applications ranging from medical diagnosis to autonomous vehicles to financial decision-making, the need for a structured, multi-dimensional approach to AI system quality becomes urgent. Traditional software quality models are insufficient for AI systems because AI behavior is learned from data rather than explicitly programmed, introducing unique quality considerations around training data quality, model robustness, explainability, and fairness.

AI systems can fail in ways that traditional software cannot: they may exhibit bias against protected groups, degrade gracefully in unexpected ways, produce confident but incorrect outputs, or behave unpredictably in edge cases not represented in training data. These failure modes require quality evaluation approaches specifically designed for AI.

TS 25058 adapts the ISO/IEC 25010 quality model framework to the unique characteristics of AI systems. It introduces new quality characteristics and sub-characteristics relevant to AI, refines existing characteristics to address AI-specific concerns, and defines measurement methods appropriate for evaluating AI system quality. The specification addresses the full AI system lifecycle — from data collection and model training through deployment, monitoring, and retraining — recognizing that AI quality is not a static property but must be continuously evaluated as data distributions shift and operational contexts evolve.

The specification is closely aligned with other ISO/IEC AI standards, including ISO/IEC 22989 (AI concepts and terminology), ISO/IEC 23053 (ML framework), and the emerging ISO/IEC 42001 (AI management system). Together, these standards form a comprehensive governance framework for AI systems.

AI-Specific Quality Characteristics

Data Quality and Fitness

Unlike traditional software where quality depends primarily on code correctness, AI system quality is fundamentally determined by the quality of training data. TS 25058 defines data quality characteristics that must be evaluated as part of AI system quality assessment:

Characteristic AI-Specific Sub-Characteristics Evaluation Approach
Data Suitability Data completeness, data representativeness, data balance, data relevance Statistical analysis of training data distribution; comparison with target population demographics; coverage analysis for feature space
Data Accuracy Label accuracy, feature accuracy, annotation consistency Inter-annotator agreement measures (Cohen’s kappa, Fleiss’ kappa); holdout validation set for label verification
Data Timeliness Data currency, concept drift detection, data freshness Monitor prediction accuracy over time; implement drift detection algorithms (PSI, KS test); track data age distribution
Data Provenance Source traceability, transformation transparency, lineage completeness Maintain data lineage documentation; implement data version control; audit data collection and processing pipelines
The number one cause of AI system failure in production is data distribution shift — the gap between training data and real-world deployment data. TS 25058 emphasizes continuous monitoring of data quality as an essential part of AI system quality management, not a one-time evaluation at development time.

Model Quality Characteristics

Beyond data quality, TS 25058 defines model-specific quality characteristics that address the unique properties of AI/ML models:

>Expected calibration error (ECE); reliability diagrams; confidence-interval coverage for regression tasks
Characteristic Description Measurement Approach
Model Accuracy Degree to which model outputs match correct or expected values Standard ML metrics (precision, recall, F1, AUC-ROC, MAE, RMSE) evaluated on representative test sets; disaggregated by relevant subgroups
Model Robustness Ability to maintain prediction quality under perturbed inputs or changing conditions Adversarial testing (FGSM, PGD); noise injection testing; distribution shift robustness evaluation; out-of-distribution detection performance
Explainability Degree to which model decisions can be understood by humans Feature importance analysis (SHAP, LIME); counterfactual explanation generation; interpretability metrics for different stakeholder groups
Fairness and Bias Degree to which model decisions are free from systematic discrimination Statistical parity, equal opportunity, equalized odds, demographic parity metrics; bias audit across protected attributes
Uncertainty Quantification Degree to which the model accurately communicates confidence in its predictions

Implementing TS 25058 in AI System Development

TS 25058 provides a quality framework that should be integrated throughout the AI system development lifecycle. In the design phase, the quality model characteristics inform requirements specification — teams should explicitly document which quality characteristics are relevant, the target levels to be achieved, and the evaluation methods to be used. This proactive approach prevents the common pitfall of treating quality evaluation as a post-hoc activity.

During data preparation, teams should evaluate data quality characteristics from TS 25058, documenting data provenance, assessing representativeness, and verifying label quality. Data quality issues discovered at this stage are far less costly to address than issues discovered after model deployment.

During model development and evaluation, the model quality characteristics provide a comprehensive evaluation framework that goes beyond simple accuracy metrics. Teams should evaluate models across all relevant characteristics — robustness, explainability, fairness, and uncertainty — not just predictive performance. This multi-dimensional evaluation often reveals trade-offs: improving robustness may slightly reduce accuracy, and increasing fairness may require accepting higher error rates for some groups. These trade-offs should be documented and managed explicitly.

Leading AI engineering teams adopt a “model card” approach inspired by TS 25058, documenting each model’s performance across multiple quality dimensions, intended use cases, limitations, and ethical considerations. This practice improves transparency and supports responsible AI deployment.

During deployment and operations, TS 25058 guides the implementation of continuous monitoring for model quality degradation. Key monitoring elements include data drift detection, prediction distribution monitoring, and regular retraining triggers. The specification emphasizes that AI system quality is not a one-time evaluation but a continuous process that must keep pace with changing data distributions and operational contexts.

For engineers implementing TS 25058, a practical starting point is to create an AI system quality specification document that maps each relevant quality characteristic from the model to specific measures, target values, evaluation methods, and monitoring approaches. This document serves as the quality contract between development teams, operations teams, and business stakeholders, establishing shared expectations for AI system behavior across its operational life.

Frequently Asked Questions

Q1: How does TS 25058 relate to the EU AI Act requirements?
A: TS 25058 provides technical quality evaluation methods that can support compliance with regulatory requirements such as the EU AI Act. The quality characteristics around transparency, fairness, and robustness directly address regulatory concerns, though TS 25058 is a technical specification rather than a regulatory compliance standard.
Q2: Can TS 25058 be applied to generative AI models?
A: The core quality model applies to generative AI, but additional quality considerations specific to generative models — such as hallucination rate, output coherence, safety alignment, and content moderation effectiveness — may require supplementary evaluation methods beyond those defined in the current specification.
Q3: How do you measure explainability under TS 25058?
A: TS 25058 approaches explainability measurement through multiple lenses: feature attribution quality (how accurately do attribution methods reflect model behavior?), explanation stability (how much do explanations vary for similar inputs?), and user comprehension (can users correctly predict model behavior based on explanations?).
Q4: Is TS 25058 applicable to traditional rule-based AI systems?
A: Yes, though the emphasis on data quality and model robustness is more relevant to ML-based systems. For rule-based AI systems, the quality characteristics around functional suitability, correctness, and maintainability from the base SQuaRE model are typically more applicable, with less emphasis on data-driven characteristics.

Leave a Reply

Your email address will not be published. Required fields are marked *