ISO/IEC 26137:2023 establishes the definitive framework for validation of artificial intelligence systems. AI system validation differs fundamentally from traditional software validation because AI system behavior is learned from data rather than explicitly programmed. This means that validation must address not only functional correctness but also statistical performance, generalization capability, robustness to distribution shift, fairness across population groups, and explainability of outputs. The standard provides a comprehensive validation methodology that covers data validation, model validation, behavioral validation, and operational validation, ensuring that AI systems are fit for purpose before deployment and remain so throughout their operational life.
AI validation is inherently statistical rather than deterministic. Unlike traditional software where a test case has a binary pass/fail outcome, AI validation requires statistical confidence intervals and acceptance criteria based on population-level performance metrics.
The standard emphasizes the importance of independent validation — the team that developed the AI system should not be the sole validator. Independence reduces confirmation bias and increases the credibility of validation results.
Validation data must be carefully managed to prevent data leakage between training and validation sets. The standard provides specific guidance on temporal and spatial data splitting techniques appropriate for different AI applications.
Start planning for validation during the data collection phase, not after model training. Validation requirements should drive data acquisition strategy, labeling quality standards, and test scenario design from the beginning of the project.
1. Validation Scope and Types
The standard defines four interrelated validation types: data validation ensures that datasets used for training, validation, and testing meet quality requirements including completeness, consistency, accuracy, representativeness, and freedom from bias; model validation assesses the trained model’s performance against predefined metrics and acceptance criteria, covering accuracy, precision, recall, F1-score, ROC-AUC, and application-specific metrics; behavioral validation examines the AI system’s behavior in edge cases, under adversarial conditions, and across different demographic groups to identify failure modes and bias; and operational validation evaluates the AI system in its intended deployment environment, including integration with existing systems, human-AI interaction, and performance under realistic operational conditions.
Data validation is the most overlooked yet most critical validation type. Many AI failures can be traced to data quality issues — label errors, sampling bias, temporal mismatch — that were not caught because data validation was treated as a low-priority activity.
For model validation, do not rely on a single metric. A model can achieve high accuracy while being completely unusable due to poor calibration, high variance across subgroups, or brittleness to small input perturbations.
Behavioral validation should include systematic testing of the model’s response to out-of-distribution inputs. The standard provides guidance on generating effective OOD test cases for different data modalities.
Operational validation is the final gate before deployment and should reproduce production conditions as closely as possible. This includes realistic data volumes, latency constraints, and user interaction patterns.
2. Validation Metrics and Acceptance Criteria
ISO/IEC 26137 provides a comprehensive catalogue of validation metrics organized by AI task type (classification, regression, clustering, generation, reinforcement learning) and by quality attribute (performance, robustness, fairness, explainability, safety). For classification systems, metrics include accuracy, precision, recall, F1-score, confusion matrix analysis, ROC analysis, precision-recall curves, calibration error, and subgroup-specific performance analysis. For regression systems, metrics include mean absolute error (MAE), root mean square error (RMSE), R-squared, prediction intervals, and residual analysis. The standard also introduces AI-specific metrics such as distribution shift detection (population stability index, KL divergence), model uncertainty quantification (predictive entropy, Monte Carlo dropout uncertainty), and fairness metrics (demographic parity, equal opportunity, equalized odds).
| Validation Type |
Key Metrics |
Acceptance Criteria Basis |
Validation Frequency |
| Data Validation |
Completeness, accuracy, representativeness, bias indicators |
Domain requirements, regulatory standards |
Each data update cycle |
| Model Validation |
Accuracy, precision, recall, F1, AUC-ROC, calibration error |
Business requirements, risk tolerance |
Each training run |
| Behavioral Validation |
Edge case pass rate, adversarial robustness, subgroup fairness |
Risk assessment, regulatory requirements |
Pre-deployment and after significant changes |
| Operational Validation |
Latency, throughput, availability, user satisfaction |
SLA requirements, operational constraints |
Pre-deployment and periodic during operation |
The standard provides detailed guidance on setting acceptance criteria, emphasizing that criteria should be: (a) measurable — defined with clear metrics and thresholds; (b) context-dependent — reflecting the risk level and domain requirements of the specific AI application; (c) statistically grounded — accounting for sampling variability and confidence intervals; (d) multi-dimensional — covering performance, robustness, fairness, and explainability; and (e) auditable — documented with clear rationale and supporting evidence. The acceptance criteria should be established before validation begins and should be approved by relevant stakeholders including domain experts, risk managers, and regulators where applicable.
Warning: Setting acceptance criteria after seeing validation results is a form of p-hacking and invalidates the entire validation process. Pre-specify all acceptance criteria in a validation plan that is reviewed and approved before any validation testing begins. This includes not only primary metrics but also subgroup analyses, edge case definitions, and acceptable degradation thresholds.
3. Validation Process and Documentation
The standard prescribes a structured validation process consisting of: validation planning (defining scope, metrics, acceptance criteria, test methods, and resource requirements), validation execution (conducting the four validation types in a systematic sequence, documenting results, and managing deviations), validation analysis (interpreting results, assessing conformance to acceptance criteria, identifying non-conformances, and root cause analysis), validation reporting (producing a comprehensive validation report that documents all findings, decisions, and evidence), and validation maintenance (ongoing monitoring and periodic re-validation throughout the AI system’s operational life). The validation report is a critical artifact that serves as evidence for regulatory compliance, audit purposes, and stakeholder communication.
The validation report should be structured to serve multiple audiences: an executive summary for management, detailed technical sections for engineers, and compliance mapping for regulators. This multi-layer approach ensures the report is useful rather than archived.
Validation traceability is essential. Every validation requirement should be traceable to a system requirement, which in turn should be traceable to a stakeholder need. This traceability chain is the backbone of a defensible validation argument.
Re-validation triggers should be pre-defined and monitored. Common triggers include: retraining with new data, algorithm changes, deployment environment changes, identified failure modes in the field, and regulatory updates.
Consider using validation checklists derived from the standard’s requirements. These checklists ensure completeness of validation coverage and provide a structured framework for audit preparation.
Frequently Asked Questions
Q: What is the difference between validation and verification in the AI context per this standard?
A: Verification confirms that the AI system was built correctly (meeting specifications), while validation confirms that the right AI system was built (meeting stakeholder needs and being fit for purpose in the operational environment). Both are required, but validation is broader and includes operational and user-centered assessment.
Q: Does ISO/IEC 26137 address large language model validation?
A: While the standard was published in 2023, its validation framework is applicable to LLMs with appropriate tailoring. Specific considerations for LLM validation include prompt-based testing, hallucination detection, toxicity screening, and alignment evaluation. The standard’s four validation types (data, model, behavioral, operational) map well to LLM validation requirements.
Q: How does this standard relate to the validation requirements in the EU AI Act?
A: The EU AI Act requires conformity assessment for high-risk AI systems, including validation documentation. ISO/IEC 26137 provides the operational methodology for conducting the validation activities that support conformity assessment. Organizations using 26137 are well-positioned to meet the Act’s validation-related requirements.
Q: Can validation be automated?
A: Yes, many aspects of validation can be automated, including data quality checks, model performance evaluation, regression testing, and drift monitoring. However, the standard emphasizes that human judgment is required for interpreting results, assessing edge cases, and making acceptance decisions. Automation augments but does not replace expert validation.
Q: What is the minimum acceptable sample size for validation?
A: The standard does not prescribe a universal minimum but provides statistical guidance based on required confidence levels, effect sizes, and subgroup analysis requirements. For high-risk applications, larger sample sizes are needed to achieve statistically significant subgroup comparisons and to detect rare failure modes.
Q: How should validation address continuously learning systems?
A: For continuously learning systems, the standard recommends a combination of initial comprehensive validation followed by ongoing validation monitoring. Specific activities include trigger-based re-validation, cumulative performance tracking, and safeguards against degradation from automated retraining.