Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Reliable evaluation is the bedrock of trustworthy presentation attack detection. Without standardized testing protocols, it is impossible to compare PAD systems, validate security claims, or understand the limitations of deployed technology. ISO/IEC 29147:2022 establishes the comprehensive evaluation framework for PAD systems across all biometric modalities, defining test protocols, dataset requirements, statistical validation methods, and reporting formats that enable rigorous and reproducible PAD performance assessment.
The standard structures PAD evaluation into three levels. Level 1 — Algorithmic evaluation tests the PAD algorithm against digital presentation data under controlled conditions, typically using pre-recorded attack and bona fide datasets. Level 2 — Operational evaluation tests the complete capture and PAD system in a laboratory environment that simulates operational conditions, including varied lighting, positioning, and environmental factors. Level 3 — Field evaluation tests the deployed system in its operational environment with actual users, capturing real-world performance data including user acceptance and usability impacts.
The standard defines rigorous requirements for the selection and documentation of attack species used in evaluation. For each attack species (e.g., “printed photograph” for face PAD), the evaluation must specify the presentation instrument (e.g., specific printer model and paper type). The standard requires a minimum of three distinct presentation instruments per attack species to ensure that results are not specific to a single piece of equipment. For manufactured artefacts, multiple fabrication batches from different production runs must be tested to account for manufacturing variability.
The standard provides detailed guidance on dataset size determination based on desired statistical confidence levels. For Level 1 evaluations targeting an APCER of 2% with 95% confidence, a minimum of approximately 150 attack presentations per attack species is required. The dataset must include diverse bona fide presentations representing the full demographic and physiological variation expected in the target population, with minimum sample sizes calculated using binomial proportion confidence intervals. The standard emphasizes the importance of disjoint datasets for development and evaluation to avoid overfitting.
| Evaluation Level | Minimum Attack Species | Minimum Presentation Instruments per Species | Minimum Bona Fide Subjects | Typical Duration |
|---|---|---|---|---|
| Level 1 (Algorithmic) | 3 | 3 | 100 | 2–4 weeks |
| Level 2 (Operational) | 5 | 3 | 200 | 4–8 weeks |
| Level 3 (Field) | All relevant | As available | 500+ | 3–12 months |
The standard specifies statistical methods for estimating APCER and BPCER with appropriate confidence intervals. For small sample sizes or low error rates, exact binomial confidence intervals (Clopper-Pearson method) are recommended. For larger datasets, normal approximation intervals may be used with continuity correction. The standard also defines methods for comparing PAD systems statistically, including McNemar’s test for paired comparisons and bootstrap resampling for difference-of-performance confidence intervals.
Implementing a robust PAD evaluation program requires significant investment in test infrastructure, data collection, and statistical expertise. The standard provides practical guidance for organizations at different maturity levels, from small vendors conducting basic Level 1 evaluations to large testing laboratories conducting comprehensive Level 3 evaluations.
The standard introduces the concept of generalization evaluation — testing PAD systems against attack types that were not explicitly included in the training data. This is critical for assessing real-world robustness, as attackers will inevitably use techniques not anticipated during system development. The standard recommends that at least one attack species per evaluation level be reserved as a “zero-day” attack, unknown to the system developer, to explicitly measure generalization capability.
From a reporting perspective, the standard mandates that PAD evaluation results include both overall performance metrics and a vulnerability profile — a species-by-species breakdown of APCER that identifies specific attack types to which the system is most vulnerable. This vulnerability profile enables system integrators to understand the residual risk associated with each attack type and to implement compensating controls where needed. Results must also be reported with clear specification of the evaluation conditions, including the PAD decision threshold, capture hardware, environmental conditions, and subject demographics.