Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
IEC TR 63038 provides a standardized framework for evaluating the performance of video analytics systems in digital video monitoring applications. As video surveillance deployments grow exponentially — from smart city traffic management to retail footfall analytics — the need for objective, repeatable performance metrics becomes paramount. This technical report defines test scenarios, ground-truth annotation methodologies, and statistical reporting conventions for object detection, classification, tracking, and event recognition.
TR 63038 covers four core analytic tasks: (1) object detection (bounding box output), (2) object classification (label assignment), (3) multi-object tracking (ID preservation across frames), and (4) event detection (loitering, line-crossing, abandoned object). Each task has dedicated metrics, test datasets, and minimum reporting requirements.
The standard mandates reporting of the following metrics for every analytics evaluation:
| Metric | Definition | Reporting Requirement |
|---|---|---|
| Precision | TP / (TP + FP) | Per object class, per condition |
| Recall | TP / (TP + FN) | Per object class, per condition |
| F₁ Score | 2 · (Precision · Recall) / (Precision + Recall) | Harmonic mean, overall and per class |
| MOTA | Multiple Object Tracking Accuracy | For tracking scenarios only |
| Processing Latency | Frame-in to result-out delay | P₅₀, P₉₅ in milliseconds |
| Throughput | Frames processed per second | At native resolution |
TR 63038 specifies that test datasets must include at least 10,000 annotated frames per task, with a minimum of 500 frames per environmental condition (daylight, low-light, rain, fog, night-infrared). The annotation format is based on a modified COCO JSON schema, extended with temporal fields (track_id, occlusion_flag, confidence). Ground-truth accuracy must be ≥ 99% at the pixel level for bounding boxes and ≥ 99.5% for classification labels.
Video analytics performance under the TR 63038 framework is highly dependent on edge-device compute capability. A typical deep learning accelerator (e.g., NVIDIA Jetson Orin, Hailo-8, Intel Movidius) can achieve 30-60 FPS on lightweight object detection networks (YOLOv8n, MobileNet-SSD) at 1080p resolution. The standard recommends reporting performance at the target deployment resolution rather than at the training resolution, as downscaling artifacts significantly affect small-object recall.
Looking forward, the IEC is considering a second edition that incorporates neural network robustness testing (adversarial patch attacks) and privacy-preserving analytics evaluation (on-device inference vs. cloud-based). The foundational metrics framework defined in TR 63038 will remain central to these future extensions.