ISO/IEC 25422:2020 — Data Provenance Model

Tracking data lineage across heterogeneous information systems

ISO/IEC 25422:2020 defines a provenance data model for representing the origin, derivation, and transformation history of data across information systems. The standard aligns closely with the W3C PROV family of recommendations (PROV-DM, PROV-O, PROV-N) but extends them with domain-specific constructs for enterprise data management — including business context annotations, policy constraints, and multi-level aggregation. For data engineers building lineage tracking systems, 25422 provides the conceptual foundation and serialization guidelines necessary for interoperable provenance exchange.

While W3C PROV focuses on generic provenance on the Web, ISO/IEC 25422 adds enterprise-oriented features: organizational role bindings, data quality impact propagation, and support for both fine-grained (record-level) and coarse-grained (dataset-level) provenance.

1. The Provenance Graph Structure

At its core, the provenance model is a directed acyclic graph (DAG) with three primary node types: entities (data artifacts, records, datasets), activities (processes, transformations, ETL jobs), and agents (persons, organizations, software systems). Edges in the graph represent relationships: used (activity to entity), wasGeneratedBy (entity to activity), wasAttributedTo (entity to agent), wasAssociatedWith (activity to agent), and actedOnBehalfOf (agent to agent for delegation).

The standard introduces two extensions to the basic PROV graph: businessContext annotations that attach project identifiers, regulatory classifications, and business process metadata to provenance nodes; and qualityImpact edges that propagate data quality scores along derivation paths. This last feature is particularly valuable in regulatory compliance scenarios where a downstream report’s accuracy depends on the quality of multiple upstream data sources.

Node Type PROV Equivalent 25422 Extension Example
Entity prov:Entity businessContext, retentionPolicy “Customer_Data_Daily_Export.csv”
Activity prov:Activity executionEnvironment, inputSchema “ETL_Job_Daily_Customer_Sync”
Agent prov:Agent organizationalRole, certificationLevel “Data_Engineer_Team_Alpha”
A common implementation mistake: creating a provenance graph that is too fine-grained. Recording provenance at the individual field or row level of a million-row table produces a graph with billions of nodes and edges — computationally prohibitive. The standard recommends aggregation: provenance at the dataset level by default, with drill-down to record-level only for critical data elements identified through data sensitivity classification.

2. Provenance Collection Strategies

The standard describes three collection strategies: (1) instrumentation-based — embedding provenance capture logic directly into data processing pipelines (e.g., hooks in Apache Spark transformations, JDBC driver interceptors); (2) log-based — deriving provenance from existing audit logs, database transaction logs, and workflow management system records; and (3) inference-based — deducing provenance relationships from data characteristics (e.g., schema fingerprints, statistical correlations) when direct capture is not feasible.

For most enterprise environments, a hybrid approach works best. Instrumentation-based capture provides the most accurate provenance but requires modifying every data pipeline. Log-based capture works as a fallback for legacy systems, while inference-based capture is best reserved for data-discovery contexts where approximate lineage is acceptable.

A large European bank implemented hybrid provenance capture using Apache Atlas (instrumentation for Hadoop pipelines) combined with Spline (log-based for legacy SQL jobs) and achieved 94% provenance coverage across 2,400 data assets within six months. The ISO/IEC 25422 model was used as the canonical schema for the provenance store.

3. Engineering Design Insights for Provenance Systems

One of the most practical contributions of the standard is the provenance API specification, which defines RESTful endpoints for submitting and querying provenance records. The API supports temporal queries (“show me the provenance of this report as of last Tuesday”), impact analysis (“which downstream reports depend on this source table?”), and path tracing (“find the shortest derivation path between these two datasets”).

For engineering teams implementing provenance storage, the standard recommends a property-graph database (e.g., Neo4j, JanusGraph) over a relational store, because provenance queries are fundamentally graph traversal operations. The standard includes query pattern examples in both SPARQL (for RDF serializations) and Cypher/Gremlin (for property graph stores).

Provenance metadata itself has a retention policy. The standard warns against indefinite storage of fine-grained provenance, which can become a data privacy risk (e.g., capturing who accessed a specific customer record at a specific time may violate data minimization principles under GDPR). Provenance retention should be aligned with the underlying data’s retention schedule.

Frequently Asked Questions

Q: What is the difference between data provenance and data lineage?
The terms are often used interchangeably, but ISO/IEC 25422 distinguishes them: provenance is the complete history of origin and transformations (who, what, when, how), while lineage is a subset focused on derivation paths (what transformed into what). Lineage is essentially a projection of the full provenance graph.
Q: How does 25422 relate to the OpenLineage standard?
OpenLineage is a community-driven specification focused on real-world lineage collection in modern data pipelines. ISO/IEC 25422 provides a more formal and comprehensive conceptual model. The two can coexist — OpenLineage events can be mapped to the 25422 provenance graph structure for enterprise-wide lineage consolidation.
Q: Can provenance be retroactively captured for legacy systems?
Yes, using the log-based strategy. Database transaction logs, ETL job logs, and file system metadata can be parsed to reconstruct approximate provenance. The standard acknowledges that retrofitted provenance has lower accuracy than instrumented capture and recommends labeling provenance records with a confidence score.
Q: What serialization formats does the standard support?
The standard defines bindings for JSON-LD (aligned with W3C PROV-JSON), XML (PROV-XML), Turtle (PROV-N), and a binary format for high-throughput scenarios. JSON-LD is recommended as the default interchange format due to its broad ecosystem support.

Leave a Reply

Your email address will not be published. Required fields are marked *