Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
ISO/IEC 25422:2020 defines a provenance data model for representing the origin, derivation, and transformation history of data across information systems. The standard aligns closely with the W3C PROV family of recommendations (PROV-DM, PROV-O, PROV-N) but extends them with domain-specific constructs for enterprise data management — including business context annotations, policy constraints, and multi-level aggregation. For data engineers building lineage tracking systems, 25422 provides the conceptual foundation and serialization guidelines necessary for interoperable provenance exchange.
At its core, the provenance model is a directed acyclic graph (DAG) with three primary node types: entities (data artifacts, records, datasets), activities (processes, transformations, ETL jobs), and agents (persons, organizations, software systems). Edges in the graph represent relationships: used (activity to entity), wasGeneratedBy (entity to activity), wasAttributedTo (entity to agent), wasAssociatedWith (activity to agent), and actedOnBehalfOf (agent to agent for delegation).
The standard introduces two extensions to the basic PROV graph: businessContext annotations that attach project identifiers, regulatory classifications, and business process metadata to provenance nodes; and qualityImpact edges that propagate data quality scores along derivation paths. This last feature is particularly valuable in regulatory compliance scenarios where a downstream report’s accuracy depends on the quality of multiple upstream data sources.
| Node Type | PROV Equivalent | 25422 Extension | Example |
|---|---|---|---|
| Entity | prov:Entity | businessContext, retentionPolicy | “Customer_Data_Daily_Export.csv” |
| Activity | prov:Activity | executionEnvironment, inputSchema | “ETL_Job_Daily_Customer_Sync” |
| Agent | prov:Agent | organizationalRole, certificationLevel | “Data_Engineer_Team_Alpha” |
The standard describes three collection strategies: (1) instrumentation-based — embedding provenance capture logic directly into data processing pipelines (e.g., hooks in Apache Spark transformations, JDBC driver interceptors); (2) log-based — deriving provenance from existing audit logs, database transaction logs, and workflow management system records; and (3) inference-based — deducing provenance relationships from data characteristics (e.g., schema fingerprints, statistical correlations) when direct capture is not feasible.
For most enterprise environments, a hybrid approach works best. Instrumentation-based capture provides the most accurate provenance but requires modifying every data pipeline. Log-based capture works as a fallback for legacy systems, while inference-based capture is best reserved for data-discovery contexts where approximate lineage is acceptable.
One of the most practical contributions of the standard is the provenance API specification, which defines RESTful endpoints for submitting and querying provenance records. The API supports temporal queries (“show me the provenance of this report as of last Tuesday”), impact analysis (“which downstream reports depend on this source table?”), and path tracing (“find the shortest derivation path between these two datasets”).
For engineering teams implementing provenance storage, the standard recommends a property-graph database (e.g., Neo4j, JanusGraph) over a relational store, because provenance queries are fundamentally graph traversal operations. The standard includes query pattern examples in both SPARQL (for RDF serializations) and Cypher/Gremlin (for property graph stores).