Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
ISO 25720:2009 defines the Genomic Sequence Variation Markup Language (GSVML), an XML-based data exchange format designed to facilitate the interchange of genomic sequence variation data across international research and clinical facilities. In the post-genomic era, overwhelming amounts of genomic data are being generated worldwide, yet these data reside in databases with heterogeneous formats. GSVML addresses the critical need for a standardized, interoperable format that enables seamless data exchange without requiring changes to existing database schemas.
The standard positions GSVML within the broader healthcare data ecosystem alongside HL7 (clinical data), DICOM (medical imaging), and other bioinformatics markup languages such as BSML (Bioinformatic Sequence Markup Language) and SBML (Systems Biology Markup Language). This layered approach recognizes that modern healthcare IT must integrate clinical, imaging, and genomic data to enable the vision of personalized medicine and pharmacogenomics.
| Data Type | Standard | Focus Area |
|---|---|---|
| Clinical Data | HL7 / EN 13606 | Electronic healthcare records, clinical messaging |
| Image Data | DICOM / JPEG | Medical imaging, radiology, pathology |
| Genomic Variation Data | GSVML (ISO 25720) | SNP/STRP annotation, allele frequencies, genotypes |
| Biological Models | SBML / Cell ML | Systems biology, cellular pathway modeling |
| Sequence Annotation | BSML | Bioinformatic sequence features and metadata |
The GSVML specification is built on a modular XML architecture comprising both a Document Type Definition (DTD) and an XML Schema. The DTD (Annex A) defines the structural grammar — the permissible elements, attributes, and their hierarchical relationships. The XML Schema (Annex B) provides stronger data typing, including constraint definitions for allele values, genomic coordinates, and experimental parameters.
The core GSVML data model captures five fundamental categories of information:
1. Variation Identification — Each genetic variation is assigned a unique identifier within the GSVML document. The model supports multiple identifier systems including dbSNP rs IDs, local database identifiers, and HGVS nomenclature. The standard explicitly accommodates the reality that different laboratories and databases may use different identifier schemes.
2. Genomic Context — Variations are localized using reference sequence coordinates (chromosome, position, strand orientation). The model supports both assembly-specific coordinates (e.g., GRCh38) and locus-based descriptions, enabling cross-reference between genome builds.
3. Allele and Genotype Information — For each variation, the observed alleles, their frequencies in studied populations, and individual genotype data are captured. The model supports bi-allelic and multi-allelic variations, with explicit encoding of allele phases where known.
4. Sample and Population Metadata — Crucial for research validity, the model captures details about the samples studied, including population origin, sample size, genotyping platform, and quality metrics. This enables consumers of GSVML data to assess data quality and relevance to their own research questions.
5. Clinical Annotations — Where available, clinical significance, associated phenotypes, drug response correlations, and literature references are linked to each variation. This clinical dimension distinguishes GSVML from purely research-oriented genomic formats.
The GSVML development process (Clause 6) followed a rigorous methodology beginning with use case analysis from the Japanese Millennium Project and HL7 Clinical Genomics Special Interest Group. Three primary use cases drove the specification:
Pharmacogenomic Data Exchange — Enabling the transfer of patient genotype data between clinical laboratories and prescribing physicians. For example, a laboratory might generate CYP2C9 and VKORC1 genotype data to guide warfarin dosing, and transmit this data in GSVML format to the electronic health record system.
Population Genetics Research — Supporting the aggregation of SNP frequency data across multiple studies and populations. GSVML enables meta-analyses by providing a common format for allele frequency reports, facilitating large-scale genome-wide association studies (GWAS).
Diagnostic Genomic Reporting — Standardizing the format for reporting clinically actionable genetic findings from diagnostic laboratories to healthcare providers. This use case requires the integration of variation data with clinical interpretations, therapeutic recommendations, and references to evidence sources.
| Use Case | Data Volume | Key GSVML Features | Primary Users |
|---|---|---|---|
| Pharmacogenomics | Low (specific gene panels) | Clinical annotations, drug associations | Clinical labs, physicians |
| Population research | High (genome-wide) | Frequency data, population metadata | Research institutions |
| Diagnostic reporting | Medium (targeted panels) | Phenotype links, evidence references | Diagnostic labs, genetic counselors |
| Database aggregation | Very high (multi-study) | Provenance tracking, quality metrics | Bioinformatics platforms |
For engineers implementing GSVML-compatible systems, several architectural considerations emerge. The XML Schema validation provides robust input checking, but for high-throughput production systems, binary encoding or compression should be considered — GSVML documents can be verbose, particularly when transmitting genome-wide variation data for thousands of samples.
API design around GSVML should consider that genomic data consumers have diverse needs. Some applications require real-time query of specific variants; others need batch import of large datasets. The standard defines the data format but not the transport mechanism, leaving implementers free to choose REST APIs, message queues, or file-based exchange as appropriate.
For long-term system architecture, consider that genomic knowledge evolves rapidly. A variant classified as benign today may be reclassified as pathogenic tomorrow. GSVML documents should include version metadata and support for annotation updates, enabling consuming systems to track the provenance of clinical interpretations over time.