Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
ISO 28500:2017 specifies the WARC (Web ARChive) file format, the international standard for storing web archival content. Developed as the successor to the ARC format (used by the Internet Archive since 1996), WARC addresses the growing need for a standardized, extensible format capable of capturing the complex structure of modern web resources. The format supports storing multiple digital resources — including HTTP responses, DNS records, and metadata — within a single archive file while preserving the relationships between resources and their contextual metadata.
A WARC file is a sequence of concatenated records, each consisting of a header section and an optional content block. The standard defines several mandatory and optional named fields within each record header:
| Field Name | Mandatory | Description |
|---|---|---|
| WARC-Record-ID | Yes | Globally unique identifier for the record (URI format) |
| Content-Length | Yes | Length of the content block in bytes |
| WARC-Date | Yes | Date/time of resource capture (ISO 8601 format, with seconds precision) |
| WARC-Type | Yes | Record type: response, request, resource, metadata, conversion, revisit, warcinfo, or continuation |
| Content-Type | No | MIME type of the content block (e.g., application/http; msgtype=response) |
| WARC-Block-Digest | No | Hash digest of the content block (e.g., sha1:X4QJ7P…) |
| WARC-Payload-Digest | No | Hash digest of the payload within the content block |
| WARC-IP-Address | No | IP address of the server that served the resource |
| WARC-Refers-To | No | Reference to a related record (e.g., the original response for a revisit record) |
WARC defines eight record types, each serving a specific function in the web archiving workflow:
warcinfo describes the archive file itself (crawler software, operator, parameters). response stores the full HTTP response including headers and body. request captures the HTTP request that generated a response, enabling reconstruction of the complete client-server interaction. resource stores any digital resource not encapsulated in an HTTP envelope, such as FTP downloads or DNS records.
metadata stores supplementary information about another record (e.g., crawl log entries, extracted text). conversion records represent transformed versions of the original resource, such as OCR output or format migrations. revisit records indicate that a resource’s content is identical to a previously captured version, using the WARC-Refers-To field to establish the relationship — this is critical for efficient storage of repeated crawls.
The continuation record type allows splitting large content blocks across multiple WARC records, essential when individual resources exceed practical file size limits. This mechanism ensures that WARC remains usable with both strict file size constraints (e.g., 4 GB for some filesystems) and very large web resources.
Implementing WARC-based systems requires careful consideration of several engineering aspects:
Compression Strategy: WARC files compress exceptionally well with gzip (typical ratios of 5:1 to 10:1 for HTML content). However, individual record compression (gzip per record) is preferred over file-level compression for random access. The standard recommends using Content-Encoding: gzip at the record level.
File Size Management: Common practice limits individual WARC files to 1 GB uncompressed or approximately 100 MB compressed. This balances storage efficiency with practical handling requirements. The WARC specification permits any file size but recommends implementors support at least 4 GB files.
Deduplication Strategies: Revisit records combined with payload digests enable significant storage savings during repeated crawls. A typical weekly crawl of a 10-million-page site might store only 10-20% new content after the initial crawl, with the remainder stored as revisit records referencing the original capture.
Error Handling: Robust WARC implementations should handle truncated records gracefully, log validation errors, and support partial file recovery. The standard’s record-based structure means that damage to one record does not affect others in the same file.