Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
ISO/IEC 29341-4-2:2011 defines the UPnP AV Architecture, the foundational framework that describes how the various device types and services defined in the ISO/IEC 29341 series work together to form a complete, interoperable audio/video home networking system. Unlike the individual service specifications that focus on specific functions, this architecture document provides the overarching design patterns, interaction protocols, and system-level behavior that enable a media server from Vendor A to stream content to a renderer from Vendor B controlled by an app from Vendor C.
The architecture defines four fundamental phases of AV interaction: discovery (SSDP), description (device and service XML documents), control (SOAP actions), and eventing (GENA). Each phase builds upon the UPnP Device Architecture v1.0 (ISO/IEC 29341-1 series) and adds AV-specific semantics. The architecture also defines two 3-box models (the standard server-renderer-control point triangle) and a 2-box model (where a device combines two roles).
The Media Server role provides content storage, metadata management, and streaming capabilities. It implements the ContentDirectory, ConnectionManager, and AVTransport services. The Media Renderer role receives and renders content, implementing the RenderingControl, ConnectionManager, and optionally AVTransport services. The Control Point role orchestrates the interaction, implementing no media services itself but acting as the intelligent director that tells the server what to serve and the renderer how to render it.
| Role | Required Services | Examples | Network Position |
|---|---|---|---|
| Media Server | CDS, CMS, AVT | NAS, PC media library, DVR | Content source (HTTP/RTSP server) |
| Media Renderer | RCS, CMS, (AVT optional) | Smart TV, Sonos speaker, AV receiver | Content sink (HTTP/RTSP client) |
| Control Point | None (client only) | Smartphone app, remote control UI | Orchestrator (invokes actions on both) |
The canonical 3-box interaction flow proceeds as follows: (1) The Control Point discovers a Media Server and Media Renderer via SSDP multicast. (2) The Control Point retrieves device descriptions and service XML from both. (3) The Control Point queries the CMS of both devices via GetProtocolInfo() to find compatible protocols. (4) The Control Point calls Browse() on the server’s CDS to present content choices to the user. (5) Upon user selection, the Control Point calls PrepareForConnection() on both devices’ CMS. (6) The Control Point invokes SetAVTransportURI() followed by Play() on the server’s AVT. (7) Media flows directly from server to renderer. (8) The Control Point can adjust rendering via RCS and playback via AVT during streaming.
In the 2-box model, two of the three roles are combined into a single physical device. The most common variant is the “Media Server + Control Point” combination, where a device like a smartphone acts as both content source and controller, streaming to a separate renderer. Less common but architecturally valid is the “Media Renderer + Control Point” combination, where a smart TV with a built-in browser discovers and browses a remote media server. The 2-box model reduces network round-trips at the cost of tighter coupling between the combined roles.
The AV Architecture intentionally separates the control plane (UPnP actions via SOAP) from the data plane (media transfer). This separation allows the architecture to support any transport protocol that can be described by the ProtocolInfo string format. When new streaming technologies emerge (e.g., WebRTC, HLS, MPEG-DASH), they can be integrated into the UPnP AV framework simply by defining their protocol identifier and ensuring the CMS negotiation handles the new format.
The architecture also specifies a comprehensive event notification model. State changes in any service are pushed to subscribed control points via GENA. The LastChange event variable in each service aggregates multiple state variable updates into a single XML document, reducing event volume. For scalability, the architecture recommends that control points subscribe with a timeout (default 300 seconds) and refresh as needed, allowing the devices to clean up stale subscriptions.