1. Introduction
There are multiple IETF protocols for establishment and termination of media sessions (SIP [6]), low-level media control (Media Gateway Control Protocol (MGCP) [7] and Media Gateway Controller (MEGACO) [8]), and media record and playback (RTSP [9]). This document focuses on requirements for one or more protocols to support the control of network elements that perform Automated Speech Recognition (ASR), speaker identification or verification (SI/SV), and rendering text into audio, also known as Text-to-Speech (TTS). Many multimedia applications can benefit from having automatic speech recognition (ASR) and text-to-speech (TTS) processing available as a distributed, network resource. This requirements document limits its focus to the distributed control of ASR, SI/SV, and TTS servers. There is a broad range of systems that can benefit from a unified approach to control of TTS, ASR, and SI/SV. These include environments such as Voice over IP (VoIP) gateways to the Public Switched Telephone Network (PSTN), IP telephones, media servers, and wireless mobile devices that obtain speech services via servers on the network. To date, there are a number of proprietary ASR and TTS APIs, as well as two IETF documents that address this problem [13], [14]. However, there are serious deficiencies to the existing documents. In particular, they mix the semantics of existing protocols yet are close enough to other protocols as to be confusing to the implementer. This document sets forth requirements for protocols to support distributed speech processing of audio streams. For simplicity, and to remove confusion with existing protocol proposals, this document presents the requirements as being for a "framework" that addresses the distributed control of speech resources. It refers to such a framework as "SPEECHSC", for Speech Services Control.
1.1. Document Conventions
In this document, the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in RFC 2119 [3].
