Speech to Text Documentation

Service Overview

The IBM Speech to Text service provides a Representational State Transfer (REST) Application Programming Interface (API) that enables you to add IBM's speech transcription capabilities to your applications. The service also supports an asynchronous HTTP interface for transcribing audio via non-blocking calls. The service transcribes speech from various languages and audio formats to text with low latency. The service supports transcription of the following languages: Brazilian Portuguese, Japanese, Mandarin Chinese, Modern Standard Arabic, Spanish, UK English, and US English. For most languages, the service supports two sampling rates, broadband and narrowband.

API Overview

The Speech to Text service provides the following endpoints:

  • /v1/models returns information about the models (languages and sampling rates) available for transcription.
  • /v1/sessions provides a collection of methods that provide a mechanism for a client to maintain a long, multi-turn exchange, or session, with the service or to establish multiple parallel conversations with a particular instance of the service.
  • /v1/recognize (sessionless) includes a single method that provides a simple means of transcribing audio without the overhead of establishing and maintaining a session, but it lacks some of the capabilities available with sessions.
  • /v1/register_callback (asynchronous) offers a single method that registers, or white-lists, a callback URL for use with methods of the asynchronous HTTP interface.
  • /v1/recognitions (asynchronous) provides a set of non-blocking methods for submitting, querying, and deleting jobs for recognition requests with the asynchronous HTTP interface.

API Usage

The following general information pertains to the transcription of audio:

  • You can pass the audio to be transcribed as a one-shot delivery or in streaming mode. With one-shot delivery, you pass all of the audio data to the service at one time. With streaming mode, you send audio data to the service in chunks over a persistent connection. If your data consists of multiple parts, you must stream the data. To use streaming, you must pass the Transfer-Encoding request header with a value of chunked. Both forms of data transmission impose a limit of 100 MB of total data for transcription.
  • You can use methods of the session-based, sessionless, or asynchronous HTTP interfaces to pass audio data to the service. All interfaces let you send the data via the body of the request; the session-based and sessionless methods also let you pass data in the form of one or more audio files as multipart form data. With the former approach, you control the transcription via a collection of request headers and query parameters. With the latter, you control the transcription primarily via JSON metadata sent as form data.
  • The service also offers a WebSocket interface as an alternative to its HTTP interfaces. The WebSocket interface supports efficient implementation, lower latency, and higher throughput. The interface establishes a persistent connection with the service, eliminating the need for session-based calls from the HTTP interface.
  • By default, all Watson services log requests and their results. Data is collected only to improve the Watson services. If you do not want to share your data, set the header parameter X-Watson-Learning-Opt-Out to true for each request. Data is collected for any request that omits this header.

For more information about using the Speech to Text service and the various interfaces it supports, see Using the Speech to Text service.