API Reference › Speech to Text

Streaming Audio

Endpoints

Primary Endpoint: Supports both encoded audio and raw PCM

wss://api.aldea.ai/v1/listen

Note: Both URLs are aliases and function identically

PCM-Only Endpoint: Optimized for raw PCM

wss://api.aldea.ai/v1/listen/pcm
Method: GET (WebSocket upgrade)
Status: 101 Switching Protocols

Headers

HeaderTypeRequiredDescription
AuthorizationstringYesBearer token; required for all API requests.
Content-TypestringNoAudio format hint (e.g., audio/mpeg, audio/wav). Used for format detection.

Query Parameters

NameTypeDefaultDescription
sample_ratenumber16000Client audio sample rate in Hz; server resamples to 16 kHz for inference.
interim_resultsbooltrueEmit interim updates while streaming.
endpointingnumber|falseenvControls sentence finalization latency (heuristic mapping).
encodingstringauto-detectAudio encoding format (pcm, mp3, aac, flac, wav, ogg, webm, opus, m4a). Also accepts format or codec as aliases.
redactstring-Enable PII redaction. Can be repeated multiple times. Values: pii (person names, emails, phone numbers, addresses, SSN), pci (credit cards, bank numbers), numbers (dates, phone numbers, SSN, credit cards), true (all PII types). Example: ?redact=pii&redact=pci
languagestring-Language identifier in BCP-47 format (e.g., "en-US"). Also accepts lang as alias. Spanish can be indicated with "es" or "es-ES".

Send

Audio Databinary

Binary audio data in one of the following formats:

  • Raw PCM: 16-bit signed little-endian (s16le) frames
  • Encoded formats: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A
OR
Finalizeobject
OR
CloseStreamobject
OR
KeepAliveobject

Receive

Metadataobject
OR
SpeechStartedobject
OR
Resultsobject
OR
UtteranceEndobject

Messages

Client → Server

Binary audio data in one of the following formats:• Raw PCM: 16-bit signed little-endian (s16le) frames  - Supported by both /v1/listen and /v1/listen/pcm  • Encoded formats: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A  - Supported by /v1/listen only  - Auto-detected from format hints or binary headers
{ "type": "Finalize" }
{ "type": "CloseStream" }
{ "type": "KeepAlive" }

Server → Client

{  "type": "Metadata",  "request_id": "...",  "created": "2025-01-01T00:00:00.000000Z",  "duration": 0.0,  "channels": 1,  "model_info": {    "name": "<model_id>",    "version": "",    "arch": "aldea-asr"  }}
{  "type": "SpeechStarted",  "channel": [0],  "timestamp": 0.0}
{  "type": "Results",  "channel_index": [0],  "duration": 1.98,  "start": 0.00,  "is_final": false,  "speech_final": false,  "channel": {    "alternatives": [{      "transcript": "Hello world",      "confidence": 0.95,      "words": [        ["Hello", 0, 320],        ["world", 320, 640]      ]    }]  },  "metadata": {    "request_id": "...",    "model_info": {      "name": "<model>",      "version": null,      "arch": "aldea-asr"    }  }}
{  "type": "UtteranceEnd",  "channel": [0],  "last_word_end": 2.5}

Supported Audio Formats

Encoded Formats (via /v1/listen):

  • MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A
  • Auto-detected from format hints or binary headers

Raw PCM (via /v1/listen/pcm):

  • 16-bit signed little-endian (s16le)
  • 16kHz sample rate (configurable via sample_rate param)
  • Mono channel (auto-converted)

Notes

  • Audio is treated as mono; channels param is not required.
  • Supported events: Metadata, Results (interim/final), SpeechStarted. UtteranceEnd requires word timestamps to be enabled.
  • Unsupported: diarization, multichannel separation, numerals/profanity/smart_format (accepted but no‑ops).
  • PII Redaction: Use ?redact=pii to automatically redact personally identifiable information. Combine multiple types: ?redact=pii&redact=pci. Control anonymization style with redact_mode (mask, redact, or replace).
  • Format Detection: Format is auto-detected from Content-Type header or query parameters (encoding/format/codec). If not specified, the format is detected automatically.
  • Confidence Scores: Results include calculated confidence values based on word-level confidences when available.
  • Word Timestamps: Words array format is [word, start_ms, end_ms] when timestamps are enabled. Timestamps are in milliseconds.
  • Encoded Audio: If your client produces MP3/Opus, you can send encoded chunks directly to /v1/listen, or decode on the client and use /v1/listen/pcm.