API Reference › Speech to Text

Streaming Audio

Endpoints

Primary Endpoint: Supports both encoded audio and raw PCM

wss://api.aldea.ai/v1/listen

Note: Both URLs are aliases and function identically

PCM-Only Endpoint: Optimized for raw PCM

wss://api.aldea.ai/v1/listen/pcm

Method: GET (WebSocket upgrade)

Status: 101 Switching Protocols

Headers

Header	Type	Required	Description
`Authorization`	string	Yes	Bearer token; required for all API requests.
`Content-Type`	string	No	Audio format hint (e.g., audio/mpeg, audio/wav). Used for format detection.

Query Parameters

Name	Type	Default	Description
`sample_rate`	number	16000	Client audio sample rate in Hz; server resamples to 16 kHz for inference.
`interim_results`	bool	true	Emit interim updates while streaming.
`endpointing`	number\|false	env	Controls sentence finalization latency (heuristic mapping).
`encoding`	string	auto-detect	Audio encoding format (pcm, mp3, aac, flac, wav, ogg, webm, opus, m4a). Also accepts format or codec as aliases.
`redact`	string	-	Enable PII redaction. Can be repeated multiple times. Values: pii (person names, emails, phone numbers, addresses, SSN), pci (credit cards, bank numbers), numbers (dates, phone numbers, SSN, credit cards), true (all PII types). Example: ?redact=pii&redact=pci
`language`	string	-	Language identifier in BCP-47 format (e.g., `"en-US"`). Also accepts `lang` as alias. Spanish can be indicated with `"es"` or `"es-ES"`.

Send

Audio Databinary

Binary audio data in one of the following formats:

Raw PCM: 16-bit signed little-endian (s16le) frames
Encoded formats: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A

Finalizeobject

CloseStreamobject

KeepAliveobject

Receive

Metadataobject

SpeechStartedobject

Resultsobject

UtteranceEndobject

Messages

Client → Server

Binary audio data in one of the following formats:• Raw PCM: 16-bit signed little-endian (s16le) frames  - Supported by both /v1/listen and /v1/listen/pcm  • Encoded formats: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A  - Supported by /v1/listen only  - Auto-detected from format hints or binary headers

{ "type": "Finalize" }

{ "type": "CloseStream" }

{ "type": "KeepAlive" }

Server → Client

{  "type": "Metadata",  "request_id": "...",  "created": "2025-01-01T00:00:00.000000Z",  "duration": 0.0,  "channels": 1,  "model_info": {    "name": "<model_id>",    "version": "",    "arch": "aldea-asr"  }}

{  "type": "SpeechStarted",  "channel": [0],  "timestamp": 0.0}

{  "type": "Results",  "channel_index": [0],  "duration": 1.98,  "start": 0.00,  "is_final": false,  "speech_final": false,  "channel": {    "alternatives": [{      "transcript": "Hello world",      "confidence": 0.95,      "words": [        ["Hello", 0, 320],        ["world", 320, 640]      ]    }]  },  "metadata": {    "request_id": "...",    "model_info": {      "name": "<model>",      "version": null,      "arch": "aldea-asr"    }  }}

{  "type": "UtteranceEnd",  "channel": [0],  "last_word_end": 2.5}

Supported Audio Formats

Encoded Formats (via /v1/listen):

MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A
Auto-detected from format hints or binary headers

Raw PCM (via /v1/listen/pcm):

16-bit signed little-endian (s16le)
16kHz sample rate (configurable via sample_rate param)
Mono channel (auto-converted)

Notes

Audio is treated as mono; channels param is not required.
Supported events: Metadata, Results (interim/final), SpeechStarted. UtteranceEnd requires word timestamps to be enabled.
Unsupported: diarization, multichannel separation, numerals/profanity/smart_format (accepted but no‑ops).
PII Redaction: Use ?redact=pii to automatically redact personally identifiable information. Combine multiple types: ?redact=pii&redact=pci. Control anonymization style with redact_mode (mask, redact, or replace).
Format Detection: Format is auto-detected from Content-Type header or query parameters (encoding/format/codec). If not specified, the format is detected automatically.
Confidence Scores: Results include calculated confidence values based on word-level confidences when available.
Word Timestamps: Words array format is [word, start_ms, end_ms] when timestamps are enabled. Timestamps are in milliseconds.
Encoded Audio: If your client produces MP3/Opus, you can send encoded chunks directly to /v1/listen, or decode on the client and use /v1/listen/pcm.