API Reference › Speech to Text
Streaming Audio
Endpoints
Primary Endpoint: Supports both encoded audio and raw PCM
wss://api.aldea.ai/v1/listenNote: Both URLs are aliases and function identically
PCM-Only Endpoint: Optimized for raw PCM
wss://api.aldea.ai/v1/listen/pcmMethod:
GET (WebSocket upgrade)Status:
101 Switching ProtocolsHeaders
| Header | Type | Required | Description |
|---|---|---|---|
Authorization | string | Yes | Bearer token; required for all API requests. |
Content-Type | string | No | Audio format hint (e.g., audio/mpeg, audio/wav). Used for format detection. |
Query Parameters
| Name | Type | Default | Description |
|---|---|---|---|
sample_rate | number | 16000 | Client audio sample rate in Hz; server resamples to 16 kHz for inference. |
interim_results | bool | true | Emit interim updates while streaming. |
endpointing | number|false | env | Controls sentence finalization latency (heuristic mapping). |
encoding | string | auto-detect | Audio encoding format (pcm, mp3, aac, flac, wav, ogg, webm, opus, m4a). Also accepts format or codec as aliases. |
redact | string | - | Enable PII redaction. Can be repeated multiple times. Values: pii (person names, emails, phone numbers, addresses, SSN), pci (credit cards, bank numbers), numbers (dates, phone numbers, SSN, credit cards), true (all PII types). Example: ?redact=pii&redact=pci |
language | string | - | Language identifier in BCP-47 format (e.g., "en-US"). Also accepts lang as alias. Spanish can be indicated with "es" or "es-ES". |
Send
Audio Databinary
Binary audio data in one of the following formats:
- Raw PCM: 16-bit signed little-endian (s16le) frames
- Encoded formats: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A
OR
Finalizeobject
OR
CloseStreamobject
OR
KeepAliveobject
Receive
Metadataobject
OR
SpeechStartedobject
OR
Resultsobject
OR
UtteranceEndobject
Messages
Client → Server
Binary audio data in one of the following formats:• Raw PCM: 16-bit signed little-endian (s16le) frames - Supported by both /v1/listen and /v1/listen/pcm • Encoded formats: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A - Supported by /v1/listen only - Auto-detected from format hints or binary headers{ "type": "Finalize" }{ "type": "CloseStream" }{ "type": "KeepAlive" }Server → Client
{ "type": "Metadata", "request_id": "...", "created": "2025-01-01T00:00:00.000000Z", "duration": 0.0, "channels": 1, "model_info": { "name": "<model_id>", "version": "", "arch": "aldea-asr" }}{ "type": "SpeechStarted", "channel": [0], "timestamp": 0.0}{ "type": "Results", "channel_index": [0], "duration": 1.98, "start": 0.00, "is_final": false, "speech_final": false, "channel": { "alternatives": [{ "transcript": "Hello world", "confidence": 0.95, "words": [ ["Hello", 0, 320], ["world", 320, 640] ] }] }, "metadata": { "request_id": "...", "model_info": { "name": "<model>", "version": null, "arch": "aldea-asr" } }}{ "type": "UtteranceEnd", "channel": [0], "last_word_end": 2.5}Supported Audio Formats
Encoded Formats (via /v1/listen):
- MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A
- Auto-detected from format hints or binary headers
Raw PCM (via /v1/listen/pcm):
- 16-bit signed little-endian (s16le)
- 16kHz sample rate (configurable via sample_rate param)
- Mono channel (auto-converted)
Notes
- Audio is treated as mono; channels param is not required.
- Supported events: Metadata, Results (interim/final), SpeechStarted. UtteranceEnd requires word timestamps to be enabled.
- Unsupported: diarization, multichannel separation, numerals/profanity/smart_format (accepted but no‑ops).
- PII Redaction: Use
?redact=piito automatically redact personally identifiable information. Combine multiple types:?redact=pii&redact=pci. Control anonymization style withredact_mode(mask, redact, or replace). - Format Detection: Format is auto-detected from Content-Type header or query parameters (encoding/format/codec). If not specified, the format is detected automatically.
- Confidence Scores: Results include calculated confidence values based on word-level confidences when available.
- Word Timestamps: Words array format is [word, start_ms, end_ms] when timestamps are enabled. Timestamps are in milliseconds.
- Encoded Audio: If your client produces MP3/Opus, you can send encoded chunks directly to /v1/listen, or decode on the client and use /v1/listen/pcm.