API Reference › Speech to Text
Pre-Recorded Audio
Endpoint
Endpoint: Accept binary audio or JSON with url field
https://api.aldea.ai/v1/listenMethod:
POSTReturns JSON with transcript, confidence scores, and optional word timestamps.
Headers
| Header | Type | Required | Description |
|---|---|---|---|
Authorization | string | Yes | Bearer token; required for all API requests. |
timestamps | string | No | Set to "true" to include per‑word timestamps in response. |
Query Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
callback | string | No | - | URL to send results asynchronously. When provided, request returns immediately with request_id. |
callback_method | string | No | POST | HTTP method for callback URL. |
metadata | string | No | - | JSON string that will be passed through to callback. |
diarization | boolean | No | - | Enable speaker diarization to identify different speakers. |
language | string | No | - | Language identifier in BCP-47 format (e.g., "en-US"). Also accepts lang as alias. Spanish can be indicated with "es" or "es-ES". |
Request Body - URL Format
To transcribe audio from a cloud-hosted URL, send a JSON object in the request body with a url field.
URL Request Format
- Content-Type:
application/json - Request Body: JSON object with
urlfield - URL Requirements: Must be a valid HTTP or HTTPS URL
Example JSON Body:
{"url": "https://example.com/audio.wav"}Request Body
Send raw binary audio bytes or a JSON object with a url field.
Supported Formats
- Binary audio: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A
- Raw audio: 16-bit PCM (s16le)
- URL via JSON:
{"url": "https://..."}(Deepgram-compatible) - Max duration defaults to 10 minutes (600 seconds)
Response Format
Transcription results response format:
Metadataobject
Resultsobject
OR
Response format when using a callback:
object
Examples
Request
Download sample audioAPI="https://api.aldea.ai/v1/listen"TOKEN="$STT_API_TOKEN"FILE=~/Downloads/aldea_sample.wavcurl -s -X POST "$API" \ -H "Authorization: Bearer $TOKEN" \ -H "timestamps: true" \ --data-binary @"$FILE"API="https://api.aldea.ai/v1/listen"TOKEN="$STT_API_TOKEN"curl -X POST "$API" \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -H "timestamps: true" \ -d '{"url": "https://raw.githubusercontent.com/aldea-ai/audio/main/aldea_sample.wav"}'For more examples including async callbacks, URL handling, and advanced use cases, see the Documentation section.
Response
{ "metadata": { "request_id": "77aaccd1-3b19-4000-9055-3f91009751b4", "created": "2025-01-01T00:00:00.000000Z", "duration": 6.916625, "channels": 1 }, "results": { "channels": [{ "alternatives": [{ "transcript": "Something, you know, it's just like I'm saying, it's a it's a next level. It's like there's I mean, when we give you a show now, you're gonna get", "confidence": 0.802, "words": [ { "word": "Something,", "start": 0.04, "end": 0.36 }, { "word": "you", "start": 0.44, "end": 0.52 } // ... more words ] }] }] }}Async Processing
For long audio files or when you need non-blocking processing, use the callback query parameter:
- Include
?callback=https://your-server.com/webhookto process asynchronously - The API returns immediately with a
request_id - Results are sent to your callback URL when transcription completes
- Optional
metadataparameter (JSON string) is passed through to your callback
Error Responses
For detailed information about error response formats and status codes, see the Error Responses documentation.
Notes
- Alternative Endpoints: The endpoints
/v1/listen/media,/v1/listen/media/transcribe, and/v1/listen/media/transcribe_fileare aliases that function identically to/v1/listen. - The Aldea Speech-to-Text API conforms to the Deepgram SDK.
- Include the optional
timestamps: trueheader to receive per‑word timings in the response. - URL Support: Send a JSON object with a
urlfield in the request body:{"url": "https://..."}. - Word Timestamps: When enabled, words are returned as objects with
word,start(seconds), andend(seconds) fields. - Speaker Diarization: Enable with
?diarization=truequery parameter to identify different speakers. Requires word timestamps. - Confidence Scores: Response includes overall confidence score calculated from word-level confidences.
- PII Redaction: PII redaction via query parameters is not available for HTTP endpoints. For per-request redaction control, use WebSocket streaming endpoints with
redactandredact_modeparameters.