API Reference › Speech to Text

Pre-Recorded Audio

Endpoint

Endpoint: Accept binary audio or JSON with url field

https://api.aldea.ai/v1/listen

Method: POST

Returns JSON with transcript, confidence scores, and optional word timestamps.

Headers

Header	Type	Required	Description
`Authorization`	string	Yes	Bearer token; required for all API requests.
`timestamps`	string	No	Set to "true" to include per‑word timestamps in response.

Query Parameters

Name	Type	Required	Default	Description
`callback`	string	No	-	URL to send results asynchronously. When provided, request returns immediately with request_id.
`callback_method`	string	No	POST	HTTP method for callback URL.
`metadata`	string	No	-	JSON string that will be passed through to callback.
`diarization`	boolean	No	-	Enable speaker diarization to identify different speakers.
`language`	string	No	-	Language identifier in BCP-47 format (e.g., `"en-US"`). Also accepts `lang` as alias. Spanish can be indicated with `"es"` or `"es-ES"`.

Request Body - URL Format

To transcribe audio from a cloud-hosted URL, send a JSON object in the request body with a url field.

URL Request Format

Content-Type: application/json
Request Body: JSON object with url field
URL Requirements: Must be a valid HTTP or HTTPS URL

Example JSON Body:

{"url": "https://example.com/audio.wav"}

Request Body

Send raw binary audio bytes or a JSON object with a url field.

Supported Formats

Binary audio: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A
Raw audio: 16-bit PCM (s16le)
URL via JSON: {"url": "https://..."} (Deepgram-compatible)
Max duration defaults to 10 minutes (600 seconds)

Response Format

Transcription results response format:

Metadataobject

Resultsobject

Response format when using a callback:

object

Examples

Request

Download sample audio

API="https://api.aldea.ai/v1/listen"TOKEN="$STT_API_TOKEN"FILE=~/Downloads/aldea_sample.wavcurl -s -X POST "$API" \  -H "Authorization: Bearer $TOKEN" \  -H "timestamps: true" \  --data-binary @"$FILE"

API="https://api.aldea.ai/v1/listen"TOKEN="$STT_API_TOKEN"curl -X POST "$API" \  -H "Authorization: Bearer $TOKEN" \  -H "Content-Type: application/json" \  -H "timestamps: true" \  -d '{"url": "https://raw.githubusercontent.com/aldea-ai/audio/main/aldea_sample.wav"}'

For more examples including async callbacks, URL handling, and advanced use cases, see the Documentation section.

Response

{  "metadata": {    "request_id": "77aaccd1-3b19-4000-9055-3f91009751b4",    "created": "2025-01-01T00:00:00.000000Z",    "duration": 6.916625,    "channels": 1  },  "results": {    "channels": [{      "alternatives": [{        "transcript": "Something, you know, it's just like I'm saying, it's a it's a next level. It's like there's I mean, when we give you a show now, you're gonna get",        "confidence": 0.802,        "words": [          {            "word": "Something,",            "start": 0.04,            "end": 0.36          },          {            "word": "you",            "start": 0.44,            "end": 0.52          }          // ... more words        ]      }]    }]  }}

Async Processing

For long audio files or when you need non-blocking processing, use the callback query parameter:

Include ?callback=https://your-server.com/webhook to process asynchronously
The API returns immediately with a request_id
Results are sent to your callback URL when transcription completes
Optional metadata parameter (JSON string) is passed through to your callback

Error Responses

For detailed information about error response formats and status codes, see the Error Responses documentation.

Notes

Alternative Endpoints: The endpoints /v1/listen/media, /v1/listen/media/transcribe, and /v1/listen/media/transcribe_file are aliases that function identically to /v1/listen.
The Aldea Speech-to-Text API conforms to the Deepgram SDK.
Include the optional timestamps: true header to receive per‑word timings in the response.
URL Support: Send a JSON object with a url field in the request body: {"url": "https://..."}.
Word Timestamps: When enabled, words are returned as objects with word, start (seconds), and end (seconds) fields.
Speaker Diarization: Enable with ?diarization=true query parameter to identify different speakers. Requires word timestamps.
Confidence Scores: Response includes overall confidence score calculated from word-level confidences.
PII Redaction: PII redaction via query parameters is not available for HTTP endpoints. For per-request redaction control, use WebSocket streaming endpoints with redact and redact_mode parameters.