API Reference › Speech to Text

Pre-Recorded Audio

Endpoint

Endpoint: Accept binary audio or JSON with url field

https://api.aldea.ai/v1/listen
Method: POST

Returns JSON with transcript, confidence scores, and optional word timestamps.


Headers

HeaderTypeRequiredDescription
AuthorizationstringYesBearer token; required for all API requests.
timestampsstringNoSet to "true" to include per‑word timestamps in response.

Query Parameters

NameTypeRequiredDefaultDescription
callbackstringNo-URL to send results asynchronously. When provided, request returns immediately with request_id.
callback_methodstringNoPOSTHTTP method for callback URL.
metadatastringNo-JSON string that will be passed through to callback.
diarizationbooleanNo-Enable speaker diarization to identify different speakers.
languagestringNo-Language identifier in BCP-47 format (e.g., "en-US"). Also accepts lang as alias. Spanish can be indicated with "es" or "es-ES".

Request Body - URL Format

To transcribe audio from a cloud-hosted URL, send a JSON object in the request body with a url field.

URL Request Format

  • Content-Type: application/json
  • Request Body: JSON object with url field
  • URL Requirements: Must be a valid HTTP or HTTPS URL

Example JSON Body:

{"url": "https://example.com/audio.wav"}

Request Body

Send raw binary audio bytes or a JSON object with a url field.

Supported Formats

  • Binary audio: MP3, AAC, FLAC, WAV, OGG, WebM, Opus, M4A
  • Raw audio: 16-bit PCM (s16le)
  • URL via JSON: {"url": "https://..."} (Deepgram-compatible)
  • Max duration defaults to 10 minutes (600 seconds)

Response Format

Transcription results response format:

Metadataobject
Resultsobject
OR

Response format when using a callback:

object

Examples

API="https://api.aldea.ai/v1/listen"TOKEN="$STT_API_TOKEN"FILE=~/Downloads/aldea_sample.wavcurl -s -X POST "$API" \  -H "Authorization: Bearer $TOKEN" \  -H "timestamps: true" \  --data-binary @"$FILE"
API="https://api.aldea.ai/v1/listen"TOKEN="$STT_API_TOKEN"curl -X POST "$API" \  -H "Authorization: Bearer $TOKEN" \  -H "Content-Type: application/json" \  -H "timestamps: true" \  -d '{"url": "https://raw.githubusercontent.com/aldea-ai/audio/main/aldea_sample.wav"}'

For more examples including async callbacks, URL handling, and advanced use cases, see the Documentation section.

Response

{  "metadata": {    "request_id": "77aaccd1-3b19-4000-9055-3f91009751b4",    "created": "2025-01-01T00:00:00.000000Z",    "duration": 6.916625,    "channels": 1  },  "results": {    "channels": [{      "alternatives": [{        "transcript": "Something, you know, it's just like I'm saying, it's a it's a next level. It's like there's I mean, when we give you a show now, you're gonna get",        "confidence": 0.802,        "words": [          {            "word": "Something,",            "start": 0.04,            "end": 0.36          },          {            "word": "you",            "start": 0.44,            "end": 0.52          }          // ... more words        ]      }]    }]  }}

Async Processing

For long audio files or when you need non-blocking processing, use the callback query parameter:

  • Include ?callback=https://your-server.com/webhook to process asynchronously
  • The API returns immediately with a request_id
  • Results are sent to your callback URL when transcription completes
  • Optional metadata parameter (JSON string) is passed through to your callback

Error Responses

For detailed information about error response formats and status codes, see the Error Responses documentation.

Notes

  • Alternative Endpoints: The endpoints /v1/listen/media, /v1/listen/media/transcribe, and /v1/listen/media/transcribe_file are aliases that function identically to /v1/listen.
  • The Aldea Speech-to-Text API conforms to the Deepgram SDK.
  • Include the optional timestamps: true header to receive per‑word timings in the response.
  • URL Support: Send a JSON object with a url field in the request body: {"url": "https://..."}.
  • Word Timestamps: When enabled, words are returned as objects with word, start (seconds), and end (seconds) fields.
  • Speaker Diarization: Enable with ?diarization=true query parameter to identify different speakers. Requires word timestamps.
  • Confidence Scores: Response includes overall confidence score calculated from word-level confidences.
  • PII Redaction: PII redaction via query parameters is not available for HTTP endpoints. For per-request redaction control, use WebSocket streaming endpoints with redact and redact_mode parameters.