Speech-to-Behavioral-Text

Medera Speech-to-Behavioral-Text is the first speech engine purpose-built for behavioral and mental health. It is exposed through three endpoints backed by one calibrated BH-tuned engine — same vocabulary, same DSM-5-TR alignment, same C-SSRS calibration, same instrument-score extraction across every endpoint.

General medical STT was trained on radiology dictations and OR cross-talk. Medera Co-Therapy was trained on the conversation that happens inside a 50-minute behavioral-health session — outpatient psychiatry, CBT, DBT, and psychiatric emergency assessments.

Three endpoints. One BH-tuned engine.

What makes it BH-tuned

Property	Generic medical STT	Medera Co-Therapy
Training audio	Radiology dictation, OR cross-talk	Outpatient psychiatry, CBT, DBT, psychiatric emergency
Vocabulary bias	Anatomy, procedure names	DSM-5-TR, psychopharmacology, BH program names
Calibration	Procedural accuracy	C-SSRS safety, instrument scoring, validated screeners
Speaker model	Single-speaker dictation	2-speaker diarization (clinician + patient)
Crisis handling	None	C-SSRS-aligned, non-bypassable human handoff
Median WER	—	5.2 % on outpatient psych / CBT / DBT (vs. 18–25 % typical for psychotherapy ASR)
Keyword accuracy	—	~91 % on the practice’s custom vocabulary

Endpoints

/transcribe

WSS · real-time stateless dictation. Between-visit charting, refills, paging colleagues. Built for the 5-minute window between encounters.

/sessions

WSS · real-time stateful transcription. Live therapy session capture, 2-speaker diarization, in-line crisis markers, structured instrument scoring.

/transcripts

REST · async batch. Process recorded encounters after the fact. Same engine, same calibration, same structured output.

Pick the endpoint by workflow, not by accuracy. The engine, vocabulary, calibration, and structured emission are identical across all three.

Endpoint capability matrix

	/transcribe	/sessions	/transcripts
Connection	WSS	WSS	REST
Processing	Real-time	Real-time	Async
Architecture	Stateless	Stateful	Stateful
Speech model	BH-tuned dictation	BH-tuned conversational	Either
2-speaker diarization	—	✓	✓
Multichannel input	—	✓	—
Spoken commands	✓	—	—
Automatic punctuation	✓	✓	✓
Spoken punctuation	✓	—	—
Interim results	✓	✓	—
Custom vocabulary (at request time)	✓	✓	✓
Instrument-score extraction	—	✓	✓
C-SSRS in-line crisis marker	—	✓	✓
DSM-5-TR / ICD-10 emission	—	✓	✓
Output	Text + commands	Structured transcript + scores + risk	Structured transcript + scores

Languages

English variants are GA across US, UK, Canada, Australia. Arabic (ar-SA) ships for MENA deployments. Spanish, French, and German behavioral-health vocabularies are in private beta. The vocal-acoustic engine — F0, jitter, shimmer, prosodic flatness, depression / anxiety / distress indices — is language-agnostic and ships unchanged across every locale.

Behavioral-health-specific calibration

Channel	What it does
DSM-5-TR vocabulary bias	Recognition biased toward DSM-5-TR diagnostic language, common behavioral-health symptoms, and psychopharmacology
Validated screener extraction	PHQ-9, GAD-7, AUDIT-C, C-SSRS responses captured as structured items with item-level scores
C-SSRS safety check	Crisis language flagged in-line; non-bypassable human handoff trigger
Custom vocabulary at request	Practice formulary, IOP / PHP program names, clinician roster — passed per request, no retraining
2-speaker diarization	Clinician vs. patient consistently labelled	Custom vocabulary lifts keyword recall by 10–15 percentage points while holding overall WER flat — comparable to what Medera STT Nova-3 Medical and AssemblyAI publish for keyterm prompting.

Accuracy (internal eval set)

Metric	Medera Co-Therapy	Typical psychotherapy ASR
Median WER	5.2 %	18 – 25 % (Miner et al., npj Digital Medicine, 2020)
Keyword accuracy (custom vocab)	~ 91 %	—
Eval audio	Outpatient psychiatry · CBT · DBT · psychiatric emergency	Mixed conversational	We don’t benchmark on radiology dictation; we benchmark on the audio the engine actually has to work on.

Regions

Region	Storage	Compute	Sub-processors
US (default)	US region	US	US-only
EU	EU-West (Ireland), EU-Central (Frankfurt)	EU	EU-only
MENA (Arabic ar-SA)	Customer VPC option	Customer VPC	Per customer agreement
Customer VPC	Customer-managed AWS / GCP	Customer	Per customer agreement	Region is selected at tenant creation and is immutable. Audio never crosses region boundaries.

What’s next

/transcribe — between-visit dictation

Stateless dictation WebSocket.

/sessions — live therapy capture

Stateful diarized session WebSocket.

/transcripts — async batch

REST upload + processing.

Languages

Full per-surface matrix.

Contact us if you need help selecting the right endpoint or have questions about configuring requests. medera.info/contact

​Three endpoints. One BH-tuned engine.

​What makes it BH-tuned

​Endpoints

/transcribe

/sessions

/transcripts

​Endpoint capability matrix

​Languages

​Behavioral-health-specific calibration

​Accuracy (internal eval set)

​Regions

​What’s next

/transcribe — between-visit dictation

/sessions — live therapy capture

/transcripts — async batch

Languages

Three endpoints. One BH-tuned engine.

What makes it BH-tuned

Endpoints

Endpoint capability matrix

Languages

Behavioral-health-specific calibration

Accuracy (internal eval set)

Regions

What’s next