ElevenLabs, a burgeoning AI startup that recently secured $180 million in a major funding round, has gained recognition for its audio generation capabilities. The company has now ventured into new territory by unveiling its inaugural standalone speech-to-text model known as Scribe.
Valued at $3.3 billion, the startup has provided numerous companies with speech-to-text solutions through its extensive voice library. However, ElevenLabs is poised to enter the speech detection arena, positioning itself to compete against established players like Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper models.
At launch, ElevenLabs’ Scribe model boasts support for over 99 languages. The company has identified more than 25 languages that fall into the excellent accuracy category for the model, featuring a word error rate of less than 5%. Among these are English (with an asserted accuracy of 97%), French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese. Other languages are categorized with varying levels of accuracy, including high (5-10% word error rate), good (10-20% word error rate), and moderate (25-50% word error rate).
According to the company, the model has outperformed Google Gemini 2.0 Flash and Whisper Large V3 in several languages during benchmark tests such as FLEURS & Common Voice.

Previously, ElevenLabs had incorporated speech-to-text functionalities within its AI conversational agent platform released last year. However, this marks the first instance that the company has launched a dedicated speech detection model. In a discussion with TechCrunch last month, CEO Mati Staniszewski shared insights on enhancing speech detection technologies.
“Our goal is to better comprehend what is being communicated during conversations. We are actively pursuing methods to transition from merely generating content to understanding and transcribing speech,” Staniszewski remarked during the conversation. “Many believe speech-to-text technology is a resolved issue. Yet, for numerous languages, the quality remains subpar. We believe we can develop superior speech detection models as we have in-house teams for data annotation and swift feedback.”
The Scribe model also features advanced speaker diarization to identify individual speakers, word-level timestamps for precise subtitles, and auto-tagging of sound events such as audience laughter. The startup is offering a means for clients to transcribe video content directly, allowing for captions or subtitles to be added in its studio.
Currently, Scribe supports only pre-recorded audio formats. The company has indicated that a low-latency real-time version of the model will be launched soon, which means it is not yet suitable for meeting transcriptions or voice note-taking.
ElevenLabs has set the pricing for Scribe at $0.40 per hour of transcribed audio. While this price is competitive, some competitors currently offer lower rates for audio transcription services, albeit with some feature variances.
Compiled by Techarena.au.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence


