Voice is becoming one of the most natural interfaces between humans and machines. We speak far faster than we type, we think out loud, and many forms of knowledge are easier to capture verbally than through structured input. Historically, however, turning speech into usable text has been unreliable, expensive, and frustrating.
Modern speech-to-text models are changing that.
Accurate transcription enables a range of practical use cases: automatically documenting meetings, turning voice notes into searchable text, and capturing customer requests when call centers exceed capacity.
Another noteworthy shift occurs when transcription approaches real-time latency. If speech can be decoded quickly enough, it enables natural conversations between humans and software. This opens the door to internal training assistants, spoken FAQ systems, or customer support agents that can respond instantly.
Because these conversations often involve confidential information, some organizations may choose to run speech models on their own infrastructure - as we do. Performance varies significantly depending on the model size, which we illustrate with two examples below.
First the small model (244M parameters) that can run on a powerful laptop. As shown below, it performs well in English but struggles with the Swiss German dialect.
In the second example we use the large-v3 model, which requires servers with significant GPU resources. In addition to matching the English performance, it also handles Swiss German much more accurately. The bandwidth requirements are modest: streaming the speech typically only requires a few tens of kbps, so even a basic mobile connection is sufficient.
We already use these models internally to generate meeting minutes and are currently exploring "low-latency" use-cases: from receiving guidance during meetings to enabling conversational assistants for internal training or FAQ scenarios. If you're working on similar systems, we'd love to exchange notes.
Modern speech-to-text models are changing that.
Accurate transcription enables a range of practical use cases: automatically documenting meetings, turning voice notes into searchable text, and capturing customer requests when call centers exceed capacity.
Another noteworthy shift occurs when transcription approaches real-time latency. If speech can be decoded quickly enough, it enables natural conversations between humans and software. This opens the door to internal training assistants, spoken FAQ systems, or customer support agents that can respond instantly.
Because these conversations often involve confidential information, some organizations may choose to run speech models on their own infrastructure - as we do. Performance varies significantly depending on the model size, which we illustrate with two examples below.
First the small model (244M parameters) that can run on a powerful laptop. As shown below, it performs well in English but struggles with the Swiss German dialect.
In the second example we use the large-v3 model, which requires servers with significant GPU resources. In addition to matching the English performance, it also handles Swiss German much more accurately. The bandwidth requirements are modest: streaming the speech typically only requires a few tens of kbps, so even a basic mobile connection is sufficient.
We already use these models internally to generate meeting minutes and are currently exploring "low-latency" use-cases: from receiving guidance during meetings to enabling conversational assistants for internal training or FAQ scenarios. If you're working on similar systems, we'd love to exchange notes.

It's not glamorous, but we've got a strong signal here... and we think that's amazing.