Being able to seamlessly interact with humans on various channels is (should be) a key objective for any customer-facing application. Visuals have long been the preferred medium and will continue to be as increasingly immersive technologies such as augmented reality mature. However, interacting via speech has proven to be more difficult.

To much less fanfare than ChatGPT, OpenAI open-sourced their speech-to-text model Whisper in September of last year. We only recently took it for a spin.

We first tested it against a couple of well enunciated German audio-files, located here and here (credit to www.audio-lingua.eu for providing these free recordings). The results were quasi-perfect, note that the AI also understood when individual letters were being enunciated:

Recording 1:
Hallo, wir sind Mara, Luis Unico und ich bin Nico und meine Lieblingsmusik Richtung ist Hip-Hop und Rap. Ich höre einen sehr gerne, einen deutschen Künstler namens Pascha-Nim, den Buch stabiliert man, P-A-S-H-A-N-I-M. Ich höre Generog und meine Lieblingsdeutsche Band heißt Plont, B-L-O-N-D. Ich höre einen Hip-Hop und mein Lieblingskünstler ist Tupac, es schreibt durch zwei P-A-C.

Recording 2:
Was tu ich für die Umwelt? Das Thema Umweltschutz ist in den letzten Jahren ziemlich in den Vordergrund gerückt und hat auch für mich eine wichtige Rolle in meinem Leben eingenommen. Um dem Klimawandel ein wenig entgegenzuwerten, versuche ich möglichst viel bzw. so gutes geht, alle Wege mit dem Fahrrad zu fahren...

We then decided to give it something harder and found sample call-center recording from pac-biz. We transcribed the first 2 calls (it takes approx. 2 seconds per minute) and once again obtained an impressive result, especially given the audio quality:

Recording 1:
It's time to be tomorrow. What time? 10 a.m. They're heading where? I am heading to... What's your name? My name is... Can comedy prosecutors will be there? My tell. Just one. Let me just confirm everything. I'm going to pick up the address in the hospital. I think to end the press call, recruit number three. Pick up time will be tomorrow at 10 a.m. Correct? Yes, that's correct. I am...

Recording 2:
Thank you for calling. This is how can I help? I'm calling the verify that my husband has been assigned to a driver tomorrow for his ride please. Okay, sure, ma'am. Let me check it for you here one moment. Thank you. You're welcome. What's the name of the passenger under the booking ma'am? I'm his name and rather thank you. Let me check that one here. All right, the verify ma'am that will be picked up for tomorrow 11 a.m. I'm going to Chicago. Is that correct? ...

You'll observe that some words look out of place, however having listened to the recordings several times, it's hard to make out what the correct words are and the transcription is very close to the phonetics.

Whisper is released under the MIT license, and we are incredibly grateful for companies such as OpenAI to release these open-source models. It allows the community to learn from their expertise, to tinker, to discuss, to improve. It also allows smaller companies like ours to stand on the shoulders of giants and build value-adding services for our customers, which would be impossible without deep R&D pockets.

Next step: we will be testing it on Swiss-German, which is going to be a lot of fun :)
Click here to show all blog posts
Powered by Sense6