Artwork for podcast Data Driven
How Do Voice Assistants Work?
Episode 159th October 2020 • Data Driven • Data Driven
00:00:00 00:26:48

Share Episode

Shownotes

In this episode, Frank and Andy explore voice assistants and the behind the scenes technology that makes them tick.

AI Generated Transcript

00:00:02 British Voiceover AI Lady

Hello and welcome to data driven, the podcast where we explore the emerging fields of data science, machine learning and artificial intelligence and will not be the only AI generated voice today. As Frank and Andy interview, my Cousins Alexa, Cortana, Siri and the Google assistant.

00:00:18 British Voiceover AI Lady

Now that I think of it, the Google assistant needs a proper name.

00:00:22 British Voiceover AI Lady

Doesn't it?

00:00:23 British Voiceover AI Lady

Without further ado, here are your hosts Frank Lavigna and Andy Leonard.

00:00:29 Frank

So we're both together and we're going to be talking about voice assistants and kind of how they work and.

00:00:38 Frank

Uh, we have some special guests with us today.

00:00:42 Frank

Welcome once again, if you're just joining us live. It's Andy later tonight we are here and we are live streaming, data driven podcast where we explore the emerging fields of data science, machine learning an artificial intelligence.

00:00:55 Frank

How are you doing Andy?

00:00:56 Andy

I'm doing pretty good Frank. How are you?

00:00:59 Frank

I'm doing well. I know you have a hard stop so I won't Yammer too long we have.

00:01:04 Frank

Three special guests with us today.

00:01:06 Frank

And E 3 three.

00:01:09 Frank

That's a record. It is a record.

00:01:14 Frank

These guests are.

00:01:19 Andy

Alexa Hello Alexa.

00:01:22 Frank

She's going to say hello back, I'm sure.

00:01:26 Andy

Yeah.

00:01:28 Frank

Cortana.

00:01:30 Andy

Hello Cortana.

00:01:33 Frank

And.

00:01:36 Frank

On my phone, I have Google Assistant.

00:01:38 Andy

Hello Google Assistant Hey Google.

00:01:41 Frank

That didn't work. It now correctly phones on. Let me tell you whenever there's a training video or like a keynote where they talk about the integration between them. It's pandemonium in my Home Office, because I usually have all three and it's just harder pandemonium.

00:01:59 Frank

So I want to switch to, so we're recording this last. If you're watching live. Thank you. If you're watching later, thank you. We always try to respond to the comments. I think we're pretty good about that. And if you're watching this, if you're listening to this on the podcast, I will try to transcribe everything I'm saying. So let me switch.

00:02:18 Frank

Here.

00:02:19 Frank

An I'll see if I can put us in the little bottom here, how do?

00:02:22 Frank

I do that.

00:02:24 Frank

There we go.

00:02:27 Frank

Oh well anyway.

00:02:30 Andy

So there we.

00:02:31 Andy

Are were there this is a closed.

00:02:34 Frank

Almost there, this is the. This is a quote.

00:02:38 Frank

From Charles the 5th, who if you're not up in your history he was kind of a big deal. I think he was a hapsburg. I don't remember shame on me but he has this quote where he says I speak Spanish to God, Italian to women, French to men in German to my horse.

00:02:57 Frank

Now you're probably wondering what the heck does this have to do with anything? Well, here's what it has to do with.

00:03:03 Frank

Oh no, PowerPoint is going to crash.

00:03:07 Frank

No, you can tell.

00:03:08 Frank

We're live here we go. This is what I want to say.

00:03:11 Frank

This is the my my modern take on this. I speak to Elexa when I'm home to Cortana. When I'm at work and Google Assistant when I'm in my car or have my phone with me.

00:03:24 Frank

You can also replace that with Siri. I don't. I do have an iPad, but it's not with me.

00:03:28 Andy

That's Frank, the 1st right.

00:03:30 Frank

Frank the 1st.

00:03:34 Frank

I rather like that I rather like that.

00:03:38 Frank

So the idea here is you know how do these things work.

00:03:41 Frank

You know what?

00:03:44 Frank

An you know, in terms of the guests and and kind of things and I know you have a A Time Constraint so I just want to kind of demonstrate something that I've had the chance for a work engagement had a chance to kind of study up on all three because it was a competitive situation between Louis.

00:04:03 Frank

Which is ultimately what powers Cortana, kind of behind the scenes, and Lex which is the Alexa version for processing text and dialogue flow, which is the Google version, so it's a lot of mouthfuls an I did. Let's see if we can get our guests to introduce themselves Alexa.

00:04:24 Frank

Hi how are you?

00:04:27 Speaker 3

I'm feeling like a home run.

00:04:29 Speaker 3

This weekend I'll be watching a lot of my favorite sports.

00:04:35 Frank

OK.

00:04:38 Frank

I'm getting a warning sign on my stream ability here, so I don't know what's going on.

00:04:42 Andy

What's up with that?

00:04:43 Frank

I don't know Well, you're still hear me so that's a good sign.

00:04:46 Andy

Yeah, you're you're good with me and I'm out here on the in the boondocks with 25 minutes bro.

00:04:50 Frank

There you go.

00:04:52 Frank

Hey Cortana.

00:04:54 Frank

How are you?

00:04:57 Speaker 4

Great thanks.

00:04:59 Frank

There you go.

00:05:00 Frank

And let's see what our friend Google Assistant.

00:05:03 Frank

Has to say.

00:05:10 Frank

Hey court, I'm sorry. OK Google, how are you?

00:05:18 Frank

Oops, it's on my Bluetooth, that's why OK.

00:05:22 Frank

You could tell where life looks 'cause it's just all bloopers.

00:05:27 Frank

How are you?

00:05:32 Frank

So we've returned a bunch of short search results, OK?

00:05:39 Frank

What's interesting about these three is that they're all trying to solve essentially the same problem, right? The they they are trying to solve.

00:05:46 Frank

The ability to take human language.

00:05:49 Frank

And type, it in and convert it to let me see if I get this screen back up.

00:05:55 Frank

An I will maximize that there we go see my fancy setup I do. It's cool, isn't it? Yes.

00:06:04 Frank

Alright, so ultimately they're all trying to say the same problem. Hey, we have a comment wise guy. Yes I am miserable. OK, alright, so here's the problem that all these devices want to solve, right? This is a human. This is some speaker device thingy.

00:06:21 Frank

Right?

00:06:23 Frank

And.

00:06:25 Frank

You have the cloud.

00:06:27 Frank

Which I think is really makes this.

00:06:29 Frank

Possible in a lot of ways or not. Just possible and practical? Yeah yeah.

00:06:34 Frank

I say.

00:06:36 Frank

You know, turn.

00:06:39 Frank

I have to be careful 'cause I actually do have the lights in my Home Office so.

00:06:43

Set up to this.

00:06:46 Frank

Right, right? So this gets digitized into audio.

00:06:51 Frank

Right?

00:06:52 Frank

Here right, I'll draw that by Squiggly Lines.

00:06:55 Andy

Right, I like to squiggly lines.

00:06:57 Frank

See, I'm talented, I'm very.

00:06:59 Andy

Hard you are. You're an artist.

00:07:01 Frank

Then a cloud service, right? Whether that's Louis.

00:07:07 Frank

Dialogflow

00:07:09 Frank

Or Lax.

00:07:12 Frank

Converts that into.

00:07:15 Frank

Back into text or into text, right? Right turn the.

00:07:20 Frank

Lights on.

00:07:27 Frank

Then what happens is then you have to figure out what does that mean. What's the context here, right? What's the intent? That's the official word.

00:07:34 Frank

So that's turn lights.

00:07:38 Frank

And then on now most people will argue with me. Is that technically this is the intent?

00:07:43 Frank

And this is the the destination or slot.

00:07:48 Frank

Lex calls us a slot and this is the state that you want, right? So ultimately there's 100 different ways I can say that, and this is what makes the really kind of an LP problem, right? Please turn the lights on or do would you kindly turn the lights on right bioshock?

00:08:02 Frank

Right there for you.

00:08:05 Frank

Um?

00:08:06 Frank

That sort of thing, and then whatever that happens, is that this will then parse that into an action, right?

00:08:12 Frank

Which, if you have smart plugs, it will then send a message back through the magic of the Internet and then turn the actual.

00:08:20 Frank

Oh, I like how that's doing that. Turn the actual light on.

00:08:27 Frank

Right, so that's that's basically solving the same problem.

00:08:30 Frank

Right?

00:08:32 Frank

And what's interesting about this? I just realized I didn't say it out loud for folks listening on the podcast, but ultimately what happens is my words get translated into an electronic signal, right? A sign? A wave of sorts.

00:08:45 Frank

And then that is then.

00:08:47 Frank

Re on the other side, it's then sent from the speaker to the cloud, where it will turn those that sound form that sound wave back into text, right? Or words and then it'll go through and it'll parse out.

00:09:02 Frank

What I'm saying is try to get an intent from it or an action to it, and then based on that, some other program that also lives in the cloud.

00:09:11 Frank

Mostly, we'll then take an action based on that. Does that make sense like that? Explain that clearly.

00:09:17 Andy

I think so yeah, yeah I like it. I like the flow.

00:09:21 Frank

Yeah, and it's it's it's.

00:09:22 Frank

Amazing how simple this is, right? This is not rocket science inside your average in inside your average you know echo device. You know it's not rocket science, it's just well, this one is the fancy one with the screen, but you know the the typical kind of .1 or whatever is a microphone and speaker in a Wi-Fi connection. It's essentially all it is, right?

00:09:42 Frank

So ultimately the the goal then is that let me see if I can D minimize minimize this. So the the goal is is that I have an example of that, and this is essentially a build your own voice kit that I saw at Micro Center for.

00:10:01 Frank

Like $5 or something like that.

00:10:03 Frank

An inside is a speaker, a button, and a Cardboard box.

00:10:09 Frank

And if you attach your Raspberry Pi to this.

00:10:12 Frank

You basically have a Google home assistant.

00:10:17 Andy

That's nice, yeah.

00:10:19 Frank

Shame on me because I bought this longer ago than I care to admit and I haven't built it yet.

00:10:26 Frank

But that's just to demonstrate. The point is that these these actual devices are rather simple in terms of, you know, just them being their own thing, right? So what's interesting about this, and this is where the cutting edge comes is when you when I talk.

00:10:41 Frank

We have our human brains or whole.

00:10:44 Frank

Some will debate about.

00:10:45 Frank

Whether or not I have a human brain, but let's let's go with you.

00:10:49 Frank

So.

00:10:51 Frank

The short of it is, is that.

00:10:54 Frank

I have the ability to understand context right from my previous statement.

00:10:58 Frank

So I'm going to mute some of these devices because if I start to hear their name, they'll start going wild. What's interesting is how good Cortana is at this. How good the Google assistant is at this, and how.

00:11:14 Frank

Alexa needs some room for improvement, right, right? So for instance, if you haven't caught on the shirt I'm wearing says cream cash rules everything around me, that's from a Wu Tang clan song, so I will ask this simple question from Alexa Alexa, who is the Wu Tang clan.

00:11:36 Speaker 3

According to Wikipedia, Wu Tang clan is an American hip hop group formed in Staten Island, NY city in 1992 original.

00:11:45 British Voiceover AI Lady

Hopes.

00:11:46 Frank

Alexa.

00:11:48 Frank

What was their first album?

00:11:51 Speaker 3

According to Wikipedia, the first album is the debut studio album by German dual Modern Talking. It was released on April 1st, 19.

00:11:59 Frank

80 so you get the idea you and I know like if you asked me who they who the Wu Tang clan were an, then what was their first album? I would tell you right, right?

00:12:09 Frank

It does not have the notion of context. This is turns out to be very difficult problem for computers to solve. OK, because.

00:12:19 Frank

There's a lot going on, right? So if I start talking to you is like handy. I was at this great restaurant last night that and then we switch to another.

00:12:27 Frank

Topic.

00:12:28 Frank

Then

00:12:29 Frank

We would, we would then say, Hey where was that place? And then I would kind of know if you said place I would know what you were talking about right? That is humans have trouble with this right? 'cause I have many conversations with my wife that kind of go in different directions 'cause I have no idea where she's talking about.

00:12:47 Frank

But I mean it's hard for humans. It's really hard for machines, so let's try and see if Cortana does this any better. Hey Cortana?

00:12:54 Frank

Who is the Wu Tang clan?

00:12:58 Speaker 4

According to wikipedia.org Wu Tang, clan is an American hip hop group formed in Staten Island, NY city in 1992. Originally composed of rza.

00:13:10 Frank

Hey Cortana, what was their first album?

00:13:15 Speaker 4

Should I read a snippet from Wikipedia?

00:13:17 Speaker 4

That might be related.

00:13:19 Frank

Yeah, I'm afraid.

00:13:23 Speaker 4

The Wu Tang clan is a NYC based hip hop musical group consisting of 10 American rappers, rza, gza, Method Man, Raekwon, Ghostface Killah.

00:13:34 Frank

There's a lot of members of the Wu Tang clan. In case you didn't know.

00:13:37 Speaker 4

Cappadonna and the label dirty *******

00:13:42 Frank

Hey Cortana, what was their first album?

00:13:49 Speaker 4

There might be something on Wikipedia.

00:13:51 Speaker 4

Should I read it?

00:13:53 Frank

Yeah.

00:13:56 Speaker 4

The Wu Tang clan is a NYC.

00:13:58 Frank

Based all right?

00:13:59 Frank

Well, in the past she did get that right.

00:14:03 Andy

Well, she wasn't completely off base, wasn't completely off. Now she she kept it. Seems like some kind of workflow thing put her into.

00:14:11 Andy

It shot well at at least identify the context back to your previous question.

00:14:16 Frank

It did on that's a new active. That's a new behavior. I swear I I used to do this demo all the time and depending on the audience it would be Wu Tang clan or you know Aerosmith, you know. So let's see what Google has to say. OK, Google.

00:14:32 Frank

Who is the Wu Tang clan?

00:14:39 Frank

Alright, you're not very talkative today.

00:14:45 Frank

What was their first album?

00:14:50 Frank

OK, the demo gods are not kind to me today.

00:14:54 Frank

But in the past this has worked on.

00:14:56 Frank

On home assistant, an Cortana.

00:15:00 Frank

OK, so.

00:15:05 Frank

So the reason?

00:15:05 Frank

Why we're doing this today, and I know Andy has a hard stop in a couple of minutes is because we are hoping to get data driven as a flash briefing on Alexa.

00:15:15 Frank

And.

00:15:17 Frank

Alexa.

00:15:20 Frank

So I was trying to do this whole surprise thing, but apparently since the demo failed, I figured I'd break into that.

00:15:27 Frank

Into that, but that's ultimately the goal. But I also think this is an interesting, interesting topic, because for a lot of folks, this is just this magical black box. There is listening, right? An you know it's not magical and it all comes down to math and science, right? An and the key is to understand, kind of how it's built. And once you understand how it's built, you can build your own systems and it's actually not that hard.

00:15:47 Frank

There are more moving parts than you would think, but ultimately it just comes down to.

00:15:53 Frank

You know you're taking that speed that sound data, converting it into text, then taking that text and then converting that back into some kind of intent in action, right?

00:16:04 Frank

Yeah, and then on the other side, I'm sorry, go ahead.

00:16:07 Andy

No, Mark Taylor just said it's a do loop and he's right, he's.

00:16:10 Frank

A do loop. We have Mark joining us again. Thanks for watching mark.

00:16:14 Frank

I really should ask this.

00:16:17 Frank

But unfortunately to be is a bit of a bit of A.

00:16:21 Frank

Not a nice word or not a professional word for LinkedIn.

00:16:25

Yeah.

00:16:28 Frank

But they don't have enough memory. I think not at all.

00:16:33

No they don't.

00:16:34 Frank

But it's an interesting. It's an interesting thing where you, you know, 'cause I'm a nerd. I have I happen to have all three different types, but you know, actually 4 if you count Siri. Let's see if Siri will do any better on the on the album question.

00:16:49 Frank

So Andy, we actually have 4 special guests.

00:16:52 Andy

Wow, thank you.

00:16:54 Andy

Not crazy, that's that's a new record.

00:16:56 Frank

Hey Siri, who is the Wu Tang clan?

00:17:03 Speaker 4

Here's some information.

00:17:05 Frank

Alright, so she basically.

00:17:07 Frank

Pointed to Wikipedia.

00:17:13 Frank

What was their first album?

00:17:25 Frank

So it did the transcription.

00:17:27 Frank

What I said is good.

00:17:30 Speaker 3

I don't recognize this song.

00:17:32 Andy

OK oh OK.

00:17:34 Frank

So I swear this it did work before, but I mean ultimately it's a very hard problem. In fact, one of the things that they showed a couple of years ago at ignite I or build.

00:17:44 Frank

They showed this concept video of this lady talking to Cortana and it was on her phone.

00:17:51 Frank

More on that in a minute. It was on her phone an as she was driving into work. She'd be like, Oh, remind me to tell this person.

00:18:00 Frank

You know have a meeting with them.

00:18:03 Frank

And then the the logic would then go and then schedule the meeting through the through the outlook calendar and then tell her you know so and so rejected the request. But they are able to meet 30 minutes later. Is that OK? Yes, oh and invite so and so to this meeting as well.

00:18:20 Frank

Right,...

Links

Chapters

Video

More from YouTube