HockeyStick #2 - LLMs in production - Chris Brousseau & Matt Sharp

Episode 2 • 8th April 2024 • HockeyStick Show • Miko Pawlikowski

Speaker: 00:00:01

I'm Miko Pawlikowski, and this is HockeyStick.

Speaker: 00:00:09

LLMs, or Large Language Models, are taking the world by storm.

Speaker: 00:00:13

This breakthrough artificial intelligence technology promises to fundamentally

Speaker: 00:00:18

reshape the way we work with computers.

Speaker: 00:00:20

Over the last year, we've witnessed its Hockey Stick moment, and as

Speaker: 00:00:23

of early 2024, We're firmly in the Cambrian explosion phase.

Speaker: 00:00:28

Today, we're taking a deep dive into how this models came from humble beginnings to

Speaker: 00:00:32

making people scared of imminent Skynet.

Speaker: 00:00:35

I'm joined by two experts, Chris Brousseau, staff machine learning

Speaker: 00:00:39

engineer at JP Morgan, and Matthew Sharp, MLOps engineer at LTK, the

Speaker: 00:00:43

authors of "Production LLMs" currently available in early access at manning.com.

Speaker: 00:00:49

In this conversation, we'll cover the intricacies of human language

Speaker: 00:00:52

and how machines can understand it.

Speaker: 00:00:54

Give you the vocab to sound smart to the next family gathering and discuss the

Speaker: 00:00:58

various mathematical ideas and models ultimately leading to LLMs, as well as

Speaker: 00:01:02

some noteworthy examples beyond Chad GPT.

Speaker: 00:01:05

Welcome to this episode and please enjoy.

Speaker: 00:01:08

where should we start?

Speaker: 00:01:09

How did you guys meet?

Speaker: 00:01:11

we happen to both live in Utah, and we

Speaker: 00:01:13

actually met at a meetup.

Speaker: 00:01:16

It was actually an MLOps meetup, was the primary one where we met.

Speaker: 00:01:20

It happens once a month and we'd get together, and so that's our origin story.

Speaker: 00:01:25

we became friends through there, started helping each other, with,

Speaker: 00:01:28

content creation, Chris was starting a YouTube channel, I write on

Speaker: 00:01:32

LinkedIn, just giving each other feedback and helping each other out.

Speaker: 00:01:35

It was especially helpful because I was trying to figure out how

Speaker: 00:01:38

best to present a lot of the material that's in our book now.

Speaker: 00:01:42

how do you explain a transformer model?

Speaker: 00:01:45

And Matt was fantastic about helping me, find my voice on YouTube.

Speaker: 00:01:49

Okay, so going from meeting someone at a meetup, to committing

Speaker: 00:01:54

to spending a a couple of years working on a book from someone:

Speaker: 00:01:58

that's a little bit of a difference.

Speaker: 00:01:59

Was there any particular moment where I just clicked?

Speaker: 00:02:01

"Oh, we need to write a book".

Speaker: 00:02:03

How did you come up with the idea?

Speaker: 00:02:05

I was approached and, I would love to write a book, but I don't

Speaker: 00:02:10

know a lot about that process.

Speaker: 00:02:12

And obviously, I didn't really have an authorship voice.

Speaker: 00:02:15

I am not experienced in content creation.

Speaker: 00:02:19

And while I was going through the process of talking with some different

Speaker: 00:02:23

publishers, Matt approached me and said: "Hey, I was a technical reviewer

Speaker: 00:02:28

on the fundamentals of data engineering by Joe Reese and Matt Housley.

Speaker: 00:02:33

And so he had experience and he had, subject matter expertise, and he was

Speaker: 00:02:39

giving me some advice and I said, "You know what, why don't you just

Speaker: 00:02:42

come on as a coauthor?, You obviously could help a lot here ,and I need

Speaker: 00:02:47

it, so let's just do it together".

Speaker: 00:02:50

yeah, I think that it worked out really well because Chris has that background in

Speaker: 00:02:54

linguistics, he understands the natural language processing side better than

Speaker: 00:02:59

anyone else I've met in person, and I was coming more from the MLOps side,

Speaker: 00:03:04

how do we actually deploy these things?

Speaker: 00:03:06

And so I think it's really rounded out our book better than, anything else I'm seeing

Speaker: 00:03:13

out there that you could buy and read.

Speaker: 00:03:15

getting that diverse perspective, I think, really helps our book out.

Speaker: 00:03:19

I was very excited when you said 'yes' to coming onto this because since last

Speaker: 00:03:24

year I think in most people's minds sometime early last year with chat GPT.

Speaker: 00:03:30

All of a sudden, everybody started talking about large language

Speaker: 00:03:34

models, and some people started worrying about, impending doom and

Speaker: 00:03:40

robot apocalypse, and all of that.

Speaker: 00:03:43

But from a perspective of someone who's worked, with that for best

Speaker: 00:03:47

part of a decade now, I'm wondering.

Speaker: 00:03:50

what was the point when you realized that these LLMs, they're really onto

Speaker: 00:03:54

something and they're moving from, a demo to an actual legitimate technology

Speaker: 00:04:01

that's going to change things.

Speaker: 00:04:02

What was the hockey stick moment for LLMs

Speaker: 00:04:06

Oh, boy.

Speaker: 00:04:07

for me, without a doubt, that was the release of T5.

Speaker: 00:04:12

And looking at Google's paper about the text-to-text transformer, that set really

Speaker: 00:04:18

the groundwork for prompting, right?

Speaker: 00:04:20

They had a whole bunch of different tasks that you didn't have to change

Speaker: 00:04:25

anything other than some statement.

Speaker: 00:04:29

For the model to do that task, and then a colon and then whatever

Speaker: 00:04:32

your input was going to be anyway.

Speaker: 00:04:34

that was groundbreaking to me.

Speaker: 00:04:36

I had been messing around with GPT2.

Speaker: 00:04:39

I'd been playing with that and trying to shoehorn it into a

Speaker: 00:04:41

product where I was working.

Speaker: 00:04:43

T5, did everything that we were trying to do with GPT2, and it was incredibly

Speaker: 00:04:49

flexible, it was easy to fine tune, and for me, that was the hockey stick moment

Speaker: 00:04:54

that "oh wow, no, they're really cooking".

Speaker: 00:04:56

when is that?

Speaker: 00:04:57

for anybody who hasn't heard of heard

Speaker: 00:05:01

T5?

Speaker: 00:05:01

I think it was 2019, Yeah, exploring the limits of transfer learning

Speaker: 00:05:04

with a unified text to text transformer was October in 2019.

Speaker: 00:05:08

it came out in October.

Speaker: 00:05:10

I think I picked it up in November-December of 2019.

Speaker: 00:05:13

Yeah, I think for my hockey stick moment, like I was, in the industry

Speaker: 00:05:18

been paying attention, obviously GPT2 coming around, T5, etc.

Speaker: 00:05:23

But wasn't really seeing the adoption that someone who's working in MLOps

Speaker: 00:05:30

cares more about I was seeing, , these models can do really cool things,

Speaker: 00:05:35

but people weren't caring about them.

Speaker: 00:05:36

Sam Altman even said it was like, "we didn't think GPT-3

Speaker: 00:05:40

would be that big of success.

Speaker: 00:05:42

We thought that would once GPT-4 came out.

Speaker: 00:05:45

but I just remember, January 2023.

Speaker: 00:05:50

ChatGPT's been out a month.

Speaker: 00:05:52

it's still essentially in beta.

Speaker: 00:05:53

They just released it to get feedback and to start collecting data.

Speaker: 00:05:57

to start improving their model.

Speaker: 00:05:59

but it blew up, right?

Speaker: 00:06:01

I just remember being at a church function and this guy sitting

Speaker: 00:06:07

across the table from me who has no idea anything about AI, right?

Speaker: 00:06:12

I was stuck in this table for an hour and all he could talk about was GPT-3.

Speaker: 00:06:17

he was obsessed with it.

Speaker: 00:06:19

I'm like, oh, wow.

Speaker: 00:06:21

even people who don't know anything about, machine learning or AI or the

Speaker: 00:06:26

industry were like, really going gung ho and his wife was an English teacher.

Speaker: 00:06:32

she was really scared of it and was like, "how are we gonna help kids

Speaker: 00:06:36

learn how to, write and read when they can just go online and now cheat

Speaker: 00:06:42

and write these things and stuff".

Speaker: 00:06:44

The very beginning of what, like everyone's had conversations about now,

Speaker: 00:06:47

but like he talked about how his brother in law owned a website that made fake

Speaker: 00:06:54

articles you can think like the onion and so once it came out in that month like

Speaker: 00:06:58

I said chat GPT still wasn't a product yet, and anyone who's been following

Speaker: 00:07:04

it knows a lot of those demos just shut down and then never came back up

Speaker: 00:07:08

His brother in law ended up firing like a hundred writers because he's

like: 00:07:14

"Oh chat GPT can make these funny fake articles and we're good, right?"

like: 00:07:19

that was my hockey stick moment of "okay we really are changing

like: 00:07:24

when some random guy at church is talking about it all the time".

like: 00:07:28

Yeah, I love that example.

like: 00:07:30

But even for people who are in tech who weren't directly following that

like: 00:07:34

very closely, that was a scary moment.

like: 00:07:36

I remember when I first used a copilot, I was like, what, it just does that.

like: 00:07:42

And three out of four, it would actually work.

like: 00:07:45

that was a scary moment.

like: 00:07:46

It reverberated through a lot of levels of society, including, our own.

like: 00:07:51

And, I think in many ways, technology and writing code might be the easiest

like: 00:07:57

use case for, this kind of models, right?

like: 00:07:59

Do you agree with that?

like: 00:08:00

I don't know if I completely agree with it, because, code is incredibly

like: 00:08:04

syntactically dependent, right?

like: 00:08:06

every developer who's worked with JavaScript or C++ and then moves

like: 00:08:11

to Python, they feel it, right?

like: 00:08:13

That's one of the biggest complaints is "I hate Python syntax".

like: 00:08:16

"I hate that white space matters", it's a little bit more complex than just

like: 00:08:21

repeating whatever natural language happened, but you're absolutely right

like: 00:08:25

that is one of the best use cases so far.

like: 00:08:29

because, it's better structured than just spoken language, or is there any

like: 00:08:33

other reasons that make it so well suited for that particular application?

like: 00:08:37

programming languages are not real languages, right?

like: 00:08:40

one of the things that makes it simultaneously very well and ill-suited

like: 00:08:44

for it is how much gets repeated, You use the exact same words.

like: 00:08:49

The exact same tokens to define every function that you make, but then the

like: 00:08:54

function's name can be whatever you want.

like: 00:08:57

And so using the exact same tokens is awesome.

like: 00:09:01

That provides landmarks for the probability as it's

like: 00:09:03

going through all of this.

like: 00:09:05

But then that input to just say whatever you want and put it in camel

like: 00:09:09

case or snake case or whatever, tons of different formatting for functions.

like: 00:09:14

it makes it a little bit more difficult.

like: 00:09:17

Especially while you're trying to tokenize that,

like: 00:09:19

one of the big benefits with code is the amount of data we have around code.

like: 00:09:24

lots of people are writing code.

like: 00:09:26

they all have very similar ideas of what they're trying to do, of

like: 00:09:31

what they're trying to architect, of what they're trying to design.

like: 00:09:33

and so we're not necessarily worrying about, hallucinations or

like: 00:09:38

fake news or, people disagreeing or other things like that.

like: 00:09:42

there's just a lot of data, that all agrees with each other and

like: 00:09:45

pushes in the same direction.

like: 00:09:47

It makes it good.

like: 00:09:48

there's obviously some negatives of just assuming, some of these LLMs writing

like: 00:09:53

code is going to do things well, but, I think Chris highlighted that already.

like: 00:09:58

it's actually really similar to how regular languages work.

like: 00:10:02

If we have more python data, like Matt's saying, it's going to do better at python.

like: 00:10:07

And that can create a little bit of a positive feedback loop with LLMs, where

like: 00:10:11

a lot of people want to get into python, and they're very good at it, but then

like: 00:10:15

when you look at emerging languages like mojo, for example It's really difficult

like: 00:10:21

to find that data and so LLMs are worse at it, similar to natural languages

like: 00:10:25

that have a lower number of speakers, a lower presence on the internet,

like: 00:10:31

So is the solution to use an LLM to generate a lot of Mojo and make it

like: 00:10:37

a significant percentage of GitHub?

like: 00:10:41

that'd be fun, dude.

like: 00:10:42

I think there are some problems with synthetic data that can lead

like: 00:10:46

to stuff like model collapse.

like: 00:10:48

I don't know if we're going to see that in the code space, though.

like: 00:10:51

I think we could see that in natural language.

like: 00:10:53

So that might be a valid solution.

like: 00:10:56

Okay.

like: 00:10:57

the date is 13 February, the day before Valentine's Day 2024.

like: 00:11:03

I'm going to ask you for a wild prediction.

like: 00:11:05

Where do you see that going?

like: 00:11:06

Should, all kinds of, or maybe any subset of programmers who, produce code as a

like: 00:11:12

job, should they start at least worrying?

like: 00:11:16

Is that something that's going to, decrease the pool of available jobs,

like: 00:11:22

no, I don't think it's really going to impact the amount of work.

like: 00:11:27

I just think about my job, and even when I'm in very technical roles, and I'm

like: 00:11:32

spending 50% of my time on the keyboard, still, it feels like a majority of the

like: 00:11:38

work is still just communicating with stakeholders, understanding exactly what

like: 00:11:42

the problems are, technical writing, design docs, really understanding at

like: 00:11:48

a high level, what you want to build.

like: 00:11:50

To be fair, programmers have been automating the 'writing the

like: 00:11:53

code' portion forever, right?

like: 00:11:55

From the beginning.

like: 00:11:57

yeah, with massive amounts of like scripts and configs that they use.

like: 00:12:02

And that's why they love Vim or Emacs still, right?

like: 00:12:06

It's because they have it configured just right.

like: 00:12:08

And they can move really quickly, because it provides a lot of that

like: 00:12:12

automation for them already, but this is just helping junior engineers

like: 00:12:16

already have all that configuration and set up really quickly, right?

like: 00:12:21

It mostly will just make our jobs a little bit easier, it doesn't remove the need to

like: 00:12:27

really understand the engineering aspect, the architecture aspect, the design

like: 00:12:31

aspect that still is involved with coding.

like: 00:12:35

Oh, yeah.

like: 00:12:36

this is why we love comparing LLMs to a printing press.

like: 00:12:40

That Johannes Gutenberg.

like: 00:12:42

Because did that destroy the writing industry?

like: 00:12:44

All it did was it destroyed the monopoly that certain organizations

like: 00:12:49

had on publishing books.

like: 00:12:51

Before you had to get a scribe and you had to pay the scribe and you had to

like: 00:12:55

have access to scribes You couldn't just walk up to a printing press and

like: 00:13:00

hit it and then boom you have a book.

like: 00:13:02

You have to have knowledge You have to have an idea.

like: 00:13:05

The printing press just gives you a lower barrier to entry

like: 00:13:10

Which is what we love, right?

like: 00:13:12

For coding, I think Matt is exactly right, that it's a lower barrier to

like: 00:13:16

entry for junior engineers to be able to produce significantly better work.

like: 00:13:21

and in some ways it actually accelerates it, because when you copy and paste what

like: 00:13:26

an LLM gave you and it doesn't work, you have to go figure it out, right?

like: 00:13:31

With the junior engineers, it also helps speed up senior engineers, and

like: 00:13:35

staff engineers and principal engineers.

like: 00:13:37

it's good, and lowers the barrier for the entire industry, we like that.

like: 00:13:43

Yeah.

like: 00:13:43

I've lately been spending lots of time writing chapter 10 of our book,

like: 00:13:47

and in chapter 10, we actually go through a project, where we help you

like: 00:13:51

build your own co pilot and we build the VS Code extension to get it in.

like: 00:13:58

if you want to be running your own LLM on your own computer with your own data,

like: 00:14:04

so that way, you can get your own things.

like: 00:14:06

we walk through all the steps to do that.

like: 00:14:08

And in some aspects, it's interesting cause sometimes.

like: 00:14:12

adding an extra feature, made the model work, right?

like: 00:14:15

there's still just so much to learn about it.

like: 00:14:18

ultimately, it comes down to your data, right?

like: 00:14:20

how good is your coding data?

like: 00:14:22

is really how well the co pilot works, right?

like: 00:14:24

SQL is one of the most repetitive of all of the programming languages.

like: 00:14:29

but true skill with SQL does not involve being good at SQL.

like: 00:14:34

It involves knowing the data, right?

like: 00:14:36

It's knowing which tables to query, how to merge them, how window functions, all of

like: 00:14:41

that stuff, knowing exactly what you need to be looking at is the true skill in SQL.

like: 00:14:47

And we're hopefully getting to a point where we can help the

like: 00:14:51

model know the data, right?

like: 00:14:54

We can give it some sort of context for the data that it's going to be looking

like: 00:14:59

at, so that it can generate good SQL

like: 00:15:01

that's a really good point.

like: 00:15:02

I've actually had, lots of mentees who are trying to learn SQL for the first time.

I said: 00:15:07

"just use ChatGPT", generating SQL is actually something that's really

I said: 00:15:13

good at, you don't need GPT-4, like even GPT-3, like even GPT-2, it's not

I said: 00:15:18

hard to generate really good SQL syntax.

I said: 00:15:20

Cause it's so simple, it follows a very similar structure.

I said: 00:15:24

But ultimately, you can have it write the SQL, but you're going to have to

I said: 00:15:28

go back and figure out how to connect all the pieces and understand your

I said: 00:15:33

database and understand your data.

I said: 00:15:35

that's a perfect example, understanding how to write the

I said: 00:15:38

code is only half the problem.

I said: 00:15:39

Understanding how to integrate it is really the bigger problem.

I said: 00:15:43

What's the most terrible use case, that people are currently

I said: 00:15:47

trying to use LLMs for?

I said: 00:15:49

What does LLM in general, or LLMs, what do they suck at the most?

I said: 00:15:56

I'm going to say they, they suck at, sequence prediction, which sounds so off.

I said: 00:16:03

Because that's what they're made for, but one of the things that I'm seeing

I said: 00:16:07

people do, is try and automate entire workflows with LLMs, and they're trying

I said: 00:16:12

to get the LLM to just do the whole workflow and they suck at that what

I said: 00:16:18

they need all of this stuff to help it.

I said: 00:16:20

They need tools, they need rag, they need specific fine tuning landmarks

I said: 00:16:26

and they need few shot prompting, they need all sorts of stuff to make

I said: 00:16:30

it work, and then it's still up in the air about whether or not it will

I said: 00:16:34

do the right task in the right order.

I said: 00:16:36

Yeah, I was thinking, I don't know how much I'm seeing this.

I said: 00:16:39

But, three months, six months ago, I was hearing a hundred horror stories

I said: 00:16:45

about, essentially CEOs being like, "we need LLMs" and like their magic,

I said: 00:16:51

they can do anything, And so it didn't matter what the problem was, "oh, we need

I said: 00:16:55

to, do outlier detection using LLMs".

I said: 00:17:00

No,

I said: 00:17:00

use stats for that.

I said: 00:17:02

yeah, outlier detection is really a statistical problem.

I said: 00:17:05

It's really a data and math problem.

I said: 00:17:07

LLMs are good at natural language.

I said: 00:17:09

And so when we can solve a problem using words and communication,

I said: 00:17:13

that's when LLMs can get in.

I said: 00:17:15

But problems like, outlier detection or weather prediction or these

I said: 00:17:21

other things, we have, algorithm.

I said: 00:17:22

stock market prediction, Super Bowl prediction,

I said: 00:17:25

All these things, we have better ways to make predictions.

I said: 00:17:30

And it's called math, right?

I said: 00:17:32

Fourier transforms, other machine learning algorithms, other things like that.

I said: 00:17:37

LLMs are not good at doing those things, cause we don't talk

I said: 00:17:41

about them in natural language.

I said: 00:17:43

we've invented other languages like math just to describe them

I said: 00:17:47

And that's why they're not good.

I said: 00:17:48

we can make tools, you can build functions for an LLM to use to do Fourier

I said: 00:17:53

transitions and whatever else, right?

I said: 00:17:57

But getting the LLM to know that it needs to do that is really difficult.

I said: 00:18:02

Probably just as difficult to, as explaining what the Fourier transition

I said: 00:18:07

is to an LLM within your training data to get it to be able to replicate it.

I said: 00:18:11

This is one thing that makes it almost miraculous when stuff does

I said: 00:18:16

work, and that's that feeling that we're chasing right now, and that's

I said: 00:18:20

the replicability that we're trying to help people get to in a book.

I said: 00:18:25

how do you actually do it, and how do you make sure that your scope

I said: 00:18:28

is small enough, that it will work repeatedly and you can build a

I said: 00:18:31

product off of it, that's difficult.

I said: 00:18:33

I'm a big fan of chess.

I said: 00:18:35

And, since ChatGPT came out, lots of people have been making memes, or just

like: 00:18:41

"Hey, I'll play ChatGPT in chess", and ChatGPT can play chess because we

like: 00:18:47

can talk about it in language, right?

like: 00:18:48

Like E4, move the pawn, or knight to g6, whatever it is.

like: 00:18:54

we have language of it, but ChatGPT has no idea.

like: 00:18:58

It has no idea the model behind those letter number combinations.

like: 00:19:04

all it knows is that there's certain things it can do, right?

like: 00:19:07

it writes words, and so when they do this, and these like videos or

like: 00:19:11

memes, like they just let ChatGPT do whatever it says, right?

like: 00:19:14

it just magically creates a knight out of nowhere, and magically, will take

like: 00:19:18

its own pieces as it moves its pieces around, it's always pretty funny.

like: 00:19:23

And even though it's cheating the entire way, it almost always loses, right?

like: 00:19:27

Cause It doesn't have an understanding of chess, like it doesn't

like: 00:19:31

have that model underneath it.

like: 00:19:33

sure we can talk about it in language, but not really, right?

like: 00:19:36

So we, we still have better ways to play chess, alpha zero, et cetera.

like: 00:19:43

Stockfish, like there are engines out there that play chess really well.

like: 00:19:46

And we don't need to make LLMs good at chess, but that's a very good example

like: 00:19:51

of one of the things it's not good at.

like: 00:19:53

I've seen someone on Twitter who said "I'm gonna give LLM $1000 or

like: 00:19:59

whatever initial amount, and I'm gonna ask it how to best invest it.

like: 00:20:04

I didn't follow where it went.

like: 00:20:06

But I think a lot of people had the same idea.

like: 00:20:09

this is some kind of genius system.

like: 00:20:11

I'm just gonna be its flesh and bones agent in the real world.

like: 00:20:16

and hope for the best.

like: 00:20:18

So I think that kind of goes back to your chess thing.

like: 00:20:20

So excuse me for that, but I have to ask you the AGI,

like: 00:20:25

Artificial General Intelligence.

like: 00:20:28

Any chance for that happening anytime soon?

like: 00:20:31

What's your prediction?

like: 00:20:32

not with our current systems.

like: 00:20:33

No, I don't think AGI is ever going to come out of quadratic

like: 00:20:38

equations, like not a single chance.

like: 00:20:43

maybe if there are better dropping sub-quadratic replacements, stuff

like: 00:20:47

like hyena, I've tested that out.

like: 00:20:49

I think it's really cool.

like: 00:20:51

But, the fact that attention, the query key value attention,

like: 00:20:55

ultimately generates complex numbers.

like: 00:20:58

I think that is a little too much for AGI at the moment.

like: 00:21:03

So you're not one of those people who secretly hope that OpenAI has

like: 00:21:07

something they're gonna release soon.

like: 00:21:08

I don't think they have it, right?

like: 00:21:10

I'll be hopeful, sure.

like: 00:21:12

If it comes out, that's great.

like: 00:21:14

Yeah, I'm of the same mind as Chris.

like: 00:21:16

I hope they keep pursuing it.

like: 00:21:17

we've gotten major breakthroughs from what they pursued.

like: 00:21:21

It's very possible AGI will happen in my lifetime, I'm still pretty young We

like: 00:21:25

keep on making advances really quickly, but are we relatively close to it?

like: 00:21:30

Probably not.

like: 00:21:31

Oh, the thing about progress though is that it's very rarely linear, It

like: 00:21:37

tends to have a very weird curve.

like: 00:21:39

So that's why all the predictions are so funny, but hey, I had to ask you anyway.

like: 00:21:43

No, I think it's a great question.

like: 00:21:46

Okay, let's delve a little bit into, a portion of your book,

like: 00:21:52

It's basically describing the two options that you have today.

like: 00:21:56

you can either go and pay some money to OpenAI, maybe Google, or

like: 00:22:01

somebody else, or you can build,

like: 00:22:03

So you've got buy versus build.

like: 00:22:05

Could you talk to me a little bit about how someone would decide

like: 00:22:11

about this as of february 13, 2024.

like: 00:22:16

What's the things to consider, and what's the weights that

like: 00:22:19

you would put in, and biases?

like: 00:22:21

the basic consideration is just your use case, right?

like: 00:22:25

If you just want to test something out, you're a student and you don't have a

like: 00:22:30

lot of budget, and you want something up and running so that you have LLM

like: 00:22:34

experience, I would say just, shell out for that, ChatGPT+ or buy Anthropic

like: 00:22:43

or Google Bard has a fantastic API, or I guess Gemini now just do it.

like: 00:22:48

it's not that big of a thing.

like: 00:22:49

If your product that you're trying to ship is inconsequential and you

like: 00:22:54

don't need it to be right every time, you just want to sprinkle the

like: 00:22:57

AI pixie dust on it, just buy it.

like: 00:23:00

If your use case goes deeper than that, though, if you want to be able to build

like: 00:23:04

your own, if you need to make sure that it says the right things all the time,

like: 00:23:09

if you need it to behave a little bit more deterministically, There have been

like: 00:23:13

probably a thousand case studies in the last year of people building products on

like: 00:23:17

top of ChatGPT and then OpenAI rolling out an update that changes how chat

like: 00:23:24

GPT behaves, and they don't have any way to measure all of the different

like: 00:23:29

ways that it will change it, right?

like: 00:23:30

There are 176 billion parameters in GPT-3 alone, they don't know it's going

like: 00:23:35

to break your program down the line.

like: 00:23:37

they're just going to update it for what they consider to be better.

like: 00:23:41

And those programs break constantly.

like: 00:23:46

that doesn't mean you can't fix them.

like: 00:23:47

It's just a much bigger problem of maintenance, than I think a lot of

like: 00:23:51

people are expecting going into it.

like: 00:23:54

So If you want to have to maintain it less, build your own.

like: 00:23:57

Yeah, I think the other aspect is like you want that control, right?

like: 00:24:01

there's lots of examples of companies who, essentially built a small shell

like: 00:24:07

around ChatGPT that did something unique.

like: 00:24:11

And then, months down the line, now ChatGPT just does

like: 00:24:15

that out of the gate, right?

like: 00:24:17

their value proposition just completely disappeared.

like: 00:24:20

And that's because they didn't have control over the model.

like: 00:24:22

They didn't have, control over, what it did it's just interesting, right?

like: 00:24:27

Because I say these things and things have changed over time.

like: 00:24:30

But when ChatGT first came out, it was free, it was a demo, and they were

like: 00:24:33

specifically doing it to collect data.

like: 00:24:35

And that's what they did, they used collected data to improve their models.

like: 00:24:39

And that's what they continued to do for a while, right?

like: 00:24:41

Oh no, they're back.

like: 00:24:43

They it's terms and service, right?

like: 00:24:45

If you want them to save your chat, so that you can return to

like: 00:24:49

it and ask more questions, they get to train off of your data.

like: 00:24:53

So if you want to put anything private or sensitive in there, like

like: 00:24:57

it's over, you've just leaked it.

like: 00:24:59

they're back and forth about what data they're collecting, what data they're

like: 00:25:02

not collecting, and if you're with an enterprise customer, like maybe you

like: 00:25:07

can make certain rules and things like that, and oftentimes they won't, it's

like: 00:25:13

a minefield, for how people are using it, and so it's just something important

like: 00:25:17

to take into consideration, if your LLM model is doing something magical,

like: 00:25:23

that's really core to your business, that is really driving customers.

like: 00:25:28

You want to control that.

like: 00:25:29

You want to make sure that the model is working exactly as intended.

like: 00:25:35

You're not getting updates randomly, that break your application.

like: 00:25:39

You're also controlling the data flow, you're making sure that you're not

like: 00:25:45

accidentally training your competitor's model, and other things like that.

like: 00:25:48

And there's just lots of aspects where it's just important to

like: 00:25:53

make sure that you own it.

like: 00:25:55

And, no, that's not necessarily everyone's concern, right?

like: 00:25:59

if you're a student or you're just doing some side project or anything, there's

like: 00:26:02

lots of APIs out there that are very cheap that can get you up and running,

like: 00:26:07

there are literally hundreds of hugging face spaces that are free APIs.

like: 00:26:11

With, have LLMs running behind them and you can just hit

like: 00:26:14

them whenever you want, right?

like: 00:26:16

unless you're queuing behind a thousand other people.

like: 00:26:18

yeah, exactly.

like: 00:26:19

I liked the example you gave in the book, I think people at Latitude, the Dungeons

like: 00:26:25

& Dragons people would agree with a lot of what you're saying now, but can you tell

like: 00:26:29

the story of what happened with them?

like: 00:26:32

Latitude, is a local company, that was here in Utah.

like: 00:26:37

it was put together by, two guys from BYU.

like: 00:26:41

GPT-2 came out several years ago.

like: 00:26:43

They're like, "Oh, this is mind-boggling.

like: 00:26:46

Let's build a game off of it!"

like: 00:26:48

And what they came up with was like a dungeon crawler, a text

like: 00:26:52

based game it was really neat, because it would just generate, an

like: 00:26:56

infinite amount of opportunities.

like: 00:26:57

And so it created this 'choose your own adventure'.

like: 00:27:01

It got relatively big in the space, and lots of people enjoyed playing it.

like: 00:27:05

things were going really good, and then OpenAI GPT-3 came out, they offered it to

like: 00:27:11

them, hey, we can, we have this new model, it's a lot better, why don't you try it?

like: 00:27:15

they played around with it, and "oh yeah, this is, it's much more descriptive,

like: 00:27:19

it's much more interesting, it's really great", There was a lot of excitement

like: 00:27:23

around it, however, it turned out that the model itself, had a propensity

like: 00:27:29

to, generate smut, and it got really concerning people would write like,

like: 00:27:34

"I'm an eight year old girl", and then the model would complete it saying

like: 00:27:39

"....and I'm wearing a skimpy outfit",

like: 00:27:41

And oh, whoa, like the player didn't want that, but like the model generated it.

like: 00:27:45

there became this big feud between OpenAI and Latitude about creating filters.

like: 00:27:50

"hey, we don't want your players doing that.

like: 00:27:53

We don't like that".

like: 00:27:54

And, Latitude's "okay, we'll create some filters" and things like that.

like: 00:27:58

And it devolved really quickly.

like: 00:28:00

Latitude being a very startup, not necessarily knowing everything

like: 00:28:03

they were doing, they built a very shaky filtering system, and then

like: 00:28:09

OpenAI was "that's not good enough".

like: 00:28:10

So then they started banning players, and so eventually we got to this

like: 00:28:13

territory where players - paying customers would be playing a game, the

like: 00:28:18

model would randomly generate, something that the filtering system didn't

like: 00:28:23

like, and then they would get banned.

like: 00:28:24

Cause it's like the game just did itself.

like: 00:28:30

It was a very complicated time, and there was lots of back and

like: 00:28:33

forth between Latitude, who's a small company, and OpenAI.

like: 00:28:38

There's lots of ' he said they said' going on, but ultimately, it's just this

like: 00:28:46

position where Latitude They had this game that was completely dependent on OpenAI's

like: 00:28:53

model to generate good output, and it really caused a lot of drama between

like: 00:29:01

the players and Latitude and, OpenAI in the background and that is a critical

like: 00:29:07

example of LLM was very critical to their business, If they owned it, then they

like: 00:29:13

could have controlled it, they could have made sure that from the model aspect,

like: 00:29:17

they could have trained the model to make sure it didn't do any of those things.

like: 00:29:22

And then they would never need to play the little blame game, right?

like: 00:29:26

Nobody likes to play that game.

like: 00:29:27

That's whose fault is it, that the model is generating bad stuff.

like: 00:29:31

Is it the player who's prompting it?

like: 00:29:33

Is it Latitude who has some systems for tokenizing and preparing player

like: 00:29:38

output before it goes to OpenAI?

like: 00:29:41

Is it OpenAI because their model is generating that?

like: 00:29:43

Is it Latitude for post processing the content from OpenAI before

like: 00:29:48

they serve it to the player.

like: 00:29:49

I don't even know if it really matters who's to blame.

like: 00:29:51

it's just a sucky game to play.

like: 00:29:53

and that's like the ultimate example of why you might want to consider

like: 00:29:58

build versus buy is if you buy from any provider, we're picking on OpenAI here,

like: 00:30:04

because they're a big player, but you buy from Anthropic, you buy from the guys down

like: 00:30:08

the street, the startup that just barely came up and they're offering for half

like: 00:30:13

the price of whatever, Buy from anybody,

like: 00:30:15

and you will eventually have to play that blame game.

like: 00:30:18

we had another example in there of some lawyers who generated, cases that didn't

like: 00:30:22

exist they asked ChatGPT about cases and it came up with a perfect response.

like: 00:30:31

a little too perfect.

like: 00:30:32

It hallucinated stuff that didn't exist.

like: 00:30:34

and, is it ChatGPT's fault?

like: 00:30:37

Is it OpenAI's fault for, allowing their model to make

like: 00:30:41

stuff up and behave dishonestly?

like: 00:30:43

Or is it the lawyer's fault for not checking it?

like: 00:30:46

who cares?

like: 00:30:46

the problem is that it's not locked down.

like: 00:30:49

It's qnon deterministic.

like: 00:30:50

Yeah, in a way, as I was reading the chapter on that, it makes

like: 00:30:56

me think of using a machine to maybe do some farm, work.

like: 00:31:03

Let's say that you're plowing a field and you're using a

like: 00:31:06

horse versus a machine, right?

like: 00:31:08

A machine might break, but in a predictable way.

like: 00:31:11

And if you've got a mechanic around, they'll come and fix it.

like: 00:31:14

A horse can get scared, or it has a bad day, or it can be moody.

like: 00:31:19

And it can come up with something new.

like: 00:31:21

So you always have to be careful with that.

like: 00:31:23

is that an accurate feeling of someone who's working with this LLMs day-to-day?

like: 00:31:29

You work with some kind of animal?

like: 00:31:32

One of the most annoying things is even if you set the seed of it, so

like: 00:31:36

the random generator is going to be the same every single time, you

like: 00:31:40

can still give it the same prompt and get something different out.

like: 00:31:43

The truly awesome thing about LLMs is the number of non-linear activations

like: 00:31:49

that are going through the model, right?

like: 00:31:52

It's creating incredible, non-linear jumps throughout that dimensional

like: 00:31:57

space that the embeddings are in.

like: 00:31:59

you just can't really predict it.

like: 00:32:01

It is a little bit like an animal.

like: 00:32:05

the fact that like we can prompt engineer at all.

like: 00:32:09

it's a little bit telling of where we are, right?

like: 00:32:11

Cause like prompt engineering, you can change the spaces, the white space

like: 00:32:15

inside of your prompt and it can end up giving you a completely different result.

like: 00:32:19

we're still in a very interesting area, where we're trying to create

like: 00:32:24

better ways to communicate with the LLM and get predictable outputs.

like: 00:32:28

But, the fact that we can do that at all is.

like: 00:32:32

This is a bit of a miracle, right?

like: 00:32:33

you can't do that with a human.

like: 00:32:35

a human isn't going to be tricked into saying something different.

like: 00:32:39

humans are tricked all the time, but not necessarily in the

like: 00:32:41

same way that we do with LLMs.

like: 00:32:42

it's a very interesting world we are in, and a lot of people are having

like: 00:32:46

that horse versus machine experience.

like: 00:32:49

let's talk about the cost a little bit.

like: 00:32:51

you mentioned that it's super cheap to pay some big company to use their thing.

like: 00:32:57

let's focus for a minute on the cost of actually building your own LLM.

like: 00:33:02

if I wanted to build one of this foundational models,

like: 00:33:06

Let's say that I take one of those 75TB corpora from the internet and I'm

like: 00:33:13

feeling particularly GPU poor that day.

like: 00:33:17

How much money do I need to have in my little piggy bank to get something useful?

like: 00:33:24

That's difficult, man.

like: 00:33:26

because you're either paying for a GPU, right?

like: 00:33:31

Or a suite of GPUs in order to parallelize it so that you can ingest

like: 00:33:36

that over a short period of time.

like: 00:33:38

Or technically with a lot of this stuff, you can load it onto a [Geforce] 3090,

like: 00:33:43

I've done this personally, you can train in FP16, you can train up to, about, 13

like: 00:33:50

billion parameters pretty effectively, and pretty cheaply, on a 3090.

like: 00:33:56

You have to be a little bit smart about your data loading, you have to make

like: 00:33:59

sure you're streaming stuff you have to pay for the data storage anyway, it's

like: 00:34:03

incredibly slow, you have to do gradient checkpointing, you have to, do like

like: 00:34:07

gradient accumulation steps, which slow down the training even more, I trained a

like: 00:34:13

little bit bigger than that, it was about a 20 billion parameter model on my 3090,

like: 00:34:19

but what I don't, generally talk about is it took a year of just running to do that.

like: 00:34:27

it was horrendous and that all culminated in a company giving me a

like: 00:34:32

cease and desist, so I couldn't even release it, so you're either paying.

like: 00:34:37

A lot of money, hundreds of thousands of dollars in order to get something quick.

like: 00:34:42

Especially with 75TB of text or more, grab your own data, get

like: 00:34:47

more data, and you're paying to store and to process all of that.

like: 00:34:51

And that costs tons of money.

like: 00:34:53

Or you are not paying the money, but it takes a really long time and makes all

like: 00:35:00

of your shareholders really frustrated because you're ruining go to market.

like: 00:35:04

You're taking too long.

like: 00:35:06

You're not going to be the first in the space, It's a huge trade off

like: 00:35:11

as with many things, you can trade time or money, and

like: 00:35:15

training an LLM is very similar.

like: 00:35:17

I think they estimated, huge models that we see, like ChatGPT things.

like: 00:35:23

You're probably paying somewhere like what was it like a half million?

like: 00:35:27

I think they say, and that's just for the training, we're not even

like: 00:35:31

talking about all the experts you have to pay and buy in order

like: 00:35:37

data curation,

like: 00:35:38

man.

like: 00:35:39

on the very far end on the expensive side.

like: 00:35:42

it gets really expensive really quickly to train these models, just because.

like: 00:35:46

buying enough GPUs in order to parallelize this to do it within, reasonable time and

like: 00:35:51

just the sheer volume of data you have to run through to train all the parameters.

like: 00:35:55

It gets really expensive, but on the other end there's lots of good

like: 00:36:00

open source models that have done that main pre-training already.

like: 00:36:04

And so you can grab one of those, you can train it with something like

like: 00:36:10

Laura, which you, only need a handful of samples and maybe like 10 minutes

like: 00:36:16

if that, and you can train it on a very, simple GPU and you have something

like: 00:36:21

fine tuned for what you need, and you can get under $200 is very reasonable.

like: 00:36:26

$150, $20.

like: 00:36:28

It's very possible to train, these models with certain

like: 00:36:33

methods to get what you need.

like: 00:36:35

So does it mean that in a kind of natural, almost biological like evolution we're

like: 00:36:42

going to end up with few primary models that a lot of the different models branch

like: 00:36:48

off of, instead of, reinventing the wheel?

like: 00:36:51

That's where we're at currently.

like: 00:36:53

I hope that it doesn't stay that way, because I really enjoy seeing

like: 00:37:00

new people create new models for new use cases and all this stuff.

like: 00:37:04

so I hope it doesn't stay that way, but I do see a lot of value in creating industry

like: 00:37:10

standards, at least around how you are actually writing the binary files, how

like: 00:37:15

are the weights actually being stored?

like: 00:37:17

What do the different layers look like?

like: 00:37:19

I, think that standardizing what the model looks like so that you can load

like: 00:37:22

it as flexibly as possible is awesome.

like: 00:37:27

I would like to see more open source models, which is funny considering

like: 00:37:32

there are thousands of open source fine tuned versions and hundreds

like: 00:37:37

of open source foundational models on the Hugging Face Hub right now.

like: 00:37:42

I want more, right?

like: 00:37:44

I'm greedy, man.

like: 00:37:46

To me, it sounds like basically every week there is another one that's better

like: 00:37:51

at something and if you look at the Hugging Face LLM leadership board, it's

like: 00:37:56

changing by the hour, literally and it looks like a gold rush in many ways but

like: 00:38:04

I like this gold rush much better than the crypto one, couple of years ago

like: 00:38:09

Yeah, man, there's a lot higher chance that you'll come out

like: 00:38:12

of this gold rush with a great product than with the crypto one.

like: 00:38:17

yeah, there's a lot there, and just to summarize that into one sentence,

like: 00:38:22

you can probably fine tune even a gigantic model for around $200 to $500.

like: 00:38:32

And you can go lower than that.

like: 00:38:33

Even if you are smart about how you're doing it, versus training from scratch,

like: 00:38:38

which either is going to take an inordinate amount of time or will cost

like: 00:38:43

thousands and thousands of dollars.

like: 00:38:46

So I'm willing to bet money that a lot of our listeners are going to pause

like: 00:38:50

this now and start Googling furiously.

like: 00:38:52

How do I fine tune a model?

like: 00:38:56

Where would you point them as a good starting point?

like: 00:38:58

any particular paper, any particular, company, anything that's, a

like: 00:39:04

good place to start with that

like: 00:39:07

a bit selfishly, I would say you should buy our book.

like: 00:39:09

We talk about probably the main ways to train in chapter 5 of our book,

like: 00:39:16

I was going to say that, but, I

like: 00:39:18

was going to say it last, right?

like: 00:39:19

Cause we do go over it.

like: 00:39:21

The book is primarily about production environments, but you can't really

like: 00:39:26

put a model in production if you don't know how to work with it.

like: 00:39:29

So we have stuff on fine tuning.

like: 00:39:31

We have stuff on perimeter, efficient, fine tuning on low

like: 00:39:34

rank adaptation, the whole deal.

like: 00:39:36

YouTube is actually probably one of your best resources right now, because

like: 00:39:41

it has amazing content creators that show you how to do it in whatever

like: 00:39:48

format you're comfortable in.

like: 00:39:49

So if you're a C+ developer, there are YouTube videos on how to fine tune a model

like: 00:39:55

and create a Laura using llama CPP, right?

like: 00:39:59

It's not even all that difficult.

like: 00:40:01

You just have to convert a model into a GGUF format and Boom, you're there.

like: 00:40:05

You can do it on a CPU.

like: 00:40:06

it'll take a long time, but you can do it in whatever quantization

like: 00:40:10

you want and everything.

like: 00:40:12

YouTube will meet you where you're at if you want to learn something a little bit

like: 00:40:17

more industry-standard so that you could potentially, get employment in this area,

like: 00:40:22

PyTorch has an amazing documentation, fantastic tutorials and they're one of the

like: 00:40:27

best at really making it feel like you're playing with, let's say "big boy Legos"

like: 00:40:32

You're like building the model using their little Lego pieces pretty cool If you need

like: 00:40:39

something Bit more high level than that.

like: 00:40:42

Hugging face, I think is the industry standard for, working in between a whole

like: 00:40:48

bunch of different frameworks, whether that's PyTorch or TensorFlow or, whatever

like: 00:40:52

other framework you're working with Onyx.

like: 00:40:55

HuggingFace has abstracted away a lot of the difficulty of setting

like: 00:40:59

up models for fine tuning cause in PyTorch you have to build out the

like: 00:41:04

exact model architecture just to load the weights and then fine tune it.

like: 00:41:08

HuggingFace already has the class built for you.

like: 00:41:13

I would point to those if you need more explanation, like

like: 00:41:16

Coursera is a fantastic place.

like: 00:41:17

Deep learning AI on Coursera and on their own sites felt like

like: 00:41:22

that's Andrew Ng's education stuff.

like: 00:41:24

That's where I got my start with machine learning was Andrew Ng's

like: 00:41:28

machine learning course on Coursera.

like: 00:41:30

It was Awesome.

like: 00:41:31

Fantastic.

like: 00:41:32

Jeremy Howard is also amazing in that area of creating content for

like: 00:41:36

people starting out and learning from beginner to advanced level.

like: 00:41:40

He's a fast AI.

like: 00:41:42

I, yeah, I strongly recommend all of those

like: 00:41:46

and your book.

like: 00:41:48

yeah, we ingested a lot of those in order to write the book,

like: 00:41:52

our book is a very nice high-level overview of the key things you want

like: 00:41:58

to be looking at and like different methodologies from training from

like: 00:42:02

scratch to basic fine tuning to.

like: 00:42:06

model distillation to, Laura and Path and things like that.

like: 00:42:11

we definitely give a high level overview, we give code samples and show you that.

like: 00:42:15

But, ultimately if you really wanted to get into it, yeah, there

like: 00:42:19

are other resources out there.

like: 00:42:21

I know Manning has another book coming out, specifically

like: 00:42:25

around all about training LLMs.

like: 00:42:28

there are definitely other places you can go, but.

like: 00:42:30

If you're looking for the quick, summarized version of all of

like: 00:42:34

these things, our book is actually a really good resource for it.

like: 00:42:37

One other thing that I like about your book is, the part where you

like: 00:42:41

build up the different, breakthrough moments, throughout the world of

like: 00:42:46

mathematics, that ultimately led to 'attention is all you need', and

like: 00:42:52

what is it, seven years later now?

like: 00:42:54

the gold rush that we're observing.

like: 00:42:56

but just before we jump into that, there is a little bit of vocabulary

like: 00:43:00

and that one needs to have in order to basically talk or even read

like: 00:43:05

a lot of this papers, could you.

like: 00:43:08

Talk us through briefly that vocabulary.

like: 00:43:11

I'm talking about phonetics, syntax, semantics, pragmatics, morphology, that

like: 00:43:16

until I read your book actually made me think mostly of blood tests and semiotics.

like: 00:43:23

Could you give us like the MVP version of what you need to know about these

like: 00:43:27

things to be able to read papers?

like: 00:43:30

Oh, absolutely.

like: 00:43:31

Matt has been learning a lot of this too, he might be better at it than me.

like: 00:43:34

I will throw other jargon into it.

like: 00:43:36

writing this book with Chris over the last year has been, mind-opening for me.

like: 00:43:40

until you can Understand these words like you were saying it's really

like: 00:43:43

hard to dive into the deep end but we go over in our book just because

like: 00:43:48

we do find it so valuable, It really helped me understand very quickly.

like: 00:43:54

"Oh, this is what my LLMs are good at.

like: 00:43:56

This is what LLMs are not", and that was one of the first things we started with

like: 00:43:59

but the first one semantics, that is just like the structure of words, how things

like: 00:44:04

go, whether or not it sounds correct.

like: 00:44:07

that is what LLMs are really good at.

like: 00:44:09

They're really good at making sure like the semantics of words align really well.

like: 00:44:14

but after that, you got pragmatics, which is what LLMs have no idea about.

like: 00:44:19

That is all the information around.

like: 00:44:23

That isn't said, right?

like: 00:44:24

So when you say I'm going to find the eggs the Easter Bunny left, right?

like: 00:44:30

you have to understand what, Easter is, what the Easter

like: 00:44:33

Bunny is, why a bunny has eggs.

like: 00:44:35

there's a lot of context around it that you have to understand,

like: 00:44:38

and that's all pragmatics.

like: 00:44:40

it's information that isn't said.

like: 00:44:43

And that's what LLMs generally lack.

like: 00:44:45

Actually, I'm gonna, I'm gonna jump in here real quick.

like: 00:44:47

Miko, did you like the Velkanot example that I gave in there?

like: 00:44:51

Yeah, I thought it was

like: 00:44:53

Yeah.

like: 00:44:53

Was that pretty good?

like: 00:44:55

I just wanted to ask because I remember experiencing that in Slovakia.

like: 00:44:59

Like I lived there for years and that was a hugely beneficial portion to me

like: 00:45:05

to help figure out that 'no, tons of people have tons of ways of looking at

like: 00:45:11

things', and LLMs don't know about it.

like: 00:45:13

you would have to explain every bit of it to them in order to get them

like: 00:45:16

to understand the same things as you.

like: 00:45:18

Anyway, sorry, Matt.

like: 00:45:20

I find like those two words in general, semantics and pragmatics, understanding

like: 00:45:24

those is going to get you significantly farther and just understanding

like: 00:45:29

how LLMs work, what they're doing.

like: 00:45:31

there's obviously a lot of other words that we talk about,

like: 00:45:35

like morphology and stuff.

like: 00:45:36

And I'll hand it off to Chris to talk about what he wants to add to there.

like: 00:45:40

I would agree with Matt.

like: 00:45:41

Just understanding semantics and pragmatics would get you probably 60%

like: 00:45:45

of the way there, and you could read new papers that come out and immediately

like: 00:45:49

see like where are they amazing?

like: 00:45:52

Where are they failing?

like: 00:45:53

I end up using The relationship between those two, just the literal

like: 00:45:58

encoded meaning of your words.

like: 00:46:00

if I say, "I'm married to my ex-wife", there's immediately,

like: 00:46:05

boom, semantic problem there.

like: 00:46:06

How can I be married to my ex-wife?

like: 00:46:09

The words don't agree with each other.

like: 00:46:12

Versus, exactly as Matt was saying, if we talk about Easter, if we talk about

like: 00:46:16

traditions, if we talk about rituals that people have, just like the stuff

like: 00:46:20

that you say, if you ask someone in Slovakia, they're going to respond to you.

like: 00:46:24

That's normal.

like: 00:46:25

it's a question, they respond.

like: 00:46:27

LLMs don't have that, and you have to have them ingest tons and tons of data in order

like: 00:46:34

to even get as far as giving a response.

like: 00:46:38

the other ones that we can think about, syntax, I would say

like: 00:46:42

that syntax is largely solved.

like: 00:46:44

At this point, syntax is your structure around the words, like what order do

like: 00:46:49

the words go in for them to be correct?

like: 00:46:52

Is it 'I go to the store' or is it 'I to the store go' or all of that stuff.

like: 00:46:57

That's syntax.

like: 00:46:58

It's the structure that holds your sentences, your utterances together.

like: 00:47:02

Morphology is delving into something that I consider to be very important in LLMs.

like: 00:47:08

I'm not going to say the most important, cause I think that's still semantics.

like: 00:47:11

There's a lot of work there.

like: 00:47:14

but morphology would be how words are built.

like: 00:47:18

what are the fundamental units of meaning the morphemes do those

like: 00:47:21

even exist that sort of stuff.

like: 00:47:23

and we don't have to delve really deep into that.

like: 00:47:25

That's largely solved by tokenization, but we can see.

like: 00:47:30

with newer models that come out that really matters.

like: 00:47:33

You have much smaller models that have more novel tokenization, more novel

like: 00:47:39

morphology that end up outperforming larger models on tasks that they

like: 00:47:43

didn't even train on all that much.

like: 00:47:45

if we can put it all together really quick.

like: 00:47:48

The model solves syntax.

like: 00:47:50

Embeddings try to solve semantics, but semantics is difficult,

like: 00:47:55

and so they're not perfect.

like: 00:47:57

Pragmatics is stuff like RAG, your Retrieval Augmented Generation, and

like: 00:48:02

having repeated sequences within your training data, it gives it landmarks, it's

like: 00:48:07

context around the syntax and semantics.

like: 00:48:10

Morphology is your tokenization, which, if I would Give that an example, your

like: 00:48:16

tokenization provides your model with stuff that it sees, it changes from text

like: 00:48:22

into what does the model actually see.

like: 00:48:25

And, your embedding strategy is moot if you don't have it.

like: 00:48:28

Just your morphology gives your model glasses, if you want to call it that.

like: 00:48:32

And then phonetics is the one that we haven't even talked about.

like: 00:48:35

Phonetics is the reason why we are doing a podcast and we're talking instead of just

like: 00:48:41

texting each other or emailing each other.

like: 00:48:42

Can you imagine trying to ingest a podcast that's just emails?

like: 00:48:46

It's horrendous.

like: 00:48:48

And it's because there's so much richness and depth in meaning in the language that

like: 00:48:54

is just lost when you strip it of its phonetic, I'm going to call it a medium.

like: 00:48:59

And that can lead people to think that it has to do with sound, that's the

like: 00:49:04

most common modality for people, but sign language has phonetics, they have

like: 00:49:08

particular places where they, make signs.

like: 00:49:11

They have particular ways that they do them to inflect and express more emotion.

like: 00:49:15

Their phonetics exists even outside of the verbal modality.

like: 00:49:20

that's important because that's where I see the most improvements coming to LLMs

like: 00:49:25

in the future is being able to process.

like: 00:49:28

phonetic information without having to convert it into text

like: 00:49:32

or process phonetic information and compare it against the text.

like: 00:49:36

that can be incredibly helpful for your model's understanding.

like: 00:49:39

those are the five features of language that we break

like: 00:49:43

things down into in the book.

like: 00:49:44

And they're largely agreed upon.

like: 00:49:46

There are some other linguistic features that are incredibly important, stuff like

like: 00:49:49

dialogue, that we haven't even covered.

like: 00:49:52

beyond that.

like: 00:49:53

Yeah, we can talk about semiotics too.

like: 00:49:55

That's, Charles Sanders Peirce, smart dude from the 1800s just created, a lot

like: 00:50:01

of structure and organizations we dive into that very lightly in the book.

like: 00:50:07

I don't think that you need a grounding in semiotics in order to improve

like: 00:50:11

your ability to interact with LLMs.

like: 00:50:13

But it is helpful for organizing all of these other concepts.

like: 00:50:18

how do we create a mental map for how stuff needs to be processed

like: 00:50:22

within a machine learning pipeline?

like: 00:50:24

How do we make sure that we're not mixing things up and inadvertently destroying

like: 00:50:28

our model's ability to see things, right?

like: 00:50:30

If we put embeddings before tokenization, it breaks your process.

like: 00:50:36

it's helpful for organizing things and it's also helpful for understanding

like: 00:50:40

how conversation happens and how I say something and it moves through

like: 00:50:45

your mind to create an interpretation.

like: 00:50:47

that's by far like the most theoretical out there concept that

like: 00:50:50

we get into in the whole book.

like: 00:50:53

And together you came up with this language definition as being, as a

like: 00:50:59

concept, "an abstraction of feelings and thoughts that occur to us in our heads".

like: 00:51:05

And I'll be honest, I initially thought it sucked.

like: 00:51:09

because it's a little bit, it's a little bit wishy washy.

like: 00:51:13

I wanted something a bit more concrete.

like: 00:51:14

But then, as I looked up all the other definitions in different contexts, I

like: 00:51:18

was like, Okay, I can clearly not come up with anything better than that.

like: 00:51:23

So I think I'm ready to yield now and say that this is actually

like: 00:51:27

capturing it pretty well.

like: 00:51:29

Putting abstraction in it, sounds also vaguely techie, so that helps.

like: 00:51:35

How did you come up with that definition?

like: 00:51:38

I didn't.

like: 00:51:39

I would love to take credit for that.

like: 00:51:41

No, that definition has been around for a long time within the linguistic

like: 00:51:45

community, and one of the best examples of why it really works is babies, right?

like: 00:51:51

Babies have no idea how to express their thoughts, but somehow they get it across.

like: 00:51:56

when a baby is happy, we can tell when a baby is crying, we can infer that

like: 00:52:03

it needs something, babies are able to communicate without language, meaning

like: 00:52:07

that language is something that we created to shorten the conversation.

like: 00:52:13

The reason I called it an abstraction is we have abstract ideas.

like: 00:52:17

You probably come up to a situation where you're feeling something, and you don't

like: 00:52:22

know the words to really express it.

like: 00:52:24

I think that's a pretty universal human adult thing that has happened

like: 00:52:29

at least once in your life.

like: 00:52:30

That's happened to me a bunch of times, and it really illustrates that

like: 00:52:34

"Oh man, the language that we use is actually describing "what's in

like: 00:52:40

here", it isn't "what isn't here".

like: 00:52:42

it's a hard concept.

like: 00:52:44

Once you get there though, it really helps with LLMs, because you realize that the

like: 00:52:48

language that we're using is a crutch.

like: 00:52:51

And that's all that the LLMs have in the first place.

like: 00:52:54

And so this is another thing that goes towards the miraculous

like: 00:52:57

nature of them working at all.

like: 00:52:59

Is they're dealing with an abstraction of an abstraction at least.

like: 00:53:04

In order to communicate with us.

like: 00:53:06

So let's say that I buy that.

like: 00:53:09

my first question, would be going back to your baby example, isn't what the

like: 00:53:13

baby's doing some form of a language?

like: 00:53:17

what's the line

like: 00:53:18

I'd like it to

like: 00:53:19

what is and what

like: 00:53:20

isn't?

like: 00:53:20

what's the line between, a language and communication?

like: 00:53:23

I like that.

like: 00:53:24

That's a question that a lot of people I bet have and It'll

like: 00:53:27

probably go in the appendix.

like: 00:53:28

We'll probably talk about this in an appendix for curious readers so the line

like: 00:53:32

between just straight up communication and a language is the ability to talk.

like: 00:53:37

there, there are a lot, but one of my favorite ones is the ability to talk about

like: 00:53:40

something that is not physically present.

like: 00:53:42

bees have communication.

like: 00:53:44

gibbons have communication.

like: 00:53:46

Babies have communication.

like: 00:53:48

Babies, though, are unable to express any ideas about stuff that is not

like: 00:53:53

physically present, you can't talk to a baby about theoretical physics.

like: 00:53:58

I mean you can, but what are you gonna get back?

like: 00:54:01

You can talk to a baby about my Star Wars posters, right?

like: 00:54:07

I can point at them because they're right there, but if I'm in a different

like: 00:54:11

room, baby's not gonna be able to talk to me about them And that's

like: 00:54:15

the difference, It's one of them.

like: 00:54:16

That's the one that I'd like to highlight though is that the fact that

like: 00:54:20

we can speak about things that are not physically right here with us, that we

like: 00:54:26

can point at, that's the distinction between communication and language,

like: 00:54:30

because babies are communicating.

like: 00:54:33

But once they get to that point, it really deepens the interaction

like: 00:54:37

that you're able to have with them.

like: 00:54:39

So now, equipped, with all that knowledge, I'm gonna try to prompt

like: 00:54:44

engineer you and give you this prompt.

like: 00:54:47

I'm a five year old baby, that has language now, and who's very curious

like: 00:54:53

about understanding how we got from bag of words, counting frequencies all the way to

like: 00:55:00

LLMs and ChatGPT and people worrying about the Terminator actually coming into life.

like: 00:55:08

Could you walk me through the high level ideas that were important,

like: 00:55:13

build up to what we're seeing today.

like: 00:55:16

The bag of words is really easy to think about, especially if you keep

like: 00:55:22

your tokenization incredibly easy.

like: 00:55:25

Sorry, this is, I'm already out of five year old territory.

like: 00:55:28

You just count words.

like: 00:55:32

If I take that sentence, "you ; just ; count ; words".

like: 00:55:35

Each of those has a count of one.

like: 00:55:37

If I add another sentence, "I like Star Wars".

like: 00:55:41

All of those still have a count of just one word.

like: 00:55:44

And then if I add another, "do you like Star Wars?"

like: 00:55:48

You and star and wars all go up to two.

like: 00:55:52

That's it.

like: 00:55:53

That's a bag of words model.

like: 00:55:56

why is it important?

like: 00:55:57

what can it

like: 00:55:58

do?

like: 00:56:00

I think that bag of words is The first model that we really have

like: 00:56:06

to explain being data-driven.

like: 00:56:09

It's just keeping track of things.

like: 00:56:11

if you look at a bag of words model for your workouts, it's just how

like: 00:56:16

often do you do certain things?

like: 00:56:18

how often are you doing a bicep workout versus doing a pectoral workout?

like: 00:56:22

How often are you doing which thing?

like: 00:56:25

it's just being data driven.

like: 00:56:27

It's the first step, right?

like: 00:56:29

You're not looking at any features.

like: 00:56:31

You're really caring about how these things interact with each other.

like: 00:56:35

You're just keeping track

like: 00:56:37

So I guess with that information from your example, I can guess whether

like: 00:56:41

you, are skipping leg days, and I can see what's important to you.

like: 00:56:47

Or, if I'm counting, words in U.

like: 00:56:51

presidents speeches, I can say, like you described in your book, whether it's a

like: 00:56:57

wartime or a peacetime president, and what they really try to get across.

like: 00:57:02

this is something that you can use for anything you count in soccer

like: 00:57:09

which players make goals how often that is a bag of words model.

like: 00:57:13

You're not tracking words.

like: 00:57:14

It's a bag of goals or it's a bag of, whatever else.

like: 00:57:20

So what's the next step from there?

like: 00:57:21

bag of words was really monumental just because it's so simple, but it's so

like: 00:57:26

powerful because know words you use when you're describing sports is very different

like: 00:57:30

from the words you use describing politics And so just picking up on certain words

like: 00:57:35

and their counts helps us understand the overall subject of what it is.

like: 00:57:40

But it really lacked, any sort of structure, because the order

like: 00:57:45

of words also matter, right?

like: 00:57:47

So the cat in the hat versus the cat's hat, they both have the word 'cat', they

like: 00:57:54

both have 'hat', but mean different things because of the order of the words, and

like: 00:57:58

so that kind of led to, n-gram models.

like: 00:58:01

instead of just simple words, we would also take n-grams, which are,

like: 00:58:06

n number of words in a certain order, and we would start cataloging those.

like: 00:58:11

And so, more than just words, we're getting n-grams.

like: 00:58:15

And that is improving our understanding of the language because now we

like: 00:58:20

have embedded some syntax in it.

like: 00:58:22

We understand some ordering of words and that's able to improve our categorization.

like: 00:58:30

however, from there though, we're not really able to make any predictions

like: 00:58:35

of what next words about to come up or anything like that, when it comes

like: 00:58:39

to bag of words or n-grams they're really more for categorization.

like: 00:58:43

And so that kind of led to Bayesian techniques

like: 00:58:47

and so not to really go deeply

like: 00:58:50

into Bayesian statistics, but

like: 00:58:52

Yeah.

like: 00:58:53

I'm sorry.

like: 00:58:53

Sorry to all Bayesian fanboys.

like: 00:58:55

We're going to go about as deep into this as we did to pragmatics.

like: 00:59:00

it's just you know, based off of the priors of the words that came before we

like: 00:59:05

can then predict the next word to come up and so if every single time after

like: 00:59:11

in text we saw 'I am a man' then it's going to predict that the next word is

like: 00:59:16

man instead of other words that easily could have come up like woman or girl

like: 00:59:22

or boy or cook or professional athlete.

like: 00:59:25

certain things that could come up that are gonna be a lot rarer Like I am an

like: 00:59:28

astronaut like a lot less people have been astronauts in order to say that

like: 00:59:33

it's gonna have a very low probability of being the next word predicted but

like: 00:59:37

it gives us this opportunity to look at what is the next word predicted.

like: 00:59:42

from there, we move on to what's called Markov chains we're swinging

like: 00:59:47

back towards the n-gram model But it gives us a bit of prediction next.

like: 00:59:53

I actually really love Markov chains because they provide very fast

like: 01:00:00

predictive text like Markov chains is essentially what's been fueling like

like: 01:00:07

the predictive text like for Google search and things like that has been the

like: 01:00:12

technology that's really been leading that charge for a really long time.

like: 01:00:15

and it's just a very basic way that we're using Ngrams now to

like: 01:00:20

make predictions of the future.

like: 01:00:22

You can think about it there,

like: 01:00:24

that is obviously I'm

like: 01:00:27

reducing it.

like: 01:00:28

that's not exactly how it works, but it's a bag of n-grams where you

like: 01:00:31

take a state, at each point in a sequence, and look at all the times

like: 01:00:36

that Previewings have occurred in that sequence, and then from that you can

like: 01:00:40

model probability about what comes next.

like: 01:00:42

Instead of just looking at each n-gram by itself, you give it state.

like: 01:00:49

and it's a bag of n-grams.

like: 01:00:51

It's really fun.

like: 01:00:52

It's a probabilistic bag of n-grams.

like: 01:00:55

That's how the chains work.

like: 01:00:57

One of my favorite parts, and I like that you kept track of this quote

like: 01:01:01

here, that Markov models represent the first comprehensive attempt to

like: 01:01:05

actually model language, which is funny, because Markov was not trying

like: 01:01:09

to model language initially, he was just trying to win an argument.

like: 01:01:14

And He eventually used it to, he looked at distributions in

like: 01:01:19

particular Russian authors.

like: 01:01:20

He looked at distributions in, Russian government official speeches.

like: 01:01:25

he knew what he had and he believed in it, and I love that, what a

like: 01:01:29

great piece of history anyway.

like: 01:01:32

continuous bag of words.

like: 01:01:34

Is where we, start essentially taking the logic of a Markov chain where,

like: 01:01:39

"oh, if we keep track of where things appear and how often they appear there,

like: 01:01:45

then it helps us, be able to model for what could appear next", right?

like: 01:01:52

And this is the first moment where we're really coming full circle all

like: 01:01:56

together and going right back to bag of words and just adding context

like: 01:02:01

for position and adding context.

like: 01:02:05

from the context of the bag of words, the literal counting of things, we're

like: 01:02:10

able to create embeddings, right?

like: 01:02:12

I don't know if a lot of people are aware, but bag of words

like: 01:02:15

is how Word2vec came to be.

like: 01:02:18

Word2vec was huge in, I think, 2015, 2016, and it stayed huge, Gensim

like: 01:02:25

is still one of the most downloaded natural language processing libraries

like: 01:02:29

in Python for Word2vec and for GloVe.

like: 01:02:33

Continuous bag of words, just adding that one little thing.

like: 01:02:37

adds all this context so that we can create embeds.

like: 01:02:41

We can create vectors that we can compare between words.

like: 01:02:44

this all comes from the logic of I forgot that dude's name.

like: 01:02:50

Tell me the company that a word keeps, and I'll tell you what that word means.

like: 01:02:55

just what's around the word.

like: 01:02:57

influences its meaning, which goes directly against a lot

like: 01:03:01

of previous linguists' thought that, syntax and semantics are

like: 01:03:05

absolutely not related at all.

like: 01:03:07

That's one of the big things from Chomsky, the colorless green ideas sleep furiously,

like: 01:03:12

nonsense, there's some semblance to it.

like: 01:03:14

There's some sense to it and taking advantage of that with continuous

like: 01:03:18

bag of words, we can create.

like: 01:03:21

like I said, these vectors that we can then compare, and

like: 01:03:23

that's really interesting.

like: 01:03:25

that is what fuels LLMs now, is this exact same continuous bag

like: 01:03:30

of words modeling technique.

like: 01:03:32

It's been built upon a little bit, but that bag of words is still fundamental

like: 01:03:36

to how embeddings are created.

like: 01:03:38

Bag of words and positionality and, like we can get into, the rope scaling,

like: 01:03:45

all of these rotational, plugins that you can use to get longer sequences

like: 01:03:51

embedded correctly, or at least better.

like: 01:03:54

that's one of the hard things when we're talking about language modeling

like: 01:03:57

is what is good and what is better.

like: 01:03:59

a lot of people like to appeal to, this is how humans do it.

like: 01:04:02

I don't know if humans are incredibly efficient when we do it, but.

like: 01:04:06

Like it's fine.

like: 01:04:08

then we get into the 1960s, the very first

like: 01:04:11

perceptrons,

like: 01:04:12

Before we go there, can we spend a little longer on what

like: 01:04:16

the embeddings actually are?

like: 01:04:18

You mentioned words to Vec, you mentioned the words vectors and embedding, but for

like: 01:04:23

somebody, listening to us, from the start, that's probably not clear what that is.

like: 01:04:27

can we delve a little bit?

like: 01:04:30

Yeah, absolutely.

like: 01:04:31

So embeddings are the vectors that come out of models like

like: 01:04:36

continuous bag of words.

like: 01:04:37

when you look at a modern machine learning pipeline, there are multiple models that

like: 01:04:42

you go through and we just attract all of it and call it model, just one model.

like: 01:04:46

When you look at GPT-3, ChatGPT, it has a model that they call it, a byte pair

like: 01:04:54

encoding model to do its tokenization.

like: 01:04:56

And then it has a model to do embeddings.

like: 01:05:00

that model is fundamentally a continuous bag of words.

like: 01:05:03

It's built on top of it a little bit with, like I said, keeping track.

like: 01:05:07

Not just how many times a word occurs, but how many times a word

like: 01:05:11

occurs in particular positions.

like: 01:05:13

and then on top of that, it keeps track of the, flip.

like: 01:05:19

It's either an odd or an even position within a sentence and it assigns

like: 01:05:24

it cosine or sine based on whether it's an odd or an even position.

like: 01:05:29

in order to try to insert some of that meaning back into it, that was taken out

like: 01:05:35

from the tokenization, cause tokenization is just assign each token a number in

like: 01:05:41

a dictionary, and you have a way to get all words into that dictionary, and

like: 01:05:45

then come back out of that dictionary.

like: 01:05:47

So it takes all of the meaning out of it.

like: 01:05:49

It's just one number.

like: 01:05:51

The embeddings attempt to put some of the meaning back into it using positionality,

like: 01:05:56

using continuous language modeling

like: 01:05:58

techniques.

like: 01:06:00

embeddings really simply, they're not perfect, they're just an approximation

like: 01:06:05

of that meaning, and because we are able to put it into a vectorized

like: 01:06:11

space, we're able to take these words, put them in a vectorized space.

like: 01:06:14

We can start doing things that start to make sense and start to make us feel

like: 01:06:20

like we're headed in the right direction.

like: 01:06:21

the classic example is, when we first discovered embeddings, we

like: 01:06:25

took the embedding of 'king', we subtracted 'man' from it.

like: 01:06:30

We then added the embedding of 'woman' and we got the closest.

like: 01:06:35

Embedding to that was 'queen' to that, we start to get this vectorized

like: 01:06:41

space that starts to make sense.

like: 01:06:42

We start to, these words start to have connection to each other and they start

like: 01:06:46

to make semantic sense to us as humans.

like: 01:06:50

however, embeddings are still an approximation, right?

like: 01:06:52

So if you were to do that with kind of every combination, it's interesting,

like: 01:06:56

what do you get when you start, taking words, That don't necessarily make any

like: 01:07:02

sense, like adding or subtracting them together.

like: 01:07:04

what do you get

like: 01:07:05

a good quintessential example of that is you take the vector for

like: 01:07:07

'king', you subtract the vector for 'wolf', and you add the

like: 01:07:11

vector for 'prince', and you get the vector for 'village'.

like: 01:07:14

Or at least pretty close to it.

like: 01:07:16

That doesn't make any sense,

like: 01:07:17

there's still lots of, okay, these are starting to add meaning, not

like: 01:07:22

always, but sometimes, like it's an approximation and embeddings ultimately.

like: 01:07:28

it's something we're constantly trying to learn and improve

like: 01:07:30

If your listeners are wondering how to keep up in space, like embeddings are

like: 01:07:36

probably the number one thing to keep track of OpenAI recently released, logic

like: 01:07:41

for being able to change the size of embeddings, to me, like being pretty

like: 01:07:46

deep into this, it feels groundbreaking.

like: 01:07:48

Because normally you have to structure these vectors so that they're all the same

like: 01:07:52

size and each point within that vector represents meaning negative or positive

like: 01:07:59

and it's very structured and not malleable and so the idea that you could take you

like: 01:08:05

all of your embedding space and change the size of it at your whim Is just amazing.

like: 01:08:10

that's one of the things that I see as a huge groundbreaking piece of technology

like: 01:08:14

that OpenAI is continuing to lead in.

like: 01:08:17

yeah, and if you're ever in doubt for oh man, is this paper important?

like: 01:08:20

If it's about embeddings and doing really cool things with embeddings, probably.

like: 01:08:27

I think the one question for anybody to like picture that, so what's

like: 01:08:31

the dimension of all these vectors?

like: 01:08:33

Is that the entire vocabulary?

like: 01:08:36

Are there different techniques?

like: 01:08:39

yeah, currently the, number one, dimensionality that is an

like: 01:08:45

unspoken industry standard is 768.

like: 01:08:47

that's a number that pretty much every NLP practitioner knows.

like: 01:08:51

like the reason OpenAI's embeddings initially were like really cool

like: 01:08:55

and they thought they were super dense is they were, what, 536, or

like: 01:08:59

1536, which is 768 doubled, right?

like: 01:09:04

You're gonna see multiples of 768 all over the place here.

like: 01:09:09

And that's not because that number is super significant, that's just

like: 01:09:13

the first embedding space that we found that tended to work better

like: 01:09:17

than the others.

like: 01:09:19

So that's the

like: 01:09:19

more art than science part of this

like: 01:09:24

for

like: 01:09:24

It's the brute force testing.

like: 01:09:26

Yeah, before going through and testing, 767, 766, 765 and landed on

like: 01:09:33

that one and it worked, that's the best one that we've found so far.

like: 01:09:38

Even the doubled embeddings from open AI offer a marginal improvement

like: 01:09:43

in that understanding space.

like: 01:09:45

I think we can move on to the multilayer perceptrons.

like: 01:09:51

Okay.

like: 01:09:51

a perceptron is essentially just a linear transformation of data.

like: 01:09:56

If you look at it from a statistical standpoint, if you have three things

like: 01:10:02

about something, You can just add those things together and you get

like: 01:10:08

a description of that thing, right?

like: 01:10:10

Just summing them and, that's like abstracting it a little bit much,

like: 01:10:17

especially if machine learning practitioners are listening to that.

like: 01:10:20

Like we can do linear trans transformations.

like: 01:10:24

that's like the easiest way to think about it for me is you perform one.

like: 01:10:28

action on a group of features and you get something out of it.

like: 01:10:33

That's not by itself.

like: 01:10:36

really helpful.

like: 01:10:37

once you get into having multiple layers of the, this is the MLP, the

like: 01:10:43

multi layer perceptron, once you get into multiple layers where you are

like: 01:10:47

adding these transformations together, and in between those layers you have

like: 01:10:51

non linear activation functions so that you can, create, you can create

like: 01:10:56

nonlinear relationships between sets of linear transformations.

like: 01:11:02

You can get into really cool spaces.

like: 01:11:04

And one of the first things that any machine learning practitioner learns,

like: 01:11:10

at least in a lot of the cases that I've talked to is that just adding

like: 01:11:16

more layers does not make it better.

like: 01:11:18

In fact, the cool part is finding the minimum number of layers that

like: 01:11:22

you need in order to model the relationship between two points.

like: 01:11:26

that's a little bit abstract, I think the quintessential example is like

like: 01:11:30

detecting which type of iris flower.

like: 01:11:33

It is from an image, the, we don't necessarily know how many features

like: 01:11:37

there are, but we can vectorize the entire picture of an iris flower.

like: 01:11:43

And then we can discover that the, I think minimum number of layers is

like: 01:11:48

like five in order to go through and actually get really good accuracy on

like: 01:11:53

detecting which iris flower it is.

like: 01:11:57

yeah, multi layered perceptrons are The feed forward networks.

like: 01:12:02

Those are the basis of everything that comes after it whether it's recurrent

like: 01:12:05

or even Transformers have feed forward networks inside them and that's the basis

like: 01:12:12

of it right there.

like: 01:12:13

How do you choose the sizes and is it all just trial and error

like: 01:12:18

as well for the number of layers, the sizes of the hidden layers?

like: 01:12:23

Are there

like: 01:12:24

Not any

like: 01:12:24

rules that always

like: 01:12:25

work?

like: 01:12:28

Yeah, so going through a feed forward network and this comes from trial and

like: 01:12:34

error, it comes from a lot of people trying different stuff, but generally

like: 01:12:38

you have your Initial dimensionality could be something like 768, right?

like: 01:12:43

Your initial hidden layer.

like: 01:12:45

that's a good number for it.

like: 01:12:47

That's an embedding dimension that we're familiar with, but then we want the

like: 01:12:51

next hidden layer to be double that.

like: 01:12:52

And then we want to go smaller and smaller until we hit our

like: 01:12:56

final output classification layer.

like: 01:12:58

So we want to have a big jump and then small.

like: 01:13:02

What to think about that theoretically is you want to model the number of features

like: 01:13:08

that you are looking for, and then you want to just model double that is just

like: 01:13:13

a good way of saying all the features that we might not know about that we

like: 01:13:17

might not even be keeping track of.

like: 01:13:18

Let's see if the model can figure them out mathematically.

like: 01:13:21

And then we want to narrow it down.

like: 01:13:24

Narrow it down.

like: 01:13:25

Narrow it down until we get to our actual classification, which in language

like: 01:13:28

modeling is what is the next word, right?

like: 01:13:31

Got it.

like: 01:13:33

So double it and then boil it down to the size that you're actually looking

like: 01:13:37

for across a bunch of layers and hope for

like: 01:13:40

the best.

like: 01:13:42

Okay.

like: 01:13:42

and that's why when OpenAI doubled the embedding layers, it was a

like: 01:13:46

marginal improvement, but it's predictable because that's normal.

like: 01:13:50

People do that.

like: 01:13:52

Are there any particular, well known kind of configurations of this neural

like: 01:13:58

networks that just work for a bunch of problems that, something that

like: 01:14:02

you keep seeing over and over, or is it more custom for every problem

like: 01:14:06

you just follow the heuristics that you just described?

like: 01:14:09

as far as model architecture, no, it's basically the heuristics that I

like: 01:14:14

described, and then people will experiment and tune them and find that, oh man,

like: 01:14:21

statistically, If this layer of the model is bigger, then it works better,

like: 01:14:26

but it follows that general structure.

like: 01:14:28

I think, one of the papers that I would point to for this is a bit,

like: 01:14:33

MFIT, where it was, it's basically a methodology for fine tuning.

like: 01:14:40

But it experiments with gradual unfreezing of layers where when you're

like: 01:14:45

training, you will start with only the very last classification layer and

like: 01:14:50

everything else is exactly the same.

like: 01:14:51

And you only train that one.

like: 01:14:53

And then you unfreeze, unfreeze, and test each layer as you're training.

like: 01:14:58

And that tends to help things like even now that is abstracted within

like: 01:15:03

the hugging face trainer class.

like: 01:15:05

And that's abstracted within pretty much every.

like: 01:15:07

model.fit methodology because it works.

like: 01:15:13

Awesome.

like: 01:15:14

What's next in our journey?

like: 01:15:16

probably just the fact that multilayer perceptrons

like: 01:15:19

struggle with sequences, right?

like: 01:15:22

even if you try to embed things and try and keep some of that

like: 01:15:26

positional encoding within your embeddings, they struggle to model.

like: 01:15:30

Multiple things where the order of them matters, right?

like: 01:15:34

which language, which the order matters sometimes, right?

like: 01:15:38

Sometimes it's normal to say gibberish and knowing when is, which is extremely

like: 01:15:43

difficult and to solve that, I don't know if we need to necessarily go

like: 01:15:48

into recurrent neural networks, but we definitely need to talk about

like: 01:15:51

LSTMs, the long term short memories, which are recurrent neural networks

like: 01:15:56

to, start with, but they added some really important things, which, for

like: 01:16:01

example, when I'm talking, you are.

like: 01:16:04

Kind of consciously predicting what I might be saying, you can hear what I'm

like: 01:16:08

saying and you're trying to figure it out as it goes on to understand it.

like: 01:16:12

we call that active listening.

like: 01:16:13

that's what happens.

like: 01:16:14

long term short memories, model that a little bit in that they take the sequences

like: 01:16:19

and they allow the model to try to predict both going forwards and backwards.

like: 01:16:25

instead of just doing the one way.

like: 01:16:26

So that bidirectionality it's computationally expensive.

like: 01:16:30

It takes a lot longer, which is why I think these are not used as much

like: 01:16:33

anymore, but it's really novel and it did help a lot in predicting sequences.

like: 01:16:38

it was phenomenal for language modeling.

like: 01:16:40

beyond that, they like solving the attention.

like: 01:16:44

Within LSTMs, like when attention came out, adding attention to whatever you

like: 01:16:49

were doing was phenomenal where it added an extra layer of non linearity when it

like: 01:16:56

was going through and trying to search for what word might come next, it not

like: 01:17:00

only had all the modeling that we've already talked about, it also had the

like: 01:17:04

ability to search now and search for not that exact thing, but something similar.

like: 01:17:11

And, that just exploded in popularity because it works, it was phenomenal.

like: 01:17:16

However, the difficulty with long term short memories is they're computationally

like: 01:17:22

expensive, they're slow, it's a lot of math that you have to do in order to

like: 01:17:27

get through every single layer of it, let alone trying to predict and stream

like: 01:17:33

those predictions in a sequence, you're going at one token per 30 seconds.

like: 01:17:38

And that's difficult for having models that are the same size

like: 01:17:42

as transformers, for example.

like: 01:17:45

so yeah, it was a lot of really cool stuff that helped us solve

like: 01:17:49

basically how to get to the next step.

like: 01:17:53

It was just computationally expensive and slow.

like: 01:17:56

basically, not very practical in use, but important.

like: 01:18:01

talking about practicality, I think it's great that it's accurate, right?

like: 01:18:05

I think accuracy is incredibly practical.

like: 01:18:07

I don't think that from a customer experience that's practical, right?

like: 01:18:12

Customers don't like waiting a long time for the right answer because

like: 01:18:16

they might be able to find the right answer in that amount of time anyway.

like: 01:18:18

and then from there, do we jump to the attention?

like: 01:18:22

at this point, we've gone through the history of, the field modeling

like: 01:18:27

language, building up and we finally reached attention, right?

like: 01:18:32

And attention is, the backbone of transformers, which is

like: 01:18:36

what LLMs are built off of.

like: 01:18:37

And, attention just adds a non linearity.

like: 01:18:41

And it was just a breakthrough and how we're able to connect the words,

like: 01:18:45

so attention really quickly is just, creating these dictionaries,

like: 01:18:49

key values of, every word to every other word in the token space.

like: 01:18:55

and then it's able to query it.

like: 01:18:57

for each other word, we're able to build.

like: 01:19:00

importance of the other words that are important to it.

like: 01:19:03

And it's in a quadratic space, so it's much more than a linear space, but

like: 01:19:09

it's a reasonable amount of time, to compute these kind of dictionaries,

like: 01:19:14

the key values, and then query them and understand the importance of other

like: 01:19:18

words It's the backbone of what all these, different models are doing.

like: 01:19:23

and even as Chris mentioned, like we could inject attention into

like: 01:19:27

these previous, RNNs, LSTMs, et cetera, but, it was the backbone

like: 01:19:33

of building the transformer model,

like: 01:19:35

which, came out, in the catchy paper, "attention is all you need".

like: 01:19:40

where essentially all they use,

like: 01:19:41

a meme, right?

like: 01:19:43

That we've seen a whole bunch of other papers afterwards.

like: 01:19:45

They're like, "no, this is all you need".

like: 01:19:46

or no, this is all you need, or no, you don't need, but the

like: 01:19:49

reason it's a meme is because they

like: 01:19:51

took out everything that was, supposedly novel about the long

like: 01:19:55

term short memory, the LSTM.

like: 01:19:57

They used only attention and feedforward networks

like: 01:20:01

Could you give us an example of what that would look like

like: 01:20:04

on a very stripped down thing?

like: 01:20:06

What does that dictionary look like?

like: 01:20:09

for visualization

like: 01:20:11

and decode.

like: 01:20:11

no, just for the attention itself, right?

like: 01:20:13

You mentioned a key value from basically every combination.

like: 01:20:17

You have to pre compute every combination within the vocabulary.

like: 01:20:21

You can take a sentence that you're feeding in to the attention algorithm, the

like: 01:20:26

cat in the hat, since I used that earlier.

like: 01:20:28

and so essentially you would have a dictionary where the is comparing to

like: 01:20:33

every other word, cat in the hat, and it's coming up with assimilating metrics

like: 01:20:41

of the importance of all the other words.

like: 01:20:43

And then you would do that for cat, it's going to do it for the in the hat, and in

like: 01:20:51

the cat, the hat, and it's going to come up with A dictionary, essentially, of

like: 01:20:58

key value pairs for all the other words, helping you understand, the importance

like: 01:21:02

of the other words that are in there.

like: 01:21:04

and then the query algorithm, that runs, that essentially helps us

like: 01:21:08

understand being able to predict the next word that's coming afterwards

like: 01:21:11

based off of how important the, all of those kind of dictionaries

like: 01:21:16

are, and adding them.

like: 01:21:17

And so all of,

like: 01:21:17

this happens to happen in quadratic time.

like: 01:21:19

one of the nice

like: 01:21:20

novel

like: 01:21:21

things about this is that the query And key vectors, your query vector

like: 01:21:25

is the word that you're looking at in the utterance and your key

like: 01:21:28

vector is the key in the dictionary.

like: 01:21:31

those two vectors are not one hot encoded.

like: 01:21:34

The way that a lot of we haven't even mentioned this.

like: 01:21:37

But that's a vector that is 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, that's how a lot of these

like: 01:21:43

things had been represented previously, coming off of the bag of words, The idea

like: 01:21:50

that, hey, we can model these things.

like: 01:21:53

We can create vectors that are just did this word appear.

like: 01:21:56

Or did it not?

like: 01:21:58

And where did it appear?

like: 01:21:59

That was a positionality and, attention is all you need, you can immediately see

like: 01:22:04

a problem with one hot encoding in the it's very sparse, especially as you're

like: 01:22:08

getting into 768 dimensions, right?

like: 01:22:12

You have just one 1 and a whole bunch of zeros and those zeros don't really matter.

like: 01:22:17

And so one of the breakthroughs here was using dense vectors

like: 01:22:21

for queries and keys in order to get values that are also dense.

like: 01:22:26

I think one of my favorite visualizations of it, it's from Jesse Vig.

like: 01:22:30

It's called BertViz on GitHub.

like: 01:22:33

I've used this in production environments in order to show that hey, Our model

like: 01:22:39

is not understanding this because look at the attention, all of it is

like: 01:22:44

factoring in, all of the queries are related to the key of the wrong word.

like: 01:22:50

If you look at words with semantic ambiguity, I think the quintessential

like: 01:22:53

one is "time flies like an arrow".

like: 01:22:56

Where flies is also another word that could mean multiple small

like: 01:23:00

little bugs buzzing around.

like: 01:23:02

How do we know that it's not that word?

like: 01:23:04

It's because of the position in the sentence that we know that it is a verb.

like: 01:23:09

and it's referring to time and it's referring to arrow.

like: 01:23:12

And we can see that predictably within attention, because that

like: 01:23:16

word is determined to be important.

like: 01:23:18

That query is determined to be important as it relates to the keys of time and

like: 01:23:23

arrow within query key value attention.

like: 01:23:26

That's what that dictionary looks like.

like: 01:23:28

That's why it's

like: 01:23:28

useful.

like: 01:23:30

And, I guess the representation of the importance, how do

like: 01:23:34

we actually come up with

like: 01:23:35

that

like: 01:23:37

I think it's dot product.

like: 01:23:39

we're comparing the vectors between the query and the key.

like: 01:23:43

dot product attention is, I'm pretty, that's not where it started, but I

like: 01:23:47

think that's where we're at right now.

like: 01:23:49

That's like the industry standard that everybody uses.

like: 01:23:53

It's just, multiplying the vectors together.

like: 01:23:54

Essentially you take the dot product of the two vectors, and that's

like: 01:23:59

where we get the comparison and the relative importance values.

like: 01:24:02

it's not magic, it's

like: 01:24:04

math.

like: 01:24:04

kind of the same thing from time to time?

like: 01:24:08

Okay.

like: 01:24:09

And then with that, we've got the GPT, the generative pre trained transformer model.

like: 01:24:18

What's

like: 01:24:18

so groundbreaking about that?

like: 01:24:20

as opposed to the original transformer, they only use a decoder.

like: 01:24:24

So the original transformer had attention based encoders, which changed your

like: 01:24:29

embeddings into essentially another embedding that was then taken by your

like: 01:24:34

decoder and used to predict the next word.

like: 01:24:37

So it had two networks linked together in the middle in order to produce

like: 01:24:44

Your next word and the reason this is important is it goes back to that

like: 01:24:48

original idea that we talked about a language as an abstraction, right?

like: 01:24:52

The authors of attention is all you need looked at that abstraction and

like: 01:24:58

we're like, Hey, can we model that?

like: 01:25:00

And that's what an encoder is.

like: 01:25:01

When you look at models like BERT, it's taking your input and putting it

like: 01:25:07

into a new abstract space with lots of nonlinear trans transformations and

like: 01:25:12

it's taking your input and putting it into a new abstract space with lots

like: 01:25:13

of nonlinear trans transformations and it's taking your Incredibly useful.

like: 01:25:16

And so the GPT models were groundbreaking, because they

like: 01:25:21

were like, we don't need that.

like: 01:25:22

we just need the decoder and we're just going to use syntax basically.

like: 01:25:28

And the thought process there is that syntax is related to semantics deeper than

like: 01:25:33

linguists are able to really conceptualize in an easy to understand way.

like: 01:25:38

We know that it's true, And we know that it's predictive with especially

like: 01:25:42

looking at how good GPT-3, GPT-4 are.

like: 01:25:46

And even looking at the open source stuff, LLAMA is a decoder

like: 01:25:49

only network and it rocks, right?

like: 01:25:52

I have a suspicion that we're going to hit a point later where, Google

like: 01:25:58

is going to blow everybody out of the water with another T5, like another,

like: 01:26:03

version of that puts the encoder back in.

like: 01:26:06

I don't know how we're going to get to that point, though, because the

like: 01:26:08

decoder only models work so well.

like: 01:26:11

And they're faster, they're less computationally expensive, because

like: 01:26:14

you're taking, probably, a third of the model and just throwing it away.

like: 01:26:18

So you mentioned Llama, and I think that might be a good

like: 01:26:22

segway from what essentially is, about a third of your book.

like: 01:26:28

so for everybody else who wants to go and jump into more details and

like: 01:26:31

see actual Python implementations of a lot of what we just covered,

like: 01:26:37

the book is called Production LLMs.

like: 01:26:39

It's available on manning.com, and I'm pretty sure you're going to love it.

like: 01:26:45

So going back to Llama, let's do a little hall of fame, rundown

like: 01:26:51

of the kind of landmark important models from the last few years.

like: 01:26:55

Where should we start?

like: 01:26:56

I would probably start with the original transformer, like

like: 01:27:00

they deserve credit.

like: 01:27:01

A lot of the, Vaswani and all, a lot of the people who wrote that paper have

like: 01:27:05

gone on to found or co found companies that are now competing in this space.

like: 01:27:10

Whether that's Anthropic or Character.

like: 01:27:12

ai, those are the people that created that Transformer and

like: 01:27:15

they're still building on it.

like: 01:27:16

I think that's the first one that I'd say for the Hall of Fame.

like: 01:27:19

what would you say, Matt?

like: 01:27:20

think part of this question is what is the first LLM versus what is, the first, Hall

like: 01:27:24

of Fame model and yeah, like Transformers, Bert, like Bert, is incredibly powerful,

like: 01:27:29

I think, because it's so small, it's not in the LLM space, it's often overlooked.

like: 01:27:36

And I think many companies are still looking at these massive

like: 01:27:42

LLM models for problems they could solve with a simple BERT model.

like: 01:27:46

But because they're only getting into this space now,

like: 01:27:52

they

like: 01:27:53

think immediately, hey, we have to use an LLM,

like: 01:27:55

right?

like: 01:27:55

And

like: 01:27:56

they didn't care in 2017.

like: 01:27:58

And

like: 01:27:59

And

like: 01:27:59

over what

like: 01:27:59

was there.

like: 01:28:00

and I go back, I said it before, I love Markov chains, like they're

like: 01:28:04

amazing and they're really powerful for what they do really well.

like: 01:28:07

And even then, a lot of people could just use Markov chains for a lot

like: 01:28:12

of the problems that they're trying to solve with LLMs, but, LLMs.

like: 01:28:16

They do give that flexibility, just their massive levels of computation.

like: 01:28:22

I think if I was to point, to a model that I thought was just really powerful, it.

like: 01:28:28

It would be Bloom, actually.

like: 01:28:29

Bloom was essentially the first, LLM massive, large model that was built.

like: 01:28:37

And it was built, completely transparently.

like: 01:28:40

it was a research, project.

like: 01:28:42

funded, a large part by, the French government.

like: 01:28:46

And just, it was built completely transparently and

like: 01:28:49

completely in the open space.

like: 01:28:51

and even though the bloom model today, isn't seen as, a very competitive

like: 01:28:57

model, but like a lot of the open source learnings, a lot of what

like: 01:29:02

we have nowadays is because of what those researchers figured out

like: 01:29:08

while they were working in bloom.

like: 01:29:10

we got amazing, libraries out of it from like deep speed

like: 01:29:14

and other things like that.

like: 01:29:16

it really boosted the open source community, which has been one of the

like: 01:29:20

major driving factors of LLMs today, and probably a large part of why

like: 01:29:25

we could even write our book, cause the open source community wasn't.

like: 01:29:30

At where it is today, like there wouldn't be much we could really

like: 01:29:33

tell people other than oh, You got to go work for Google or Microsoft or

like: 01:29:39

how would We, know any of it, right?

like: 01:29:41

Yeah.

like: 01:29:42

we know, about it largely

like: 01:29:43

because, we've been involved in the open source and we, built off of

like: 01:29:48

what those scientists at Bloom did.

like: 01:29:50

Big

like: 01:29:50

science.

like: 01:29:51

So that's 2022, right?

like: 01:29:53

That's a couple of years now.

like: 01:29:56

Yeah.

like: 01:29:57

and then we had llama that became important, and llama2

like: 01:30:03

Yeah,

like: 01:30:04

even more important.

like: 01:30:07

Yeah, and it's largely just because, I don't remember the username of who

like: 01:30:12

did it, but whoever put that PR on the original llama GitHub that had the

like: 01:30:17

torrent link to leak the weights, that's the hockey stick moment for LLMs, right?

like: 01:30:22

That's what made them available to everybody.

like: 01:30:25

That's what enabled Stanford to create alpaca and show that, oh man,

like: 01:30:28

you can make the model better with like only 50 K responses like you

like: 01:30:33

don't need tons and tons of data in order to fine tune and get very good

like: 01:30:38

results and improve in every metric.

like: 01:30:40

yeah, that everything since then has just been building off of that

like: 01:30:43

exact same momentum of whoever leaked that first llama and Meta

like: 01:30:49

has benefited greatly from it too.

like: 01:30:50

they now have a very open, I wouldn't say completely, but a very open attitude

like: 01:30:58

towards the space because they recognize how, advantageous it is to have other

like: 01:31:03

people building on top of their model and be considered an industry standard.

like: 01:31:08

Yeah they've really leaned into it recently, right?

like: 01:31:11

And like

like: 01:31:12

how big was their

like: 01:31:13

stock jump?

like: 01:31:13

right?

like: 01:31:14

all of the underlying architecture, right?

like: 01:31:16

Like these open source programmers or even just like the video programmers, like

like: 01:31:23

they're able to go in and because they know everything about Lama, they're able

like: 01:31:26

to optimize, cuda kernels and everything.

like: 01:31:29

And so Lama has gotten faster and more proficient, Lama CPP, we're able to run

like: 01:31:35

it with, just on a CPU, there's lots of benefits that because they, gave

like: 01:31:42

us the architecture, it was leaked, but now, they've, leaned into it.

like: 01:31:45

They essentially they've given it to us.

like: 01:31:47

And so

like: 01:31:48

Yeah, we just need them to release the data that they used to train

like: 01:31:51

on it And it's completely open,

like: 01:31:53

right?

like: 01:31:53

but even the data, they've told us a lot about what the data is, right?

like: 01:31:58

we don't have the exact data, but we know essentially red pajama, what those data

like: 01:32:03

sites were built off of, what they were.

like: 01:32:05

And so

like: 01:32:07

we're able to.

like: 01:32:08

replicate it really closely in the open source community.

like: 01:32:11

Llama, I don't know, if we have a really good list of Hall of

like: 01:32:14

Famers because

like: 01:32:16

it's difficult to see what's going to stick around partially because

like: 01:32:19

it's so difficult to evaluate these models as opposed to BERT right?

like: 01:32:23

large BERT had 300 million parameters.

like: 01:32:26

You can run stuff to see how well those parameters are,

like: 01:32:30

like you can hyper tune them.

like: 01:32:31

you can run evaluations to see how each one is performing

like: 01:32:34

and still go relatively fast.

like: 01:32:38

When we're getting into the 7 billion parameter range and the 13

like: 01:32:41

billion parameter range and the 70 billion parameter range, it's much

like: 01:32:45

more difficult and computationally expensive to evaluate on that level.

like: 01:32:49

And we don't even have the ability to describe what all

like: 01:32:51

the parameters are doing.

like: 01:32:52

and so our evaluation metrics are difficult to gauge.

like: 01:32:58

You look at MMLU, you look at a lot of the benchmarks that people

like: 01:33:02

are running, and they're useful.

like: 01:33:04

But ultimately at this stage, we still have to go download those models

like: 01:33:09

and test them against our own use cases to see if they perform better.

like: 01:33:13

And that's incredibly time consuming.

like: 01:33:15

like we could talk about a lot of the models that have come out, like Capybara,

like: 01:33:19

we can talk about New Zermes, we can talk about WizardCoder, and they're all great.

like: 01:33:25

I don't know which ones are going to be the hall of fame.

like: 01:33:27

The next industry standard though,

like: 01:33:29

there's definitely some other models that we love and we talk about in our

like: 01:33:32

book, like Falcon, which came out of

like: 01:33:35

the TII and Abu Dabi, right?

like: 01:33:38

Like amazing model.

like: 01:33:40

It's,

like: 01:33:41

Micu.

like: 01:33:43

the latest Falcon is one of the largest open source models and it's

like: 01:33:46

come, under the Apache 2 license.

like: 01:33:47

So it's completely open source.

like: 01:33:49

the very first model that's fully open source.

like: 01:33:52

there's definitely amazing, progress being

like: 01:33:54

made and lots of different models to be paying attention to.

like: 01:33:57

But yeah,

like: 01:33:59

One of the biggest ones to

like: 01:34:00

pay attention to.

like: 01:34:01

right now, I think is Olmo, not because it's competitive and

like: 01:34:04

performant, but because like Falcon, it is 100% open source.

like: 01:34:09

You can see the data they trained on.

like: 01:34:10

You can replicate exactly their experiments.

like: 01:34:12

that's going to be one of the biggest drivers in this field where, you look at

like: 01:34:16

a lot of the, innovation that's happening and it's happening over on files that

like: 01:34:21

people are passing around on torrents.

like: 01:34:23

It's happening on like random users on Reddit are coming up with NTK aware

like: 01:34:28

scaling and rope scaling after that.

like: 01:34:30

And they're coming up with more stuff because.

like: 01:34:33

They have time, and they want to help and a lot of these people are experts

like: 01:34:37

and they're just anonymous and that's Incredibly important for the space because

like: 01:34:43

we're finding that people who deal with these models and use them 24/7 Have skills

like: 01:34:49

that the researchers don't necessarily have and that's difficult to admit being

like: 01:34:54

on the research part of it But it's true.

like: 01:34:57

so that's the one coming from Allen Institute for AI, right?

like: 01:35:02

The one it has, yeah, I think they're also open source in the

like: 01:35:08

actual training code as well.

like: 01:35:09

the whole

like: 01:35:10

they are the whole

like: 01:35:11

thing.

like: 01:35:12

That's pretty awesome.

like: 01:35:14

So with that caveat out of the way, hedging your predictions, we don't

like: 01:35:18

know what's going to happen tomorrow.

like: 01:35:20

Do you see any one company kind of getting ahead of the others?

like: 01:35:27

The GPT-4 is still holding up well against a lot of these models, which makes me

like: 01:35:35

think personally that they have a few.

like: 01:35:38

Tweaks and hacks they haven't shared, which helps with

like: 01:35:41

their multi billion valuation.

like: 01:35:43

Do you see anybody like running away from the crowds or is it too late now?

like: 01:35:48

The cat's out of the bag and the progress is going to come from the mass of people.

like: 01:35:53

I don't know I know that, I was texting with a couple of people the

like: 01:35:56

other day talking about GPT-4 and, how it is still relevant, even, people

like: 01:36:02

talk about the performance decrease, but it's still relevant, and every

like: 01:36:06

week, every model is, that's coming out getting compared against GPT-4.

like: 01:36:10

And they're finding that most models are more performant in GPT than

like: 01:36:15

GPT-4 on certain things, right?

like: 01:36:19

It's comparing the Rain Man to an average human where, and asking like

like: 01:36:25

what tasks they're good at, right?

like: 01:36:26

If you, if it's going to McDonald's and ordering your

like: 01:36:30

own food, Rain Man is not great.

like: 01:36:33

And you just got to find the model that's better.

like: 01:36:35

a good example for that with GPT-4 is math.

like: 01:36:38

if you need a model to perform calculations for you.

like: 01:36:41

That's not it.

like: 01:36:43

you have Alpha Wolf, you have, Goat, you have, even just Vanilla Llama 2 is

like: 01:36:49

better at math than GPT-4, even though they weren't explicitly training on it.

like: 01:36:53

And I think that they currently have that first-to-market

like: 01:37:00

advantage more than anything.

like: 01:37:02

That's not to say that it's bad.

like: 01:37:03

That's not to reduce the work that OpenAI has done because it is phenomenal.

like: 01:37:08

But that's what's keeping them really afloat is the first

like: 01:37:12

market and the ease of use.

like: 01:37:16

One other question I was holding, as you were speaking with, you

like: 01:37:21

mentioned mixed role and, What is it

like: 01:37:24

called?

like: 01:37:24

Mixed of, mix of experts.

like: 01:37:26

what's

like: 01:37:26

Yeah, mixtral.

like: 01:37:27

Yeah, it's routing.

like: 01:37:30

it's being smart and saying, hey, we don't need a dense feed forward

like: 01:37:34

network for every single thing.

like: 01:37:36

Let's have a whole bunch of sparse networks and just based on the input

like: 01:37:41

route it and tell it which expert is actually going to be the best.

like: 01:37:45

It results in much larger models that are smaller on disc and faster to run.

like: 01:37:52

Is that more similar to how the human brain works?

like: 01:37:57

Because it's obviously not fully

like: 01:37:59

connected.

like: 01:37:59

It's got different regions and stuff like that.

like: 01:38:02

I would love to appeal to that.

like: 01:38:04

authority.

like: 01:38:05

that didn't rock.

like: 01:38:05

I don't know though, because like you look at MRIs and you can see, Oh man,

like: 01:38:10

this portion is lighting up when you're experiencing that emotion or seeing that

like: 01:38:14

input.

like: 01:38:15

But

like: 01:38:16

who

like: 01:38:16

we don't really have a really great mapping of

like: 01:38:19

every person's brain.

like: 01:38:20

I think the connection between a neural net and actual neurons has

like: 01:38:26

been lost a long time ago, right?

like: 01:38:29

how does the human brain work and how does it really compare to modern day models?

like: 01:38:33

Like it's hard to really make that argument, we're still

like: 01:38:37

learning about how we learn.

like: 01:38:39

And as we do, and as neuroscience filled advances, like ultimately leads to

like: 01:38:46

advances in the AI space and vice versa.

like: 01:38:49

there's definitely connections there.

like: 01:38:51

but yeah, as far as your question goes, I think it's anybody's guess.

like: 01:38:56

I think this is a perfect note to end.

like: 01:39:00

A little bit of suspense.

like: 01:39:02

we're going to have to get you back at some point when you've finished your

like: 01:39:06

book and talk a little bit more about the actual technical problems and challenges.

like: 01:39:12

We haven't really touched upon any of that yet, but today I certainly

like: 01:39:17

learned a lot from you and I hope a lot of our listeners will as well.

like: 01:39:22

It was an absolute pleasure to meet you both.

like: 01:39:26

Thank you so much and see you next time.

Share Episode

Shownotes

Transcripts

Follow

Links

Chapters

Video

More from YouTube