Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!
Check out Hugo’s latest episode with Fei-Fei Li, on How Human-Centered AI Actually Gets Built
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work!
Visit our Patreon page to unlock exclusive Bayesian swag ;)
Takeaways:
Chapters:
00:00 Understanding Computational Cognitive Science
13:52 Bayesian Models and Human Cognition
29:50 Eliciting Implicit Prior Distributions
38:07 The Relationship Between Human and AI Intelligence
45:15 Aligning Human and Machine Preferences
50:26 Innovations in AI and Human Interaction
55:35 Resource Rationality in Decision Making
01:00:07 Language Learning in AI Models
01:06:04 Inductive Biases in Language Learning
01:11:55 Advice for Aspiring Cognitive Scientists
01:21:19 Future Trends in Cognitive Science and AI
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan, Francesco Madrisotti, Ivy Huang, Gary Clarke, Robert Flannery, Rasmus Hindström, Stefan, Corey Abshire, Mike Loncaric, David McCormick, Ronald Legere, Sergio Dolia, Michael Cao, Yiğit Aşık and Suyog Chandramouli.
Links from the show:
Transcript
This is an automatic transcript and may therefore contain errors. Please get in touch if you're willing to correct them.
In this episode, I'm thrilled to welcome Tom Griffith, one of the most influential figures
in computational cognitive science.
2
:Tom is the Henry Lewis Professor at Princeton University, where he bridges the fields of
psychology and computer science to study how humans, and increasingly machines, learn,
3
:think, and make decisions.
4
:He's also the co-author of the best-selling book, Algorithms to Live By, and the recent
one, Bayesian Models of Cognition,
5
:reverse engineering the mind.
6
:In our conversation, Tom walks us through how Bayesian models can help us understand the
foundations of human intelligence, from how we form inductive inferences to how our
7
:cognitive processes differ from those of modern AI systems.
8
:We dive into the idea of humans as noisy Bayesian agents, the importance of prior
elicitation in psychology, and why individual responses may sometimes offer more insight
9
:than the averaged wisdom.
10
:We also talk about the tension and potential synergy between human and machine
intelligence.
11
:shares his thoughts on generative AI, representational alignment, and what it will take
for artificial systems to truly complement human reasoning and values.
12
:And as always, we end with practical advice for researchers entering the field and a peek
into Tom's future work.
13
:This is Learning Basics and Statistics, 132, recorded.
14
:December 18, 2024.
15
:Welcome Bayesian Statistics, a podcast about Bayesian inference, the methods, the
projects, and the people who make it possible.
16
:I'm your host, Alex Andorra.
17
:You can follow me on Twitter at alex-underscore-andorra.
18
:like the country.
19
:For any info about the show, learnbasedats.com is Laplace to be.
20
:Show notes, becoming a corporate sponsor, unlocking Beijing Merch, supporting the show on
Patreon, everything is in there.
21
:That's learnbasedats.com.
22
:If you're interested in one-on-one mentorship, online courses, or statistical consulting,
feel free to reach out and book a call at topmate.io slash alex underscore and dora.
23
:See you around, folks.
24
:and best patient wishes to you all.
25
:And if today's discussion sparked ideas for your business, well, our team at Pimc Labs can
help bring them to life.
26
:Check us out at pimc-labs.com.
27
:Hello my dear patients!
28
:Do you remember Hugo Baird Anderson?
29
:Of course you do!
30
:was in episode 122 of the podcast and we talked about learning and teaching in the age of
AI.
31
:And that's because well, Hugo is not only a friend of the show, he's also an independent
and brilliant AI and data scientist at Vanishing Gradients.
32
:And he just dropped a new episode of his podcast, High Signal.
33
:with Fai Fai Li.
34
:They talk about the rise of foundation models, what human-centered AI means in real
systems like elder care and education, and why spatial intelligence might be the next big
35
:shift beyond LLMs.
36
:Honestly, it's a wide-ranging conversation.
37
:It's a thoughtful conversation on how AI is reshaping society and how we can shape it
back.
38
:It's definitely worth a listen, so I wanted to highlight it here.
39
:In case you want to check it out, of course, the link is in the show notes.
40
:Okay, on with the show now.
41
:Tom Griffith, welcome to Learning Bayesian Statistics.
42
:Hi, great to be here.
43
:Yeah, thank you so much for taking the time.
44
:I know you're very busy and that was hard to find a slot in your schedule for that one.
45
:But thank you so much.
46
:I'm very grateful and very happy to have you on the show.
47
:Before we talk about all the amazing work you do on the neuroscience side, uh maybe can
you tell our listeners
48
:Basically what you're doing nowadays and how you ended up working on this.
49
:Yeah, so I'm a computational cognitive scientist.
50
:What that means is trying to understand the mathematical principles behind intelligence.
51
:And the way that I do that is by thinking about the abstract computational problems that
human minds have to solve, what the ideal solutions to those problems look like, and then
52
:how we can use those solutions to better understand human behavior.
53
:And so for the kinds of problems that I'm interested in, there things like
54
:learning inference problems for which Bayesian statistics is particularly useful, as well
as all sorts of other ideas that come from computer science, machine learning, and AI.
55
:And so for that reason, we use these methods extensively as a kind of standard against
which we can compare human cognition, but also taking human cognition and the amazing
56
:things that people do.
57
:as a source of insights about ways that we can improve our current approaches to
statistical learning.
58
:Yeah, so I am guessing that to do that, you are using at least sometimes some patient
statistics, are you?
59
:Yeah.
60
:So one way of thinking about Bayes' rule is that it's really a theory of inductive
inference, right?
61
:So if you learn Bayes' rule in a probability class, you just sort of learn it as a,
62
:tautology of probability theory.
63
:And if you learn it in a statistics class, you learn it as an alternative way of thinking
about how to perform inference.
64
:But if you learn it in a cognitive science class, you learn it as a theory of how it is
that we go from limited data to conclusions about uncertain hypotheses.
65
:And that's really fundamental to understanding things that people do.
66
:So one of the big projects of cognitive science is trying to understand how people are
able to learn so much from so little.
67
:We learn language from just a few years of exposure.
68
:We can learn new concepts and new words from just a single example.
69
:How is it that we make those kinds of inductive leaps?
70
:The best way to understand that is in terms of the inductive biases that people bring to
the learning problems that they solve, right?
71
:The things other than the data that allow them to draw particular conclusions.
72
:And Bayes' rule allows us to formulate those inductive biases in terms of prior
distributions and
73
:of structured hypotheses that those distributions are defined over in a way that allows us
to then explain these aspects of human cognition.
74
:Okay, yeah, I see.
75
:And so we'll definitely get back to that.
76
:But first, just to dive a bit deeper into your background, do remember when you were first
exposed actually to patient statistics?
77
:Yeah.
78
:So my undergraduate degree is in psychology.
79
:And basically, uh what happened was I was a student in Australia, I grew up in Perth in
Western Australia.
80
:And in high school, actually, you get to learn some statistics.
81
:So, you know, I took math classes and learned probability theory and, you know, linear
regression, a lot of like basic statistics in high school.
82
:And then when I was going to university, um
83
:The way that it works there is you have to choose what degree you want to do in your last
year of high school.
84
:And so they were like, do you want to be a lawyer?
85
:Do you want to be a doctor?
86
:Do you want to be a scientist?
87
:You know, these were all of the options.
88
:And I was like, I really don't know what I want to do.
89
:I just want to learn things that I haven't had the chance to learn about.
90
:So I ended up enrolling in a bachelor of arts degree and choosing to study stuff that I
just not had the chance to get exposed to in high school, like, you know, psychology,
91
:anthropology, philosophy, ancient history, and
92
:every year you had to like cut one of those out and you got more and more focused.
93
:And so the one that was I found the sort of most rewarding was the psychology one.
94
:And so that's the track that I was on.
95
:And so when you learn psychology, part of what you learn is statistics, right?
96
:You, you know, take these classes and for most psychology majors, they're the worst
classes in psychology.
97
:They're the ones that everyone dreads.
98
:And for me, they were some of the most exciting classes because they were really about
99
:taking those kinds of mathematical ideas that I'd already learned a lot about and using
them to really understand the world.
100
:And so it was through those psychology statistics classes that I first learned about the
idea of Bayesian statistics and Bayes' rule and so on.
101
:I, you know, it's sort of the way that you learn statistics in psychology is much more
kind of like being told.
102
:in this situation you use this kind of hypothesis test and you in this situation you run
this kind of analysis and I was always sitting in the back of the class and thinking well
103
:why is that what you do in that situation and why is that the kind of hypothesis test you
use or the kind of analysis you run and trying to work out from sort of mathematical first
104
:principles why these are the things that you should be doing and that was actually a
really good way of learning statistics because it sort of forced me to rediscover a lot of
105
:these things that if I'd just been in a class and they were fed to me
106
:wouldn't have been as exciting.
107
:And so I enjoyed the Bayesian perspective as a way of really kind of like making sense of
a bunch of things that were perhaps, you know, like harder to understand in the context of
108
:classical statistics.
109
:Yeah, so it sounds like it's really a...
110
:It's almost you who went into the Bayesian stance framework with word.
111
:and not the patient stance that were imposed on you by some course, right, because you
were asking the questions of, but, you know, how come we're doing that in this case and,
112
:and not these, which I, I can definitely relate to that.
113
:So I always had like, frequently stance has always been extremely hard for me to learn
because I don't have a brain that works in the way that is like, you know,
114
:Remembering in these cases, you do that in these cases, you do that.
115
:I am much better at remembering the process and the reasons behind why I do something.
116
:Whereas just, you know, without understanding why doing something.
117
:so the zoo of statistical tests has always been a nightmare for me because I could never
remember.
118
:whereas if you learn from first principles in a generative process, like you have to do
innovation framework that just
119
:works way better for my brain, for instance.
120
:um And it seems to have been the same for you.
121
:Yeah, I mean, I think one of the things that I got out of learning in the way that I did
was having a kind of solid basis in frequent statistics.
122
:um having had that already established so that then when I encountered Bayesian
statistics, it wasn't like,
123
:Like I was able to be more charitable in the way that I think about frequent statistics.
124
:And I still use a lot of frequent statistics when I write psychology papers because that's
really the language that that community uses, right?
125
:And the way that I think about it is, you know, via the sort of name and Pearson
interpretation, something that you can think about as a way of approximating.
126
:the kind of inferences that you might make in a Bayesian setting, right?
127
:So um you can think about what a lot of classical frequentist hypothesis tests are doing
is measuring something like a unit of evidence um in a way where you're doing that
128
:approximately and you have to sort of correct for the approximations via the kinds of
methods that frequentist statisticians developed.
129
:And so that was a way that I could integrate that classical background with the
130
:perspective on statistics that I got from thinking in that Bayesian way.
131
:But I think it's also interesting to draw a distinction between what you're trying to do
as a statistician and what human brains are trying to do.
132
:So as a statistician, you're trying to convince another person that they should draw a
conclusion from the data.
133
:That's the conclusion that you're drawing from the data.
134
:and all of the controversy about prior distributions and all of the stuff that kind of
like informed that classical, know, frequentist versus Bayesian debate.
135
:That's in the context of, you know, that sort of question of like, how do you make that
argument?
136
:How do you be objective in making that argument?
137
:How do you, you know, construct evidence, which is compelling for another person in the
context of cognitive science?
138
:I feel like there's just much less controversy around thinking in a Bayesian way.
139
:uh
140
:And part of the reason for that is that all of that subjective stuff is intrinsic to the
thing that a human being is doing when they're making an inference.
141
:So if I'm using Bayes' rule to explain how you walk around the world and see data and
interpret those data and learn things from them, then the part of that which is about your
142
:prior distributions and all of the subjective stuff which Bayes' rule allows us to talk
about, that's actually fundamental to it.
143
:It's not something which is an obstacle.
144
:to objectivity is not what you want there.
145
:Subjectivity is exactly the thing which is going on, and we want to be able to capture the
content of those subjective beliefs.
146
:And so for that reason, I just feel like it's really good fit for the problem that
cognitive scientists have to solve.
147
:OK, Yeah, I was going to ask you basically now, so how does that work for the brain and
things like that?
148
:So maybe to dive into what you're doing, um can you
149
:tell us how you actually develop these statistical and mathematical models of higher level
cognition.
150
:Yeah.
151
:So our starting point is always thinking about the abstract problem that human minds have
to solve.
152
:So there's a uh cognitive scientist whose name is David Maher, and he introduced this idea
of there being three different levels at which you can analyze a information processing
153
:system.
154
:So the first level, the most abstract one is what he called the computational level.
155
:And that's really just saying, what is the problem that the system is solving and what's
the ideal solution to that problem?
156
:um And then the level below that is the algorithmic level.
157
:And this is what are the actual processes that are being executed?
158
:What are the representations and algorithms that are approximating that ideal solution?
159
:And this is kind of like, you know, for psychology, this is kind of like the classic level
of cognitive psychology where we're trying to figure out what are the sorts of heuristics
160
:that people might use?
161
:What are the, you know, what are the
162
:elements of memory and similarity and attention that are all combining together to allow
us to answer a question.
163
:And then um below that is the implementation level.
164
:And this is the level at which those representations and algorithms are implemented in
some kind of physical substrate.
165
:And that's like in the case of understanding people, that's the uh domain of neuroscience,
trying to figure out how representations and algorithms are instantiated in brains built
166
:out of neurons and so on.
167
:When we're making Bayesian models of cognition, we typically are operating at that
abstract computational level, where we say, what is the problem that the human mind is
168
:solving here?
169
:And when you're thinking about an inductive problem, right, where you've got some data and
some hypotheses you want to evaluate, and you don't have enough data to tell you which
170
:hypothesis is the right one, then that's a setting where, you know, thinking in Bayesian
terms is actually the way to characterize what an optimal solution to that problem might
171
:look like.
172
:And then the...
173
:The next step is saying clearly, what is the data that people get to see and what are the
hypotheses that they might be evaluating?
174
:And one way to do that in a Bayesian way is to think about the generative process that's
involved.
175
:Right?
176
:So how are the data generated?
177
:What's the sort of, you know, the sequence of steps that underlies those data being
generated?
178
:What are the latent variables involved?
179
:What's the structure that people are potentially inferring from the data that they see?
180
:And so that
181
:gives us a way of then conceptualizing the problem.
182
:And then we can make some guesses about what a prior distribution might be that would be
an appropriate prior distribution for those structures.
183
:And we can then get a model that we can compare against human behavior.
184
:And then we can also do some iteration and try and figure out, well, just using this kind
of model, is there a way that we can infer what the prior distributions are that people
185
:might use?
186
:So I can give you some concrete examples.
187
:So this idea of trying to understand human cognition in terms of probability theory, ah at
the time when we were doing our initial work in this area, and this is work that I was
188
:doing as a graduate student working with Josh Tannenbaum, um we were really sort of up
against this idea that people are really bad at thinking in terms of probabilities, right?
189
:uh
190
:Daniel Kahneman, Amos Tversky had run this set of experiments that documented in the book,
Thinking Fast and Slow, right?
191
:Which sort of suggests if you give people a problem that's a Bayesian inference problem,
they're not going to do a good job of solving it.
192
:uh And so what we wanted to do was say, actually, look, there's a different way of
thinking about how we use probability theory here, where their focus had been on people's
193
:intuitive reasoning about probabilities.
194
:What we were interested in was
195
:using probability theory just to characterize the problems that people are solving, but
never actually asking them anything which is a question about probability.
196
:It's very clear that people have messed up conceptions of what probabilities are.
197
:And if you ask them to do a probability problem, they sort of treat it as a math problem.
198
:They do the math wrong.
199
:And they just have uh wrong ideas about the way that probability works.
200
:If you don't say anything about probability and you just give them a problem of inductive
inference, you can analyze the answers that they give in terms that, you know, like
201
:probability theory is the right tool for analyzing how to solve that problem, and it can
give us insight into their answers.
202
:And so we took a set of uh different domains where you could frame a problem we call a
predicting the future problem.
203
:And this problem is, given that you observed some amount of a quantity, what will the
total amount of that quantity be?
204
:So an example of this would be like, I tell you a movie has made $90 million so far.
205
:What would you guess for the total amount of money that the movie is going to make?
206
:I can, I'll ask you Alex this question.
207
:What's, what's your guess for the total amount of money that this movie is going to make?
208
:Okay.
209
:So it's made nine, 99 million dollars.
210
:90 million so far.
211
:90 million so far in how long?
212
:don't know.
213
:You just hear on the radio.
214
:Okay.
215
:So I don't know.
216
:The movie has made 99 so far and I have to guess how.
217
:what the total will be without knowing for how long the movie has been out in the
theaters.
218
:Okay.
219
:Well then I would say 125.
220
:Okay.
221
:All right.
222
:And now you hear a movie's made $6 million so far.
223
:What's your guess for how much money it's going to make?
224
:6.7.
225
:Okay.
226
:All right.
227
:And now, uh if you meet a man in the street and he's 90 years old, what would you guess
for his total lifespan?
228
:uh 90 years old how long is he gonna live oh that's yeah that's a tough one i think it's
for me it's tougher than the than the movie um i would say
229
:I would say 92.
230
:Great.
231
:Okay.
232
:And then you meet a six year old boy.
233
:What would you guess for his lifespan?
234
:so let's say this is where I live right now.
235
:So rich country.
236
:I'm guessing that the average expectancy is like 80, 80 something for men.
237
:So let's say 81.
238
:Great.
239
:Okay.
240
:So I gave you in those examples, the same numbers, right?
241
:But the answers you gave me were different.
242
:So you gave me a higher number when it was about, uh, you know,
243
:for the $90 million movie, he said 125, for the 90-year-old man, said 92.
244
:And you gave me, so it was a higher number for the movie.
245
:But then when I asked you about the $6 million movie, you said 6.7, and for the
six-year-old boy, you said 81.
246
:So you gave me a lower number for the movie in that case, right?
247
:Why did that happen?
248
:Well, the obvious answer is because you have different prior distributions that you
associate with those quantities.
249
:And in fact, what we showed was that if you just ask people questions like this,
250
:you can very easily elicit these implicit prior distributions that people are using when
they're solving these kinds of problems.
251
:And so uh the grosses of movies actually follow something like a power law distribution.
252
:So the answer you produce should always be a multiple of the amount that you saw so far.
253
:human lifespans follow uh roughly a Gaussian distribution.
254
:So the amount that you predict is basically something like the mean until you get close to
the mean, and then you predict something which is a little bit
255
:a little bit longer than, than, you know, what you've seen so far, which is exactly what
you did.
256
:So people, uh, even, you know, not, uh, the podcasters who know lot of basic statistics,
but just everyday people have exactly those intuitions.
257
:And we can, we can show that across all sorts of different kinds of things.
258
:The length of poems, how long people will serve in the house of representatives, all of
these kinds of everyday quantities that people have some familiarity with.
259
:They're able to make.
260
:these judgments about in a way that reflects the fact that they have implicit prior
distributions that they're using.
261
:So that's a sort of simple example.
262
:When we make these Bayesian models, the hypotheses that are involved can be much more
complicated.
263
:They can be entire, like, causal structures or hypotheses about, like, what the meaning of
a word is or, you know, things that are far more complex than this.
264
:But this is one of the first studies that we did where we really said, hey, look, here's a
tool that you can use that tells us something about what's going on when people make these
265
:uncertain inferences.
266
:Yeah, this is so first, thank you, because that was that was very fun.
267
:I love it when we do experiments like that uh on me on the on the podcast.
268
:I love that.
269
:um And, yeah, I mean, I'm really surprised to hear and that's, I mean, a good surprise
that people have these intuitions, even when they don't have daily exposure to statistics
270
:and distributions.
271
:um That's great, because
272
:that's something I'm definitely going to use next time you know I talk about my job and
because when people ask me what I do for a living if I want to if I want the conversation
273
:to end because I don't want to talk about that right now I say I'm a statistician and then
that's why like you know then people will be like will close and be like my god I hate
274
:math or I was so bad at stats you know with that thinking that you have to be extremely
like to be good at stat or math you have to be gifted and if you're not then
275
:Yeah, then you're done.
276
:Then yeah, if I want the conversation to continue, I'll say I'll do something like, I
asked them if they have seen Moneyball, the movie Moneyball and then and then they asked
277
:me, Oh, really?
278
:what do you do, blah, blah, blah.
279
:But, but then I'm definitely gonna use your, your example to, you know, like, prove to
them that they have actually intuition about stats.
280
:It's not, you know, that, that complicated.
281
:And as a lot of things, it's just
282
:a lot of training and sweat and tears uh to get good at it.
283
:m But yeah, I love that.
284
:That's really fascinating.
285
:And that reminds me of a modeling webinar I did on the show with Justin Boyce, who is a
statistics professor at Caltech.
286
:And one of the methods he uses for not only eliciting the priors, but then refining the
287
:the priors that people give is that, so for instance, the questions you gave me, then he
would ask people, uh would you bet your house that this is the true answer?
288
:And if they don't, then that means they actually have a prior that they can refine.
289
:And when we get to the number that they would bet their house on, that means, okay, here
they are.
290
:They are like pretty sure about their prior.
291
:uh And I find that's a great method to...
292
:use to ask priors to people who don't read no stats, but I need their I need their input,
for instance, from my model.
293
:That's that's a great method.
294
:I don't know if you've seen something like that in your research.
295
:Or or not.
296
:Yeah, so a lot of the work that we've done that's kind of like, related to elicitation has
really been around em finding tasks that allow us to get information about implicit
297
:probability distributions.
298
:that might go beyond the kinds of settings where traditional elicitation methods are used.
299
:one of the challenges for elicitation, if you're doing something like asking people about
quantiles or these other kinds of things, is that those methods work quite well for a
300
:single dimensional quantity.
301
:uh But they don't scale to how do you define a prior distribution over abstract objects or
how do you define a prior distribution over quantities that have many, many dimensions?
302
:And those are the kinds of things that as cognitive scientists we're often interested in.
303
:one of the uh weird things that we've done to try and solve that problem is um doing
things like running Markov chain Monte Carlo algorithms with humans.
304
:So um in your Markov chain Monte Carlo algorithm, you set up a Markov chain where you
sample
305
:from a sequence of probability distributions where each distribution depends on the
previously sampled value.
306
:And that sampling process is normally done by a computer.
307
:And that's the way that you're getting the relevant probabilities.
308
:What we do is we run exactly the same kinds of algorithms, but where the sampling process
is implemented by a human being.
309
:So when people answer a question like the one that I asked you, they don't always give you
the same answer.
310
:You actually get a lot of stochasticity in the answers that people give.
311
:And in some other work, we've shown that, in fact, that stochasticity often reflects what
the underlying posterior distribution is.
312
:So if you ask people questions like the one that I asked you, and we actually calculate
the posterior distribution using the appropriate prior, and we plot the posterior
313
:distribution against the distribution of responses, we get something which is basically
like a uh nice linear plot on a quantile-quantile plot.
314
:What that suggests is that we can use people as random number generators, and then we can
define algorithms that are using those people as a means of then sampling from
315
:distributions that are associated with the way in which they're answering that question.
316
:So we've done this in a few different ways.
317
:One of those ways is something called iterated learning.
318
:And this is an idea that comes from the language evolution literature, where if you think
about how you learn language.
319
:you learned language from somebody who'd learned language from somebody who learned
language from somebody who learned language, right?
320
:And so if you think about what that process is, if you just think about this as now
there's a chain of individual learners where each learner is learning from the utterances
321
:that are produced by the previous learner.
322
:You can think about that as a Markov chain.
323
:You have a learner generates some utterances.
324
:The next learner learns from those utterances, generate some utterances based on what they
learn.
325
:And if you now think about that for a
326
:a Bayesian perspective, you have a learner who sees some data, forms a posterior
distribution.
327
:If they sample a hypothesis from that posterior distribution and then sample data from the
hypothesis that they chose from the sort of corresponding likelihood function, then the
328
:next person sees those data, computes a posterior distribution, samples a hypothesis from
that posterior distribution, generates data from the corresponding likelihood function.
329
:That defines a Gibbs sampler.
330
:And that Gibbs sampler, the stationary distribution of that process, is the prior
distribution that's used by the learners, assuming all of the learners have the same prior
331
:distribution.
332
:what that means is any situation where you have these kinds of games of telephone, where
someone is making an inference from data that was generated by a previous person, the
333
:expectation you should have is that that process is going to converge to the prior
distribution that the agents are using when they're making those inferences.
334
:And so we can set that up in the lab.
335
:for this predicting the future problem that I told you about an easy way to do this is you
just get people to generate their answers.
336
:And then for the next number that I give you, right?
337
:So I told you the movie had made $90 million so far.
338
:So the next number that I would give to the next person would be, take your answer of 125
and I would sample a number uniformly from between one and 125.
339
:And then that would be
340
:the data that I would give to the next person.
341
:And if you iterate that, that converges to the prior.
342
:And we can use this, we sort of confirm that this works in this one dimensional case, but
we've actually used this as an elicitation method for measuring people's prior
343
:distributions on much more complicated things.
344
:Elements of language, category structures, causal relationships, all of these things are
actually things that we can now sort of measure and we can use to inform the Bayesian
345
:models that we make of how people perform inductive inference.
346
:Okay, yeah, that's really fascinating.
347
:That sounds to me like um more sophisticated, more evolved wisdom of crowds.
348
:So it's like, yeah, I really love that.
349
:I didn't know you could do that.
350
:That's really fascinating.
351
:Yeah.
352
:oh Just on the wisdom of crowds point.
353
:mean, so wisdom of crowds is normally you get a bunch of people and then you average
together their responses, right?
354
:And that's more accurate than the individual responses.
355
:We actually wrote a paper where we called it.
356
:something like the wisdom of individuals, right?
357
:Because what this suggests is that within any individual, there's actually that whole
distribution already.
358
:And so you can elicit that distribution and you can actually get a uh characterization of
what's going on with that quantity than you would if you just asked that individual one
359
:question in a way that is reminiscent of that individual, um of that wisdom of crowds
phenomenon.
360
:Yeah.
361
:um
362
:And how so obviously I'm curious now with the relationship with the whole generative AI uh
research to you.
363
:And I know you work on that.
364
:And uh I recommend people take a look at your latest book, Bayesian Models of Cognition at
MIT Press.
365
:And that's in the show notes.
366
:That's a big book, 600.
367
:600 pages, folks.
368
:So if you like that, you're gonna have a lot of things to read.
369
:Yes, or can you can you tell us a bit?
370
:How do you see the relationship between human cognition and artificial intelligence?
371
:Yeah.
372
:So I'll maybe say something specific and then say something more general.
373
:So the specific thing is, yes, all of the things that we just talked about are actually
things that you can do with something like a large language model, right?
374
:Like, you GPT-4.
375
:So we've actually used these methods.
376
:And my postdoc, Jianqiao Zhu, has actually run experiments where he's used exactly the
same paradigms that we used with people to measure what are the implicit prior
377
:distributions over things like the grosses of movies that are in GPT-4 uh and showing that
we can actually use this iterated learning procedure with these uh large models, not just
378
:to pull out
379
:prior distributions, but also to get information that otherwise these models don't want to
release, right?
380
:So if you actually ask them to make predictions about things like, when will there be
superhuman AI?
381
:They really don't want to answer those sort of speculative questions.
382
:But if you ask them a much more concrete question, like, if we create human level AI in
:
383
:It's actually kind of happened.
384
:happy to answer those conditional questions.
385
:And so you can actually use that to then run one of these uh little Gibbs sampling
procedures and get a sense of what its beliefs are about things that it might not
386
:otherwise want to tell you.
387
:these kinds of methods specifically work there.
388
:They're also related to, um there was a paper in Nature earlier this year that talked
about model collapse, which is um if you take a generative AI model,
389
:And then you train it on all of the human data.
390
:But then you generate a new data set from that.
391
:And you train another AI model on that generated data set.
392
:And you keep on doing that.
393
:What happens?
394
:And of course, you get this result where it collapses.
395
:It ends up converging to what we would say is its prior distribution.
396
:So it's exactly the same result that we had when we were trying to characterize this
iterated learning process in people, which is that you get this uh deterioration towards
397
:the prior.
398
:um
399
:In terms of thinking about how human minds relate to AI more generally, one high level
point is that in the same way that we can think about human cognition in these Bayesian
400
:terms, it makes sense to think about these AI models in that way, right?
401
:So to the extent that these models are solving uh inductive problems and are doing a good
job of solving those inductive problems, then Bayes is a good tool for making sense of
402
:that.
403
:And that's part of why
404
:it's reasonable to ask these questions like, what are their prior distributions?
405
:We wrote a paper.
406
:This is with papers led by Tom McCoy.
407
:And we wrote a paper called Embers of Auto-Regression, in which we show that, in fact, AI
models like GPT-4 and other large language models are overly influenced by their prior
408
:distributions in the sense that you can ask them questions
409
:which are completely deterministic questions where there's only one correct answer and
your prior shouldn't matter at all.
410
:And they are nonetheless influenced by the prior distribution that they've learned from
looking at all of the text on the internet.
411
:So for example, if you take GPT-4 and you ask it to count how many letters appear in a
string of letters, it is more likely to give you the correct answer if there are 30
412
:letters in the string than if there are 29 letters in the string.
413
:And the reason why
414
:is because the number 30 appears on the internet more often than the number 29.
415
:And this model has been trained to predict things that are on the internet.
416
:And so it's influenced by that prior distribution when it's answering a question where the
prior should be completely irrelevant.
417
:So I think there's a productive sort of analogy that we can make in terms of using these
sorts of methods to analyze these systems.
418
:By doing that, we can pull apart some of the ways in which the priors that are used by
humans and AI models might be different from one another.
419
:But I think more generally, it highlights the fact that um there are some fundamental
differences between these kinds of systems.
420
:And those differences aren't necessarily in terms of these abstract computational level
problems that we're trying to solve, right?
421
:These things that are the things that we analyze in terms of Bayesian inference.
422
:They're more at these other levels of analysis that I mentioned.
423
:in terms of the constraints that come from that level of algorithms representations or the
constraints that come from the way in which these systems are physically implemented.
424
:And so if you want me to talk about a more general take on humans and AI, I'm happy to do
that.
425
:But in terms of the Bayesian part, I kind of think the Bayesian part is one of the things
that we share.
426
:And it's the other stuff that really is where we end up with our meaningful differences.
427
:Why is that?
428
:Do you have already some research about that, things you can talk about?
429
:Yeah.
430
:So the way that I think about it is that uh human intelligence has been shaped by a set of
constraints that human minds operate under.
431
:And those constraints are quite different from the ones that are faced by intelligent
machines.
432
:So if you think about what's characteristic of human intelligence, all of the things that
we learn, we learn from the data
433
:which we're going to experience in the course of a single human lifetime.
434
:So that means that we're intrinsically limited in the amount of data that we can learn
from.
435
:In order for something to be useful to you, you probably need to learn it relatively early
on in that lifetime.
436
:So realistically, we're talking about maybe 20 years of data that you're using to form
most of the inductive inferences that you're going to make about the world that you live
437
:in.
438
:And even much of that happens in the first five years.
439
:language pretty well in the first five years of life.
440
:We've kind of figured out most of the causal structure of our environment, although we're
still fine tuning it after that, right?
441
:We kind of like got most of the stuff that makes us a human being in those first five
years of life.
442
:Okay.
443
:So that's thing number one is limited data.
444
:That's something that shapes human cognition is that you have to take about five years to
like figure out all of this stuff.
445
:The second is that we do all of that learning and all of our thinking with a
446
:fixed amount of compute, which is what you're carrying around inside your head, two to
three pounds of neurons.
447
:So that means that way number two that we're different from our AI systems is that we have
to use that same amount of compute for every single problem that we solve.
448
:And so we need to be efficient in the way that we use those computational resources.
449
:We need to be able to recognize when a problem has the same kind of structure as one we've
seen before.
450
:We need to be
451
:smart about how much time we spend thinking about any one particular thing.
452
:And so we need something that allows us to make good use of those limited computational
resources.
453
:So that's number two.
454
:And then number three is we have limited bandwidth for communication.
455
:So if it was just possible for me to transfer my brain state to you, then we could easily
overcome those limits on data and computation.
456
:because I could just share with you all of the data that I've ever seen, all of the
conclusions that are drawn from it by computing on those data.
457
:If we needed to work on a problem that was too hard for one of us to solve, we could just
combine our computational resources that would allow us to overcome those constraints.
458
:Instead, we have to do this by doing what we're doing right now, making honking noises at
each other.
459
:uh And that is a really low bandwidth method of communication.
460
:So humans have developed all sorts of
461
:workarounds for this, right?
462
:We've developed language that allows us to convey complicated concepts to one another,
teaching as a means of compressing information.
463
:Like all of the data that I've seen, you don't need to see, you just need to have the
information from me about what to conclude from it.
464
:We develop methods for creating social institutions, things like science that allow us as
a community to aggregate knowledge across generations, engage in what's called cumulative
465
:cultural evolution.
466
:All of that stuff is
467
:stuff that allows us to uh try and overcome those constraints, but it's still not as good
as if we had sort of perfect high bandwidth communication.
468
:So by contrast, the current approach that's taken in AI is one of scaling the amount of
data that's going into systems.
469
:So it's many, many human lifetimes of data that these systems are now trained on.
470
:ah It's scaling the amount of compute that's going into those systems.
471
:And you can kind of have arguments about how that compute compares to human
472
:minds, right?
473
:But ah the relevant thing is that ah that number is something that's easy to increase,
whereas for humans it's not.
474
:And you have the potential for just being able to transfer states between machines
directly, right?
475
:So you don't have to worry about those low bandwidth communication ah bottlenecks that are
characteristic of human cognition.
476
:And so for that reason, we might expect that
477
:human intelligence would continue to be quite different from machine intelligence because,
you know, there's, it would be nice to have machines that could learn from less data and
478
:be more efficient in the way that they use computational resources.
479
:But if you just want to make a machine that does a particular task, and in particular, if
you want to be the first person who makes a machine that can do that, then you can kind of
480
:get around all of the hard engineering that might need to go into.
481
:producing a system that has good inductive biases for solving a problem, or producing a
system which is able to be resource efficient in doing that.
482
:And that's kind of the mode that we've been in for AI, is really just turning up that knob
of data and compute, because it's the thing that so far has been the easiest way to get us
483
:to uh more effective AI systems.
484
:Okay, and so, I'm wondering, so two main things I'm wondering, so trying to order my
thoughts.
485
:First one is, then if I understand correctly what you're saying, you're mainly saying that
the way we're gonna be able to make these generative AI learn is gonna be different from
486
:the way our brains learn.
487
:Does that mean you think there is?
488
:something we can do so that we build these machines, these generative AI in a way that's
complementing our learning and even helping our learning and maybe, you know, fast
489
:tracking it in some way.
490
:What kind of potential do you see here?
491
:Also, what kind of drawbacks are you trying to maybe avoid?
492
:Because we are building these machines, so we are in a position where we can also try and
avoid uh
493
:costs that we can anticipate.
494
:Yeah.
495
:So I don't think there's anything intrinsic to the way that AI is currently being built,
which has that kind of human compatibility built into it, right?
496
:The two things that I think are properties of the way in which AI systems are currently
trained that do contribute to this to some extent are, first of all, training on language.
497
:right, or other data, which are human generated data.
498
:That's something which means that these systems end up kind of like viewing the world in
the same way that we do, because they're viewing it through the lens that we provided.
499
:Right.
500
:So if you think about what language is doing, language is one of those mechanisms that
humans developed as a means of overcoming that limited communication bottleneck.
501
:And as a consequence, it's a thing that we use to describe the world around us very, in a
very compressed form.
502
:And so that compressed form, as well as the structure of language itself is something
which
503
:when you put it into an AI system, means that that system actually gets a lot of knowledge
about the world that's being digested through human minds.
504
:And so that's one thing that maybe stops these systems from being quite as alien as they
could be, is that they are really uh trained to view the world in the same way that a
505
:human has.
506
:um The other thing is the last stage of training for many AI models is something called um
reinforcement learning from human feedback, RLHF.
507
:um
508
:And that's something where humans give direct information about the preferences that they
have about the responses that a system is producing.
509
:And that's something which, in some ways, contributes to alignment between humans and
machines.
510
:Although we have a recent paper where we argue that, in fact, what it results in is more
alignment between the current state of knowledge that humans have ah and machines, which
511
:is not necessarily the same thing as having the human's best interests in mind as you go
further into the future.
512
:And you can think about this even in simple examples, like if you're building a chatbot
that is going to advise you on what products to purchase, unless that chatbot is receiving
513
:its reinforcement based on how satisfied you are with the product after you've bought it,
then it's going to be incentivized to instead focus on giving you the information that
514
:makes you satisfied at the point where you make the purchase.
515
:And that is not the same thing.
516
:And it's not the same thing in a way that can lead those models to intentionally, well, we
shouldn't say intentionally because it's not an intentional agent, but to engage in
517
:behavior that a human being would consider deceptive.
518
:um And so I think if you really want to build systems that are designed to interact with
people, then there's a lot more things that we could do.
519
:which are making direct use of the things that we know about people.
520
:um So we've done a bunch of work on what we call um representational alignment, which is
making sure that the representations that the models have are the same as the
521
:representations that humans have.
522
:And so in some early work, we actually showed that if you train a vision model on, instead
of training it on sort of sanitized data, where every image
523
:has a label, is the of agreed upon label that the humans give it.
524
:If you train those models instead on data that contains human errors, where when the
humans are confused, you include those labels too, the models actually end up with better
525
:representations and representations that are more aligned with those of people.
526
:So that's something where having this more complete information about how humans see the
world and the kinds of things that humans find confusing actually lets the model get
527
:representations that are more human-like.
528
:And in other work, we've shown that models that have more human-like representations
actually do a better job at things like learning from small amounts of data and at
529
:learning to capture human moral judgments.
530
:Right?
531
:So um if you view the world in the same way, that gives you a better basis for being able
to feel the same way about the world in terms of
532
:what's a good action to take versus a bad action to take.
533
:And so we've done other work where we talk about a more general idea of conceptual
alignment, which goes beyond just sort of having these representations that we can uh
534
:measure in terms of what's inside the model that align and really being focused on this
notion of like understanding the world in terms of the same basic concepts.
535
:And so even when we have models that sort of correlate in their representations, they can
still be quite different.
536
:in the way they're making sense of the world.
537
:for example, ah know, humans think about the world in a way where we distinguish between
different numbers of objects, right?
538
:Where the number of things is the sort of discrete thing that we represent and we can
individuate.
539
:And it seems like, at least if you take image models, once you get past four or five or
six objects in an image, they kind of lose track of what those numbers are.
540
:So they just sort of have, you know, like maybe have somewhat individuated representations
of small numbers, but they just have sort of like a fuzzy mass when you get beyond those
541
:small numbers.
542
:And that's exactly how, you know, part of human cognition works.
543
:Right?
544
:If I show you an image and ask you how many things are in it, if there's up to four things
in the image, you're to be able to answer it perfectly.
545
:And then after that you drop off.
546
:But we have this other richer conceptual representation, which allows us to individuate,
you know, higher numbers, 30 and 31 from each other and so on.
547
:And so.
548
:At the moment, our AI systems are capturing sort of part of how humans perceive the world.
549
:It's kind of like the part that happens very quickly that you're able to do in, you know,
less than a second where you're just sort of like seeing something and making sense of it,
550
:or sort of like, reading something and getting the gist of it, but they're missing
something, which is the more symbolic abstract representations that people form.
551
:And so we have a paper called, uh, machines that learn and think with people where we make
this argument.
552
:and, and
553
:And so to say, you really want to make machines that are able to engage with humans in all
the ways that humans engage with humans, then we need to capture more of the kind of thing
554
:that allows humans to engage with humans, which is having these shared representations of
the world.
555
:And that's probably going to require some innovation in terms of the methods that we use
for building these AI systems.
556
:Hmm.
557
:Yeah, okay.
558
:That's so fascinating.
559
:I'm curious.
560
:So my main, you know, one of my main questions in these is um how can we apply what we
learned basically so far about human policies and cognition to make these generative AI
561
:assistance
562
:basically elevators and help us try not fall into these biases that may be conscious and
unconscious, but we know about them from research.
563
:How can we make that happen?
564
:Because what I'm thinking is if we're trying to make them to our perfect image and
symmetry, uh that's useful, but...
565
:I think we're missing, we would be missing one of the most interesting parts, which is,
okay, trying to make us better humans instead of just the same we are.
566
:um And so is that possible?
567
:Is that something the research is currently going towards?
568
:um How hard is it?
569
:It's like, it seems like a huge endeavor to me.
570
:Yeah, I mean, my first response would be, you for the reasons they said,
571
:It's not clear to me that an objective should be building AI systems that are like people.
572
:Because people, what human minds are is a consequence of this evolutionary process, right?
573
:That reflected various biological constraints that we operate under.
574
:And it might be better to think about it in terms of, know, what are, what's the best way
for us to build.
575
:intelligent systems that aren't like people in terms of what's easy for us to engineer,
but also something that's easy for people to interact with.
576
:And I think that's a perfectly reasonable goal for AI.
577
:So if you think about birds and jet planes, which is an example of people who work on AI
talk about a fair amount.
578
:studying how birds fly is incredibly important to making your first airplane in terms of
understanding
579
:aerodynamics and lift and all of these fundamental concepts that are intrinsic to like how
you're going to design the wings of your plane and sort of figure out like what it is that
580
:actually makes something fly.
581
:But uh there's a point where once you have something that's flying, you can engage in an
iterative process, which is the one that leads you to jet planes, right?
582
:And jet planes and birds are very different from one another.
583
:I think that's a totally reasonable place to end up, right?
584
:I'm not going to be someone who's going to be like, we should build flying machines that
are more like birds.
585
:There are plenty of reasons why, you you might want to take some more bird-like properties
and put them into jet planes, for example, being more efficient in their fuel consumption.
586
:Or, um you know, like other kinds of characteristics that, you know, birds have, are sort
of amazing as examples of flying machines, right?
587
:But it's not something that if your fundamental goal is, I want to get from one location
to another as fast as possible, that's something where learning more about jet engines is
588
:probably the way to go.
589
:And I sort of feel like we're in the same space with AI.
590
:It's like, yes, there's plenty of things that we can still learn from human cognition, um
but we shouldn't say that our objective is to build something which is uh human-like,
591
:necessarily.
592
:m
593
:So that said, there are plenty of ways that AI can be helpful to humans.
594
:So one of the things that I work on is understanding human decision making.
595
:And this kind of goes back to those experiments by Kahn and Tversky that I talked about
before, right?
596
:So part of what they showed was that not just people about it reasoning about
probabilities, but people about it making decisions with respect to
597
:you know, expected utility theory is our kind of criterion of what it means to be a good
decision maker.
598
:And that's certainly true.
599
:And if we try and think about that in terms of the different levels of analysis, right,
expected utility theory is our abstract computational level theory, tells us what we
600
:should be doing.
601
:But there's a sense in which that is a kind of theory, which in fact, could not be
achieved by any real entity that exists in the world, right?
602
:So expected utility theory says, no matter how many outcomes you have for each of those
outcomes, evaluate the utility, evaluate the probability, take the expectation.
603
:If you're trying to use that as a guide to making any real decision, it's kind of
impossible to do that, right?
604
:Because you'll just spend your entire life thinking about possible outcomes, trying to
evaluate how good they are, trying to evaluate how probable they are.
605
:And in fact, most of the time when we have to make a decision, we have to do it under some
kind of time constraints.
606
:Those constraints are things that are intrinsic to the way that the problem is set up.
607
:They're also a consequence of that sort of second level with the kinds of algorithms that
we're able to execute on our human brains, right?
608
:And that sort of gets us down to the third level.
609
:it's things from these other levels of analysis that are influencing the problem that we
should be thinking about.
610
:And so back in the 1980s, um some computer scientists who are trying to figure out how to
make a good AI system.
611
:passes like Stuart Russell and Eric Horvitz, among others, they tried to work out, what's
actually the right optimization problem to set up for making a decision?
612
:And they came up with an idea that they call bounded optimality.
613
:Basically, the idea is that a bounded optimal agent is one that uses the best possible
algorithm for making a decision, taking into account not just the outcomes that that
614
:produces, the sort of expected value of those outcomes.
615
:but also the computational cost involved in executing that algorithm.
616
:And that's a much more realistic way of thinking about how to characterize optimal
decision making.
617
:We've used that idea and expanded on it in an approach that we call resource rationality,
which is the idea that actually being a rational agent means making rational use of the
618
:limited cognitive resources that you have to do the best that you can in making the
decisions that you make.
619
:And we can use that.
620
:as a lens for then going back and re-evaluating human decisions.
621
:And when you do that, you find out that a lot of the things that people do that deviate
from expected utility theory are actually things that kind of make sense from the
622
:perspective of resource rationality.
623
:Like the heuristics that people use for solving problems are typically good heuristics
when evaluated by this criterion of do they achieve relatively high expected utility while
624
:not incurring too much computational cost.
625
:But the other thing that you get out of this is a perspective on how you can help humans
make better decisions, which is that if we're making bad decisions some of the time
626
:because of those resource constraints, one way that we can better support human decision
making is by increasing the computational resources that are available to human agents.
627
:so, you know, Elon Musk is trying to do that by putting chips in people's brains.
628
:That's not
629
:where we are in this space.
630
:uh Instead, what we focused on is, what if you can use AI systems to put more compute in
human environments, right?
631
:So that when you are making a decision, we've used, you know, we've sort of partially
pre-solved part of that problem, and we can give you information in a much more accessible
632
:way that's going to help you make a good decision.
633
:So either putting the most relevant information in the most easily accessible place or
634
:giving you information about payoffs that supplements the uh information that's
immediately available to you, sort of gamification, where we're saying, you get some more
635
:points if you choose this option, ah where you can actually set up those points in a way
that better aligns with people's long-term goals.
636
:One of my collaborators, Falk Leader, has worked on making gamified to-do lists where you
can specify the things that you want to do.
637
:and how valuable they are, break these down into components.
638
:And then it automatically works out how many points you should get for doing each thing.
639
:And that gives you a much more immediate reward for checking something off your to-do list
as those points add up.
640
:And maybe you get yourself some prize when you get enough points.
641
:um And that's a way of helping to motivate humans who uh might not be able to plan all the
way to the end of some long-term goals.
642
:And we've actually run experiments.
643
:We've shown that this kind of thing can help people reduce procrastination.
644
:So those are.
645
:ways of using elements of AI systems to help humans without having to go all the way to
like building something which is, you know, a human like AI.
646
:I love that.
647
:is absolutely fascinating.
648
:I have so many, still so many questions for you.
649
:know, well, I'll start warning us down because I don't want to take three hours of your
time, but something I'm also curious, and I think you also work on that, is the language
650
:learning aspect of these generative AI models.
651
:Yeah, can you talk a bit about that?
652
:How does that work?
653
:How do you make...
654
:an AI learn language and grammar.
655
:If I understood correctly, grammar is something that's quite hard.
656
:em So yeah, what's the state of the art right now?
657
:And how does this work?
658
:Yeah, so the the basic principle behind the large language models that are used in current
AI systems is a very nice statistical principle, which is em the goal of these systems is
659
:based on the words that you've seen,
660
:in a document so far, predict the next word that'll appear in that document.
661
:It's a very simple kind of objective to define.
662
:And this is called autoregression by analogy to more familiar autoregressive models that
you might use for time series analysis.
663
:That idea itself goes back to cognitive science.
664
:A cognitive scientist named Jeff Elman published a paper in 1990 where he showed that
using that kind of objective was actually
665
:an effective way of learning, in his case, a very simple language.
666
:So he sort of made up a simple language, that simple language.
667
:You know, he trained his neural network on what now seems kind of quaint.
668
:It's like 10,000 uh words that were generated in that simple language.
669
:And he showed that when he analyzed what the neural network had learned, actually pulled
out things like nouns and verbs.
670
:And the fact that there are different kinds of nouns and different kinds of verbs that
interact with one another in different ways.
671
:So it had kind of like pulled out this information in the
672
:the hidden units of that model, the representations of the model it formed, it had made
distinctions between these different syntactic classes that were key to the simple
673
:language that he created.
674
:And so honestly, the story of creating the modern large language models is essentially a
story of scaling that up.
675
:So it's increasing the amount of data that the models are trained on and also
676
:modifying the architectures of the neural networks, which are used for solving those
problems.
677
:So Elman's original model used something called a simple recurrent network, which would
just sort of copy some of the information that it had used to predict the last word, and
678
:then have that information available when it's predicting the next word.
679
:So it copied those internal hidden unit states.
680
:And that works fine for learning short sequences.
681
:But as you start to try and learn long sequences, it runs into all sorts of problems just
as a consequence of
682
:you know, the dynamics of, like basically you started to start losing information very
quickly when you're engaging in that copying process for reasons that are related to the
683
:iterated learning thing that I was talking about before.
684
:It's just like any iterated process, you sort of, you can potentially start to, it's hard
to carry information forward.
685
:And so there was another innovation called long short-term networks, sorry, long
short-term memory networks, which were basically,
686
:a way of allowing the neural network to make decisions about what information it would
store in memory and what information that would read out.
687
:And then those were used to significantly expand what these kinds of neural network models
could learn.
688
:And then the latest generation of these models uses an architecture that's called the
transformer architecture.
689
:And in that architecture, instead of processing words one by one and making the prediction
about the next word, it has the whole sequence available to it when it's making that
690
:prediction.
691
:And what it learns is what words within that sequence it should be paying attention to in
order to generate that prediction in a way that is then able to change dynamically for
692
:whatever sequences it's looking at.
693
:And that's a very powerful way of, it's not necessarily particularly human-like.
694
:Again, making this difference between the engineered system and the human system, right?
695
:You as a human pretty much have to process stuff in real time.
696
:Whereas the model is able to kind of look back and forth and it's able to look along the
entire sequence when it's making that prediction.
697
:So it's much more like if I gave you a written document and I asked you to predict the
next word and you could be going back and looking at all of the words in that document
698
:instead of circling them and figuring out which ones you're going to use to then make that
prediction about what that next word is going to be.
699
:um And that architecture is even more effective.
700
:And that's the architecture that modern large language models are based on.
701
:So
702
:Alongside those architectural innovations, you just have massive increases in the amount
of data that are being supplied to the systems where GPT-3, which is the last of the open
703
:AI models that we know a reasonable amount about what it was actually trained on, it was
trained on the equivalent of if you just had somebody reading out the training data, it
704
:would be 5,000 years of continuous speech.
705
:So go back to ancient Egypt.
706
:start reading 24 hours a day, you would stop now.
707
:That is the data that it learned from.
708
:And then the models that are successes to that have been trained on, again, like another
order of magnitude more data.
709
:And so really, it's those two things.
710
:And I think the surprise for many cognitive scientists is that that's enough.
711
:You can get to remarkably complete
712
:understanding of language, uh given those neural network architectures and enough training
data.
713
:Hmm, yeah, that sounds like an awesome project, but also, yeah, like, huge endeavor.
714
:And I'm wondering also, now that, you know, we talk about a bit, how you see these next
projects in the future, I'm also wondering, how do you see the future of patient methods?
715
:in cognitive science for these kind of work that you're doing and talking about right now.
716
:I mean, so I think this case is actually a really good one for highlighting what's
different between humans and these machines, right?
717
:So, okay, take that GPT-3 case.
718
:Does a remarkably good job of learning, you know, uh like not just English, but multiple
languages, right?
719
:From 5,000 years of data.
720
:Okay.
721
:Take the human being does a pretty good job of learning language from, you know,
722
:five years of data, actually less than five, because my number was speaking 24 hours a day
and kids do not get swirking true 24 hours a day.
723
:But there's like at least three orders of magnitude difference in the amount of data that
humans are using to achieve the state that they get to.
724
:And so when you think about that in terms of solving an inductive problem, the only place
where that missing piece can come from, that allows you to get to that level of
725
:performance with so much less data.
726
:is inductive bias, the kind of thing that in a Bayesian framework we would express in
terms of like a prior distribution.
727
:Right.
728
:So one way of thinking about that is humans have some prior distribution over languages
that's much more informative than the implicit prior distribution, which is built into a
729
:transformer model.
730
:And so that's part of the project of a cognitive scientist is to figure out what is that
distribution like, right?
731
:Like what is that ah the way to characterize
732
:those human inductive biases for language learning.
733
:And that's a huge project.
734
:It's the kind of thing that linguists have been engaged in for the last 70 years at least
in terms of looking across different human languages, looking at the things that people
735
:can learn.
736
:We've used methods like the iterated learning method that I talked about as a way to
reveal some aspects of what those human prior distributions for learning languages look
737
:like.
738
:We've also used a method called meta-learning as a way of exploring this.
739
:So you can kind of turn this problem around and you can say,
740
:If I took the data that kids get exposed to from different human languages, can I reverse
engineer what inductive bias you would need to have in order to be able to learn to speak
741
:those languages?
742
:And you can do this using a tool that's come from the machine learning community called
meta learning, which is basically you, you have a training procedure where you're going to
743
:train a bunch of neural networks to perform different tasks, but
744
:All of those neural networks are going to have the same initial weights.
745
:And so you then have an inner loop where they learn to perform the tasks and an outer loop
where you optimize the initial weights.
746
:And the initial weights of a neural network are one way of, you know, that's one factor
that influences the inductive bias of that network.
747
:So you can think about what that's doing is it's trying to find a sort of starting point
in weight space so that by doing your neural network learning from that starting point,
748
:you're going to end up at the end points that you want to get to.
749
:And so we can use that actually as a tool for trying to work out what's good starting
point, what's good set of initial weights for characterizing what you might need to have
750
:in order to be able to learn the things that humans learn from the data that the humans
actually get exposed to.
751
:And we have a paper where we show that, in fact, that method is a kind of hierarchical
Bayesian method.
752
:So you can show that this is exact in the case where these are linear models that you're
optimizing, and it's a little more heuristic in the case of neural networks.
753
:But that's a way to think about it is that.
754
:It's a way of like trying to back out a prior distribution expressed in terms of the
weights of a neural network.
755
:And so I think there's all sorts of, you know, interesting things for cognitive scientists
to do in, you know, this world where, uh, one of the challenges that AI faces at the
756
:moment.
757
:And, you know, we'll see how much of a challenge this is ultimately, but they're starting
to run out of data.
758
:Uh, so, you know, the, models have been trained on
759
:you know, all human generated text in electronic form.
760
:And so the question is, where do they go next to make the models better?
761
:One way to go is to try and understand what those human inductive biases are like and
think about how you can implement something like that in these models.
762
:That's certainly a direction that I'm excited about.
763
:I think the other way that we can use these cognitive scientists' tools are understanding
more about the actual inductive biases that these models have.
764
:understanding how they relate to or differ from human inductive biases and just getting a
better sense of what the capacities of these models are.
765
:Because one of the challenges of working with these kinds of neural network models that
underlie most of modern AI is that they're really opaque.
766
:It's really hard to know what assumptions they're making.
767
:So part of what attracts me to Bayesian statistics is we know exactly what assumptions
we're making.
768
:We write them down and then we derive their consequences.
769
:And so the more we can do to characterize for these neural networks what they're doing in
something like Bayesian terms, the better position it puts us in in order to think about
770
:what the consequences might be of deploying them in a particular situation.
771
:What's the right kind of model to use in a particular situation?
772
:How are these models different from the assumptions humans might make?
773
:Things that are actually relevant to thinking about.
774
:How do we guarantee better outcomes from deploying these models?
775
:Yeah, so many interesting things in your agenda, guess, for that coming year and many
years to come, I'm guessing.
776
:And actually, you're not only a researcher, you're also a director of a lab.
777
:So I'm guessing you also see a lot of students.
778
:You even mentor some PhD students.
779
:I'm wondering what advice you would give to someone starting out right now in patient
stance and or cognitive science and someone who is interested in what you're doing and
780
:would like to do it too.
781
:Yeah.
782
:I mean, I think one thing that happened when uh chat GPT came out was that a lot of my
students who work on Bayesian methods were kind of worried at that point, right?
783
:I did a lot of counseling of graduate students where
784
:I think there was a concern that a, whatever problem they were working on was going to be
solved by some relatively dumb machine learning method scaling up, right?
785
:Or just, you know, or by the systems that are created as a consequence of that process.
786
:And ah B, that they might not be able to get jobs because people might not be interested
in
787
:hiring people who are working on Bayesian methods.
788
:They might just want to hire people who are working on deep learning or language models or
whatever it is.
789
:So I think for anyone who's in this space, uh my advice is, uh think you should work on
the things that you're interested in.
790
:If you're drawn to Bayesian methods, work on Bayesian methods.
791
:But with two caveats.
792
:So one of those is, it's better to work on a problem than a method.
793
:So it's better to say, this is the thing that I want to understand, or this is the problem
that I want to solve than to say, I am going to solve this problem using Markov Chain
794
:Monte Carlo or whatever it is, right?
795
:Because if you have a problem, somebody making progress on methods is good news for you,
right?
796
:So as AI gets better, you're putting yourself in a position to do a better job of solving
your problem.
797
:Rather than worrying about, someone is going to, you know,
798
:do something which is better than the method that I'm using.
799
:so choosing to structure your research program in a way where improvements in AI are
positive outcomes for you rather than negative outcomes is a really important thing to
800
:think about at the moment just because stuff is happening so fast.
801
:So that's thing number one.
802
:And the thing number two is I think just strategically, whatever you're working on, it's
good to think about how to relate that to
803
:these kinds of developments, right?
804
:So it doesn't have to be that you make your research program all about large language
models, but it's quite helpful if you have one paper where you think about large language
805
:models or deep learning or whatever it is and how it relates to the problem that you're
solving, because that's a way of communicating that you know these methods as well and
806
:that you're able to talk about them intelligently.
807
:If you're looking for academic jobs, you're able to teach a class on those methods and
that
808
:the questions and methods that you work on aren't going to be things that are going to be
superseded, but they're things that are compatible with whatever those new approaches are.
809
:And so those two principles are things that I use when I'm, you know, sort of mentoring
students and guiding them and thinking about their research programs and choosing
810
:projects.
811
:Um, just to, you know, yeah, like position yourself to succeed and be able to communicate
how it is that the things that you work on are relevant to all of these exciting things
812
:that are happening in modern AI.
813
:All right.
814
:So mainly, mainly focusing on solving the problems, not really on the methods.
815
:Yeah.
816
:I mean, that's something where I think building a strong methodological toolbox puts you
in a good position to be able to roll with the punches, right?
817
:But if you shouldn't be committed to the idea that one particular method is going to be
the thing, which is what you're going to be focused on for your, you know,
818
:your whole career, certainly, but maybe even not for the next few years.
819
:And there are plenty of examples of this.
820
:Like I mentioned our crazy experiments where we run Markov chain Monte Carlo algorithms
with people.
821
:We also run those crazy Markov chain Monte Carlo algorithms on large language models,
right?
822
:And this is, it's Markov chain Monte Carlo, but we're using it in weird ways that are
responsive to the particular problems or questions that come up in different disciplines.
823
:And think being able to be creative, like I still continue to think that Bayesian
statistics is a fundamental, incredibly valuable toolkit for anyone who's doing research
824
:on AI, machine learning, cognitive science, precisely because it tells us what optimal
solutions to problems look like, right?
825
:And if any kind of system is doing a good job of solving a problem, it's going to be doing
something which is approximating what those optimal solutions are.
826
:A lot of research which is done in AI and machine learning is often done in a kind of
like, put these things together and it works kind of way at the moment.
827
:um And so being somebody who has that toolkit to be able to come in and say, this is why
it works.
828
:This is where it's not going to work.
829
:And this is a solution to that problem.
830
:That is super useful.
831
:And it's a very powerful way of being able to engage with m this huge amount of research
which is going on.
832
:For me, those are the papers that I find most satisfying are the ones where we're able to
take something that's relatively ad hoc, explain it, extend it, improve upon it.
833
:That's the stuff that gets me most excited in research.
834
:Yeah.
835
:And so, um, concretely, actually, what did you recommend, um, your students, you know, who
were, uh, who were worried that their, their investment in these hard methods, because the
836
:problem is that, you know, learning new, I get the advice of, yeah, being flexible with
the methods you use and so on is, yeah, important.
837
:And in a world where it would be.
838
:easy to learn methods.
839
:for instance, if you were like in the matrix and you just put, know, you can, you just
have to watch a tape and then, and then you know something.
840
:the most rational thing to do would be to just jump methods all the time you need.
841
:In our world though, in the way we learn, it's not possible, especially since the methods
we use for inference are, are definitely complex and you acquire them with months, if not
842
:years of training.
843
:changing
844
:is not like if you're on a small boat, you know, you cannot turn very fast.
845
:It's more like you're on a cruise ship and it takes time to do a U-turn.
846
:So yeah, like in these cases, I'm wondering what the uh optimal decision would be for
these kind of students where like, well, actually, the message you learned is still
847
:extremely valuable.
848
:And I would, I would still, you know,
849
:work with that or, no, you need to change something uh right now.
850
:So how do you handle these cases?
851
:Yeah, I mean, think if you're very invested in the methods, then the right strategy is one
of looking to the contemporary machine learning literature and looking for places where
852
:those methods can be used to answer questions in that literature.
853
:And that's much more straightforward.
854
:um And it doesn't have to be the thing that is like your core interest in questions can be
the core interest in questions that you have.
855
:But I think it's more.
856
:being open to finding those opportunities and like I said, like communicating with that
literature, right?
857
:ah And I think there are lots of good ways to do that in terms of, you know, places where
for the reasons I outlined, know, like thinking about these systems from the perspective
858
:of Bayesian statistics is a very rich.
859
:lens to use in trying to make sense of them.
860
:And it's a skill set that's rare m at this point.
861
:Like 90 plus percent of the research that's being done in machine learning at the moment
is probably being done by people who don't have a deep grounding in those kinds of
862
:Bayesian methods.
863
:And so that's a competitive advantage.
864
:um And so I think.
865
:The one thing I would sort of discourage is just sort of being dogmatic about it, right?
866
:Saying, I don't do that kind of thing.
867
:I do this Bayesian stuff.
868
:Because I think that's something where you can just shut yourself in a box.
869
:think it's being able to be responsive and open to these things and think about them.
870
:there are plenty of ways that you can get communication going back and forth between these
different fields.
871
:Yeah, yeah.
872
:Yeah, that makes sense.
873
:And actually, what do you think are the most promising trends or advancement in your field
that you would recommend uh such a student or professional to invest in, like someone who
874
:knows already like your students, well, the Bayesian methods, uh but you see some
advancement in other methods or their tools that you would recommend these
875
:people to learn because these would be a good investment of their time.
876
:mean, I actually think that the thing I would recommend and this is, you ah somewhat self
centered perspective, right, but is, you know, some of the papers that we write are really
877
:about like, how do you take a Bayesian view of these things?
878
:And in some ways, that's most useful because it's building a bridge, right?
879
:um And that's
880
:that bridge allows you to make use of the methods that you already know, right?
881
:Where if you can say, OK, here's a set of things that you know, here's this other set of
things, and now I can tell you, that thing is relevant to this thing, and that thing's
882
:relevant to this thing, and that thing's relevant to this thing, that gives you a little
bit of a map of the territory that you can explore.
883
:So we have a paper called Bays in the Age of Intelligent Machines, where we make this
argument that Bayesian reasoning is really relevant to building AI systems.
884
:um
885
:There's a nice position paper that was at ICML last year.
886
:think if you just Google position deep learning, ah that's a paper that's really a good
summary of the state of the art in um Bayesian thinking as applied to machine learning
887
:models.
888
:um And then our paper, Embers of Auto-Regression, is a...
889
:um
890
:a characterization of like large language models from this Bayesian perspective um and
cite some related things.
891
:There's also a lot of nice work on um like in context learning in ah large language models
as Bayesian inference.
892
:Some of that work has come out of Percy Liang's group um and we cite some of that work.
893
:m But ah those are places to look for.
894
:sort of like nice connections between Bayes and contemporary machine learning methods.
895
:Yeah, definitely feel free to add these links to the show notes because I think lots of
people will be interested in that and want to dig deeper.
896
:Okay, so that's already been a long time.
897
:I have to let you go in a few minutes.
898
:So let me ask you one last question before the last two questions.
899
:uh Many about you.
900
:Yeah, what's next for you?
901
:Are there any upcoming projects or ideas you're particularly excited about for the next
coming month?
902
:I mean, in general, I think this is a super exciting time to be a cognitive scientist,
right?
903
:For a long time, we only had one really intelligent system to study, and now we have a
bunch.
904
:And that's creating all sorts of fun opportunities.
905
:We've been doing a lot of uh LLM psychology, like trying to make sense of what's going on
in large language models and related models.
906
:um And that's a fun way to use a, yeah, as I said, like a toolbox that's coming from
another discipline and applying it in the context of AI.
907
:um The other kinds of things that
908
:uh you know, I do in my lab are these really large scale studies of decision making.
909
:And we have a few of these things that are coming together and going to be, you know,
exciting papers that I'm looking forward to working on.
910
:uh Other than that, my main job at the moment is we just started a new AI lab at
Princeton, which is a big interdisciplinary effort to explore how AI can be used to
911
:support research across the entire
912
:university, ultimately, and to try and figure out how do we, you know, like, support AI
research on campus more generally.
913
:um So I'm the director of that effort.
914
:That's been taking a lot of time and is at the point where it's really starting to
accelerate.
915
:And that's super exciting.
916
:Like the, the thing that I see here is the potential transformative impact of AI across
many different disciplines where, for example,
917
:the Nobel Prize in Chemistry this year.
918
:That's just one indicator of, think, what's going to come, which is as these methods
propagate out across all of the different research disciplines, they're going to enable us
919
:to make discoveries that we couldn't have made before.
920
:um And I'm excited to be part of that process at Princeton.
921
:Awesome.
922
:Well, Tom, that was an absolute pleasure to have you on the show.
923
:And well, before you go, of course, I'm going to ask you the last two questions I ask
every guest at the end of the show.
924
:uh So first one, if you had unlimited time and resources, which problem would you try to
solve?
925
:I think a really deep, interesting problem at the moment is figuring out how to scale
Bayesian inference.
926
:So you can think about a lot of the success of the current AI paradigm as a consequence of
people kind of figuring out how to scale neural networks better.
927
:methods for doing back propagation, automatic differentiation, things to stop gradients
disappearing in deeper networks.
928
:These were all insights that made it possible to create bigger and bigger neural networks.
929
:And that was crucial to then being able to then train models that could work with these
very large data sets.
930
:Bayes is even more notorious for having problems with scale.
931
:But I'm really curious about what happens when we scale.
932
:in the same way that we've been able to scale these other kinds of methods.
933
:So that's a huge challenge.
934
:But as you said, unlimited time and resources, that's the one I would pick.
935
:And if you could have dinner with any great scientific mind, dead, alive, or fictional,
who would it be?
936
:I assume that they're alive at the point where you're eating dinner with them, right?
937
:Because otherwise, it would be boring.
938
:One of my intellectual heroes is Leibniz.
939
:Uh, so I think that would be fun just because, know, he's somebody who was just like all
over all sorts of ideas that have been incredibly fundamental ideas to not just cognitive
940
:science, but, um, you know, all sorts of other disciplines.
941
:he's not in our normal Bayesian Pantheon, but he was actually very interested in
probability, right?
942
:He did his, early work was on combinatorics and he started to think about probability and
943
:started to even think about it in terms of, you know, being eh an extension to the kind of
basic methods of logic that he was starting to think about.
944
:So ah definitely somebody I would enjoy having dinner with.
945
:Yeah, it's a great answer.
946
:think nobody gave that yet, but I'm surprised because Leibniz is definitely uh a big name
of science and not only statistics, but math and physics.
947
:it definitely sounds like a great dinner, uh Well, amazing.
948
:Thank you so much for taking the time, Tom.
949
:That was absolutely incredible.
950
:I still have so many questions for you, but I definitely recommend
951
:your new book to listeners.
952
:These will be in the show notes.
953
:There will be also other links that Tom mentioned to some papers.
954
:So all of these will be in the show notes for those who want to dig deeper.
955
:Thank you again, Tom, for taking the time and being on this show.
956
:Thanks for having me.
957
:Maybe I can answer the other questions next time I write a book.
958
:Oh, yeah, for sure.
959
:Okay, come back next time you have a book you like.
960
:You're welcome to come and that will be very fun.
961
:This has been another episode of Learning Bayesian Statistics.
962
:Be sure to rate, review, and follow the show on your favorite podcatcher, and visit
learnbayestats.com for more resources about today's topics, as well as access to more
963
:episodes to help you reach true Bayesian state of mind.
964
:That's learnbayestats.com.
965
:Our theme music is Good Bayesian by Baba Brinkman, fit MC Lance and Meghiraan.
966
:Check out his awesome work at bababrinkman.com.
967
:I'm your host.
968
:Alex and Dora.
969
:can follow me on Twitter at Alex underscore and Dora like the country.
970
:You can support the show and unlock exclusive benefits by visiting Patreon.com slash
LearnBasedDance.
971
:Thank you so much for listening and for your support.
972
:You're truly a good Bayesian.
973
:Change your predictions after taking information.
974
:And if you're thinking I'll be less than amazing.
975
:Let's adjust those expectations.
976
:Let me show you how to be a good Bayesian Change calculations after taking fresh data in
Those predictions that your brain is making Let's get them on a solid foundation