Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!
GPs are extremely powerful…. but hard to handle. One of the bottlenecks is learning the appropriate kernel. What if you could learn the structure of GP kernels automatically? Sounds really cool, but also a bit futuristic, doesn’t it?
Well, think again, because in this episode, Feras Saad will teach us how to do just that! Feras is an Assistant Professor in the Computer Science Department at Carnegie Mellon University. He received his PhD in Computer Science from MIT, and, most importantly for our conversation, he’s the creator of AutoGP.jl, a Julia package for automatic Gaussian process modeling.
Feras discusses the implementation of AutoGP, how it scales, what you can do with it, and how you can integrate its outputs in your models.
Finally, Feras provides an overview of Sequential Monte Carlo and its usefulness in AutoGP, highlighting the ability of SMC to incorporate new data in a streaming fashion and explore multiple modes efficiently.
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell and Gal Kampel.
Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag ;)
Takeaways:
- AutoGP is a Julia package for automatic Gaussian process modeling that learns the structure of GP kernels automatically.
- It addresses the challenge of making structural choices for covariance functions by using a symbolic language and a recursive grammar to infer the expression of the covariance function given the observed data.
-AutoGP incorporates sequential Monte Carlo inference to handle scalability and uncertainty in structure learning.
- The package is implemented in Julia using the Gen probabilistic programming language, which provides support for sequential Monte Carlo and involutive MCMC.
- Sequential Monte Carlo (SMC) and inductive MCMC are used in AutoGP to infer the structure of the model.
- Integrating probabilistic models with language models can improve interpretability and trustworthiness in data-driven inferences.
- Challenges in Bayesian workflows include the need for automated model discovery and scalability of inference algorithms.
- Future developments in probabilistic reasoning systems include unifying people around data-driven inferences and improving the scalability and configurability of inference algorithms.
Chapters:
00:00 Introduction to AutoGP
26:28 Automatic Gaussian Process Modeling
45:05 AutoGP: Automatic Discovery of Gaussian Process Model Structure
53:39 Applying AutoGP to New Settings
01:09:27 The Biggest Hurdle in the Bayesian Workflow
01:19:14 Unifying People Around Data-Driven Inferences
Links from the show:
Transcript
This is an automatic transcript and may therefore contain errors. Please get in touch if you're willing to correct them.
GPs are extremely powerful, but hard to
handle.
2
:One of the bottlenecks is learning the
appropriate kernels.
3
:Well, what if you could learn the
structure of GP's kernels automatically?
4
:Sounds really cool, right?
5
:But also, eh, a bit futuristic, doesn't
it?
6
:Well, think again, because in this
episode, Farah Saad will teach us how to
7
:do just that.
8
:Feras is an assistant professor in the
computer science department at Carnegie
9
:Mellon University.
10
:He received his PhD in computer science
from MIT.
11
:And most importantly for our conversation,
he's the creator of AutoGP .jl, a Julia
12
:package for automatic Gaussian process
modeling.
13
:Feras discusses the implementation of
AutoGP, how it scales, what you can do
14
:with it, and how you can integrate its
outputs in your patient models.
15
:Finally,
16
:DeepFerence provides an overview of
Sequential Monte Carlo and its usefulness
17
:in AutoGP, highlighting the ability of SMC
to incorporate new data in a streaming
18
:fashion and explore multiple modes
efficiently.
19
:This is Learning Basics Statistics,
,:
20
:Welcome to Learning Bayesian Statistics, a
podcast about Bayesian inference, the
21
:methods, the projects, and the people who
make it possible.
22
:I'm your host, Alex Andorra.
23
:You can follow me on Twitter at alex
.andorra, like the country, for any info
24
:about the show.
25
:LearnBayStats .com is Laplace to me.
26
:Show notes,
27
:becoming a corporate sponsor, unlocking
Bayesian Merge, supporting the show on
28
:Patreon, everything is in there.
29
:That's learnbasedats .com.
30
:If you're interested in one -on -one
mentorship, online courses, or statistical
31
:consulting, feel free to reach out and
book a call at topmate .io slash alex
32
:underscore and dora.
33
:See you around, folks, and best Bayesian
wishes to you all.
34
:idea patients.
35
:First, I want to thank Edwin Saveliev,
Frederic Ayala, Jeffrey Powell, and Gala
36
:Campbell for supporting the show.
37
:Patreon, your support is invaluable, guys,
and literally makes this show possible.
38
:I cannot wait to talk with you in the
Slack channel.
39
:Second, I have an exciting modeling
webinar coming up on April 18 with Juan
40
:Ardus, a fellow PyMC Core Dev and
mathematician.
41
:In this modeling webinar, we'll learn how
to use the new HSGP approximation for fast
42
:and efficient Gaussian processes, we'll
simplify the foundational concepts,
43
:explain why this technique is so useful
and innovative, and of course, we'll show
44
:you a real -world application in PyMC.
45
:So if that sounds like fun,
46
:Go to topmade .io slash Alex underscore
and Dora to secure your seat.
47
:Of course, if you're a patron of the show,
you get bonuses like submitting questions
48
:in advance, early access to the recording,
et cetera.
49
:You are my favorite listeners after all.
50
:Okay, back to the show now.
51
:Arasad, welcome to Learning Vision
Statistics.
52
:Hi, thank you.
53
:Thanks for the invitation.
54
:I'm delighted to be here.
55
:Yeah, thanks a lot for taking the time.
56
:Thanks a lot to Colin Carroll.
57
:who of course listeners know, he was in
episode 3 of Uninvasioned Statistics.
58
:Well I will of course put it in the show
notes, that's like a vintage episode now,
59
:from 4 years ago.
60
:I was a complete beginner in invasion
stats, so if you wanna embarrass myself,
61
:definitely that's one of the episodes you
should listen to without my -
62
:my beginner's questions, and that's one of
the rare episodes I could do on site.
63
:I was with Colleen in person to record
that episode in Boston.
64
:So, hi Colleen, thanks a lot again.
65
:And Feres, let's talk about you first.
66
:How would you define the work you're doing
nowadays?
67
:And also, how did you end up doing that?
68
:Yeah, yeah, thanks.
69
:And yeah, thanks for calling Carol for
setting up this connection.
70
:I've been watching the podcast for a while
and I think it's really great how you've
71
:brought together lots of different people
in the Bayesian inference community, the
72
:statistics community to talk about their
work.
73
:So thank you and thank you to Colin for
that connection.
74
:Yeah, so a little background about me.
75
:I'm a professor at CMU and I'm working
in...
76
:a few different areas surrounding Bayesian
inference with my colleagues and students.
77
:One, I think, you know, I like to think of
the work I do as following different
78
:threads, which are all unified by this
idea of probability and computation.
79
:So one area that I work a lot in, and I'm
sure you have lots of experience in this,
80
:being one of the core developers of PyMC,
is probabilistic programming languages and
81
:developing new tools that help
82
:both high level users and also machine
learning experts and statistics experts
83
:more easily use Bayesian models and
inferences as part of their workflow.
84
:The, you know, putting my programming
languages hat on, it's important to think
85
:about not only how do we make it easier
for people to write up Bayesian inference
86
:workflows, but also what kind of
guarantees or what kind of help can we
87
:give them in terms of verifying the
correctness of their implementations or.
88
:automating the process of getting these
probabilistic programs to begin with using
89
:probabilistic program synthesis
techniques.
90
:So these are questions that are very
challenging and, you know, if we're able
91
:to solve them, you know, really can go a
long way.
92
:So there's a lot of work in the
probabilistic programming world that I do,
93
:and I'm specifically interested in
probabilistic programming languages that
94
:support programmable inference.
95
:So we can think of many probabilistic
programming languages like Stan or Bugs or
96
:PyMC as largely having a single inference
algorithm that they're going to use
97
:multiple times for all the different
programs you can express.
98
:So bugs might use Gibbs sampling, Stan
uses HMC with nuts, PyMC uses MCMC
99
:algorithms, and these are all great.
100
:But of course, one of the limitations is
there's no universal inference algorithm
101
:that works well for any problem you might
want to express.
102
:And that's where I think a lot of the
power of programmable inference comes in.
103
:A lot of where the interesting research is
as well, right?
104
:Like how can you support users writing
their own say MCMC proposal for a given
105
:Bayesian inference problem and verify that
that proposal distribution meets the
106
:theoretical conditions needed for
soundness, whether it's defining a
107
:reducible chain, for example, or whether
it's a periodic.
108
:or in the context of variational
inference, whether you define the
109
:variational family that is broad enough,
so it's support encompasses the support of
110
:the target model.
111
:We have all of these conditions that we
usually hope are correct, but our systems
112
:don't actually verify that for us, whether
it's an MCMC or variational inference or
113
:importance sampling or sequential Monte
Carlo.
114
:And I think the more flexibility we give
programmers,
115
:And I touched upon this a little bit by
talking about probabilistic program
116
:synthesis, which is this idea of
probabilistic, automated probabilistic
117
:model discovery.
118
:And there, our goal is to use hierarchical
Bayesian models to specify prior
119
:distributions, not only over model
parameters, but also over model
120
:structures.
121
:And here, this is based on this idea that
traditionally in statistics, a data
122
:scientist or an expert,
123
:we'll hand design a Bayesian model for a
given problem, but oftentimes it's not
124
:obvious what's the right model to use.
125
:So the idea is, you know, how can we use
the observed data to guide our decisions
126
:about what is the right model structure to
even be using before we worry about
127
:parameter inference?
128
:So, you know, we've looked at this problem
in the context of learning models of time
129
:series data.
130
:Should my time series data have a periodic
component?
131
:Should it have polynomial trends?
132
:Should it have a change point?
133
:right?
134
:You know, how can we automate the
discovery of these different patterns and
135
:then learn an appropriate probabilistic
model?
136
:And I think it ties in very nicely to
probabilistic programming because
137
:probabilistic programs are so expressive
that we can express prior distributions on
138
:structures or prior distributions on
probabilistic programs all within the
139
:system using this unified technology.
140
:Yeah.
141
:Which is where, you know, these two
research areas really inform one another.
142
:If we're able to express
143
:rich probabilistic programming languages,
then we can start doing inference over
144
:probabilistic programs themselves and try
and synthesize these programs from data.
145
:Other areas that I've looked at are
tabular data or relational data models,
146
:different types of traditionally
structured data, and synthesizing models
147
:there.
148
:And the workhorse in that area is largely
Bayesian non -parametrics.
149
:So prior distributions over unbounded
spaces of latent variables, which are, I
150
:think, a very mathematically elegant way
to treat probabilistic structure discovery
151
:using Bayesian inferences as the workhorse
for that.
152
:And I'll just touch upon a few other areas
that I work in, which are also quite
153
:aligned, which a third area I work in is
more on the computational statistics side,
154
:which is now that we have probabilistic
programs and we're using them and they're
155
:becoming more and more routine in the
workflow of Bayesian inference, we need to
156
:start thinking about new statistical
methods and testing methods for these
157
:probabilistic programs.
158
:So for example, this is a little bit
different than traditional statistics
159
:where, you know, traditionally in
statistics we might
160
:some type of analytic mathematical
derivation on some probability model,
161
:right?
162
:So you might write up your model by hand,
and then you might, you know, if you want
163
:to compute some property, you'll treat the
model as some kind of mathematical
164
:expression.
165
:But now that we have programs, these
programs are often far too hard to
166
:formalize mathematically by hand.
167
:So if you want to analyze their
properties, how can we understand the
168
:properties of a program?
169
:By simulating it.
170
:So a very simple example of this would be,
say I wrote a probabilistic program for
171
:some given data, and I actually have the
data.
172
:Then I'd like to know whether the
probabilistic program I wrote is even a
173
:reasonable prior from that data.
174
:So this is a goodness of fit testing, or
how well does the probabilistic program I
175
:wrote explain the range of data sets I
might see?
176
:So, you know, if you do a goodness of fit
test using stats 101, you would look, all
177
:right, what is my distribution?
178
:What is the CDF?
179
:What are the parameters that I'm going to
derive some type of thing by hand?
180
:But for policy programs, we can't do that.
181
:So we might like to simulate data from the
program and do some type of analysis based
182
:on samples of the program as compared to
samples of the observed data.
183
:So these type of simulation -based
analyses of statistical properties of
184
:probabilistic programs for testing their
behavior or for quantifying the
185
:information between variables, things like
that.
186
:And then the final area I'll touch upon is
really more at the foundational level,
187
:which is.
188
:understanding what are the primitive
operations, a more rigorous or principled
189
:understanding of the primitive operations
on our computers that enable us to do
190
:random computations.
191
:So what do I mean by that?
192
:Well, you know, we love to assume that our
computers can freely compute over real
193
:numbers.
194
:But of course, computers don't have real
numbers built within them.
195
:They're built on finite precision
machines, right, which means I can't
196
:express.
197
:some arbitrary division between two real
numbers.
198
:Everything is at some level it's floating
point.
199
:And so this gives us a gap between the
theory and the practice.
200
:Because in theory, you know, whenever
we're writing our models, we assume
201
:everything is in this, you know,
infinitely precise universe.
202
:But when we actually implement it, there's
some level of approximation.
203
:So I'm interested in understanding first,
theoretically, what is this approximation?
204
:How important is it that I'm actually
treating my model as running on an
205
:infinitely precise machine where I
actually have finite precision?
206
:And second, what are the implications of
that gap for Bayesian inference?
207
:Does it mean that now I actually have some
208
:properties of my Markov chain that no
longer hold because I'm actually running
209
:it on a finite precision machine whereby
all my analysis was assuming I have an
210
:infinite precision or what does it mean
about the actual variables we generate?
211
:So, you know, we might generate a Gaussian
random variable, but in practice, the
212
:variable we're simulating has some other
distribution.
213
:Can we theoretically quantify that other
distribution and its error with respect to
214
:the true distribution?
215
:Or have we come up with sampling
procedures that are as close as possible
216
:to the ideal real value distribution?
217
:And so this brings together ideas from
information theory, from theoretical
218
:computer science.
219
:And one of the motivations is to thread
those results through into the actual
220
:Bayesian inference procedures that we
implement using probabilistic programming
221
:languages.
222
:So that's just, you know, an overview of
these three or four different areas that
223
:I'm interested in and I've been working on
recently.
224
:Yeah, that's amazing.
225
:Thanks a lot for these, like full panel of
what you're doing.
226
:And yeah, that's just incredible also that
you're doing so many things.
227
:I'm really impressed.
228
:And of course we're going to dive a bit
into these, at least some of these topics.
229
:I don't want to take three hours of your
time, but...
230
:Before that though, I'm curious if you
remembered when and how you first got
231
:introduced to Bayesian inference and also
why it's ticked with you because it seems
232
:like it's underpinning most of your work,
at least that idea of probabilistic
233
:programming.
234
:Yeah, that's a good question.
235
:I think I was first interested in
probability before I was interested in
236
:Bayesian inference.
237
:I remember...
238
:I used to read a book by Maasteller called
50 Challenging Problems in Probability.
239
:I took a course in high school and I
thought, how could I actually use these
240
:cool ideas for fun?
241
:And there was actually a very nice book
written back in the 50s by Maasteller.
242
:So that got me interested in probability
and how we can use probability to reason
243
:about real world phenomena.
244
:So the book that...
245
:that I used to read would sort of have
these questions about, you know, if
246
:someone misses a train and the train has a
certain schedule, what's the probability
247
:that they'll arrive at the right time?
248
:And it's a really nice book because it
ties in our everyday experiences with
249
:probabilistic modeling and inference.
250
:And so I thought, wow, this is actually a
really powerful paradigm for reasoning
251
:about the everyday things that we do,
like, you know, missing a bus and knowing
252
:something about its schedule and when's
the right time that I should arrive to
253
:maximize the probability of, you know,
some, some, some, some,
254
:event of interest, things like that.
255
:So that really got me hooked to the idea
of probability.
256
:But I think what really connected Bayesian
inference to me was taking, I think this
257
:was as a senior or as a first year
master's student, a course by Professor
258
:Josh Tannenbaum at MIT, which is
computational cognitive science.
259
:And that course has evolved.
260
:quiet a lot through the years, but the
version that I took was really a beautiful
261
:synthesis of lots of deep ideas of how
Bayesian inference can tell us something
262
:meaningful about how humans reason about,
you know, different empirical phenomena
263
:and cognition.
264
:So, you know, in cognitive science for,
you know, for...
265
:majority of the history of the field,
people would run these experiments on
266
:humans and they would try and analyze
these experiments using some type of, you
267
:know, frequentist statistics or they would
not really use generative models to
268
:describe how humans are are solving a
particular experiment.
269
:But the, you know, Professor Tenenbaum's
approach was to use Bayesian models.
270
:as a way of describing or at least
emulating the cognitive processes that
271
:humans do for solving these types of
cognition tasks.
272
:And by cognition tasks, I mean, you know,
simple experiments you might ask a human
273
:to do, which is, you know, you might have
some dots on a screen and you might tell
274
:them, all right, you've seen five dots,
why don't you extrapolate the next five?
275
:Just simple things that, simple cognitive
experiments or, you know, yeah, so.
276
:I think that being able to use Bayesian
models to describe very simple cognitive
277
:phenomena was another really appealing
prospect to me throughout that course.
278
:I'm seeing all the ways in which that
manifested in very nice questions about.
279
:how do we do efficient inference in real
time?
280
:Because humans are able to do inference
very quickly.
281
:And Bayesian inference is obviously very
challenging to do.
282
:But then, if we actually want to engineer
systems, we need to think about the hard
283
:questions of efficient and scalable
inference in real time, maybe at human
284
:level speeds.
285
:Which brought in a lot of the reason for
why I'm so interested in inference as
286
:well.
287
:Because that's one of the harder aspects
of Bayesian computing.
288
:And then I think a third thing which
really hooked me to Bayesian inference was
289
:taking a machine learning course and kind
of comparing.
290
:So the way these machine learning courses
work is they'll teach you empirical risk
291
:minimization, and then they'll teach you
some type of optimization, and then
292
:there'll be a lecture called Bayesian
inference.
293
:And...
294
:What was so interesting to me at the time
was up until the time, up until the
295
:lecture where we learned anything about
Bayesian inference, all of these machine
296
:learning concepts seem to just be a
hodgepodge of random tools and techniques
297
:that people were using.
298
:So I, you know, there's the support vector
machine and it's good at classification
299
:and then there's the random forest and
it's good at this.
300
:But what's really nice about using
Bayesian inference in the machine learning
301
:setting, or at least what I found
appealing was how you have a very clean
302
:specification of the problem that you're
trying to solve in terms of number one, a
303
:prior distribution.
304
:over parameters and observable data, and
then the actual observed data, and three,
305
:which is the posterior distribution that
you're trying to infer.
306
:So you can use a very nice high -level
specification of what is even the problem
307
:you're trying to solve before you even
worry about how you solve it.
308
:you can very cleanly separate modeling and
inference, whereby most of the machine
309
:learning techniques that I was initially
reading or learning about seem to be only
310
:focused on how do I infer something
without crisply formalizing the problem
311
:that I'm trying to solve.
312
:And then, you know, just, yeah.
313
:And then, yeah.
314
:So once we have this Bayesian posterior
that we're trying to infer, then maybe
315
:we'll do fully Bayesian inference, or
maybe we'll do approximate Bayesian
316
:inference, or maybe we'll just do maximum
likelihood.
317
:That's maybe less of a detail.
318
:The more important detail is we have a
very clean specification for our problem
319
:and we can, you know, build in our
assumptions.
320
:And as we change our assumptions, we
change the specification.
321
:So it seemed like a very systematic way,
very systematic way to build machine
322
:learning and artificial intelligence
pipelines.
323
:using a principled process that I found
easy to reason about.
324
:And I didn't really find that in the other
types of machine learning approaches that
325
:we learned in the class.
326
:So yeah, so I joined the probabilistic
computing project at MIT, which is run by
327
:my PhD advisor, Dr.
328
:Vikash Mansinga.
329
:And, um, you really got the opportunity to
explore these interests at the research
330
:level, not only in classes.
331
:And that's, I think where everything took
off afterwards.
332
:Those are the synthesis of various things,
I think that got me interested in the
333
:field.
334
:Yeah.
335
:Thanks a lot for that, for that, that
that's super interesting to see.
336
:And, uh, I definitely relate to the idea
of these, um, like the patient framework
337
:being, uh, attractive.
338
:not because it's a toolbox, but because
it's more of a principle based framework,
339
:basically, where instead of thinking, oh
yeah, what tool do I need for that stuff,
340
:it's just always the same in a way.
341
:To me, it's cool because you don't have to
be smart all the time in a way, right?
342
:You're just like, it's the problem takes
the same workflow.
343
:It's not going to be the same solution.
344
:But it's always the same workflow.
345
:Okay.
346
:What does the data look like?
347
:How can we model that?
348
:Where is the data generative story?
349
:And then you have very different
challenges all the time and different
350
:kinds of models, but you're not thinking
about, okay, what is the ready made model
351
:that they can apply to these data?
352
:It's more like how can I create a custom
model to these data knowing the
353
:constraints I have about my problem?
354
:And.
355
:thinking in a principled way instead of
thinking in a toolkit way.
356
:I definitely relate to that.
357
:I find that amazing.
358
:I'll just add to that, which is this is
not only some type of aesthetic or
359
:theoretical idea.
360
:I think it's actually strongly tied into
good practice that makes it easier to
361
:solve problems.
362
:And by that, what do I mean?
363
:Well, so I did a very brief undergraduate
research project in a biology lab,
364
:computational biology lab.
365
:And just looking at the empirical workflow
that was done,
366
:made me very suspicious about the process,
which is, you know, you might have some
367
:data and then you'll hit it with PCA and
you'll get some projection of the data and
368
:then you'll use a random forest classifier
and you're going to classify it in
369
:different ways.
370
:And then you're going to use the
classification and some type of logistic
371
:regression.
372
:So you're just chaining these ad hoc
different data analyses to come up with
373
:some final story.
374
:And while that might be okay to get you
some specific result, it doesn't really
375
:tell you anything about how changing one
modeling choice in this pipeline.
376
:is going to impact your final inference
because this sort of mix and match
377
:approach of applying different ad hoc
estimators to solve different subtasks
378
:doesn't really give us a way to iterate on
our models, understand their limitations
379
:very well, knowing their sensitivity to
different choices, or even building
380
:computational systems that automate a lot
of these things, right?
381
:Like probabilistic programs.
382
:Like you're saying, we can write our data
generating process as the workflow itself,
383
:right?
384
:Rather than, you know, maybe in Matlab
I'll run PCA and then, you know, I'll use
385
:scikit -learn and Python.
386
:Without, I think, this type of prior
distribution over our data, it becomes
387
:very hard to reason formally about our
entire inference workflow, which would...
388
:know, which probabilistic programming
languages are trying to make easier and
389
:give a more principled approach that's
more amenable to engineering, to
390
:optimization, to things of that sort.
391
:Yeah.
392
:Yeah, yeah.
393
:Fantastic point.
394
:Definitely.
395
:And that's also the way I personally tend
to teach patient stats.
396
:Now it's much more on a, let's say,
principle -based way instead of, and
397
:workflow -based instead of just...
398
:Okay, Poisson regression is this
multinomial regression is that I find that
399
:much more powerful because then when
students get out in the wild, they are
400
:used to first think about the problem and
then try to see how they could solve it
401
:instead of just trying to find, okay,
which model is going to be the most.
402
:useful here in the models that I already
know, because then if the data are
403
:different, you're going to have a lot of
problems.
404
:Yeah.
405
:And so you actually talked about the
different topics that you work on.
406
:There are a lot I want to ask you about.
407
:One of my favorites, and actually I think
Colin also has been working a bit on that
408
:lately.
409
:is the development of AutoGP .jl.
410
:So I think that'd be cool to talk about
that.
411
:What inspired you to develop that package,
which is in Julia?
412
:Maybe you can also talk about that if you
mainly develop in Julia most of the time,
413
:or if that was mostly useful for that
project.
414
:And how does this package...
415
:advance, like help the learning structure
of Gaussian Processes kernels because if I
416
:understand correctly, that's what the
package is mostly about.
417
:So yeah, if you can give a primer to
listeners about that.
418
:Definitely.
419
:Yes.
420
:So Gaussian Processes are a pretty
standard model that's used in many
421
:different application areas.
422
:spatial temporal statistics and many
engineering applications based on
423
:optimization.
424
:So these Gaussian process models are
parameterized by covariance functions,
425
:which specify how the data produced by
this Gaussian process co -varies across
426
:time, across space, across any domain
which you're able to define some type of
427
:covariance function.
428
:But one of the main challenges in using a
Gaussian process for modeling your data,
429
:is making the structural choice about what
should the covariance structure be.
430
:So, you know, the one of the universal
choices or the most common choices is to
431
:say, you know, some type of a radial basis
function for my data, the RBF kernel, or,
432
:you know, maybe a linear kernel or a
polynomial kernel, somehow hoping that
433
:you'll make the right choice to model your
data accurately.
434
:So the inspiration for auto GP or
automatic Gaussian process is to try and
435
:use the data not only to infer the numeric
parameters of the Gaussian process, but
436
:also the structural parameters or the
actual symbolic structure of this
437
:covariance function.
438
:And here we are drawing our inspiration
from work which is maybe almost 10 years
439
:now from Dave Duvenoe and colleagues
called the Automated Statistician Project,
440
:or ABCD, Automatic Bayesian Covariance
Discovery, which introduced this idea of
441
:defining a symbolic language.
442
:over Gaussian process covariance functions
or covariance kernels and using a grammar,
443
:using a recursive grammar and trying to
infer an expression in that grammar given
444
:the observed data.
445
:So, you know, in a time series setting,
for example, you might have time on the
446
:horizontal axis and the variable on the y
-axis and you just have some variable
447
:that's evolving.
448
:You don't know necessarily the dynamics of
that, right?
449
:There might be some periodic structure in
the data or there might be multiple
450
:periodic effects.
451
:Or there might be a linear trend that's
overlaying the data.
452
:Or there might be a point in time in which
the data is switching between some process
453
:before the change point and some process
after the change point.
454
:Obviously, for example, in the COVID era,
almost all macroeconomic data sets had
455
:some type of change point around April
:
456
:And we see that in the empirical data that
we're analyzing today.
457
:So the question is, how can we
automatically surface these structural
458
:choices?
459
:using Bayesian inference.
460
:So the original approach that was in the
automated statistician was based on a type
461
:of greedy search.
462
:So they were trying to say, let's find the
single kernel that maximizes the
463
:probability of the data.
464
:Okay.
465
:So they're trying to do a greedy search
over these kernel structures for Gaussian
466
:processes using these different search
operators.
467
:And for each different kernel, you might
find the maximum likelihood parameter, et
468
:cetera.
469
:And I think that's a fine approach.
470
:But it does run into some serious
limitations, and I'll mention a few of
471
:them.
472
:One limitation is that greedy search is in
a sense not representing any uncertainty
473
:about what's the right structure.
474
:It's just finding a single best structure
to maximize some probability or maybe
475
:likelihood of the data.
476
:But we know just like parameters are
uncertain, structure can also be quite
477
:uncertain because the data is very noisy.
478
:We may have sparse data.
479
:And so, you know, we'd want type of
inference systems that are more robust.
480
:when discovering the temporal structure in
the data and that greedy search doesn't
481
:really give us that level of robustness
through expressing posterior uncertainty.
482
:I think another challenge with greedy
search is its scalability.
483
:And by that, if you have a very large data
set in a greedy search algorithm, we're
484
:typically at each stage of the search,
we're looking at the entire data set to
485
:score our model.
486
:And this is also a traditional Markov
chain Monte Carlo algorithms.
487
:We often score our data set, but in the
Gaussian process setting, scoring the data
488
:set is very expensive.
489
:If you have N data points, it's going to
cost you N cubed.
490
:And so it becomes quite infeasible to run
greedy search or even pure Markov chain
491
:Monte Carlo, where at each step, each time
you change the parameters or you change
492
:the kernel, you need to now compute the
full likelihood.
493
:And so the second motivation in AutoGP is
to build an inference algorithm.
494
:that is not looking at the whole data set
at each point in time, but using subsets
495
:of the data set that are sequentially
growing.
496
:And that's where the sequential Monte
Carlo inference algorithm comes in.
497
:So AutoGP is implemented in Julia.
498
:And the API is that basically you give it
a one -dimensional time series.
499
:You hit infer.
500
:And then it's going to report an ensemble
of Gaussian processes or a sample from my
501
:posterior distribution, where each
Gaussian process has some particular
502
:structure and some numeric parameters.
503
:And you can show the user, hey, I've
inferred these hundred GPS from my
504
:posterior.
505
:And then they can start using them for
generating predictions.
506
:You can use them to find outliers because
these are probabilistic models.
507
:You can use them for a lot of interesting
tasks.
508
:Or you might say, you know,
509
:This particular model actually isn't
consistent with what I know about the
510
:data.
511
:So you might remove one of the posterior
samples from your ensemble.
512
:Yeah, so those are, you know, we used
AutoGP on the M3.
513
:We benchmarked it on the M3 competition
data.
514
:M3 is around, or the monthly data sets in
M3 are around 1 ,500 time series, you
515
:know, between 100 and 500 observations in
length.
516
:And we compared the performance against
different statistics baselines and machine
517
:learning baselines.
518
:And it's actually able to find pretty
common sense structures in these economic
519
:data.
520
:Some of them have seasonal features,
multiple seasonal effects as well.
521
:And what's interesting is we don't need to
customize the prior to analyze each data
522
:set.
523
:It's essentially able to discover.
524
:And what's also interesting is that
sometimes when the data set just looks
525
:like a random walk, it's going to learn a
covariance structure, which emulates a
526
:random walk.
527
:So by having a very broad prior
distribution on the types of covariance
528
:structures that you see, it's able to find
which of these are plausible explanation
529
:given the data.
530
:Yes, as you mentioned, we implemented this
in Julia.
531
:The reason is that AutoGP is built on the
Gen probabilistic programming language,
532
:which is embedded in the Julia language.
533
:And the reason that Gen, I think, is a
very useful system for this problem.
534
:So Gen was developed primarily by Marco
Cosumano Towner, who wrote a PhD thesis.
535
:He was a colleague of mine at the MIT
Policy Computing Project.
536
:And Gen really, it's a Turing complete
language and has programmable inference.
537
:So you're able to write a prior
distribution over these symbolic
538
:expressions in a very natural way.
539
:And you're able to customize an inference
algorithm that's able to solve this
540
:problem efficiently.
541
:And
542
:What really drew us to GEN for this
problem, I think, are twofold.
543
:The first is its support for sequential
Monte Carlo inference.
544
:So it has a pretty mature library for
doing sequential Monte Carlo.
545
:And sequential Monte Carlo construed more
generally than just particle filtering,
546
:but other types of inference over
sequences of probability distributions.
547
:So particle filters are one type of
sequential Monte Carlo algorithm you might
548
:write.
549
:But you might do some type of temperature
annealing or data annealing or other types
550
:of sequentialization strategies.
551
:And Jen provides a very nice toolbox and
abstraction for experimenting with
552
:different types of sequential Monte Carlo
approaches.
553
:And so we definitely made good use of that
library when developing our inference
554
:algorithm.
555
:The second reason I think that Jen was
very nice to use is its library for
556
:involutive MCMC.
557
:And involutive MCMC, it's a relatively new
framework.
558
:It was discovered, I think, concurrently.
559
:and independently both by Marco and other
folks.
560
:And this is kind of, you can think of it
as a generalization of reversible jump
561
:MCMC.
562
:And it's really a unifying framework to
understand many different MCMC algorithms
563
:using a common terminology.
564
:And so there's a wonderful ICML paper
which lists 30 or so different algorithms
565
:that people use all the time like
Hamiltonian Monte Carlo, reversible jump
566
:MCMC, Gibbs sampling, Metropolis Hastings.
567
:and expresses them using the language of
involutive MCMC.
568
:I believe the author is Nick Liudov,
although I might be mispronouncing that,
569
:sorry for that.
570
:So, Jen has a library for involutive MCMC,
which makes it quite easy to write
571
:different proposals for how you do this
inference over your symbolic expressions.
572
:Because when you're doing MCMC within the
inner loop of a sequential Monte Carlo
573
:algorithm,
574
:You need to somehow be able to improve
your current symbolic expressions for the
575
:covariance kernel, given the observed
data.
576
:And, uh, doing that is, is hard because
this is kind of a reversible jump
577
:algorithm where you make a structural
change.
578
:Then you need to maybe generate some new
parameters.
579
:You need the reverse probability of going
back.
580
:And so Jen has a high level, has a lot of
automation and a library for implementing
581
:these types of structure moves in a very
high level way.
582
:And it automates the low level math for.
583
:computing the acceptance probability and
embedding all of that within an outer
584
:level SMC loop.
585
:And so this is, I think, one of my
favorite examples for what probabilistic
586
:programming can give us, which is very
expressive priors over these, you know,
587
:symbolic expressions generated by symbolic
grammars, powerful inference algorithms
588
:using combinations of sequential Monte
Carlo and involutive MCMC and reversible
589
:jump moves and gradient based inference
over the parameters.
590
:It really brings together a lot of the
591
:a lot of the strengths of probabilistic
programming languages.
592
:And we showed at least on these M3
datasets that they can actually be quite
593
:competitive with state -of -the -art
solutions, both in statistics and in
594
:machine learning.
595
:I will say, though, that as with
traditional GPs, the scalability is really
596
:in the likelihood.
597
:So whether AutoGP can handle datasets with
10 ,000 data points, it's actually too
598
:hard because ultimately,
599
:Once you've seen all the data in your
sequential Monte Carlo, you will be forced
600
:to do this sort of N cubed scaling, which
then, you know, you need some type of
601
:improvements or some type of approximation
for handling larger data.
602
:But I think what's more interesting in
AutoGP is not necessarily that it's
603
:applied to inferring structures of
Gaussian processes, but that it's sort of
604
:a library for inferring probabilistic
structure and showing how to do that by
605
:integrating these different inference
methodologies.
606
:Hmm.
607
:Okay.
608
:Yeah, so many things here.
609
:So first, I put all the links to autogp
.jl in the show notes.
610
:I also put a link to the underlying paper
that you've written with some co -authors
611
:about, well, the sequential Monte Carlo
learning that you're doing to discover
612
:these time -series structure for people
who want to dig deeper.
613
:And I put also a link to all, well, most
of the LBS episodes where we talk about
614
:Gaussian processes for people who need a
bit more background information because
615
:here we're mainly going to talk about how
you do that and so on and how useful is
616
:it.
617
:And we're not going to give a primer on
what Gaussian processes are.
618
:So if you want that, folks, there are a
bunch of episodes in the show notes for
619
:that.
620
:So...
621
:on that basically practical utility of
that time -series discovery.
622
:So if understood correctly, for now, you
can do that only on one -dimensional input
623
:data.
624
:So that would be basically on a time
series.
625
:You cannot input, let's say, that you have
categories.
626
:These could be age groups.
627
:So.
628
:you could one -hot, usually I think that's
the way it's done, how to give that to a
629
:GP would be to one -hot encode each of
these edge groups.
630
:And then that means, let's say you have
four edge group.
631
:Now the input dimension of your GP is not
one, which is time, but it's five.
632
:So one for time and four for the edge
groups.
633
:This would not work here, right?
634
:Right, yes.
635
:So at the moment, we're focused on, and
these are called, I guess, in
636
:econometrics, pure time series models,
where you're only trying to do inference
637
:on the time series based on its own
history.
638
:I think the extensions that you're
proposing are very natural to consider.
639
:You might have a multi -input Gaussian
process where you're not only looking at
640
:your own history, but you're also
considering some type of categorical
641
:variable.
642
:Or you might have exogenous covariates
evolving along with the time series.
643
:If you want to predict temperature, for
example, you might have the wind speed and
644
:you might want to use that as a feature
for your Gaussian process.
645
:Or you might have an output, a multiple
output Gaussian process.
646
:You want a Gaussian process over multiple
different time series generally.
647
:And I think all of these variants are, you
know, they're possible to develop.
648
:There's no fundamental difficulty, but the
main, I think the main challenge is how
649
:can you define a domain specific language
over these covariance structures for
650
:multi, for multivariate input data?
651
:becomes a little bit more challenging.
652
:So in the time series setting, what's nice
is we can interpret how any type of
653
:covariance kernel is going to impact the
actual prior over time series.
654
:Once we're in the multi -dimensional
setting, we need to think about how to
655
:combine the kernels for different
dimensions in a way that's actually
656
:meaningful for modeling to ensure that
it's more tractable.
657
:But I think extensions of the DSL to
handle multiple inputs, exogenous
658
:covariates, multiple outputs,
659
:These are all great directions.
660
:And I'll just add on top of that, I think
another important direction is using some
661
:of the more recent approximations for
Gaussian processes.
662
:So we're not bottlenecked by the n cubed
scaling.
663
:So there are, I think, a few different
approaches that have been developed.
664
:There are approaches which are based on
stochastic PDEs or state space
665
:approximations of Gaussian processes,
which are quite promising.
666
:There's some other things like nearest
neighbor Gaussian processes, but I'm a
667
:little less confident about those because
we lose a lot of the nice affordances of
668
:GPs once we start doing nearest neighbor
approximations.
669
:But I think there's a lot of new methods
for approximate GPs.
670
:So we might do a stochastic variational
inference, for example, an SVGP.
671
:So I think as we think about handling more
672
:more richer types of data, then we should
also think about how to start introducing
673
:some of these more scalable approximations
to make sure we can still efficiently do
674
:the structure learning in that setting.
675
:Yeah, that would be awesome for sure.
676
:As a more, much more on the practitioner
side than on the math side.
677
:Of course, that's where my head goes
first.
678
:You know, I'm like, oh, that'd be awesome,
but I would need to have that to have it
679
:really practical.
680
:Um, and so if I use auto GP dot channel,
so I give it a time series data.
681
:Um, then what do I get back?
682
:Do I get back, um, the busier samples of
the, the implied model, or do I get back
683
:the covariance structure?
684
:So that could be, I don't know what, what
form that could be, but I'm thinking, you
685
:know,
686
:Uh, often when I use GPS, I use them
inside other models with other, like I
687
:could use a GP in a linear regression, for
instance.
688
:And so I'm thinking that'd be cool if I'm
not sure about the covariance structure,
689
:especially if it can do the discovery of
the seasonality and things like that
690
:automatically, because it's always
seasonality is a bit weird and you have to
691
:add another GP that can handle
periodicity.
692
:Um, and then you have basically a sum of
GP.
693
:And then you can take that sum of GP and
put that in the linear predictor of the
694
:linear regression.
695
:That's usually how I use that.
696
:And very often, I'm using categorical
predictors almost always.
697
:And I'm thinking what would be super cool
is that I can outsource that discovery
698
:part of the GP to the computer like you're
doing with this algorithm.
699
:And then I get back under what form?
700
:I don't know yet.
701
:I'm just thinking about that.
702
:this covariance structure that I can just,
which would be an MV normal, like a
703
:multivit normal in a way, that I just use
in my linear predictor.
704
:And then I can use that, for instance, in
a PMC model or something like that,
705
:without to specify the GP myself.
706
:Is it something that's doable?
707
:Yeah, yeah, I think that's absolutely
right.
708
:So you can, because Gaussian processes are
compositional, just, you know, you
709
:mentioned the sum of two Gaussian
processes, which corresponds to the sum of
710
:two kernel.
711
:So if I have Gaussian process one plus
Gaussian process two, that's the same as
712
:the Gaussian process whose covariance is
k1 plus k2.
713
:And so what that means is we can take our
synthesized kernel, which is comprised of
714
:some base kernels and then maybe sums and
products and change points, and we can
715
:wrap all of these in just one mega GP,
basically, which would encode the entire
716
:posterior disk or, you know,
717
:a summary of all of the samples in one GP.
718
:Another, and I think you also mentioned an
important point, which is multivariate
719
:normals.
720
:You can also think of the posterior as
just a mixture of these multivariate
721
:normals.
722
:So let's say I'm not going to sort of
compress them into a single GP, but I'm
723
:actually going to represent the output of
auto GP as a mixture of multivariate
724
:normals.
725
:And that would be another type of API.
726
:So depending on exactly what type of
727
:how you're planning to use the GP, I think
you can use the output of auto GP in the
728
:right way, because ultimately, it's
producing some covariance kernels, you
729
:might aggregate them all into a GP, or you
might compose them together to make a
730
:mixture of GPs.
731
:And you can export this to PyTorch, or
most of the current libraries for GPs
732
:support composing the GPs with one
another, et cetera.
733
:So I think depending on the use case, it
should be quite straightforward to figure
734
:out how to leverage the output of AutoGP
to use within the inner loop of some bra
735
:or within the internals of some larger
linear regression model or other type of
736
:model.
737
:Yeah, that's definitely super cool because
then you can, well, yeah, use that,
738
:outsource that part of the model where I
think the algorithm probably...
739
:If not now, in just a few years, it's
going to make do a better job than most
740
:modelers, at least to have a rough first
draft.
741
:That's right.
742
:The first draft.
743
:A data scientist who's determined enough
to beat AutoGP, probably they can do it if
744
:they put in enough effort just to study
the data.
745
:But it's getting a first pass model that's
actually quite good as compared to other
746
:types of automated techs.
747
:Yeah, exactly.
748
:I mean, that's recall.
749
:It's like asking for a first draft of, I
don't know, blog post to ChatGPT and then
750
:going yourself in there and improving it
instead of starting everything from
751
:scratch.
752
:Yeah, for sure you could do it, but that's
not where your value added really lies.
753
:So yeah.
754
:So what you get is these kind of samples.
755
:In a way, do you get back samples?
756
:or do you get symbolic variables back?
757
:You get symbolic expressions for the
covariance kernels as well as the
758
:parameters embedded within them.
759
:So you might get, let's say you asked for
five posterior samples, you're going to
760
:have maybe one posterior sample, which is
a linear kernel.
761
:And then another posterior sample, which
is a linear times linear, so a quadratic
762
:kernel.
763
:And then maybe a third posterior sample,
which is again, a linear, and each of them
764
:will have their different parameters.
765
:And because we're using sequential Monte
Carlo,
766
:all of the posterior samples are
associated with weights.
767
:The sequential Monte Carlo returns a
weighted particle collection, which is
768
:approximating the posterior.
769
:So you get back these weighted particles,
which are symbolic expressions.
770
:And we have, in AutoGP, we have a minimal
prediction GP library.
771
:So you can actually put these symbolic
expressions into a GP to get a functional
772
:GP, but you can export them to a text file
and then use your favorite GP library and
773
:embed them within that as well.
774
:And we also get noise parameters.
775
:So each kernel is going to be associated
with the output noise.
776
:Because obviously depending on what kernel
you use, you're going to infer a different
777
:noise level.
778
:So you get a kernel structure, parameters,
and noise for each individual particle in
779
:your SMC ensemble.
780
:OK, I see.
781
:Yeah, super cool.
782
:And so yeah, if you can get back that as a
text file.
783
:Like either you use it in a full Julia
program, or if you prefer R or Python, you
784
:could use auto -gp .jl just for that.
785
:Get back a text file and then use that in
R or in Python in another model, for
786
:instance.
787
:Okay.
788
:That's super cool.
789
:Do you have examples of that?
790
:Yeah.
791
:Do you have examples of that we can link
to for listeners in the show notes?
792
:We have tutorial.
793
:And so...
794
:The tutorial, I think, prints, it shows a
print of the, it prints the learned
795
:structures into the output cells of the
IPython notebooks.
796
:And so you could take the printed
structure and just save it as a text file
797
:and write your own little parser for
extracting those structures and building
798
:an RGP or a PyTorch GP or any other GP.
799
:Okay.
800
:Yeah.
801
:That was super cool.
802
:That's awesome.
803
:And do you know if there is already an
implementation in R?
804
:and or in Python of what you're doing in
AutoGP .JS?
805
:Yeah, so we, so this project was
implemented during my year at Google when
806
:I was so between starting at CMU and
finishing my PhD, I was at Google for a
807
:year as a visiting faculty scientist.
808
:And some of the prototype implementations
were also in Python.
809
:But I think the only public version at the
moment is the Julia version.
810
:But I think it's a little bit challenging
to reimplement this because one of the
811
:things we learned when trying to implement
it in Python is that we don't have Gen, or
812
:at least at the time we didn't.
813
:The reason we focused on Julia is that we
could use the power of the Gen
814
:probabilistic programming language in a
way that made model development and
815
:iterating.
816
:much more feasible than a pure Python
implementation or even, you know, an R
817
:implementation or in another language.
818
:Yeah.
819
:Okay.
820
:Um, and so actually, yeah, so I, I would
have so many more questions on that, but I
821
:think that's already a good, a good
overview of, of that project.
822
:Maybe I'm curious about the, the biggest
obstacle that you had on the path, uh,
823
:when developing
824
:that package, autogp .jl, and also what
are your future plans for this package?
825
:What would you like to see it become in
the coming months and years?
826
:Yeah.
827
:So thanks for those questions.
828
:So for the biggest challenge, I think
designing and implementing the inference
829
:algorithm that includes...
830
:sequential Monte Carlo and involuted MCMC.
831
:That was a challenge because there aren't
many works, prior works in the literature
832
:that have actually explored this type of a
combination, which is, um, you know, which
833
:is really at the heart of auto GP, um,
designing the right proposal distributions
834
:for, I have some given structure and I
have my data.
835
:How do I do a data driven proposal?
836
:So I'm not just blindly proposing some new
structure from the prior or some new sub
837
:-structure.
838
:but actually use the observed data to come
up with a smart proposal for how I'm going
839
:to improve the structure in the inner loop
of MCMC.
840
:So we put a lot of thought into the actual
move types and how to use the data to come
841
:up with data -driven proposal
distributions.
842
:So the paper describes some of these
tricks.
843
:So there's moves which are based on
replacing a random subtree.
844
:There are moves which are detaching the
subtree and throwing everything away or...
845
:embedding the subtree within a new tree.
846
:So there are these different types of
moves, which we found are more helpful to
847
:guide the search.
848
:And it was a challenging process to figure
out how to implement those moves and how
849
:to debug them.
850
:So that I think was, was part of the
challenge.
851
:I think another challenge which, which we
came, which we were facing was of course,
852
:the fact that we were using these dense
Gaussian process models without the actual
853
:approximations that are needed to scale to
say tens or hundreds of thousands of data
854
:points.
855
:And so.
856
:This I think was part of the motivation
for thinking about what are other types of
857
:approximations of the GP that would let us
handle datasets of that size.
858
:In terms of what I'd like for AutoGP to be
in the future, I think there's two answers
859
:to that.
860
:One answer, and I think there's already a
nice success case here, but one answer is
861
:I'd like the implementation of AutoGP to
be a reference for how to do probabilistic
862
:structure discovery using GEN.
863
:So I expect that people...
864
:across many different disciplines have
this problem of not knowing what their
865
:specific model is for the data.
866
:And then you might have a prior
distribution over symbolic model
867
:structures and given your observed data,
you want to infer the right model
868
:structure.
869
:And I think in the auto GP code base, we
have a lot of the important components
870
:that are needed to apply this workflow to
new settings.
871
:So I think we've really put a lot of
effort in having the code be self
872
:-documenting in a sense.
873
:and make it easier for people to adapt the
code for their own purposes.
874
:And so there was a recent paper this year
presented at NURiPS by Tracy Mills and Sam
875
:Shayet from Professor Tenenbaum's group
that extended the AutoGP package for a
876
:task in cognition, which was very nice to
see that the code isn't only valuable for
877
:its own purpose, but also adaptable by
others for other types of tasks.
878
:Um, and I think the second thing that I'd
like auto GP or at least the auto GP type
879
:models to do is, um, you know, integrating
these with, and this goes back to the
880
:original automatic statistician that, uh,
that motivated auto GP.
881
:It's worked say 10 years ago.
882
:Um, so the auto automated statistician had
the component, the natural language
883
:processing component, which is, you know,
at the time there was no chat GPT or large
884
:language models.
885
:So they just wrote some simple rules to
take the learned Gaussian process.
886
:and summarize it in terms of a report.
887
:But now we have much more powerful
language models.
888
:And one question could be, how can I use
the outputs of AutoGP and integrate it
889
:within a language model, not only for
reporting the structure, but also for
890
:answering now probabilistic queries.
891
:So you might say, find for me a time when
there could be a change point, or give me
892
:a numerical estimate of the covariance
between two different time slices, or
893
:impute the data.
894
:between these two different time regions,
or give me a 95 % prediction interval.
895
:And so a data scientist can write these in
terms of natural language, or rather a
896
:domain specialist can write these in
natural language, and then you would
897
:compile it into different little programs
that are querying the GP learned by
898
:AutoGP.
899
:And so creating some type of a higher
level interface that makes it possible for
900
:people to not necessarily dive into the
guts of Julia and, you know, or implement
901
:even an IPython notebook.
902
:but have the system learn the
probabilistic models and then have a
903
:natural language interface which you can
use to query those models, either for
904
:learning something about the structure of
the data, but also for solving prediction
905
:tasks.
906
:And in both cases, I think, you know, off
the shelf models may not work so well
907
:because, you know, they may not know how
to parse the auto GP kernel to come up
908
:with a meaningful summary of what it
actually means in terms of the data, or
909
:they may not know how to translate natural
language into
910
:Julia code for AutoGP.
911
:So there's a little bit of research into
thinking about how do we fine tune these
912
:models so that they're able to interact
with the automatically learned
913
:probabilistic models.
914
:And I think what's, I'll just mention
here, which is one of the benefits of an
915
:AutoGP like system is its
interpretability.
916
:So because Gaussian processes are, they're
quiet, transparent, like you said, they're
917
:ultimately at the end of the day, these
giant multivariate normals.
918
:We can explain to people who are using
these types of these distributions and
919
:they're comfortable with them, what
exactly is the distribution that's been
920
:learned?
921
:These are some weights and some giant
neural network and here's the prediction
922
:and you have to live with it.
923
:Rather, you can say, well, here's our
prediction and the reason we made this
924
:prediction is, well, we inferred a
seasonal components with so -and -so
925
:frequency.
926
:And so you can get the predictions, but
you can also get some type of
927
:interpretable summary for why those
predictions were made, which maybe helps
928
:with the trustworthiness of the system.
929
:or just transparency more generally.
930
:Yeah.
931
:I'm signing now.
932
:That sounds like an awesome tool.
933
:Yeah, for sure.
934
:That looks absolutely fantastic.
935
:And yeah, hopefully that will, these kind
of tools will help.
936
:I'm definitely curious to try that now in
my own models, basically.
937
:And yeah, see what...
938
:AutoGP .jl tells you, but the covariance
structure and then try and use that myself
939
:in a model of mine, probably in Python so
that I have to get out of the Julia and
940
:see how that, like how you can plug that
into another model.
941
:That would be super, super interesting for
sure.
942
:Yeah.
943
:I'm going to try and find an excuse to do
that.
944
:Um, actually I'm curious now, um, we could
talk a bit about how that's done, right?
945
:How you do that discovery of the time
series structure.
946
:And you've mentioned that you're using
sequential Monte Carlo to do that.
947
:So SMC, um, can you give listeners an idea
of what SMC is and why that would be
948
:useful in that case?
949
:Uh, and also if.
950
:the way you do it for these projects
differs from the classical way of doing
951
:SMC.
952
:Good.
953
:Yes, thanks for that question.
954
:So sequential Monte Carlo is a very broad
family of algorithms.
955
:And I think one of the confusing parts for
me when I was learning sequential Monte
956
:Carlo is that a lot of the introductory
material of sequential Monte Carlo are
957
:very closely married to particle filters.
958
:But particle filtering, which is only one
application of sequential Monte Carlo,
959
:isn't the whole story.
960
:And so I think, you know, there's now more
modern expositions of sequential Monte
961
:Carlo, which are really bringing to light
how general these methods are.
962
:And here I would like to recommend
Professor Nicholas Chopin's textbook,
963
:Introduction to Sequential Monte Carlo.
964
:It's a Springer 2020 textbook.
965
:I continue to use this in my research and,
you know, I think that it's a very well
966
:-written overview of really
967
:how general and how powerful sequential
Monte Carlo is.
968
:So a brief explanation of sequential Monte
Carlo.
969
:I guess maybe one way we could contrast it
is the traditional Markov chain Monte
970
:Carlo.
971
:So in traditional MCMC, we have some
particular latent state, let's call it
972
:theta.
973
:And we just, theta is supposed to be drawn
from P of theta given X, where that's our
974
:posterior distribution and X is the data.
975
:And we just apply some transition kernel
over and over and over again, and then we
976
:hope.
977
:And the limit of the applications of these
transition kernels, we're going to
978
:converge to the posterior distribution.
979
:Okay.
980
:So MCMC is just like one iterative chain
that you run forever.
981
:You can do a little bit of modifications.
982
:You might have multiple chains, which are
independent of one another, but sequential
983
:Monte Carlo is, is in a sense, trying to
go beyond that, which is anything you can
984
:do in a traditional MCMC algorithm, you
can do using sequential Monte Carlo.
985
:But in sequential Monte Carlo,
986
:you don't have a single chain, but you
have multiple different particles.
987
:And each of these different particles you
can think of as being analogous in some
988
:way to a particular MCMC chain, but
they're allowed to interact.
989
:And so you start with, say, some number of
particles, and you start with no data.
990
:And so what you would do is you would just
draw these particles from your prior
991
:distribution.
992
:And each of these draws from the prior are
basically draws from p of theta.
993
:And now I'd like to get them to p of theta
given x.
994
:That's my goal.
995
:So I start with a bunch of particles drawn
from p of theta, and I'd like to get them
996
:to p of theta given x.
997
:So how am I going to go from p of theta to
p of theta given x?
998
:There's many different ways you might do
that, and that's exactly what's
999
:sequential, right?
:
01:00:55,896 --> 01:00:58,376
How do you go from the prior to the
posterior?
:
01:00:58,376 --> 01:01:04,176
The approach we take in data in AutoGP is
based on this idea of data tempering.
:
01:01:04,176 --> 01:01:08,706
So let's say my data x consists of a
thousand measurements, okay?
:
01:01:08,706 --> 01:01:11,756
And I'd like to go from p of theta to p of
theta given x.
:
01:01:11,756 --> 01:01:15,136
Well, here's one sequential strategy that
I can use to bridge between these two
:
01:01:15,136 --> 01:01:16,076
distributions.
:
01:01:16,076 --> 01:01:20,316
I can start with P of theta, then I can
start with P of theta given X1, then P of
:
01:01:20,316 --> 01:01:23,636
theta given X1 and X2, P of theta given X2
and X3.
:
01:01:23,636 --> 01:01:27,596
So I can anneal or I can temper these data
points into the prior.
:
01:01:27,596 --> 01:01:30,436
And the more data points I put in, the
closer I'm going to get to the full
:
01:01:30,436 --> 01:01:34,796
posterior P of theta given X1 through a
thousand or something.
:
01:01:34,796 --> 01:01:36,876
Or you might introduce these data in
batch.
:
01:01:36,876 --> 01:01:41,708
But the key idea is that you start with
draws from some prior typically.
:
01:01:41,708 --> 01:01:45,448
and then you're just adding more and more
data and you're reweighting the particles
:
01:01:45,448 --> 01:01:48,548
based on the probability that they assign
to the new data.
:
01:01:48,548 --> 01:01:53,368
So if I have 10 particles and some
particle is always able to predict or it's
:
01:01:53,368 --> 01:01:57,108
always assigning a very high score to the
new data, I know that that's a particle
:
01:01:57,108 --> 01:01:59,168
that's explaining the data quite well.
:
01:01:59,168 --> 01:02:02,628
And so I might resample these particles
according to their weights to get rid of
:
01:02:02,628 --> 01:02:05,948
the particles that are not explaining the
new data well and to focus my
:
01:02:05,948 --> 01:02:09,156
computational effort on the particles that
are explaining the data well.
:
01:02:09,516 --> 01:02:12,576
And this is something that an MCMC
algorithm does not give us.
:
01:02:12,576 --> 01:02:17,496
Because even if we run like a hundred MCMC
chains in parallel, we don't know how to
:
01:02:17,496 --> 01:02:22,236
resample the chains, for example, because
they're all these independent executions
:
01:02:22,236 --> 01:02:26,276
and we don't have a principled way of
assigning a score to those different
:
01:02:26,276 --> 01:02:26,516
chains.
:
01:02:26,516 --> 01:02:28,016
You can't use the joint likelihood.
:
01:02:28,016 --> 01:02:32,956
That's not, it's not a valid or even a
meaningful statistic to use to measure, to
:
01:02:32,956 --> 01:02:34,668
measure the quality of a given chain.
:
01:02:34,668 --> 01:02:38,648
But SMC has, because it's built on
importance sampling, has a principled way
:
01:02:38,648 --> 01:02:43,208
for us to assign weights to these
different particles and focus on the ones
:
01:02:43,208 --> 01:02:44,868
which are most promising.
:
01:02:44,928 --> 01:02:49,048
And then I think the final component
that's missing in my explanation is where
:
01:02:49,048 --> 01:02:50,728
does the MCMC come in?
:
01:02:50,728 --> 01:02:54,288
So traditionally in sequential Monte
Carlo, there was no MCMC.
:
01:02:54,288 --> 01:02:59,368
You would just have your particles, you
would add new data, you would reweight it
:
01:02:59,368 --> 01:03:02,168
based on the probability of the data, then
you would resample the particles.
:
01:03:02,168 --> 01:03:03,628
Then I'm going to add some...
:
01:03:03,628 --> 01:03:07,008
next batch of data, resample, re -weight,
et cetera.
:
01:03:07,008 --> 01:03:12,648
But you're also able to, in between adding
new data points, run MCMC in the inner
:
01:03:12,648 --> 01:03:14,608
loop of sequential Monte Carlo.
:
01:03:14,608 --> 01:03:20,318
And that does not sort of make the
algorithm incorrect.
:
01:03:20,318 --> 01:03:23,968
It preserves the correctness of the
algorithm, even if you run MCMC.
:
01:03:23,968 --> 01:03:28,908
And there the intuition is that, you know,
your prior draws are not going to be good.
:
01:03:28,908 --> 01:03:32,248
So now that after I've observed say 10 %
of the data, I might actually run some
:
01:03:32,248 --> 01:03:37,288
MCMC on that subset of 10 % of the data
before I introduce the next batch of data.
:
01:03:37,288 --> 01:03:42,048
So after you're reweighting the particles,
you're also using a little bit of MCMC to
:
01:03:42,048 --> 01:03:45,608
improve their structure given the data
that's been observed so far.
:
01:03:45,608 --> 01:03:49,288
And that's where the MCMC is run inside
the inner loop.
:
01:03:49,288 --> 01:03:53,428
So some of the benefits I think of this
kind of approach are, like I mentioned at
:
01:03:53,428 --> 01:03:57,168
the beginning, in MCMC you have to compute
the probability of all the data at each
:
01:03:57,168 --> 01:03:57,708
step.
:
01:03:57,708 --> 01:04:01,767
But in SMC, because we're sequentially
incorporating new batches of data, we can
:
01:04:01,767 --> 01:04:06,808
get away with only looking at say 10 or 20
% of the data and get some initial
:
01:04:06,808 --> 01:04:10,488
inferences before we actually reach to the
end and processed all of the observed
:
01:04:10,488 --> 01:04:11,488
data.
:
01:04:12,268 --> 01:04:18,548
So that's, I guess, a high level overview
of the algorithm that AutoGP is using.
:
01:04:18,548 --> 01:04:20,908
It's annealing the data or tempering the
data.
:
01:04:20,908 --> 01:04:24,812
It's reassigning the scores of the
particles based on how well they're
:
01:04:24,812 --> 01:04:30,372
explaining the new batch of data and it's
running MCMC to improve their structure by
:
01:04:30,372 --> 01:04:33,572
applying these different moves like
removing the sub -expression, adding the
:
01:04:33,572 --> 01:04:36,282
sub -expression, different things of that
nature.
:
01:04:38,188 --> 01:04:39,348
Okay, yeah.
:
01:04:39,348 --> 01:04:43,508
Thanks a lot for this explanation because
that was a very hard question on my part
:
01:04:43,508 --> 01:04:52,228
and I think you've done a tremendous job
explaining the basics of SMC and when that
:
01:04:52,228 --> 01:04:53,608
would be useful.
:
01:04:53,608 --> 01:04:55,768
So, yeah, thank you very much.
:
01:04:55,768 --> 01:04:57,668
I think that's super helpful.
:
01:04:58,048 --> 01:05:04,528
And why in this case, when you're trying
to do these kind of time series
:
01:05:04,528 --> 01:05:06,188
discoveries, why...
:
01:05:06,188 --> 01:05:11,228
would SMC be more useful than a classic
MCMC?
:
01:05:11,568 --> 01:05:12,078
Yeah.
:
01:05:12,078 --> 01:05:15,468
So it's more useful, I guess, for several
reasons.
:
01:05:15,468 --> 01:05:19,638
One reason is that, well, you might
actually have a true streaming problem.
:
01:05:19,638 --> 01:05:24,688
So if your data is actually streaming, you
can't use MCMC because MCMC is operating
:
01:05:24,688 --> 01:05:25,968
on a static data set.
:
01:05:25,968 --> 01:05:31,928
So what if I'm running AutoGP in some type
of industrial process system where some
:
01:05:31,928 --> 01:05:33,068
data is coming in?
:
01:05:33,068 --> 01:05:36,098
and I'm updating the models in real time
as my data is coming in.
:
01:05:36,098 --> 01:05:41,248
That's a purely online algorithm in which
SMC is perfect for, but MCMC is not so
:
01:05:41,248 --> 01:05:46,388
well suited because you basically don't
have a way to, I mean, obviously you can
:
01:05:46,388 --> 01:05:50,768
always incorporate new data in MCMC, but
that's not the traditional algorithm where
:
01:05:50,768 --> 01:05:52,748
we know its correctness properties.
:
01:05:52,748 --> 01:05:56,228
So for when you have streaming data, that
might be extremely useful.
:
01:05:56,228 --> 01:05:59,180
But even if your data is not streaming,
:
01:05:59,180 --> 01:06:03,280
you know, theoretically there's results
that show that convergence can be much
:
01:06:03,280 --> 01:06:06,620
improved when you use the sequential Monte
Carlo approach.
:
01:06:06,620 --> 01:06:12,100
Because you have these multiple particles
that are interacting with one another.
:
01:06:12,100 --> 01:06:16,580
And what they can do is they can explore
multiple modes whereby an MCMC, you know,
:
01:06:16,580 --> 01:06:19,540
each individual MCMC chain might get
trapped in a mode.
:
01:06:19,540 --> 01:06:23,620
And unless you have an extremely accurate
posterior proposal distribution, you may
:
01:06:23,620 --> 01:06:25,298
never escape from that mode.
:
01:06:25,388 --> 01:06:28,708
But in SMC, we're able to resample these
different particles so that they're
:
01:06:28,708 --> 01:06:32,568
interacting, which means that you can
probably explore the space much more
:
01:06:32,568 --> 01:06:36,148
efficiently than you could with a single
chain that's not interacting with other
:
01:06:36,148 --> 01:06:36,708
chains.
:
01:06:36,708 --> 01:06:41,928
And this is especially important in the
types of posteriors that AutoGP is
:
01:06:41,928 --> 01:06:44,568
exploring, because these are symbolic
expression spaces.
:
01:06:44,568 --> 01:06:46,428
They are not Euclidean space.
:
01:06:46,428 --> 01:06:51,548
And so we expect there to be largely non
-smooth components, and we want to be able
:
01:06:51,548 --> 01:06:54,156
to jump efficiently through this space
through...
:
01:06:54,156 --> 01:07:00,736
the resampling procedure of, of, of SMC,
uh, which, which is why, uh, which, which
:
01:07:00,736 --> 01:07:02,116
is why it's a suitable algorithm.
:
01:07:02,116 --> 01:07:06,256
And then the third component is because,
you know, this is more specific to GPs in
:
01:07:06,256 --> 01:07:11,516
particular, which is because GPs have a
cubic cost of evaluating the likelihood in
:
01:07:11,516 --> 01:07:14,486
MCMC, that's really going to bite you if
you're doing it each step.
:
01:07:14,486 --> 01:07:17,676
If I have a million, a thousand
observations, I don't want to be doing
:
01:07:17,676 --> 01:07:22,476
that at each step, but in SMC, because the
data is being introduced in batches, what
:
01:07:22,476 --> 01:07:23,628
that means is.
:
01:07:23,628 --> 01:07:28,068
I might be able to get some very accurate
predictions using only the first 10 % of
:
01:07:28,068 --> 01:07:31,928
the data, which is going to be quite cheap
to evaluate the likelihood.
:
01:07:31,928 --> 01:07:35,768
So you're somehow smoothly interpolating
between the prior, where you can get
:
01:07:35,768 --> 01:07:39,728
perfect samples, and the posterior, which
is hard to sample, using these
:
01:07:39,728 --> 01:07:44,148
intermediate distributions, which are
closer to one another than the distance
:
01:07:44,148 --> 01:07:46,068
between the prior and the posterior.
:
01:07:46,068 --> 01:07:49,168
And that's what makes inference hard,
essentially, which is the distance between
:
01:07:49,168 --> 01:07:50,700
the prior and the posterior.
:
01:07:50,700 --> 01:07:56,420
because SMC is introducing datasets in
smaller batches, it's making this sort of
:
01:07:56,420 --> 01:07:57,020
bridging.
:
01:07:57,020 --> 01:08:00,700
It's making it easier to bridge between
the prior and the posterior by having
:
01:08:00,700 --> 01:08:03,280
these partial posteriors, basically.
:
01:08:03,740 --> 01:08:06,460
Okay, I see.
:
01:08:06,460 --> 01:08:07,560
Yeah.
:
01:08:07,860 --> 01:08:08,660
Yeah, okay.
:
01:08:08,660 --> 01:08:12,650
That makes sense because of that batching
process, basically.
:
01:08:12,650 --> 01:08:13,540
Yeah, for sure.
:
01:08:13,540 --> 01:08:19,428
And the requirements also of MCMC coupled
to a GP that's...
:
01:08:19,436 --> 01:08:22,376
That's for sure making stuff hard.
:
01:08:22,376 --> 01:08:22,656
Yeah.
:
01:08:22,656 --> 01:08:23,816
Yeah.
:
01:08:25,216 --> 01:08:29,726
And well, I've already taken a lot of time
from you.
:
01:08:29,726 --> 01:08:30,886
So thanks a lot for us.
:
01:08:30,886 --> 01:08:32,326
I really appreciate it.
:
01:08:32,326 --> 01:08:35,306
And that's very, very fascinating.
:
01:08:35,306 --> 01:08:37,476
Everything you're doing.
:
01:08:38,156 --> 01:08:42,596
I'm curious also because you're a bit on
both sides, right?
:
01:08:42,596 --> 01:08:46,406
Where you see practitioners, but you're
also on the very theoretical side.
:
01:08:46,406 --> 01:08:48,268
And also you teach.
:
01:08:48,268 --> 01:08:54,368
So I'm wondering if like, what's the, in
your opinion, what's the biggest hurdle in
:
01:08:54,368 --> 01:08:56,136
the Bayesian workflow currently?
:
01:08:57,868 --> 01:08:59,828
Yeah, I think there's really a lot of
hurdles.
:
01:08:59,828 --> 01:09:01,928
I don't know if there's a biggest one.
:
01:09:01,948 --> 01:09:08,068
So obviously, you know, Professor Andrew
Gelman has enormous manuscript on the
:
01:09:08,068 --> 01:09:09,968
archive, which is called Bayesian
workflow.
:
01:09:09,968 --> 01:09:13,908
And he goes through the nitty gritty of
all the different challenges with coming
:
01:09:13,908 --> 01:09:15,688
up with the Bayesian model.
:
01:09:15,688 --> 01:09:20,188
But for me, at least the one that's tied
closely to my research is where do we even
:
01:09:20,188 --> 01:09:21,288
start?
:
01:09:21,288 --> 01:09:22,868
Where do we start this workflow?
:
01:09:22,868 --> 01:09:27,222
And that's really what drives a lot of my
interest in automatic model discovery.
:
01:09:27,244 --> 01:09:29,164
probabilistic program synthesis.
:
01:09:29,164 --> 01:09:33,424
The idea is not that we want to discover
the model that we're going to use for the
:
01:09:33,424 --> 01:09:38,584
rest of our, for the rest of the lifetime
of the workflow, but come up with good
:
01:09:38,584 --> 01:09:42,704
explanations that we can use to bootstrap
this process, after which then we can
:
01:09:42,704 --> 01:09:44,534
apply the different stages of the
workflow.
:
01:09:44,534 --> 01:09:49,044
But I think it's getting from just data to
plausible explanations of that data.
:
01:09:49,044 --> 01:09:52,504
And that's what, you know, probabilistic
program synthesis or automatic model
:
01:09:52,504 --> 01:09:55,244
discovery is trying to solve.
:
01:09:56,204 --> 01:09:58,754
So I think that's a very large bottleneck.
:
01:09:58,754 --> 01:10:01,724
And then I'd say, you know, the second
bottleneck is the scalability of
:
01:10:01,724 --> 01:10:02,444
inference.
:
01:10:02,444 --> 01:10:07,404
I think that Bayesian inference has a poor
reputation in many corners because of how
:
01:10:07,404 --> 01:10:10,384
unscalable traditional MCMC algorithms
are.
:
01:10:10,384 --> 01:10:15,324
But I think in the last 10, 15 years,
we've seen many foundational developments
:
01:10:15,324 --> 01:10:21,884
in more scalable posterior inference
algorithms that are being used in many
:
01:10:21,884 --> 01:10:24,564
different settings in computational
science, et cetera.
:
01:10:24,564 --> 01:10:25,548
And I think...
:
01:10:25,548 --> 01:10:28,928
building probabilistic programming
technologies that better expose these
:
01:10:28,928 --> 01:10:35,868
different inference innovations is going
to help push Bayesian inference to the
:
01:10:35,868 --> 01:10:42,448
next level of applications that people
have traditionally thought are beyond
:
01:10:42,448 --> 01:10:45,648
reach because of the lack of scalability.
:
01:10:45,648 --> 01:10:49,168
So I think putting a lot of effort into
engineering probabilistic programming
:
01:10:49,168 --> 01:10:53,508
languages that really have fast, powerful
inference, whether it's sequential Monte
:
01:10:53,508 --> 01:10:54,668
Carlo, whether it's...
:
01:10:54,668 --> 01:10:58,308
Hamiltonian Monte Carlo with no U -turn
sampling, whether it's, you know, there's
:
01:10:58,308 --> 01:11:01,688
really a lot of different, in volutive
MCMC over discrete structure.
:
01:11:01,688 --> 01:11:03,598
These are all things that we've seen quiet
recently.
:
01:11:03,598 --> 01:11:07,468
And I think if you put them together, we
can come up with very powerful inference
:
01:11:07,468 --> 01:11:08,628
machinery.
:
01:11:08,668 --> 01:11:13,588
And then I think the last thing I'll say
on that topic is, you know, we also need
:
01:11:13,588 --> 01:11:18,808
some new research into how to configure
our inference algorithms.
:
01:11:18,808 --> 01:11:22,408
So, you know, we spend a lot of time
thinking is our model the right model, but
:
01:11:22,408 --> 01:11:22,956
you know,
:
01:11:22,956 --> 01:11:27,176
I think now that we have probabilistic
programming and we have inference
:
01:11:27,176 --> 01:11:31,936
algorithms maybe themselves implemented as
probabilistic programming, we might think
:
01:11:31,936 --> 01:11:37,256
in a more mathematically principled way
about how to optimize the inference
:
01:11:37,256 --> 01:11:40,756
algorithms in addition to optimizing the
parameters of the model.
:
01:11:40,756 --> 01:11:44,056
I think of some type of joint inference
process where you're simultaneously using
:
01:11:44,056 --> 01:11:47,756
the right inference algorithm for your
given model and have some type of
:
01:11:47,756 --> 01:11:51,296
automation that's helping you make those
choices.
:
01:11:52,620 --> 01:11:59,160
Yeah, kind of like the automated
statistician that you were talking about
:
01:11:59,160 --> 01:12:01,740
at the beginning of the show.
:
01:12:01,880 --> 01:12:05,120
Yeah, that would be fantastic.
:
01:12:05,200 --> 01:12:12,300
Definitely kind of like having a stats
sidekick helping you when you're modeling.
:
01:12:12,300 --> 01:12:15,240
That would definitely be fantastic.
:
01:12:15,300 --> 01:12:21,260
Also, as you were saying, the workflow is
so big and diverse that...
:
01:12:21,260 --> 01:12:28,240
It's very easy to forget about something,
forget a step, neglect one, because we're
:
01:12:28,240 --> 01:12:31,500
all humans, you know, things like that.
:
01:12:31,500 --> 01:12:33,140
No, definitely.
:
01:12:33,140 --> 01:12:38,980
And as you were saying, you're also a
professor at CMU.
:
01:12:38,980 --> 01:12:45,780
So I'm curious how you approach teaching
these topics, teaching stats to prepare
:
01:12:45,780 --> 01:12:49,932
your students for all of these challenges,
especially given...
:
01:12:49,932 --> 01:12:54,372
challenges of probabilistic computing that
we've mentioned throughout this show.
:
01:12:55,820 --> 01:12:59,839
Yeah, yeah, that's something I think about
frequently actually, because, you know, I
:
01:12:59,839 --> 01:13:03,080
haven't been teaching for a very long time
and this is over the course of the next
:
01:13:03,080 --> 01:13:08,900
few years, gonna have to put a lot of
effort into thinking about how to give
:
01:13:08,900 --> 01:13:13,000
students who are interested in these areas
the right background so that they can
:
01:13:13,000 --> 01:13:14,660
quickly be productive.
:
01:13:14,800 --> 01:13:17,980
And what's especially challenging, at
least in my interest area, which is
:
01:13:17,980 --> 01:13:21,600
there's both the probabilistic modeling
component and there's also the programming
:
01:13:21,600 --> 01:13:22,916
languages component.
:
01:13:23,148 --> 01:13:27,428
And what I've learned is these two
communities don't talk much with one
:
01:13:27,428 --> 01:13:28,188
another.
:
01:13:28,188 --> 01:13:31,988
You have people who are doing statistics
who think like, oh, programming language
:
01:13:31,988 --> 01:13:34,298
is just our scripts and that's really all
it is.
:
01:13:34,298 --> 01:13:37,688
And I never want to think about it because
that's the messy details.
:
01:13:37,748 --> 01:13:41,808
But programming languages, if we think
about them in a principled way and we
:
01:13:41,808 --> 01:13:46,828
start looking at the code as a first
-class citizen, just like our mathematical
:
01:13:46,828 --> 01:13:50,968
model is a first -class citizen, then we
need to really be thinking in a much more
:
01:13:50,968 --> 01:13:52,780
principled way about our programs.
:
01:13:52,780 --> 01:13:56,920
And I think the type of students who are
going to make a lot of strides in this
:
01:13:56,920 --> 01:14:00,960
research area are those who really value
the programming language, the programming
:
01:14:00,960 --> 01:14:05,380
languages theory, in addition to the
statistics and the Bayesian modeling
:
01:14:05,380 --> 01:14:08,220
that's actually used for the workflow.
:
01:14:08,580 --> 01:14:13,800
And so I think, you know, the type of
courses that we're going to need to
:
01:14:13,800 --> 01:14:17,520
develop at the graduate level or at the
undergraduate level are going to need to
:
01:14:17,520 --> 01:14:21,964
really bring together these two different
worldviews, the worldview of, you know,
:
01:14:21,964 --> 01:14:26,584
empirical data analysis, statistical model
building, things of that sort, but also
:
01:14:26,584 --> 01:14:31,004
the programming languages view where we're
actually being very formal about what are
:
01:14:31,004 --> 01:14:34,304
these actual systems, what they're doing,
what are their semantics, what are their
:
01:14:34,304 --> 01:14:39,284
properties, what are the type systems that
are enabling us to get certain guarantees,
:
01:14:39,284 --> 01:14:40,864
maybe compiler technologies.
:
01:14:40,864 --> 01:14:46,244
So I think there's elements of both of
these two different communities that need
:
01:14:46,244 --> 01:14:51,116
to be put into teaching people how to be
productive probabilistic programming.
:
01:14:51,116 --> 01:14:54,876
researchers bringing ideas from these two
different areas.
:
01:14:54,956 --> 01:15:00,016
So, you know, the students who I advise,
for example, I often try and get a sense
:
01:15:00,016 --> 01:15:02,776
for whether they're more in the
programming languages world and they need
:
01:15:02,776 --> 01:15:05,936
to learn a little bit more about the
Bayesian modeling stuff, or whether
:
01:15:05,936 --> 01:15:09,896
they're more squarely in Bayesian modeling
and they need to appreciate some of the PL
:
01:15:09,896 --> 01:15:11,116
aspects better.
:
01:15:11,116 --> 01:15:13,956
And that's the sort of a game that you
have to play to figure out what are the
:
01:15:13,956 --> 01:15:17,956
right areas to be focusing on for
different students so that they can have a
:
01:15:17,956 --> 01:15:19,308
more holistic view of
:
01:15:19,308 --> 01:15:22,088
probabilistic programming and its goals
and probabilistic computing more
:
01:15:22,088 --> 01:15:25,828
generally, and building the technical
foundations that are needed to carry
:
01:15:25,828 --> 01:15:28,448
forward that research.
:
01:15:29,048 --> 01:15:31,008
Yeah, that makes sense.
:
01:15:31,208 --> 01:15:43,148
And related to that, are there any future
developments that you foresee or expect or
:
01:15:43,148 --> 01:15:48,848
hope in probabilistic reasoning systems in
the coming years?
:
01:15:49,580 --> 01:15:50,890
Yeah, I think there's quite a few.
:
01:15:50,890 --> 01:15:55,220
And I think I already touched upon one of
them, which is, you know, the integration
:
01:15:55,220 --> 01:15:57,640
with language models, for example.
:
01:15:57,640 --> 01:16:00,340
I think there's a lot of excitement about
language models.
:
01:16:00,340 --> 01:16:04,480
I think from my perspective as a research
area, that's not what I do research in.
:
01:16:04,480 --> 01:16:08,080
But I think, you know, if we think about
how to leverage the things that they're
:
01:16:08,080 --> 01:16:12,770
good at, it might be for creating these
types of interfaces between, you know,
:
01:16:12,770 --> 01:16:16,400
automatically learned probabilistic
programs and natural language queries
:
01:16:16,400 --> 01:16:18,828
about these learned programs for solving
tasks.
:
01:16:18,828 --> 01:16:21,188
data analysis or data science tasks.
:
01:16:21,188 --> 01:16:25,428
And I think this is an important, marrying
these two ideas is important because if
:
01:16:25,428 --> 01:16:28,968
people are going to start using language
models for solving statistics, I would be
:
01:16:28,968 --> 01:16:30,028
very worried.
:
01:16:30,028 --> 01:16:34,628
I don't think language models in their
current form, which are not backed by
:
01:16:34,628 --> 01:16:38,488
probabilistic programs, are at all
appropriate to doing data science or data
:
01:16:38,488 --> 01:16:39,048
analysis.
:
01:16:39,048 --> 01:16:41,788
But I expect people will be pushing that
direction.
:
01:16:41,788 --> 01:16:45,468
The direction that I'd really like to see
thrive is the one where language models
:
01:16:45,468 --> 01:16:45,900
are
:
01:16:45,900 --> 01:16:50,180
interacting with probabilistic programs to
come up with better, more principled, more
:
01:16:50,180 --> 01:16:53,820
interpretable reasoning for answering an
end user question.
:
01:16:54,180 --> 01:16:59,260
So I think these types of probabilistic
reasoning systems, you know, will really
:
01:16:59,260 --> 01:17:04,040
make probabilistic programs more
accessible on the one hand, and will make
:
01:17:04,040 --> 01:17:06,440
language models more useful on the other
hand.
:
01:17:06,440 --> 01:17:10,060
That's something that I'd like to see from
the application standpoint.
:
01:17:10,060 --> 01:17:13,920
From the theory standpoint, I have many
theoretical questions, which maybe I won't
:
01:17:13,920 --> 01:17:14,924
get into.
:
01:17:14,924 --> 01:17:18,684
which are really related about the
foundations of random variate generation.
:
01:17:18,684 --> 01:17:22,744
Like I was mentioning at the beginning of
the talk, understanding in a more
:
01:17:22,744 --> 01:17:26,164
mathematically principled way the
properties of the inference algorithms or
:
01:17:26,164 --> 01:17:29,684
the probabilistic computations that we run
on our finite precision machines.
:
01:17:29,684 --> 01:17:34,164
I'd like to build a type of complexity
theory for these type or a theory about
:
01:17:34,164 --> 01:17:38,644
the error and complexity and the resource
consumption of Bayesian inference in the
:
01:17:38,644 --> 01:17:40,184
presence of finite resources.
:
01:17:40,184 --> 01:17:43,980
And that's a much longer term vision, but
I think it will be quite valuable.
:
01:17:43,980 --> 01:17:47,080
once we start understanding the
fundamental limitations of our
:
01:17:47,080 --> 01:17:52,040
computational processes for running
probabilistic inference and computation.
:
01:17:53,680 --> 01:17:57,080
Yeah, that sounds super exciting.
:
01:17:57,080 --> 01:17:58,040
Thanks, Alain.
:
01:17:58,740 --> 01:18:06,320
That's making me so hopeful for the coming
years to hear you talk in that way.
:
01:18:06,320 --> 01:18:11,880
I'm like, yeah, it's super stoked about
the world that you are depicting here.
:
01:18:11,880 --> 01:18:13,932
And...
:
01:18:13,932 --> 01:18:19,732
Actually, it's so I think I still had so
many questions for you because as I was
:
01:18:19,732 --> 01:18:21,462
saying, you're doing so many things.
:
01:18:21,462 --> 01:18:25,612
But I think I've taken enough of your
time.
:
01:18:25,612 --> 01:18:27,692
So let's call it to show.
:
01:18:27,812 --> 01:18:32,252
And before you go though, I'm going to ask
you the last two questions I ask every
:
01:18:32,252 --> 01:18:33,972
guest at the end of the show.
:
01:18:33,972 --> 01:18:39,272
If you had unlimited time and resources,
which problem would you try to solve?
:
01:18:39,292 --> 01:18:43,468
Yeah, that's a very tough question.
:
01:18:43,468 --> 01:18:46,088
I should have prepared for that one
better.
:
01:18:46,848 --> 01:18:55,448
Yeah, I think one area which would be
really worth solving is using, or at least
:
01:18:55,448 --> 01:19:01,108
within the scope of Bayesian inference and
probabilistic modeling, is using these
:
01:19:01,108 --> 01:19:13,782
technologies to unify people around data,
solid data -driven inferences.
:
01:19:14,028 --> 01:19:18,448
to have better discussions in empirical
fields, right?
:
01:19:18,448 --> 01:19:20,988
So obviously politics is extremely
divisive.
:
01:19:20,988 --> 01:19:26,348
People have all sorts of different
interpretations based on their political
:
01:19:26,348 --> 01:19:30,748
views and based on their aesthetics and
whatever, and all that's natural.
:
01:19:30,748 --> 01:19:36,828
But one question I think about, which is
how can we have a shared language when we
:
01:19:36,828 --> 01:19:41,848
talk about a given topic or the pros and
cons of those topic in terms of rigorous
:
01:19:41,848 --> 01:19:42,988
data -driven,
:
01:19:42,988 --> 01:19:48,708
or rigorous data -driven theses about why
we have these different views and try and
:
01:19:48,708 --> 01:19:53,628
disconnect the fundamental tensions and
bring down the temperature so that we can
:
01:19:53,628 --> 01:19:58,648
talk more about the data and have good
insights or leverage insights from the
:
01:19:58,648 --> 01:20:04,048
data and use that to guide our decision
-making across, especially the more
:
01:20:04,048 --> 01:20:07,868
divisive areas like public policy, things
of that nature.
:
01:20:07,868 --> 01:20:11,788
But I think part of the challenge is that
why we don't do this, well, you know,
:
01:20:11,788 --> 01:20:15,548
From the political standpoint, it's much
easier to not focus on what the data is
:
01:20:15,548 --> 01:20:19,098
saying because that could be expedient and
it appeals to a broader amount of people.
:
01:20:19,098 --> 01:20:23,348
But at the same time, maybe we don't have
the right language of how we might use
:
01:20:23,348 --> 01:20:28,048
data to think more, you know, in a more
principled way about some of the main, the
:
01:20:28,048 --> 01:20:29,808
major challenges that we're facing.
:
01:20:29,808 --> 01:20:36,048
So I, yeah, I think I'd like to get to a
stage where we can focus more about, you
:
01:20:36,048 --> 01:20:40,620
know, principle discussions about hard
problems that are really grounded in data.
:
01:20:40,620 --> 01:20:45,160
And the way we would get those sort of
insights is by building good probabilistic
:
01:20:45,160 --> 01:20:49,660
models of the data and using it to
explain, you know, explain to policymakers
:
01:20:49,660 --> 01:20:52,880
why they shouldn't, they shouldn't do a
different, a certain thing, for example.
:
01:20:52,880 --> 01:20:58,260
So I think that's a very important problem
to solve because surprisingly many areas
:
01:20:58,260 --> 01:21:03,100
that are very high impact are not using
real world inference and data to drive
:
01:21:03,100 --> 01:21:04,000
their decision -making.
:
01:21:04,000 --> 01:21:07,820
And that's quite shocking, whether that be
in medicine, you know, we're using very
:
01:21:07,820 --> 01:21:09,068
archaic.
:
01:21:09,068 --> 01:21:13,068
inference technologies in medicine and
clinical trials, things of that nature,
:
01:21:13,068 --> 01:21:14,548
even economists, right?
:
01:21:14,548 --> 01:21:17,088
Like linear regression is still the
workhorse in economics.
:
01:21:17,088 --> 01:21:22,308
We're using very primitive data analysis
technologies.
:
01:21:22,308 --> 01:21:28,088
I'd like to see how we can use better data
technologies, better types of inference to
:
01:21:28,088 --> 01:21:31,908
think about these hard, hard challenging
problems.
:
01:21:32,808 --> 01:21:36,908
Yeah, couldn't agree more.
:
01:21:37,168 --> 01:21:37,900
And...
:
01:21:37,900 --> 01:21:42,020
And I'm coming from a political science
background, so for sure these topics are
:
01:21:42,020 --> 01:21:46,860
always very interesting to me, quite dear
to me.
:
01:21:47,300 --> 01:21:52,700
Even though in the last years, I have to
say I've become more and more pessimistic
:
01:21:52,700 --> 01:21:54,200
about these.
:
01:21:55,140 --> 01:22:02,280
And yeah, like I completely agree with
your, like with the problem and the issues
:
01:22:02,280 --> 01:22:07,564
you have laid out and the solutions I am
for now.
:
01:22:07,564 --> 01:22:10,204
completely out of them.
:
01:22:10,344 --> 01:22:16,384
Unfortunately, but yeah, like that I agree
that something has to be done.
:
01:22:16,384 --> 01:22:28,204
Because these kind of political debates,
which are completely out of our out of the
:
01:22:28,204 --> 01:22:33,704
science, scientific consensus just so we
are to me, I'm like, but I don't know,
:
01:22:33,704 --> 01:22:37,164
we've talked about that, you know, we've
learned that I like,
:
01:22:37,164 --> 01:22:38,594
It's one of the things we know.
:
01:22:38,594 --> 01:22:41,044
I don't know what we're still arguing
about that.
:
01:22:41,044 --> 01:22:46,344
Or if we don't know, why don't we try and
find a way to, you know, find out instead
:
01:22:46,344 --> 01:22:52,744
of just being like, I know, but I'm right
because I think I'm right and my position
:
01:22:52,744 --> 01:22:54,664
actually makes sense.
:
01:22:54,884 --> 01:23:00,964
It's like one of the worst arguments like,
oh, well, it's common sense.
:
01:23:01,444 --> 01:23:07,122
Yeah, I think maybe there's some work we
have to do in having people trust.
:
01:23:07,180 --> 01:23:12,360
know, science and data -driven inference
and data analysis more.
:
01:23:12,360 --> 01:23:16,500
That's about by being more transparent, by
improving the ways in which they're being
:
01:23:16,500 --> 01:23:20,300
used, things of that nature, so that
people trust these and that it becomes the
:
01:23:20,300 --> 01:23:24,480
gold standard for talking about different
political issues or social issues or
:
01:23:24,480 --> 01:23:26,040
economic issues.
:
01:23:26,580 --> 01:23:27,840
Yeah, for sure.
:
01:23:27,840 --> 01:23:32,820
But at the same time, and that's
definitely something I try to do at a very
:
01:23:32,820 --> 01:23:35,554
small scale with these podcasts,
:
01:23:35,660 --> 01:23:43,340
It's how do you communicate about science
and try to educate the general public
:
01:23:43,340 --> 01:23:43,859
better?
:
01:23:43,859 --> 01:23:46,380
And I definitely think it's useful.
:
01:23:46,380 --> 01:23:52,520
At the same time, it's a hard task because
it's hard.
:
01:23:52,740 --> 01:23:58,800
If you want to find out the truth, it's
often not intuitive.
:
01:23:58,800 --> 01:24:03,380
And so in a way you have to want it.
:
01:24:03,380 --> 01:24:05,284
It's like, eh.
:
01:24:05,644 --> 01:24:12,464
I know broccoli is better for my health
long term, but I still prefer to eat a
:
01:24:12,464 --> 01:24:15,404
very, very fat snack.
:
01:24:15,404 --> 01:24:17,664
I definitely prefer sneakers.
:
01:24:17,664 --> 01:24:22,464
And yet I know that eating lots of fruits
and vegetables is way better for my health
:
01:24:22,464 --> 01:24:23,604
long term.
:
01:24:23,604 --> 01:24:30,304
And I feel it's a bit of a similar issue
where it's like, I'm pretty sure people
:
01:24:30,304 --> 01:24:34,532
know it's long term better to...
:
01:24:35,020 --> 01:24:39,380
use these kinds of methods to find out
about the truth, even if it's a political
:
01:24:39,380 --> 01:24:42,400
issue, even more, I would say, if it's a
political issue.
:
01:24:44,080 --> 01:24:50,520
But it's just so easy right now, at least
given how the different political
:
01:24:50,520 --> 01:24:58,260
incentives are, especially in the Western
democracies, the different incentives that
:
01:24:58,260 --> 01:25:01,540
are made with the media structure and so
on.
:
01:25:01,540 --> 01:25:04,940
It's actually way easier to
:
01:25:04,940 --> 01:25:10,880
not care about that and just like, just
lie and say what you think is true, then
:
01:25:10,880 --> 01:25:13,100
actually doing the hard work.
:
01:25:13,100 --> 01:25:14,340
And I agree.
:
01:25:14,340 --> 01:25:16,080
It's like, it's very hard.
:
01:25:16,080 --> 01:25:23,040
How do you make that hard work look not
boring, but actually what you're supposed
:
01:25:23,040 --> 01:25:26,220
to do and that I don't know for now.
:
01:25:26,220 --> 01:25:26,740
Yeah.
:
01:25:26,740 --> 01:25:32,480
Um, that makes me think like, I mean, I,
I'm definitely always thinking about these
:
01:25:32,480 --> 01:25:33,452
things and so on.
:
01:25:33,452 --> 01:25:40,092
Something that definitely helped me at a
very small scale, my scale where, because
:
01:25:40,092 --> 01:25:44,072
of course I'm always the, the scientists
around the table.
:
01:25:44,072 --> 01:25:48,952
So of course, when these kinds of topics
come up, I'm like, where does that come
:
01:25:48,952 --> 01:25:49,232
from?
:
01:25:49,232 --> 01:25:49,481
Right?
:
01:25:49,481 --> 01:25:51,202
Like, why are you saying that?
:
01:25:51,202 --> 01:25:53,092
Where, how do you know that's true?
:
01:25:53,092 --> 01:25:53,302
Right?
:
01:25:53,302 --> 01:25:55,832
What's your level of confidence and things
like that.
:
01:25:55,832 --> 01:26:01,732
There is actually a very interesting
framework where, which can teach you how
:
01:26:01,732 --> 01:26:03,108
to ask.
:
01:26:03,276 --> 01:26:07,396
questions to actually really understand
where people are coming from and how they
:
01:26:07,396 --> 01:26:12,956
develop their positions more than trying
to argue with them about their position.
:
01:26:13,156 --> 01:26:17,476
And usually it ties in also with the
literature about that, about how to
:
01:26:17,476 --> 01:26:23,836
actually not debate, but talk with someone
who has very entrenched political views.
:
01:26:24,816 --> 01:26:28,496
And it's called street epistemology.
:
01:26:28,496 --> 01:26:30,456
I don't know if you've heard of that.
:
01:26:30,456 --> 01:26:32,476
That is super interesting.
:
01:26:32,476 --> 01:26:32,716
And
:
01:26:32,716 --> 01:26:34,216
I will link to that in the show notes.
:
01:26:34,216 --> 01:26:39,296
So there is a very good YouTube channel by
Anthony McNabosco, who is one of the main
:
01:26:39,296 --> 01:26:42,876
person doing straight epistemology.
:
01:26:42,876 --> 01:26:44,226
So I will link to that.
:
01:26:44,226 --> 01:26:50,536
You can watch his video where he goes in
the street literally and just talk about
:
01:26:50,536 --> 01:26:54,736
very, very hot topics to random people in
the street.
:
01:26:54,736 --> 01:26:55,916
Can be politics.
:
01:26:55,916 --> 01:27:01,420
Very often it's about supernatural beliefs
about...
:
01:27:01,420 --> 01:27:06,580
religious beliefs, things like this is
really, these are not light topics.
:
01:27:06,960 --> 01:27:11,260
But it's done through the framework of
street epistemology.
:
01:27:11,260 --> 01:27:13,660
That's super helpful, I find.
:
01:27:14,300 --> 01:27:19,320
And if you want like a more, a bigger
overview of these topics, there is a very
:
01:27:19,320 --> 01:27:25,800
good somewhat recent book that's called
How Minds Change by David McCraney, who's
:
01:27:25,800 --> 01:27:29,460
got a very good podcast also called You're
Not So Smart.
:
01:27:30,020 --> 01:27:30,572
So,
:
01:27:30,572 --> 01:27:32,412
Definitely recommend those resources.
:
01:27:32,412 --> 01:27:34,326
I'll put them in the show notes.
:
01:27:36,300 --> 01:27:36,820
Awesome.
:
01:27:36,820 --> 01:27:41,660
Well, for us, that was an unexpected end
to the show.
:
01:27:41,660 --> 01:27:42,430
Thanks a lot.
:
01:27:42,430 --> 01:27:46,600
I think we've covered so many different
topics.
:
01:27:46,980 --> 01:27:49,940
Well, actually, I still have a second
question to ask you.
:
01:27:49,940 --> 01:27:56,260
The second last question I ask you, so if
you could have dinner with any great
:
01:27:56,260 --> 01:28:00,772
scientific mind, dead, alive, fictional,
who would it be?
:
01:28:03,468 --> 01:28:10,628
I think I will go with Hercules Poirot,
Agatha Christie's famous detective.
:
01:28:10,848 --> 01:28:16,988
So I read a lot of Hercules Poirot and I
would ask him, because he's an inference,
:
01:28:16,988 --> 01:28:19,188
everything he does is based on inference.
:
01:28:19,188 --> 01:28:23,748
So I'd work with him to come up with a
formal model of the inferences that he's
:
01:28:23,748 --> 01:28:26,268
making to solve very hard crimes.
:
01:28:28,288 --> 01:28:29,708
I am not.
:
01:28:29,908 --> 01:28:33,132
That's the first time someone answers
Hercules Poirot.
:
01:28:33,132 --> 01:28:38,602
But I'm not surprised as to the
motivation.
:
01:28:38,602 --> 01:28:39,842
So I like it.
:
01:28:39,842 --> 01:28:40,632
I like it.
:
01:28:40,632 --> 01:28:43,632
I think I would do that with Sherlock
Holmes also.
:
01:28:43,632 --> 01:28:45,732
Sherlock Holmes has a very Bayesian mind.
:
01:28:45,732 --> 01:28:47,062
I really love that.
:
01:28:47,062 --> 01:28:48,572
Yeah, for sure.
:
01:28:48,832 --> 01:28:49,332
Awesome.
:
01:28:49,332 --> 01:28:50,642
Well, thanks a lot, Ferris.
:
01:28:50,642 --> 01:28:52,512
That was a blast.
:
01:28:52,512 --> 01:28:53,882
We've talked about so many things.
:
01:28:53,882 --> 01:28:55,652
I've learned a lot about GPs.
:
01:28:55,652 --> 01:29:00,972
Definitely going to try AutoGP .jl.
:
01:29:01,580 --> 01:29:07,580
Thanks a lot for all the work you are
doing on that and all the different topics
:
01:29:07,580 --> 01:29:13,280
you are working on and were kind enough to
come here and talk about.
:
01:29:13,380 --> 01:29:18,860
As usual, I will put resources and links
to your website in the show notes for
:
01:29:18,860 --> 01:29:24,980
those who want to dig deeper and feel free
to add anything yourself or for people.
:
01:29:25,280 --> 01:29:29,600
And on that note, thank you again for
taking the time and being on this show.
:
01:29:29,600 --> 01:29:30,380
Thank you, Alex.
:
01:29:30,380 --> 01:29:31,876
I appreciate it.
:
01:29:35,756 --> 01:29:39,496
This has been another episode of Learning
Bayesian Statistics.
:
01:29:39,496 --> 01:29:44,456
Be sure to rate, review, and follow the
show on your favorite podcatcher, and
:
01:29:44,456 --> 01:29:49,356
visit learnbaystats .com for more
resources about today's topics, as well as
:
01:29:49,356 --> 01:29:54,096
access to more episodes to help you reach
true Bayesian state of mind.
:
01:29:54,096 --> 01:29:56,036
That's learnbaystats .com.
:
01:29:56,036 --> 01:30:00,886
Our theme music is Good Bayesian by Baba
Brinkman, fit MC Lass and Meghiraam.
:
01:30:00,886 --> 01:30:04,036
Check out his awesome work at bababrinkman
.com.
:
01:30:04,036 --> 01:30:05,196
I'm your host,
:
01:30:05,196 --> 01:30:06,196
Alex and Dora.
:
01:30:06,196 --> 01:30:10,456
You can follow me on Twitter at Alex
underscore and Dora like the country.
:
01:30:10,456 --> 01:30:15,516
You can support the show and unlock
exclusive benefits by visiting patreon
:
01:30:15,516 --> 01:30:17,696
.com slash LearnBasedDance.
:
01:30:17,696 --> 01:30:20,136
Thank you so much for listening and for
your support.
:
01:30:20,136 --> 01:30:26,036
You're truly a good Bayesian change your
predictions after taking information and
:
01:30:26,036 --> 01:30:29,396
if you think and I'll be less than
amazing.
:
01:30:29,396 --> 01:30:32,492
Let's adjust those expectations.
:
01:30:32,492 --> 01:30:37,892
Let me show you how to be a good Bayesian
Change calculations after taking fresh
:
01:30:37,892 --> 01:30:43,932
data in Those predictions that your brain
is making Let's get them on a solid
:
01:30:43,932 --> 01:30:45,772
foundation