Artwork for podcast Learning Bayesian Statistics
#104 Automated Gaussian Processes & Sequential Monte Carlo, with Feras Saad
Causal Inference, AI & Machine Learning Episode 10416th April 2024 • Learning Bayesian Statistics • Alexandre Andorra
00:00:00 01:30:47

Share Episode

Shownotes

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

GPs are extremely powerful…. but hard to handle. One of the bottlenecks is learning the appropriate kernel. What if you could learn the structure of GP kernels automatically? Sounds really cool, but also a bit futuristic, doesn’t it?

Well, think again, because in this episode, Feras Saad will teach us how to do just that! Feras is an Assistant Professor in the Computer Science Department at Carnegie Mellon University. He received his PhD in Computer Science from MIT, and, most importantly for our conversation, he’s the creator of AutoGP.jl, a Julia package for automatic Gaussian process modeling.

Feras discusses the implementation of AutoGP, how it scales, what you can do with it, and how you can integrate its outputs in your models.

Finally, Feras provides an overview of Sequential Monte Carlo and its usefulness in AutoGP, highlighting the ability of SMC to incorporate new data in a streaming fashion and explore multiple modes efficiently.

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell and Gal Kampel.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag ;)

Takeaways:

- AutoGP is a Julia package for automatic Gaussian process modeling that learns the structure of GP kernels automatically.

- It addresses the challenge of making structural choices for covariance functions by using a symbolic language and a recursive grammar to infer the expression of the covariance function given the observed data.

-AutoGP incorporates sequential Monte Carlo inference to handle scalability and uncertainty in structure learning.

- The package is implemented in Julia using the Gen probabilistic programming language, which provides support for sequential Monte Carlo and involutive MCMC.

- Sequential Monte Carlo (SMC) and inductive MCMC are used in AutoGP to infer the structure of the model.

- Integrating probabilistic models with language models can improve interpretability and trustworthiness in data-driven inferences.

- Challenges in Bayesian workflows include the need for automated model discovery and scalability of inference algorithms.

- Future developments in probabilistic reasoning systems include unifying people around data-driven inferences and improving the scalability and configurability of inference algorithms.

Chapters:

00:00 Introduction to AutoGP

26:28 Automatic Gaussian Process Modeling

45:05 AutoGP: Automatic Discovery of Gaussian Process Model Structure

53:39 Applying AutoGP to New Settings

01:09:27 The Biggest Hurdle in the Bayesian Workflow

01:19:14 Unifying People Around Data-Driven Inferences

Links from the show:

Transcript

This is an automatic transcript and may therefore contain errors. Please get in touch if you're willing to correct them.

Transcripts

Speaker:

GPs are extremely powerful, but hard to

handle.

2

:

One of the bottlenecks is learning the

appropriate kernels.

3

:

Well, what if you could learn the

structure of GP's kernels automatically?

4

:

Sounds really cool, right?

5

:

But also, eh, a bit futuristic, doesn't

it?

6

:

Well, think again, because in this

episode, Farah Saad will teach us how to

7

:

do just that.

8

:

Feras is an assistant professor in the

computer science department at Carnegie

9

:

Mellon University.

10

:

He received his PhD in computer science

from MIT.

11

:

And most importantly for our conversation,

he's the creator of AutoGP .jl, a Julia

12

:

package for automatic Gaussian process

modeling.

13

:

Feras discusses the implementation of

AutoGP, how it scales, what you can do

14

:

with it, and how you can integrate its

outputs in your patient models.

15

:

Finally,

16

:

DeepFerence provides an overview of

Sequential Monte Carlo and its usefulness

17

:

in AutoGP, highlighting the ability of SMC

to incorporate new data in a streaming

18

:

fashion and explore multiple modes

efficiently.

19

:

This is Learning Basics Statistics,

,:

20

:

Welcome to Learning Bayesian Statistics, a

podcast about Bayesian inference, the

21

:

methods, the projects, and the people who

make it possible.

22

:

I'm your host, Alex Andorra.

23

:

You can follow me on Twitter at alex

.andorra, like the country, for any info

24

:

about the show.

25

:

LearnBayStats .com is Laplace to me.

26

:

Show notes,

27

:

becoming a corporate sponsor, unlocking

Bayesian Merge, supporting the show on

28

:

Patreon, everything is in there.

29

:

That's learnbasedats .com.

30

:

If you're interested in one -on -one

mentorship, online courses, or statistical

31

:

consulting, feel free to reach out and

book a call at topmate .io slash alex

32

:

underscore and dora.

33

:

See you around, folks, and best Bayesian

wishes to you all.

34

:

idea patients.

35

:

First, I want to thank Edwin Saveliev,

Frederic Ayala, Jeffrey Powell, and Gala

36

:

Campbell for supporting the show.

37

:

Patreon, your support is invaluable, guys,

and literally makes this show possible.

38

:

I cannot wait to talk with you in the

Slack channel.

39

:

Second, I have an exciting modeling

webinar coming up on April 18 with Juan

40

:

Ardus, a fellow PyMC Core Dev and

mathematician.

41

:

In this modeling webinar, we'll learn how

to use the new HSGP approximation for fast

42

:

and efficient Gaussian processes, we'll

simplify the foundational concepts,

43

:

explain why this technique is so useful

and innovative, and of course, we'll show

44

:

you a real -world application in PyMC.

45

:

So if that sounds like fun,

46

:

Go to topmade .io slash Alex underscore

and Dora to secure your seat.

47

:

Of course, if you're a patron of the show,

you get bonuses like submitting questions

48

:

in advance, early access to the recording,

et cetera.

49

:

You are my favorite listeners after all.

50

:

Okay, back to the show now.

51

:

Arasad, welcome to Learning Vision

Statistics.

52

:

Hi, thank you.

53

:

Thanks for the invitation.

54

:

I'm delighted to be here.

55

:

Yeah, thanks a lot for taking the time.

56

:

Thanks a lot to Colin Carroll.

57

:

who of course listeners know, he was in

episode 3 of Uninvasioned Statistics.

58

:

Well I will of course put it in the show

notes, that's like a vintage episode now,

59

:

from 4 years ago.

60

:

I was a complete beginner in invasion

stats, so if you wanna embarrass myself,

61

:

definitely that's one of the episodes you

should listen to without my -

62

:

my beginner's questions, and that's one of

the rare episodes I could do on site.

63

:

I was with Colleen in person to record

that episode in Boston.

64

:

So, hi Colleen, thanks a lot again.

65

:

And Feres, let's talk about you first.

66

:

How would you define the work you're doing

nowadays?

67

:

And also, how did you end up doing that?

68

:

Yeah, yeah, thanks.

69

:

And yeah, thanks for calling Carol for

setting up this connection.

70

:

I've been watching the podcast for a while

and I think it's really great how you've

71

:

brought together lots of different people

in the Bayesian inference community, the

72

:

statistics community to talk about their

work.

73

:

So thank you and thank you to Colin for

that connection.

74

:

Yeah, so a little background about me.

75

:

I'm a professor at CMU and I'm working

in...

76

:

a few different areas surrounding Bayesian

inference with my colleagues and students.

77

:

One, I think, you know, I like to think of

the work I do as following different

78

:

threads, which are all unified by this

idea of probability and computation.

79

:

So one area that I work a lot in, and I'm

sure you have lots of experience in this,

80

:

being one of the core developers of PyMC,

is probabilistic programming languages and

81

:

developing new tools that help

82

:

both high level users and also machine

learning experts and statistics experts

83

:

more easily use Bayesian models and

inferences as part of their workflow.

84

:

The, you know, putting my programming

languages hat on, it's important to think

85

:

about not only how do we make it easier

for people to write up Bayesian inference

86

:

workflows, but also what kind of

guarantees or what kind of help can we

87

:

give them in terms of verifying the

correctness of their implementations or.

88

:

automating the process of getting these

probabilistic programs to begin with using

89

:

probabilistic program synthesis

techniques.

90

:

So these are questions that are very

challenging and, you know, if we're able

91

:

to solve them, you know, really can go a

long way.

92

:

So there's a lot of work in the

probabilistic programming world that I do,

93

:

and I'm specifically interested in

probabilistic programming languages that

94

:

support programmable inference.

95

:

So we can think of many probabilistic

programming languages like Stan or Bugs or

96

:

PyMC as largely having a single inference

algorithm that they're going to use

97

:

multiple times for all the different

programs you can express.

98

:

So bugs might use Gibbs sampling, Stan

uses HMC with nuts, PyMC uses MCMC

99

:

algorithms, and these are all great.

100

:

But of course, one of the limitations is

there's no universal inference algorithm

101

:

that works well for any problem you might

want to express.

102

:

And that's where I think a lot of the

power of programmable inference comes in.

103

:

A lot of where the interesting research is

as well, right?

104

:

Like how can you support users writing

their own say MCMC proposal for a given

105

:

Bayesian inference problem and verify that

that proposal distribution meets the

106

:

theoretical conditions needed for

soundness, whether it's defining a

107

:

reducible chain, for example, or whether

it's a periodic.

108

:

or in the context of variational

inference, whether you define the

109

:

variational family that is broad enough,

so it's support encompasses the support of

110

:

the target model.

111

:

We have all of these conditions that we

usually hope are correct, but our systems

112

:

don't actually verify that for us, whether

it's an MCMC or variational inference or

113

:

importance sampling or sequential Monte

Carlo.

114

:

And I think the more flexibility we give

programmers,

115

:

And I touched upon this a little bit by

talking about probabilistic program

116

:

synthesis, which is this idea of

probabilistic, automated probabilistic

117

:

model discovery.

118

:

And there, our goal is to use hierarchical

Bayesian models to specify prior

119

:

distributions, not only over model

parameters, but also over model

120

:

structures.

121

:

And here, this is based on this idea that

traditionally in statistics, a data

122

:

scientist or an expert,

123

:

we'll hand design a Bayesian model for a

given problem, but oftentimes it's not

124

:

obvious what's the right model to use.

125

:

So the idea is, you know, how can we use

the observed data to guide our decisions

126

:

about what is the right model structure to

even be using before we worry about

127

:

parameter inference?

128

:

So, you know, we've looked at this problem

in the context of learning models of time

129

:

series data.

130

:

Should my time series data have a periodic

component?

131

:

Should it have polynomial trends?

132

:

Should it have a change point?

133

:

right?

134

:

You know, how can we automate the

discovery of these different patterns and

135

:

then learn an appropriate probabilistic

model?

136

:

And I think it ties in very nicely to

probabilistic programming because

137

:

probabilistic programs are so expressive

that we can express prior distributions on

138

:

structures or prior distributions on

probabilistic programs all within the

139

:

system using this unified technology.

140

:

Yeah.

141

:

Which is where, you know, these two

research areas really inform one another.

142

:

If we're able to express

143

:

rich probabilistic programming languages,

then we can start doing inference over

144

:

probabilistic programs themselves and try

and synthesize these programs from data.

145

:

Other areas that I've looked at are

tabular data or relational data models,

146

:

different types of traditionally

structured data, and synthesizing models

147

:

there.

148

:

And the workhorse in that area is largely

Bayesian non -parametrics.

149

:

So prior distributions over unbounded

spaces of latent variables, which are, I

150

:

think, a very mathematically elegant way

to treat probabilistic structure discovery

151

:

using Bayesian inferences as the workhorse

for that.

152

:

And I'll just touch upon a few other areas

that I work in, which are also quite

153

:

aligned, which a third area I work in is

more on the computational statistics side,

154

:

which is now that we have probabilistic

programs and we're using them and they're

155

:

becoming more and more routine in the

workflow of Bayesian inference, we need to

156

:

start thinking about new statistical

methods and testing methods for these

157

:

probabilistic programs.

158

:

So for example, this is a little bit

different than traditional statistics

159

:

where, you know, traditionally in

statistics we might

160

:

some type of analytic mathematical

derivation on some probability model,

161

:

right?

162

:

So you might write up your model by hand,

and then you might, you know, if you want

163

:

to compute some property, you'll treat the

model as some kind of mathematical

164

:

expression.

165

:

But now that we have programs, these

programs are often far too hard to

166

:

formalize mathematically by hand.

167

:

So if you want to analyze their

properties, how can we understand the

168

:

properties of a program?

169

:

By simulating it.

170

:

So a very simple example of this would be,

say I wrote a probabilistic program for

171

:

some given data, and I actually have the

data.

172

:

Then I'd like to know whether the

probabilistic program I wrote is even a

173

:

reasonable prior from that data.

174

:

So this is a goodness of fit testing, or

how well does the probabilistic program I

175

:

wrote explain the range of data sets I

might see?

176

:

So, you know, if you do a goodness of fit

test using stats 101, you would look, all

177

:

right, what is my distribution?

178

:

What is the CDF?

179

:

What are the parameters that I'm going to

derive some type of thing by hand?

180

:

But for policy programs, we can't do that.

181

:

So we might like to simulate data from the

program and do some type of analysis based

182

:

on samples of the program as compared to

samples of the observed data.

183

:

So these type of simulation -based

analyses of statistical properties of

184

:

probabilistic programs for testing their

behavior or for quantifying the

185

:

information between variables, things like

that.

186

:

And then the final area I'll touch upon is

really more at the foundational level,

187

:

which is.

188

:

understanding what are the primitive

operations, a more rigorous or principled

189

:

understanding of the primitive operations

on our computers that enable us to do

190

:

random computations.

191

:

So what do I mean by that?

192

:

Well, you know, we love to assume that our

computers can freely compute over real

193

:

numbers.

194

:

But of course, computers don't have real

numbers built within them.

195

:

They're built on finite precision

machines, right, which means I can't

196

:

express.

197

:

some arbitrary division between two real

numbers.

198

:

Everything is at some level it's floating

point.

199

:

And so this gives us a gap between the

theory and the practice.

200

:

Because in theory, you know, whenever

we're writing our models, we assume

201

:

everything is in this, you know,

infinitely precise universe.

202

:

But when we actually implement it, there's

some level of approximation.

203

:

So I'm interested in understanding first,

theoretically, what is this approximation?

204

:

How important is it that I'm actually

treating my model as running on an

205

:

infinitely precise machine where I

actually have finite precision?

206

:

And second, what are the implications of

that gap for Bayesian inference?

207

:

Does it mean that now I actually have some

208

:

properties of my Markov chain that no

longer hold because I'm actually running

209

:

it on a finite precision machine whereby

all my analysis was assuming I have an

210

:

infinite precision or what does it mean

about the actual variables we generate?

211

:

So, you know, we might generate a Gaussian

random variable, but in practice, the

212

:

variable we're simulating has some other

distribution.

213

:

Can we theoretically quantify that other

distribution and its error with respect to

214

:

the true distribution?

215

:

Or have we come up with sampling

procedures that are as close as possible

216

:

to the ideal real value distribution?

217

:

And so this brings together ideas from

information theory, from theoretical

218

:

computer science.

219

:

And one of the motivations is to thread

those results through into the actual

220

:

Bayesian inference procedures that we

implement using probabilistic programming

221

:

languages.

222

:

So that's just, you know, an overview of

these three or four different areas that

223

:

I'm interested in and I've been working on

recently.

224

:

Yeah, that's amazing.

225

:

Thanks a lot for these, like full panel of

what you're doing.

226

:

And yeah, that's just incredible also that

you're doing so many things.

227

:

I'm really impressed.

228

:

And of course we're going to dive a bit

into these, at least some of these topics.

229

:

I don't want to take three hours of your

time, but...

230

:

Before that though, I'm curious if you

remembered when and how you first got

231

:

introduced to Bayesian inference and also

why it's ticked with you because it seems

232

:

like it's underpinning most of your work,

at least that idea of probabilistic

233

:

programming.

234

:

Yeah, that's a good question.

235

:

I think I was first interested in

probability before I was interested in

236

:

Bayesian inference.

237

:

I remember...

238

:

I used to read a book by Maasteller called

50 Challenging Problems in Probability.

239

:

I took a course in high school and I

thought, how could I actually use these

240

:

cool ideas for fun?

241

:

And there was actually a very nice book

written back in the 50s by Maasteller.

242

:

So that got me interested in probability

and how we can use probability to reason

243

:

about real world phenomena.

244

:

So the book that...

245

:

that I used to read would sort of have

these questions about, you know, if

246

:

someone misses a train and the train has a

certain schedule, what's the probability

247

:

that they'll arrive at the right time?

248

:

And it's a really nice book because it

ties in our everyday experiences with

249

:

probabilistic modeling and inference.

250

:

And so I thought, wow, this is actually a

really powerful paradigm for reasoning

251

:

about the everyday things that we do,

like, you know, missing a bus and knowing

252

:

something about its schedule and when's

the right time that I should arrive to

253

:

maximize the probability of, you know,

some, some, some, some,

254

:

event of interest, things like that.

255

:

So that really got me hooked to the idea

of probability.

256

:

But I think what really connected Bayesian

inference to me was taking, I think this

257

:

was as a senior or as a first year

master's student, a course by Professor

258

:

Josh Tannenbaum at MIT, which is

computational cognitive science.

259

:

And that course has evolved.

260

:

quiet a lot through the years, but the

version that I took was really a beautiful

261

:

synthesis of lots of deep ideas of how

Bayesian inference can tell us something

262

:

meaningful about how humans reason about,

you know, different empirical phenomena

263

:

and cognition.

264

:

So, you know, in cognitive science for,

you know, for...

265

:

majority of the history of the field,

people would run these experiments on

266

:

humans and they would try and analyze

these experiments using some type of, you

267

:

know, frequentist statistics or they would

not really use generative models to

268

:

describe how humans are are solving a

particular experiment.

269

:

But the, you know, Professor Tenenbaum's

approach was to use Bayesian models.

270

:

as a way of describing or at least

emulating the cognitive processes that

271

:

humans do for solving these types of

cognition tasks.

272

:

And by cognition tasks, I mean, you know,

simple experiments you might ask a human

273

:

to do, which is, you know, you might have

some dots on a screen and you might tell

274

:

them, all right, you've seen five dots,

why don't you extrapolate the next five?

275

:

Just simple things that, simple cognitive

experiments or, you know, yeah, so.

276

:

I think that being able to use Bayesian

models to describe very simple cognitive

277

:

phenomena was another really appealing

prospect to me throughout that course.

278

:

I'm seeing all the ways in which that

manifested in very nice questions about.

279

:

how do we do efficient inference in real

time?

280

:

Because humans are able to do inference

very quickly.

281

:

And Bayesian inference is obviously very

challenging to do.

282

:

But then, if we actually want to engineer

systems, we need to think about the hard

283

:

questions of efficient and scalable

inference in real time, maybe at human

284

:

level speeds.

285

:

Which brought in a lot of the reason for

why I'm so interested in inference as

286

:

well.

287

:

Because that's one of the harder aspects

of Bayesian computing.

288

:

And then I think a third thing which

really hooked me to Bayesian inference was

289

:

taking a machine learning course and kind

of comparing.

290

:

So the way these machine learning courses

work is they'll teach you empirical risk

291

:

minimization, and then they'll teach you

some type of optimization, and then

292

:

there'll be a lecture called Bayesian

inference.

293

:

And...

294

:

What was so interesting to me at the time

was up until the time, up until the

295

:

lecture where we learned anything about

Bayesian inference, all of these machine

296

:

learning concepts seem to just be a

hodgepodge of random tools and techniques

297

:

that people were using.

298

:

So I, you know, there's the support vector

machine and it's good at classification

299

:

and then there's the random forest and

it's good at this.

300

:

But what's really nice about using

Bayesian inference in the machine learning

301

:

setting, or at least what I found

appealing was how you have a very clean

302

:

specification of the problem that you're

trying to solve in terms of number one, a

303

:

prior distribution.

304

:

over parameters and observable data, and

then the actual observed data, and three,

305

:

which is the posterior distribution that

you're trying to infer.

306

:

So you can use a very nice high -level

specification of what is even the problem

307

:

you're trying to solve before you even

worry about how you solve it.

308

:

you can very cleanly separate modeling and

inference, whereby most of the machine

309

:

learning techniques that I was initially

reading or learning about seem to be only

310

:

focused on how do I infer something

without crisply formalizing the problem

311

:

that I'm trying to solve.

312

:

And then, you know, just, yeah.

313

:

And then, yeah.

314

:

So once we have this Bayesian posterior

that we're trying to infer, then maybe

315

:

we'll do fully Bayesian inference, or

maybe we'll do approximate Bayesian

316

:

inference, or maybe we'll just do maximum

likelihood.

317

:

That's maybe less of a detail.

318

:

The more important detail is we have a

very clean specification for our problem

319

:

and we can, you know, build in our

assumptions.

320

:

And as we change our assumptions, we

change the specification.

321

:

So it seemed like a very systematic way,

very systematic way to build machine

322

:

learning and artificial intelligence

pipelines.

323

:

using a principled process that I found

easy to reason about.

324

:

And I didn't really find that in the other

types of machine learning approaches that

325

:

we learned in the class.

326

:

So yeah, so I joined the probabilistic

computing project at MIT, which is run by

327

:

my PhD advisor, Dr.

328

:

Vikash Mansinga.

329

:

And, um, you really got the opportunity to

explore these interests at the research

330

:

level, not only in classes.

331

:

And that's, I think where everything took

off afterwards.

332

:

Those are the synthesis of various things,

I think that got me interested in the

333

:

field.

334

:

Yeah.

335

:

Thanks a lot for that, for that, that

that's super interesting to see.

336

:

And, uh, I definitely relate to the idea

of these, um, like the patient framework

337

:

being, uh, attractive.

338

:

not because it's a toolbox, but because

it's more of a principle based framework,

339

:

basically, where instead of thinking, oh

yeah, what tool do I need for that stuff,

340

:

it's just always the same in a way.

341

:

To me, it's cool because you don't have to

be smart all the time in a way, right?

342

:

You're just like, it's the problem takes

the same workflow.

343

:

It's not going to be the same solution.

344

:

But it's always the same workflow.

345

:

Okay.

346

:

What does the data look like?

347

:

How can we model that?

348

:

Where is the data generative story?

349

:

And then you have very different

challenges all the time and different

350

:

kinds of models, but you're not thinking

about, okay, what is the ready made model

351

:

that they can apply to these data?

352

:

It's more like how can I create a custom

model to these data knowing the

353

:

constraints I have about my problem?

354

:

And.

355

:

thinking in a principled way instead of

thinking in a toolkit way.

356

:

I definitely relate to that.

357

:

I find that amazing.

358

:

I'll just add to that, which is this is

not only some type of aesthetic or

359

:

theoretical idea.

360

:

I think it's actually strongly tied into

good practice that makes it easier to

361

:

solve problems.

362

:

And by that, what do I mean?

363

:

Well, so I did a very brief undergraduate

research project in a biology lab,

364

:

computational biology lab.

365

:

And just looking at the empirical workflow

that was done,

366

:

made me very suspicious about the process,

which is, you know, you might have some

367

:

data and then you'll hit it with PCA and

you'll get some projection of the data and

368

:

then you'll use a random forest classifier

and you're going to classify it in

369

:

different ways.

370

:

And then you're going to use the

classification and some type of logistic

371

:

regression.

372

:

So you're just chaining these ad hoc

different data analyses to come up with

373

:

some final story.

374

:

And while that might be okay to get you

some specific result, it doesn't really

375

:

tell you anything about how changing one

modeling choice in this pipeline.

376

:

is going to impact your final inference

because this sort of mix and match

377

:

approach of applying different ad hoc

estimators to solve different subtasks

378

:

doesn't really give us a way to iterate on

our models, understand their limitations

379

:

very well, knowing their sensitivity to

different choices, or even building

380

:

computational systems that automate a lot

of these things, right?

381

:

Like probabilistic programs.

382

:

Like you're saying, we can write our data

generating process as the workflow itself,

383

:

right?

384

:

Rather than, you know, maybe in Matlab

I'll run PCA and then, you know, I'll use

385

:

scikit -learn and Python.

386

:

Without, I think, this type of prior

distribution over our data, it becomes

387

:

very hard to reason formally about our

entire inference workflow, which would...

388

:

know, which probabilistic programming

languages are trying to make easier and

389

:

give a more principled approach that's

more amenable to engineering, to

390

:

optimization, to things of that sort.

391

:

Yeah.

392

:

Yeah, yeah.

393

:

Fantastic point.

394

:

Definitely.

395

:

And that's also the way I personally tend

to teach patient stats.

396

:

Now it's much more on a, let's say,

principle -based way instead of, and

397

:

workflow -based instead of just...

398

:

Okay, Poisson regression is this

multinomial regression is that I find that

399

:

much more powerful because then when

students get out in the wild, they are

400

:

used to first think about the problem and

then try to see how they could solve it

401

:

instead of just trying to find, okay,

which model is going to be the most.

402

:

useful here in the models that I already

know, because then if the data are

403

:

different, you're going to have a lot of

problems.

404

:

Yeah.

405

:

And so you actually talked about the

different topics that you work on.

406

:

There are a lot I want to ask you about.

407

:

One of my favorites, and actually I think

Colin also has been working a bit on that

408

:

lately.

409

:

is the development of AutoGP .jl.

410

:

So I think that'd be cool to talk about

that.

411

:

What inspired you to develop that package,

which is in Julia?

412

:

Maybe you can also talk about that if you

mainly develop in Julia most of the time,

413

:

or if that was mostly useful for that

project.

414

:

And how does this package...

415

:

advance, like help the learning structure

of Gaussian Processes kernels because if I

416

:

understand correctly, that's what the

package is mostly about.

417

:

So yeah, if you can give a primer to

listeners about that.

418

:

Definitely.

419

:

Yes.

420

:

So Gaussian Processes are a pretty

standard model that's used in many

421

:

different application areas.

422

:

spatial temporal statistics and many

engineering applications based on

423

:

optimization.

424

:

So these Gaussian process models are

parameterized by covariance functions,

425

:

which specify how the data produced by

this Gaussian process co -varies across

426

:

time, across space, across any domain

which you're able to define some type of

427

:

covariance function.

428

:

But one of the main challenges in using a

Gaussian process for modeling your data,

429

:

is making the structural choice about what

should the covariance structure be.

430

:

So, you know, the one of the universal

choices or the most common choices is to

431

:

say, you know, some type of a radial basis

function for my data, the RBF kernel, or,

432

:

you know, maybe a linear kernel or a

polynomial kernel, somehow hoping that

433

:

you'll make the right choice to model your

data accurately.

434

:

So the inspiration for auto GP or

automatic Gaussian process is to try and

435

:

use the data not only to infer the numeric

parameters of the Gaussian process, but

436

:

also the structural parameters or the

actual symbolic structure of this

437

:

covariance function.

438

:

And here we are drawing our inspiration

from work which is maybe almost 10 years

439

:

now from Dave Duvenoe and colleagues

called the Automated Statistician Project,

440

:

or ABCD, Automatic Bayesian Covariance

Discovery, which introduced this idea of

441

:

defining a symbolic language.

442

:

over Gaussian process covariance functions

or covariance kernels and using a grammar,

443

:

using a recursive grammar and trying to

infer an expression in that grammar given

444

:

the observed data.

445

:

So, you know, in a time series setting,

for example, you might have time on the

446

:

horizontal axis and the variable on the y

-axis and you just have some variable

447

:

that's evolving.

448

:

You don't know necessarily the dynamics of

that, right?

449

:

There might be some periodic structure in

the data or there might be multiple

450

:

periodic effects.

451

:

Or there might be a linear trend that's

overlaying the data.

452

:

Or there might be a point in time in which

the data is switching between some process

453

:

before the change point and some process

after the change point.

454

:

Obviously, for example, in the COVID era,

almost all macroeconomic data sets had

455

:

some type of change point around April

:

456

:

And we see that in the empirical data that

we're analyzing today.

457

:

So the question is, how can we

automatically surface these structural

458

:

choices?

459

:

using Bayesian inference.

460

:

So the original approach that was in the

automated statistician was based on a type

461

:

of greedy search.

462

:

So they were trying to say, let's find the

single kernel that maximizes the

463

:

probability of the data.

464

:

Okay.

465

:

So they're trying to do a greedy search

over these kernel structures for Gaussian

466

:

processes using these different search

operators.

467

:

And for each different kernel, you might

find the maximum likelihood parameter, et

468

:

cetera.

469

:

And I think that's a fine approach.

470

:

But it does run into some serious

limitations, and I'll mention a few of

471

:

them.

472

:

One limitation is that greedy search is in

a sense not representing any uncertainty

473

:

about what's the right structure.

474

:

It's just finding a single best structure

to maximize some probability or maybe

475

:

likelihood of the data.

476

:

But we know just like parameters are

uncertain, structure can also be quite

477

:

uncertain because the data is very noisy.

478

:

We may have sparse data.

479

:

And so, you know, we'd want type of

inference systems that are more robust.

480

:

when discovering the temporal structure in

the data and that greedy search doesn't

481

:

really give us that level of robustness

through expressing posterior uncertainty.

482

:

I think another challenge with greedy

search is its scalability.

483

:

And by that, if you have a very large data

set in a greedy search algorithm, we're

484

:

typically at each stage of the search,

we're looking at the entire data set to

485

:

score our model.

486

:

And this is also a traditional Markov

chain Monte Carlo algorithms.

487

:

We often score our data set, but in the

Gaussian process setting, scoring the data

488

:

set is very expensive.

489

:

If you have N data points, it's going to

cost you N cubed.

490

:

And so it becomes quite infeasible to run

greedy search or even pure Markov chain

491

:

Monte Carlo, where at each step, each time

you change the parameters or you change

492

:

the kernel, you need to now compute the

full likelihood.

493

:

And so the second motivation in AutoGP is

to build an inference algorithm.

494

:

that is not looking at the whole data set

at each point in time, but using subsets

495

:

of the data set that are sequentially

growing.

496

:

And that's where the sequential Monte

Carlo inference algorithm comes in.

497

:

So AutoGP is implemented in Julia.

498

:

And the API is that basically you give it

a one -dimensional time series.

499

:

You hit infer.

500

:

And then it's going to report an ensemble

of Gaussian processes or a sample from my

501

:

posterior distribution, where each

Gaussian process has some particular

502

:

structure and some numeric parameters.

503

:

And you can show the user, hey, I've

inferred these hundred GPS from my

504

:

posterior.

505

:

And then they can start using them for

generating predictions.

506

:

You can use them to find outliers because

these are probabilistic models.

507

:

You can use them for a lot of interesting

tasks.

508

:

Or you might say, you know,

509

:

This particular model actually isn't

consistent with what I know about the

510

:

data.

511

:

So you might remove one of the posterior

samples from your ensemble.

512

:

Yeah, so those are, you know, we used

AutoGP on the M3.

513

:

We benchmarked it on the M3 competition

data.

514

:

M3 is around, or the monthly data sets in

M3 are around 1 ,500 time series, you

515

:

know, between 100 and 500 observations in

length.

516

:

And we compared the performance against

different statistics baselines and machine

517

:

learning baselines.

518

:

And it's actually able to find pretty

common sense structures in these economic

519

:

data.

520

:

Some of them have seasonal features,

multiple seasonal effects as well.

521

:

And what's interesting is we don't need to

customize the prior to analyze each data

522

:

set.

523

:

It's essentially able to discover.

524

:

And what's also interesting is that

sometimes when the data set just looks

525

:

like a random walk, it's going to learn a

covariance structure, which emulates a

526

:

random walk.

527

:

So by having a very broad prior

distribution on the types of covariance

528

:

structures that you see, it's able to find

which of these are plausible explanation

529

:

given the data.

530

:

Yes, as you mentioned, we implemented this

in Julia.

531

:

The reason is that AutoGP is built on the

Gen probabilistic programming language,

532

:

which is embedded in the Julia language.

533

:

And the reason that Gen, I think, is a

very useful system for this problem.

534

:

So Gen was developed primarily by Marco

Cosumano Towner, who wrote a PhD thesis.

535

:

He was a colleague of mine at the MIT

Policy Computing Project.

536

:

And Gen really, it's a Turing complete

language and has programmable inference.

537

:

So you're able to write a prior

distribution over these symbolic

538

:

expressions in a very natural way.

539

:

And you're able to customize an inference

algorithm that's able to solve this

540

:

problem efficiently.

541

:

And

542

:

What really drew us to GEN for this

problem, I think, are twofold.

543

:

The first is its support for sequential

Monte Carlo inference.

544

:

So it has a pretty mature library for

doing sequential Monte Carlo.

545

:

And sequential Monte Carlo construed more

generally than just particle filtering,

546

:

but other types of inference over

sequences of probability distributions.

547

:

So particle filters are one type of

sequential Monte Carlo algorithm you might

548

:

write.

549

:

But you might do some type of temperature

annealing or data annealing or other types

550

:

of sequentialization strategies.

551

:

And Jen provides a very nice toolbox and

abstraction for experimenting with

552

:

different types of sequential Monte Carlo

approaches.

553

:

And so we definitely made good use of that

library when developing our inference

554

:

algorithm.

555

:

The second reason I think that Jen was

very nice to use is its library for

556

:

involutive MCMC.

557

:

And involutive MCMC, it's a relatively new

framework.

558

:

It was discovered, I think, concurrently.

559

:

and independently both by Marco and other

folks.

560

:

And this is kind of, you can think of it

as a generalization of reversible jump

561

:

MCMC.

562

:

And it's really a unifying framework to

understand many different MCMC algorithms

563

:

using a common terminology.

564

:

And so there's a wonderful ICML paper

which lists 30 or so different algorithms

565

:

that people use all the time like

Hamiltonian Monte Carlo, reversible jump

566

:

MCMC, Gibbs sampling, Metropolis Hastings.

567

:

and expresses them using the language of

involutive MCMC.

568

:

I believe the author is Nick Liudov,

although I might be mispronouncing that,

569

:

sorry for that.

570

:

So, Jen has a library for involutive MCMC,

which makes it quite easy to write

571

:

different proposals for how you do this

inference over your symbolic expressions.

572

:

Because when you're doing MCMC within the

inner loop of a sequential Monte Carlo

573

:

algorithm,

574

:

You need to somehow be able to improve

your current symbolic expressions for the

575

:

covariance kernel, given the observed

data.

576

:

And, uh, doing that is, is hard because

this is kind of a reversible jump

577

:

algorithm where you make a structural

change.

578

:

Then you need to maybe generate some new

parameters.

579

:

You need the reverse probability of going

back.

580

:

And so Jen has a high level, has a lot of

automation and a library for implementing

581

:

these types of structure moves in a very

high level way.

582

:

And it automates the low level math for.

583

:

computing the acceptance probability and

embedding all of that within an outer

584

:

level SMC loop.

585

:

And so this is, I think, one of my

favorite examples for what probabilistic

586

:

programming can give us, which is very

expressive priors over these, you know,

587

:

symbolic expressions generated by symbolic

grammars, powerful inference algorithms

588

:

using combinations of sequential Monte

Carlo and involutive MCMC and reversible

589

:

jump moves and gradient based inference

over the parameters.

590

:

It really brings together a lot of the

591

:

a lot of the strengths of probabilistic

programming languages.

592

:

And we showed at least on these M3

datasets that they can actually be quite

593

:

competitive with state -of -the -art

solutions, both in statistics and in

594

:

machine learning.

595

:

I will say, though, that as with

traditional GPs, the scalability is really

596

:

in the likelihood.

597

:

So whether AutoGP can handle datasets with

10 ,000 data points, it's actually too

598

:

hard because ultimately,

599

:

Once you've seen all the data in your

sequential Monte Carlo, you will be forced

600

:

to do this sort of N cubed scaling, which

then, you know, you need some type of

601

:

improvements or some type of approximation

for handling larger data.

602

:

But I think what's more interesting in

AutoGP is not necessarily that it's

603

:

applied to inferring structures of

Gaussian processes, but that it's sort of

604

:

a library for inferring probabilistic

structure and showing how to do that by

605

:

integrating these different inference

methodologies.

606

:

Hmm.

607

:

Okay.

608

:

Yeah, so many things here.

609

:

So first, I put all the links to autogp

.jl in the show notes.

610

:

I also put a link to the underlying paper

that you've written with some co -authors

611

:

about, well, the sequential Monte Carlo

learning that you're doing to discover

612

:

these time -series structure for people

who want to dig deeper.

613

:

And I put also a link to all, well, most

of the LBS episodes where we talk about

614

:

Gaussian processes for people who need a

bit more background information because

615

:

here we're mainly going to talk about how

you do that and so on and how useful is

616

:

it.

617

:

And we're not going to give a primer on

what Gaussian processes are.

618

:

So if you want that, folks, there are a

bunch of episodes in the show notes for

619

:

that.

620

:

So...

621

:

on that basically practical utility of

that time -series discovery.

622

:

So if understood correctly, for now, you

can do that only on one -dimensional input

623

:

data.

624

:

So that would be basically on a time

series.

625

:

You cannot input, let's say, that you have

categories.

626

:

These could be age groups.

627

:

So.

628

:

you could one -hot, usually I think that's

the way it's done, how to give that to a

629

:

GP would be to one -hot encode each of

these edge groups.

630

:

And then that means, let's say you have

four edge group.

631

:

Now the input dimension of your GP is not

one, which is time, but it's five.

632

:

So one for time and four for the edge

groups.

633

:

This would not work here, right?

634

:

Right, yes.

635

:

So at the moment, we're focused on, and

these are called, I guess, in

636

:

econometrics, pure time series models,

where you're only trying to do inference

637

:

on the time series based on its own

history.

638

:

I think the extensions that you're

proposing are very natural to consider.

639

:

You might have a multi -input Gaussian

process where you're not only looking at

640

:

your own history, but you're also

considering some type of categorical

641

:

variable.

642

:

Or you might have exogenous covariates

evolving along with the time series.

643

:

If you want to predict temperature, for

example, you might have the wind speed and

644

:

you might want to use that as a feature

for your Gaussian process.

645

:

Or you might have an output, a multiple

output Gaussian process.

646

:

You want a Gaussian process over multiple

different time series generally.

647

:

And I think all of these variants are, you

know, they're possible to develop.

648

:

There's no fundamental difficulty, but the

main, I think the main challenge is how

649

:

can you define a domain specific language

over these covariance structures for

650

:

multi, for multivariate input data?

651

:

becomes a little bit more challenging.

652

:

So in the time series setting, what's nice

is we can interpret how any type of

653

:

covariance kernel is going to impact the

actual prior over time series.

654

:

Once we're in the multi -dimensional

setting, we need to think about how to

655

:

combine the kernels for different

dimensions in a way that's actually

656

:

meaningful for modeling to ensure that

it's more tractable.

657

:

But I think extensions of the DSL to

handle multiple inputs, exogenous

658

:

covariates, multiple outputs,

659

:

These are all great directions.

660

:

And I'll just add on top of that, I think

another important direction is using some

661

:

of the more recent approximations for

Gaussian processes.

662

:

So we're not bottlenecked by the n cubed

scaling.

663

:

So there are, I think, a few different

approaches that have been developed.

664

:

There are approaches which are based on

stochastic PDEs or state space

665

:

approximations of Gaussian processes,

which are quite promising.

666

:

There's some other things like nearest

neighbor Gaussian processes, but I'm a

667

:

little less confident about those because

we lose a lot of the nice affordances of

668

:

GPs once we start doing nearest neighbor

approximations.

669

:

But I think there's a lot of new methods

for approximate GPs.

670

:

So we might do a stochastic variational

inference, for example, an SVGP.

671

:

So I think as we think about handling more

672

:

more richer types of data, then we should

also think about how to start introducing

673

:

some of these more scalable approximations

to make sure we can still efficiently do

674

:

the structure learning in that setting.

675

:

Yeah, that would be awesome for sure.

676

:

As a more, much more on the practitioner

side than on the math side.

677

:

Of course, that's where my head goes

first.

678

:

You know, I'm like, oh, that'd be awesome,

but I would need to have that to have it

679

:

really practical.

680

:

Um, and so if I use auto GP dot channel,

so I give it a time series data.

681

:

Um, then what do I get back?

682

:

Do I get back, um, the busier samples of

the, the implied model, or do I get back

683

:

the covariance structure?

684

:

So that could be, I don't know what, what

form that could be, but I'm thinking, you

685

:

know,

686

:

Uh, often when I use GPS, I use them

inside other models with other, like I

687

:

could use a GP in a linear regression, for

instance.

688

:

And so I'm thinking that'd be cool if I'm

not sure about the covariance structure,

689

:

especially if it can do the discovery of

the seasonality and things like that

690

:

automatically, because it's always

seasonality is a bit weird and you have to

691

:

add another GP that can handle

periodicity.

692

:

Um, and then you have basically a sum of

GP.

693

:

And then you can take that sum of GP and

put that in the linear predictor of the

694

:

linear regression.

695

:

That's usually how I use that.

696

:

And very often, I'm using categorical

predictors almost always.

697

:

And I'm thinking what would be super cool

is that I can outsource that discovery

698

:

part of the GP to the computer like you're

doing with this algorithm.

699

:

And then I get back under what form?

700

:

I don't know yet.

701

:

I'm just thinking about that.

702

:

this covariance structure that I can just,

which would be an MV normal, like a

703

:

multivit normal in a way, that I just use

in my linear predictor.

704

:

And then I can use that, for instance, in

a PMC model or something like that,

705

:

without to specify the GP myself.

706

:

Is it something that's doable?

707

:

Yeah, yeah, I think that's absolutely

right.

708

:

So you can, because Gaussian processes are

compositional, just, you know, you

709

:

mentioned the sum of two Gaussian

processes, which corresponds to the sum of

710

:

two kernel.

711

:

So if I have Gaussian process one plus

Gaussian process two, that's the same as

712

:

the Gaussian process whose covariance is

k1 plus k2.

713

:

And so what that means is we can take our

synthesized kernel, which is comprised of

714

:

some base kernels and then maybe sums and

products and change points, and we can

715

:

wrap all of these in just one mega GP,

basically, which would encode the entire

716

:

posterior disk or, you know,

717

:

a summary of all of the samples in one GP.

718

:

Another, and I think you also mentioned an

important point, which is multivariate

719

:

normals.

720

:

You can also think of the posterior as

just a mixture of these multivariate

721

:

normals.

722

:

So let's say I'm not going to sort of

compress them into a single GP, but I'm

723

:

actually going to represent the output of

auto GP as a mixture of multivariate

724

:

normals.

725

:

And that would be another type of API.

726

:

So depending on exactly what type of

727

:

how you're planning to use the GP, I think

you can use the output of auto GP in the

728

:

right way, because ultimately, it's

producing some covariance kernels, you

729

:

might aggregate them all into a GP, or you

might compose them together to make a

730

:

mixture of GPs.

731

:

And you can export this to PyTorch, or

most of the current libraries for GPs

732

:

support composing the GPs with one

another, et cetera.

733

:

So I think depending on the use case, it

should be quite straightforward to figure

734

:

out how to leverage the output of AutoGP

to use within the inner loop of some bra

735

:

or within the internals of some larger

linear regression model or other type of

736

:

model.

737

:

Yeah, that's definitely super cool because

then you can, well, yeah, use that,

738

:

outsource that part of the model where I

think the algorithm probably...

739

:

If not now, in just a few years, it's

going to make do a better job than most

740

:

modelers, at least to have a rough first

draft.

741

:

That's right.

742

:

The first draft.

743

:

A data scientist who's determined enough

to beat AutoGP, probably they can do it if

744

:

they put in enough effort just to study

the data.

745

:

But it's getting a first pass model that's

actually quite good as compared to other

746

:

types of automated techs.

747

:

Yeah, exactly.

748

:

I mean, that's recall.

749

:

It's like asking for a first draft of, I

don't know, blog post to ChatGPT and then

750

:

going yourself in there and improving it

instead of starting everything from

751

:

scratch.

752

:

Yeah, for sure you could do it, but that's

not where your value added really lies.

753

:

So yeah.

754

:

So what you get is these kind of samples.

755

:

In a way, do you get back samples?

756

:

or do you get symbolic variables back?

757

:

You get symbolic expressions for the

covariance kernels as well as the

758

:

parameters embedded within them.

759

:

So you might get, let's say you asked for

five posterior samples, you're going to

760

:

have maybe one posterior sample, which is

a linear kernel.

761

:

And then another posterior sample, which

is a linear times linear, so a quadratic

762

:

kernel.

763

:

And then maybe a third posterior sample,

which is again, a linear, and each of them

764

:

will have their different parameters.

765

:

And because we're using sequential Monte

Carlo,

766

:

all of the posterior samples are

associated with weights.

767

:

The sequential Monte Carlo returns a

weighted particle collection, which is

768

:

approximating the posterior.

769

:

So you get back these weighted particles,

which are symbolic expressions.

770

:

And we have, in AutoGP, we have a minimal

prediction GP library.

771

:

So you can actually put these symbolic

expressions into a GP to get a functional

772

:

GP, but you can export them to a text file

and then use your favorite GP library and

773

:

embed them within that as well.

774

:

And we also get noise parameters.

775

:

So each kernel is going to be associated

with the output noise.

776

:

Because obviously depending on what kernel

you use, you're going to infer a different

777

:

noise level.

778

:

So you get a kernel structure, parameters,

and noise for each individual particle in

779

:

your SMC ensemble.

780

:

OK, I see.

781

:

Yeah, super cool.

782

:

And so yeah, if you can get back that as a

text file.

783

:

Like either you use it in a full Julia

program, or if you prefer R or Python, you

784

:

could use auto -gp .jl just for that.

785

:

Get back a text file and then use that in

R or in Python in another model, for

786

:

instance.

787

:

Okay.

788

:

That's super cool.

789

:

Do you have examples of that?

790

:

Yeah.

791

:

Do you have examples of that we can link

to for listeners in the show notes?

792

:

We have tutorial.

793

:

And so...

794

:

The tutorial, I think, prints, it shows a

print of the, it prints the learned

795

:

structures into the output cells of the

IPython notebooks.

796

:

And so you could take the printed

structure and just save it as a text file

797

:

and write your own little parser for

extracting those structures and building

798

:

an RGP or a PyTorch GP or any other GP.

799

:

Okay.

800

:

Yeah.

801

:

That was super cool.

802

:

That's awesome.

803

:

And do you know if there is already an

implementation in R?

804

:

and or in Python of what you're doing in

AutoGP .JS?

805

:

Yeah, so we, so this project was

implemented during my year at Google when

806

:

I was so between starting at CMU and

finishing my PhD, I was at Google for a

807

:

year as a visiting faculty scientist.

808

:

And some of the prototype implementations

were also in Python.

809

:

But I think the only public version at the

moment is the Julia version.

810

:

But I think it's a little bit challenging

to reimplement this because one of the

811

:

things we learned when trying to implement

it in Python is that we don't have Gen, or

812

:

at least at the time we didn't.

813

:

The reason we focused on Julia is that we

could use the power of the Gen

814

:

probabilistic programming language in a

way that made model development and

815

:

iterating.

816

:

much more feasible than a pure Python

implementation or even, you know, an R

817

:

implementation or in another language.

818

:

Yeah.

819

:

Okay.

820

:

Um, and so actually, yeah, so I, I would

have so many more questions on that, but I

821

:

think that's already a good, a good

overview of, of that project.

822

:

Maybe I'm curious about the, the biggest

obstacle that you had on the path, uh,

823

:

when developing

824

:

that package, autogp .jl, and also what

are your future plans for this package?

825

:

What would you like to see it become in

the coming months and years?

826

:

Yeah.

827

:

So thanks for those questions.

828

:

So for the biggest challenge, I think

designing and implementing the inference

829

:

algorithm that includes...

830

:

sequential Monte Carlo and involuted MCMC.

831

:

That was a challenge because there aren't

many works, prior works in the literature

832

:

that have actually explored this type of a

combination, which is, um, you know, which

833

:

is really at the heart of auto GP, um,

designing the right proposal distributions

834

:

for, I have some given structure and I

have my data.

835

:

How do I do a data driven proposal?

836

:

So I'm not just blindly proposing some new

structure from the prior or some new sub

837

:

-structure.

838

:

but actually use the observed data to come

up with a smart proposal for how I'm going

839

:

to improve the structure in the inner loop

of MCMC.

840

:

So we put a lot of thought into the actual

move types and how to use the data to come

841

:

up with data -driven proposal

distributions.

842

:

So the paper describes some of these

tricks.

843

:

So there's moves which are based on

replacing a random subtree.

844

:

There are moves which are detaching the

subtree and throwing everything away or...

845

:

embedding the subtree within a new tree.

846

:

So there are these different types of

moves, which we found are more helpful to

847

:

guide the search.

848

:

And it was a challenging process to figure

out how to implement those moves and how

849

:

to debug them.

850

:

So that I think was, was part of the

challenge.

851

:

I think another challenge which, which we

came, which we were facing was of course,

852

:

the fact that we were using these dense

Gaussian process models without the actual

853

:

approximations that are needed to scale to

say tens or hundreds of thousands of data

854

:

points.

855

:

And so.

856

:

This I think was part of the motivation

for thinking about what are other types of

857

:

approximations of the GP that would let us

handle datasets of that size.

858

:

In terms of what I'd like for AutoGP to be

in the future, I think there's two answers

859

:

to that.

860

:

One answer, and I think there's already a

nice success case here, but one answer is

861

:

I'd like the implementation of AutoGP to

be a reference for how to do probabilistic

862

:

structure discovery using GEN.

863

:

So I expect that people...

864

:

across many different disciplines have

this problem of not knowing what their

865

:

specific model is for the data.

866

:

And then you might have a prior

distribution over symbolic model

867

:

structures and given your observed data,

you want to infer the right model

868

:

structure.

869

:

And I think in the auto GP code base, we

have a lot of the important components

870

:

that are needed to apply this workflow to

new settings.

871

:

So I think we've really put a lot of

effort in having the code be self

872

:

-documenting in a sense.

873

:

and make it easier for people to adapt the

code for their own purposes.

874

:

And so there was a recent paper this year

presented at NURiPS by Tracy Mills and Sam

875

:

Shayet from Professor Tenenbaum's group

that extended the AutoGP package for a

876

:

task in cognition, which was very nice to

see that the code isn't only valuable for

877

:

its own purpose, but also adaptable by

others for other types of tasks.

878

:

Um, and I think the second thing that I'd

like auto GP or at least the auto GP type

879

:

models to do is, um, you know, integrating

these with, and this goes back to the

880

:

original automatic statistician that, uh,

that motivated auto GP.

881

:

It's worked say 10 years ago.

882

:

Um, so the auto automated statistician had

the component, the natural language

883

:

processing component, which is, you know,

at the time there was no chat GPT or large

884

:

language models.

885

:

So they just wrote some simple rules to

take the learned Gaussian process.

886

:

and summarize it in terms of a report.

887

:

But now we have much more powerful

language models.

888

:

And one question could be, how can I use

the outputs of AutoGP and integrate it

889

:

within a language model, not only for

reporting the structure, but also for

890

:

answering now probabilistic queries.

891

:

So you might say, find for me a time when

there could be a change point, or give me

892

:

a numerical estimate of the covariance

between two different time slices, or

893

:

impute the data.

894

:

between these two different time regions,

or give me a 95 % prediction interval.

895

:

And so a data scientist can write these in

terms of natural language, or rather a

896

:

domain specialist can write these in

natural language, and then you would

897

:

compile it into different little programs

that are querying the GP learned by

898

:

AutoGP.

899

:

And so creating some type of a higher

level interface that makes it possible for

900

:

people to not necessarily dive into the

guts of Julia and, you know, or implement

901

:

even an IPython notebook.

902

:

but have the system learn the

probabilistic models and then have a

903

:

natural language interface which you can

use to query those models, either for

904

:

learning something about the structure of

the data, but also for solving prediction

905

:

tasks.

906

:

And in both cases, I think, you know, off

the shelf models may not work so well

907

:

because, you know, they may not know how

to parse the auto GP kernel to come up

908

:

with a meaningful summary of what it

actually means in terms of the data, or

909

:

they may not know how to translate natural

language into

910

:

Julia code for AutoGP.

911

:

So there's a little bit of research into

thinking about how do we fine tune these

912

:

models so that they're able to interact

with the automatically learned

913

:

probabilistic models.

914

:

And I think what's, I'll just mention

here, which is one of the benefits of an

915

:

AutoGP like system is its

interpretability.

916

:

So because Gaussian processes are, they're

quiet, transparent, like you said, they're

917

:

ultimately at the end of the day, these

giant multivariate normals.

918

:

We can explain to people who are using

these types of these distributions and

919

:

they're comfortable with them, what

exactly is the distribution that's been

920

:

learned?

921

:

These are some weights and some giant

neural network and here's the prediction

922

:

and you have to live with it.

923

:

Rather, you can say, well, here's our

prediction and the reason we made this

924

:

prediction is, well, we inferred a

seasonal components with so -and -so

925

:

frequency.

926

:

And so you can get the predictions, but

you can also get some type of

927

:

interpretable summary for why those

predictions were made, which maybe helps

928

:

with the trustworthiness of the system.

929

:

or just transparency more generally.

930

:

Yeah.

931

:

I'm signing now.

932

:

That sounds like an awesome tool.

933

:

Yeah, for sure.

934

:

That looks absolutely fantastic.

935

:

And yeah, hopefully that will, these kind

of tools will help.

936

:

I'm definitely curious to try that now in

my own models, basically.

937

:

And yeah, see what...

938

:

AutoGP .jl tells you, but the covariance

structure and then try and use that myself

939

:

in a model of mine, probably in Python so

that I have to get out of the Julia and

940

:

see how that, like how you can plug that

into another model.

941

:

That would be super, super interesting for

sure.

942

:

Yeah.

943

:

I'm going to try and find an excuse to do

that.

944

:

Um, actually I'm curious now, um, we could

talk a bit about how that's done, right?

945

:

How you do that discovery of the time

series structure.

946

:

And you've mentioned that you're using

sequential Monte Carlo to do that.

947

:

So SMC, um, can you give listeners an idea

of what SMC is and why that would be

948

:

useful in that case?

949

:

Uh, and also if.

950

:

the way you do it for these projects

differs from the classical way of doing

951

:

SMC.

952

:

Good.

953

:

Yes, thanks for that question.

954

:

So sequential Monte Carlo is a very broad

family of algorithms.

955

:

And I think one of the confusing parts for

me when I was learning sequential Monte

956

:

Carlo is that a lot of the introductory

material of sequential Monte Carlo are

957

:

very closely married to particle filters.

958

:

But particle filtering, which is only one

application of sequential Monte Carlo,

959

:

isn't the whole story.

960

:

And so I think, you know, there's now more

modern expositions of sequential Monte

961

:

Carlo, which are really bringing to light

how general these methods are.

962

:

And here I would like to recommend

Professor Nicholas Chopin's textbook,

963

:

Introduction to Sequential Monte Carlo.

964

:

It's a Springer 2020 textbook.

965

:

I continue to use this in my research and,

you know, I think that it's a very well

966

:

-written overview of really

967

:

how general and how powerful sequential

Monte Carlo is.

968

:

So a brief explanation of sequential Monte

Carlo.

969

:

I guess maybe one way we could contrast it

is the traditional Markov chain Monte

970

:

Carlo.

971

:

So in traditional MCMC, we have some

particular latent state, let's call it

972

:

theta.

973

:

And we just, theta is supposed to be drawn

from P of theta given X, where that's our

974

:

posterior distribution and X is the data.

975

:

And we just apply some transition kernel

over and over and over again, and then we

976

:

hope.

977

:

And the limit of the applications of these

transition kernels, we're going to

978

:

converge to the posterior distribution.

979

:

Okay.

980

:

So MCMC is just like one iterative chain

that you run forever.

981

:

You can do a little bit of modifications.

982

:

You might have multiple chains, which are

independent of one another, but sequential

983

:

Monte Carlo is, is in a sense, trying to

go beyond that, which is anything you can

984

:

do in a traditional MCMC algorithm, you

can do using sequential Monte Carlo.

985

:

But in sequential Monte Carlo,

986

:

you don't have a single chain, but you

have multiple different particles.

987

:

And each of these different particles you

can think of as being analogous in some

988

:

way to a particular MCMC chain, but

they're allowed to interact.

989

:

And so you start with, say, some number of

particles, and you start with no data.

990

:

And so what you would do is you would just

draw these particles from your prior

991

:

distribution.

992

:

And each of these draws from the prior are

basically draws from p of theta.

993

:

And now I'd like to get them to p of theta

given x.

994

:

That's my goal.

995

:

So I start with a bunch of particles drawn

from p of theta, and I'd like to get them

996

:

to p of theta given x.

997

:

So how am I going to go from p of theta to

p of theta given x?

998

:

There's many different ways you might do

that, and that's exactly what's

999

:

sequential, right?

:

01:00:55,896 --> 01:00:58,376

How do you go from the prior to the

posterior?

:

01:00:58,376 --> 01:01:04,176

The approach we take in data in AutoGP is

based on this idea of data tempering.

:

01:01:04,176 --> 01:01:08,706

So let's say my data x consists of a

thousand measurements, okay?

:

01:01:08,706 --> 01:01:11,756

And I'd like to go from p of theta to p of

theta given x.

:

01:01:11,756 --> 01:01:15,136

Well, here's one sequential strategy that

I can use to bridge between these two

:

01:01:15,136 --> 01:01:16,076

distributions.

:

01:01:16,076 --> 01:01:20,316

I can start with P of theta, then I can

start with P of theta given X1, then P of

:

01:01:20,316 --> 01:01:23,636

theta given X1 and X2, P of theta given X2

and X3.

:

01:01:23,636 --> 01:01:27,596

So I can anneal or I can temper these data

points into the prior.

:

01:01:27,596 --> 01:01:30,436

And the more data points I put in, the

closer I'm going to get to the full

:

01:01:30,436 --> 01:01:34,796

posterior P of theta given X1 through a

thousand or something.

:

01:01:34,796 --> 01:01:36,876

Or you might introduce these data in

batch.

:

01:01:36,876 --> 01:01:41,708

But the key idea is that you start with

draws from some prior typically.

:

01:01:41,708 --> 01:01:45,448

and then you're just adding more and more

data and you're reweighting the particles

:

01:01:45,448 --> 01:01:48,548

based on the probability that they assign

to the new data.

:

01:01:48,548 --> 01:01:53,368

So if I have 10 particles and some

particle is always able to predict or it's

:

01:01:53,368 --> 01:01:57,108

always assigning a very high score to the

new data, I know that that's a particle

:

01:01:57,108 --> 01:01:59,168

that's explaining the data quite well.

:

01:01:59,168 --> 01:02:02,628

And so I might resample these particles

according to their weights to get rid of

:

01:02:02,628 --> 01:02:05,948

the particles that are not explaining the

new data well and to focus my

:

01:02:05,948 --> 01:02:09,156

computational effort on the particles that

are explaining the data well.

:

01:02:09,516 --> 01:02:12,576

And this is something that an MCMC

algorithm does not give us.

:

01:02:12,576 --> 01:02:17,496

Because even if we run like a hundred MCMC

chains in parallel, we don't know how to

:

01:02:17,496 --> 01:02:22,236

resample the chains, for example, because

they're all these independent executions

:

01:02:22,236 --> 01:02:26,276

and we don't have a principled way of

assigning a score to those different

:

01:02:26,276 --> 01:02:26,516

chains.

:

01:02:26,516 --> 01:02:28,016

You can't use the joint likelihood.

:

01:02:28,016 --> 01:02:32,956

That's not, it's not a valid or even a

meaningful statistic to use to measure, to

:

01:02:32,956 --> 01:02:34,668

measure the quality of a given chain.

:

01:02:34,668 --> 01:02:38,648

But SMC has, because it's built on

importance sampling, has a principled way

:

01:02:38,648 --> 01:02:43,208

for us to assign weights to these

different particles and focus on the ones

:

01:02:43,208 --> 01:02:44,868

which are most promising.

:

01:02:44,928 --> 01:02:49,048

And then I think the final component

that's missing in my explanation is where

:

01:02:49,048 --> 01:02:50,728

does the MCMC come in?

:

01:02:50,728 --> 01:02:54,288

So traditionally in sequential Monte

Carlo, there was no MCMC.

:

01:02:54,288 --> 01:02:59,368

You would just have your particles, you

would add new data, you would reweight it

:

01:02:59,368 --> 01:03:02,168

based on the probability of the data, then

you would resample the particles.

:

01:03:02,168 --> 01:03:03,628

Then I'm going to add some...

:

01:03:03,628 --> 01:03:07,008

next batch of data, resample, re -weight,

et cetera.

:

01:03:07,008 --> 01:03:12,648

But you're also able to, in between adding

new data points, run MCMC in the inner

:

01:03:12,648 --> 01:03:14,608

loop of sequential Monte Carlo.

:

01:03:14,608 --> 01:03:20,318

And that does not sort of make the

algorithm incorrect.

:

01:03:20,318 --> 01:03:23,968

It preserves the correctness of the

algorithm, even if you run MCMC.

:

01:03:23,968 --> 01:03:28,908

And there the intuition is that, you know,

your prior draws are not going to be good.

:

01:03:28,908 --> 01:03:32,248

So now that after I've observed say 10 %

of the data, I might actually run some

:

01:03:32,248 --> 01:03:37,288

MCMC on that subset of 10 % of the data

before I introduce the next batch of data.

:

01:03:37,288 --> 01:03:42,048

So after you're reweighting the particles,

you're also using a little bit of MCMC to

:

01:03:42,048 --> 01:03:45,608

improve their structure given the data

that's been observed so far.

:

01:03:45,608 --> 01:03:49,288

And that's where the MCMC is run inside

the inner loop.

:

01:03:49,288 --> 01:03:53,428

So some of the benefits I think of this

kind of approach are, like I mentioned at

:

01:03:53,428 --> 01:03:57,168

the beginning, in MCMC you have to compute

the probability of all the data at each

:

01:03:57,168 --> 01:03:57,708

step.

:

01:03:57,708 --> 01:04:01,767

But in SMC, because we're sequentially

incorporating new batches of data, we can

:

01:04:01,767 --> 01:04:06,808

get away with only looking at say 10 or 20

% of the data and get some initial

:

01:04:06,808 --> 01:04:10,488

inferences before we actually reach to the

end and processed all of the observed

:

01:04:10,488 --> 01:04:11,488

data.

:

01:04:12,268 --> 01:04:18,548

So that's, I guess, a high level overview

of the algorithm that AutoGP is using.

:

01:04:18,548 --> 01:04:20,908

It's annealing the data or tempering the

data.

:

01:04:20,908 --> 01:04:24,812

It's reassigning the scores of the

particles based on how well they're

:

01:04:24,812 --> 01:04:30,372

explaining the new batch of data and it's

running MCMC to improve their structure by

:

01:04:30,372 --> 01:04:33,572

applying these different moves like

removing the sub -expression, adding the

:

01:04:33,572 --> 01:04:36,282

sub -expression, different things of that

nature.

:

01:04:38,188 --> 01:04:39,348

Okay, yeah.

:

01:04:39,348 --> 01:04:43,508

Thanks a lot for this explanation because

that was a very hard question on my part

:

01:04:43,508 --> 01:04:52,228

and I think you've done a tremendous job

explaining the basics of SMC and when that

:

01:04:52,228 --> 01:04:53,608

would be useful.

:

01:04:53,608 --> 01:04:55,768

So, yeah, thank you very much.

:

01:04:55,768 --> 01:04:57,668

I think that's super helpful.

:

01:04:58,048 --> 01:05:04,528

And why in this case, when you're trying

to do these kind of time series

:

01:05:04,528 --> 01:05:06,188

discoveries, why...

:

01:05:06,188 --> 01:05:11,228

would SMC be more useful than a classic

MCMC?

:

01:05:11,568 --> 01:05:12,078

Yeah.

:

01:05:12,078 --> 01:05:15,468

So it's more useful, I guess, for several

reasons.

:

01:05:15,468 --> 01:05:19,638

One reason is that, well, you might

actually have a true streaming problem.

:

01:05:19,638 --> 01:05:24,688

So if your data is actually streaming, you

can't use MCMC because MCMC is operating

:

01:05:24,688 --> 01:05:25,968

on a static data set.

:

01:05:25,968 --> 01:05:31,928

So what if I'm running AutoGP in some type

of industrial process system where some

:

01:05:31,928 --> 01:05:33,068

data is coming in?

:

01:05:33,068 --> 01:05:36,098

and I'm updating the models in real time

as my data is coming in.

:

01:05:36,098 --> 01:05:41,248

That's a purely online algorithm in which

SMC is perfect for, but MCMC is not so

:

01:05:41,248 --> 01:05:46,388

well suited because you basically don't

have a way to, I mean, obviously you can

:

01:05:46,388 --> 01:05:50,768

always incorporate new data in MCMC, but

that's not the traditional algorithm where

:

01:05:50,768 --> 01:05:52,748

we know its correctness properties.

:

01:05:52,748 --> 01:05:56,228

So for when you have streaming data, that

might be extremely useful.

:

01:05:56,228 --> 01:05:59,180

But even if your data is not streaming,

:

01:05:59,180 --> 01:06:03,280

you know, theoretically there's results

that show that convergence can be much

:

01:06:03,280 --> 01:06:06,620

improved when you use the sequential Monte

Carlo approach.

:

01:06:06,620 --> 01:06:12,100

Because you have these multiple particles

that are interacting with one another.

:

01:06:12,100 --> 01:06:16,580

And what they can do is they can explore

multiple modes whereby an MCMC, you know,

:

01:06:16,580 --> 01:06:19,540

each individual MCMC chain might get

trapped in a mode.

:

01:06:19,540 --> 01:06:23,620

And unless you have an extremely accurate

posterior proposal distribution, you may

:

01:06:23,620 --> 01:06:25,298

never escape from that mode.

:

01:06:25,388 --> 01:06:28,708

But in SMC, we're able to resample these

different particles so that they're

:

01:06:28,708 --> 01:06:32,568

interacting, which means that you can

probably explore the space much more

:

01:06:32,568 --> 01:06:36,148

efficiently than you could with a single

chain that's not interacting with other

:

01:06:36,148 --> 01:06:36,708

chains.

:

01:06:36,708 --> 01:06:41,928

And this is especially important in the

types of posteriors that AutoGP is

:

01:06:41,928 --> 01:06:44,568

exploring, because these are symbolic

expression spaces.

:

01:06:44,568 --> 01:06:46,428

They are not Euclidean space.

:

01:06:46,428 --> 01:06:51,548

And so we expect there to be largely non

-smooth components, and we want to be able

:

01:06:51,548 --> 01:06:54,156

to jump efficiently through this space

through...

:

01:06:54,156 --> 01:07:00,736

the resampling procedure of, of, of SMC,

uh, which, which is why, uh, which, which

:

01:07:00,736 --> 01:07:02,116

is why it's a suitable algorithm.

:

01:07:02,116 --> 01:07:06,256

And then the third component is because,

you know, this is more specific to GPs in

:

01:07:06,256 --> 01:07:11,516

particular, which is because GPs have a

cubic cost of evaluating the likelihood in

:

01:07:11,516 --> 01:07:14,486

MCMC, that's really going to bite you if

you're doing it each step.

:

01:07:14,486 --> 01:07:17,676

If I have a million, a thousand

observations, I don't want to be doing

:

01:07:17,676 --> 01:07:22,476

that at each step, but in SMC, because the

data is being introduced in batches, what

:

01:07:22,476 --> 01:07:23,628

that means is.

:

01:07:23,628 --> 01:07:28,068

I might be able to get some very accurate

predictions using only the first 10 % of

:

01:07:28,068 --> 01:07:31,928

the data, which is going to be quite cheap

to evaluate the likelihood.

:

01:07:31,928 --> 01:07:35,768

So you're somehow smoothly interpolating

between the prior, where you can get

:

01:07:35,768 --> 01:07:39,728

perfect samples, and the posterior, which

is hard to sample, using these

:

01:07:39,728 --> 01:07:44,148

intermediate distributions, which are

closer to one another than the distance

:

01:07:44,148 --> 01:07:46,068

between the prior and the posterior.

:

01:07:46,068 --> 01:07:49,168

And that's what makes inference hard,

essentially, which is the distance between

:

01:07:49,168 --> 01:07:50,700

the prior and the posterior.

:

01:07:50,700 --> 01:07:56,420

because SMC is introducing datasets in

smaller batches, it's making this sort of

:

01:07:56,420 --> 01:07:57,020

bridging.

:

01:07:57,020 --> 01:08:00,700

It's making it easier to bridge between

the prior and the posterior by having

:

01:08:00,700 --> 01:08:03,280

these partial posteriors, basically.

:

01:08:03,740 --> 01:08:06,460

Okay, I see.

:

01:08:06,460 --> 01:08:07,560

Yeah.

:

01:08:07,860 --> 01:08:08,660

Yeah, okay.

:

01:08:08,660 --> 01:08:12,650

That makes sense because of that batching

process, basically.

:

01:08:12,650 --> 01:08:13,540

Yeah, for sure.

:

01:08:13,540 --> 01:08:19,428

And the requirements also of MCMC coupled

to a GP that's...

:

01:08:19,436 --> 01:08:22,376

That's for sure making stuff hard.

:

01:08:22,376 --> 01:08:22,656

Yeah.

:

01:08:22,656 --> 01:08:23,816

Yeah.

:

01:08:25,216 --> 01:08:29,726

And well, I've already taken a lot of time

from you.

:

01:08:29,726 --> 01:08:30,886

So thanks a lot for us.

:

01:08:30,886 --> 01:08:32,326

I really appreciate it.

:

01:08:32,326 --> 01:08:35,306

And that's very, very fascinating.

:

01:08:35,306 --> 01:08:37,476

Everything you're doing.

:

01:08:38,156 --> 01:08:42,596

I'm curious also because you're a bit on

both sides, right?

:

01:08:42,596 --> 01:08:46,406

Where you see practitioners, but you're

also on the very theoretical side.

:

01:08:46,406 --> 01:08:48,268

And also you teach.

:

01:08:48,268 --> 01:08:54,368

So I'm wondering if like, what's the, in

your opinion, what's the biggest hurdle in

:

01:08:54,368 --> 01:08:56,136

the Bayesian workflow currently?

:

01:08:57,868 --> 01:08:59,828

Yeah, I think there's really a lot of

hurdles.

:

01:08:59,828 --> 01:09:01,928

I don't know if there's a biggest one.

:

01:09:01,948 --> 01:09:08,068

So obviously, you know, Professor Andrew

Gelman has enormous manuscript on the

:

01:09:08,068 --> 01:09:09,968

archive, which is called Bayesian

workflow.

:

01:09:09,968 --> 01:09:13,908

And he goes through the nitty gritty of

all the different challenges with coming

:

01:09:13,908 --> 01:09:15,688

up with the Bayesian model.

:

01:09:15,688 --> 01:09:20,188

But for me, at least the one that's tied

closely to my research is where do we even

:

01:09:20,188 --> 01:09:21,288

start?

:

01:09:21,288 --> 01:09:22,868

Where do we start this workflow?

:

01:09:22,868 --> 01:09:27,222

And that's really what drives a lot of my

interest in automatic model discovery.

:

01:09:27,244 --> 01:09:29,164

probabilistic program synthesis.

:

01:09:29,164 --> 01:09:33,424

The idea is not that we want to discover

the model that we're going to use for the

:

01:09:33,424 --> 01:09:38,584

rest of our, for the rest of the lifetime

of the workflow, but come up with good

:

01:09:38,584 --> 01:09:42,704

explanations that we can use to bootstrap

this process, after which then we can

:

01:09:42,704 --> 01:09:44,534

apply the different stages of the

workflow.

:

01:09:44,534 --> 01:09:49,044

But I think it's getting from just data to

plausible explanations of that data.

:

01:09:49,044 --> 01:09:52,504

And that's what, you know, probabilistic

program synthesis or automatic model

:

01:09:52,504 --> 01:09:55,244

discovery is trying to solve.

:

01:09:56,204 --> 01:09:58,754

So I think that's a very large bottleneck.

:

01:09:58,754 --> 01:10:01,724

And then I'd say, you know, the second

bottleneck is the scalability of

:

01:10:01,724 --> 01:10:02,444

inference.

:

01:10:02,444 --> 01:10:07,404

I think that Bayesian inference has a poor

reputation in many corners because of how

:

01:10:07,404 --> 01:10:10,384

unscalable traditional MCMC algorithms

are.

:

01:10:10,384 --> 01:10:15,324

But I think in the last 10, 15 years,

we've seen many foundational developments

:

01:10:15,324 --> 01:10:21,884

in more scalable posterior inference

algorithms that are being used in many

:

01:10:21,884 --> 01:10:24,564

different settings in computational

science, et cetera.

:

01:10:24,564 --> 01:10:25,548

And I think...

:

01:10:25,548 --> 01:10:28,928

building probabilistic programming

technologies that better expose these

:

01:10:28,928 --> 01:10:35,868

different inference innovations is going

to help push Bayesian inference to the

:

01:10:35,868 --> 01:10:42,448

next level of applications that people

have traditionally thought are beyond

:

01:10:42,448 --> 01:10:45,648

reach because of the lack of scalability.

:

01:10:45,648 --> 01:10:49,168

So I think putting a lot of effort into

engineering probabilistic programming

:

01:10:49,168 --> 01:10:53,508

languages that really have fast, powerful

inference, whether it's sequential Monte

:

01:10:53,508 --> 01:10:54,668

Carlo, whether it's...

:

01:10:54,668 --> 01:10:58,308

Hamiltonian Monte Carlo with no U -turn

sampling, whether it's, you know, there's

:

01:10:58,308 --> 01:11:01,688

really a lot of different, in volutive

MCMC over discrete structure.

:

01:11:01,688 --> 01:11:03,598

These are all things that we've seen quiet

recently.

:

01:11:03,598 --> 01:11:07,468

And I think if you put them together, we

can come up with very powerful inference

:

01:11:07,468 --> 01:11:08,628

machinery.

:

01:11:08,668 --> 01:11:13,588

And then I think the last thing I'll say

on that topic is, you know, we also need

:

01:11:13,588 --> 01:11:18,808

some new research into how to configure

our inference algorithms.

:

01:11:18,808 --> 01:11:22,408

So, you know, we spend a lot of time

thinking is our model the right model, but

:

01:11:22,408 --> 01:11:22,956

you know,

:

01:11:22,956 --> 01:11:27,176

I think now that we have probabilistic

programming and we have inference

:

01:11:27,176 --> 01:11:31,936

algorithms maybe themselves implemented as

probabilistic programming, we might think

:

01:11:31,936 --> 01:11:37,256

in a more mathematically principled way

about how to optimize the inference

:

01:11:37,256 --> 01:11:40,756

algorithms in addition to optimizing the

parameters of the model.

:

01:11:40,756 --> 01:11:44,056

I think of some type of joint inference

process where you're simultaneously using

:

01:11:44,056 --> 01:11:47,756

the right inference algorithm for your

given model and have some type of

:

01:11:47,756 --> 01:11:51,296

automation that's helping you make those

choices.

:

01:11:52,620 --> 01:11:59,160

Yeah, kind of like the automated

statistician that you were talking about

:

01:11:59,160 --> 01:12:01,740

at the beginning of the show.

:

01:12:01,880 --> 01:12:05,120

Yeah, that would be fantastic.

:

01:12:05,200 --> 01:12:12,300

Definitely kind of like having a stats

sidekick helping you when you're modeling.

:

01:12:12,300 --> 01:12:15,240

That would definitely be fantastic.

:

01:12:15,300 --> 01:12:21,260

Also, as you were saying, the workflow is

so big and diverse that...

:

01:12:21,260 --> 01:12:28,240

It's very easy to forget about something,

forget a step, neglect one, because we're

:

01:12:28,240 --> 01:12:31,500

all humans, you know, things like that.

:

01:12:31,500 --> 01:12:33,140

No, definitely.

:

01:12:33,140 --> 01:12:38,980

And as you were saying, you're also a

professor at CMU.

:

01:12:38,980 --> 01:12:45,780

So I'm curious how you approach teaching

these topics, teaching stats to prepare

:

01:12:45,780 --> 01:12:49,932

your students for all of these challenges,

especially given...

:

01:12:49,932 --> 01:12:54,372

challenges of probabilistic computing that

we've mentioned throughout this show.

:

01:12:55,820 --> 01:12:59,839

Yeah, yeah, that's something I think about

frequently actually, because, you know, I

:

01:12:59,839 --> 01:13:03,080

haven't been teaching for a very long time

and this is over the course of the next

:

01:13:03,080 --> 01:13:08,900

few years, gonna have to put a lot of

effort into thinking about how to give

:

01:13:08,900 --> 01:13:13,000

students who are interested in these areas

the right background so that they can

:

01:13:13,000 --> 01:13:14,660

quickly be productive.

:

01:13:14,800 --> 01:13:17,980

And what's especially challenging, at

least in my interest area, which is

:

01:13:17,980 --> 01:13:21,600

there's both the probabilistic modeling

component and there's also the programming

:

01:13:21,600 --> 01:13:22,916

languages component.

:

01:13:23,148 --> 01:13:27,428

And what I've learned is these two

communities don't talk much with one

:

01:13:27,428 --> 01:13:28,188

another.

:

01:13:28,188 --> 01:13:31,988

You have people who are doing statistics

who think like, oh, programming language

:

01:13:31,988 --> 01:13:34,298

is just our scripts and that's really all

it is.

:

01:13:34,298 --> 01:13:37,688

And I never want to think about it because

that's the messy details.

:

01:13:37,748 --> 01:13:41,808

But programming languages, if we think

about them in a principled way and we

:

01:13:41,808 --> 01:13:46,828

start looking at the code as a first

-class citizen, just like our mathematical

:

01:13:46,828 --> 01:13:50,968

model is a first -class citizen, then we

need to really be thinking in a much more

:

01:13:50,968 --> 01:13:52,780

principled way about our programs.

:

01:13:52,780 --> 01:13:56,920

And I think the type of students who are

going to make a lot of strides in this

:

01:13:56,920 --> 01:14:00,960

research area are those who really value

the programming language, the programming

:

01:14:00,960 --> 01:14:05,380

languages theory, in addition to the

statistics and the Bayesian modeling

:

01:14:05,380 --> 01:14:08,220

that's actually used for the workflow.

:

01:14:08,580 --> 01:14:13,800

And so I think, you know, the type of

courses that we're going to need to

:

01:14:13,800 --> 01:14:17,520

develop at the graduate level or at the

undergraduate level are going to need to

:

01:14:17,520 --> 01:14:21,964

really bring together these two different

worldviews, the worldview of, you know,

:

01:14:21,964 --> 01:14:26,584

empirical data analysis, statistical model

building, things of that sort, but also

:

01:14:26,584 --> 01:14:31,004

the programming languages view where we're

actually being very formal about what are

:

01:14:31,004 --> 01:14:34,304

these actual systems, what they're doing,

what are their semantics, what are their

:

01:14:34,304 --> 01:14:39,284

properties, what are the type systems that

are enabling us to get certain guarantees,

:

01:14:39,284 --> 01:14:40,864

maybe compiler technologies.

:

01:14:40,864 --> 01:14:46,244

So I think there's elements of both of

these two different communities that need

:

01:14:46,244 --> 01:14:51,116

to be put into teaching people how to be

productive probabilistic programming.

:

01:14:51,116 --> 01:14:54,876

researchers bringing ideas from these two

different areas.

:

01:14:54,956 --> 01:15:00,016

So, you know, the students who I advise,

for example, I often try and get a sense

:

01:15:00,016 --> 01:15:02,776

for whether they're more in the

programming languages world and they need

:

01:15:02,776 --> 01:15:05,936

to learn a little bit more about the

Bayesian modeling stuff, or whether

:

01:15:05,936 --> 01:15:09,896

they're more squarely in Bayesian modeling

and they need to appreciate some of the PL

:

01:15:09,896 --> 01:15:11,116

aspects better.

:

01:15:11,116 --> 01:15:13,956

And that's the sort of a game that you

have to play to figure out what are the

:

01:15:13,956 --> 01:15:17,956

right areas to be focusing on for

different students so that they can have a

:

01:15:17,956 --> 01:15:19,308

more holistic view of

:

01:15:19,308 --> 01:15:22,088

probabilistic programming and its goals

and probabilistic computing more

:

01:15:22,088 --> 01:15:25,828

generally, and building the technical

foundations that are needed to carry

:

01:15:25,828 --> 01:15:28,448

forward that research.

:

01:15:29,048 --> 01:15:31,008

Yeah, that makes sense.

:

01:15:31,208 --> 01:15:43,148

And related to that, are there any future

developments that you foresee or expect or

:

01:15:43,148 --> 01:15:48,848

hope in probabilistic reasoning systems in

the coming years?

:

01:15:49,580 --> 01:15:50,890

Yeah, I think there's quite a few.

:

01:15:50,890 --> 01:15:55,220

And I think I already touched upon one of

them, which is, you know, the integration

:

01:15:55,220 --> 01:15:57,640

with language models, for example.

:

01:15:57,640 --> 01:16:00,340

I think there's a lot of excitement about

language models.

:

01:16:00,340 --> 01:16:04,480

I think from my perspective as a research

area, that's not what I do research in.

:

01:16:04,480 --> 01:16:08,080

But I think, you know, if we think about

how to leverage the things that they're

:

01:16:08,080 --> 01:16:12,770

good at, it might be for creating these

types of interfaces between, you know,

:

01:16:12,770 --> 01:16:16,400

automatically learned probabilistic

programs and natural language queries

:

01:16:16,400 --> 01:16:18,828

about these learned programs for solving

tasks.

:

01:16:18,828 --> 01:16:21,188

data analysis or data science tasks.

:

01:16:21,188 --> 01:16:25,428

And I think this is an important, marrying

these two ideas is important because if

:

01:16:25,428 --> 01:16:28,968

people are going to start using language

models for solving statistics, I would be

:

01:16:28,968 --> 01:16:30,028

very worried.

:

01:16:30,028 --> 01:16:34,628

I don't think language models in their

current form, which are not backed by

:

01:16:34,628 --> 01:16:38,488

probabilistic programs, are at all

appropriate to doing data science or data

:

01:16:38,488 --> 01:16:39,048

analysis.

:

01:16:39,048 --> 01:16:41,788

But I expect people will be pushing that

direction.

:

01:16:41,788 --> 01:16:45,468

The direction that I'd really like to see

thrive is the one where language models

:

01:16:45,468 --> 01:16:45,900

are

:

01:16:45,900 --> 01:16:50,180

interacting with probabilistic programs to

come up with better, more principled, more

:

01:16:50,180 --> 01:16:53,820

interpretable reasoning for answering an

end user question.

:

01:16:54,180 --> 01:16:59,260

So I think these types of probabilistic

reasoning systems, you know, will really

:

01:16:59,260 --> 01:17:04,040

make probabilistic programs more

accessible on the one hand, and will make

:

01:17:04,040 --> 01:17:06,440

language models more useful on the other

hand.

:

01:17:06,440 --> 01:17:10,060

That's something that I'd like to see from

the application standpoint.

:

01:17:10,060 --> 01:17:13,920

From the theory standpoint, I have many

theoretical questions, which maybe I won't

:

01:17:13,920 --> 01:17:14,924

get into.

:

01:17:14,924 --> 01:17:18,684

which are really related about the

foundations of random variate generation.

:

01:17:18,684 --> 01:17:22,744

Like I was mentioning at the beginning of

the talk, understanding in a more

:

01:17:22,744 --> 01:17:26,164

mathematically principled way the

properties of the inference algorithms or

:

01:17:26,164 --> 01:17:29,684

the probabilistic computations that we run

on our finite precision machines.

:

01:17:29,684 --> 01:17:34,164

I'd like to build a type of complexity

theory for these type or a theory about

:

01:17:34,164 --> 01:17:38,644

the error and complexity and the resource

consumption of Bayesian inference in the

:

01:17:38,644 --> 01:17:40,184

presence of finite resources.

:

01:17:40,184 --> 01:17:43,980

And that's a much longer term vision, but

I think it will be quite valuable.

:

01:17:43,980 --> 01:17:47,080

once we start understanding the

fundamental limitations of our

:

01:17:47,080 --> 01:17:52,040

computational processes for running

probabilistic inference and computation.

:

01:17:53,680 --> 01:17:57,080

Yeah, that sounds super exciting.

:

01:17:57,080 --> 01:17:58,040

Thanks, Alain.

:

01:17:58,740 --> 01:18:06,320

That's making me so hopeful for the coming

years to hear you talk in that way.

:

01:18:06,320 --> 01:18:11,880

I'm like, yeah, it's super stoked about

the world that you are depicting here.

:

01:18:11,880 --> 01:18:13,932

And...

:

01:18:13,932 --> 01:18:19,732

Actually, it's so I think I still had so

many questions for you because as I was

:

01:18:19,732 --> 01:18:21,462

saying, you're doing so many things.

:

01:18:21,462 --> 01:18:25,612

But I think I've taken enough of your

time.

:

01:18:25,612 --> 01:18:27,692

So let's call it to show.

:

01:18:27,812 --> 01:18:32,252

And before you go though, I'm going to ask

you the last two questions I ask every

:

01:18:32,252 --> 01:18:33,972

guest at the end of the show.

:

01:18:33,972 --> 01:18:39,272

If you had unlimited time and resources,

which problem would you try to solve?

:

01:18:39,292 --> 01:18:43,468

Yeah, that's a very tough question.

:

01:18:43,468 --> 01:18:46,088

I should have prepared for that one

better.

:

01:18:46,848 --> 01:18:55,448

Yeah, I think one area which would be

really worth solving is using, or at least

:

01:18:55,448 --> 01:19:01,108

within the scope of Bayesian inference and

probabilistic modeling, is using these

:

01:19:01,108 --> 01:19:13,782

technologies to unify people around data,

solid data -driven inferences.

:

01:19:14,028 --> 01:19:18,448

to have better discussions in empirical

fields, right?

:

01:19:18,448 --> 01:19:20,988

So obviously politics is extremely

divisive.

:

01:19:20,988 --> 01:19:26,348

People have all sorts of different

interpretations based on their political

:

01:19:26,348 --> 01:19:30,748

views and based on their aesthetics and

whatever, and all that's natural.

:

01:19:30,748 --> 01:19:36,828

But one question I think about, which is

how can we have a shared language when we

:

01:19:36,828 --> 01:19:41,848

talk about a given topic or the pros and

cons of those topic in terms of rigorous

:

01:19:41,848 --> 01:19:42,988

data -driven,

:

01:19:42,988 --> 01:19:48,708

or rigorous data -driven theses about why

we have these different views and try and

:

01:19:48,708 --> 01:19:53,628

disconnect the fundamental tensions and

bring down the temperature so that we can

:

01:19:53,628 --> 01:19:58,648

talk more about the data and have good

insights or leverage insights from the

:

01:19:58,648 --> 01:20:04,048

data and use that to guide our decision

-making across, especially the more

:

01:20:04,048 --> 01:20:07,868

divisive areas like public policy, things

of that nature.

:

01:20:07,868 --> 01:20:11,788

But I think part of the challenge is that

why we don't do this, well, you know,

:

01:20:11,788 --> 01:20:15,548

From the political standpoint, it's much

easier to not focus on what the data is

:

01:20:15,548 --> 01:20:19,098

saying because that could be expedient and

it appeals to a broader amount of people.

:

01:20:19,098 --> 01:20:23,348

But at the same time, maybe we don't have

the right language of how we might use

:

01:20:23,348 --> 01:20:28,048

data to think more, you know, in a more

principled way about some of the main, the

:

01:20:28,048 --> 01:20:29,808

major challenges that we're facing.

:

01:20:29,808 --> 01:20:36,048

So I, yeah, I think I'd like to get to a

stage where we can focus more about, you

:

01:20:36,048 --> 01:20:40,620

know, principle discussions about hard

problems that are really grounded in data.

:

01:20:40,620 --> 01:20:45,160

And the way we would get those sort of

insights is by building good probabilistic

:

01:20:45,160 --> 01:20:49,660

models of the data and using it to

explain, you know, explain to policymakers

:

01:20:49,660 --> 01:20:52,880

why they shouldn't, they shouldn't do a

different, a certain thing, for example.

:

01:20:52,880 --> 01:20:58,260

So I think that's a very important problem

to solve because surprisingly many areas

:

01:20:58,260 --> 01:21:03,100

that are very high impact are not using

real world inference and data to drive

:

01:21:03,100 --> 01:21:04,000

their decision -making.

:

01:21:04,000 --> 01:21:07,820

And that's quite shocking, whether that be

in medicine, you know, we're using very

:

01:21:07,820 --> 01:21:09,068

archaic.

:

01:21:09,068 --> 01:21:13,068

inference technologies in medicine and

clinical trials, things of that nature,

:

01:21:13,068 --> 01:21:14,548

even economists, right?

:

01:21:14,548 --> 01:21:17,088

Like linear regression is still the

workhorse in economics.

:

01:21:17,088 --> 01:21:22,308

We're using very primitive data analysis

technologies.

:

01:21:22,308 --> 01:21:28,088

I'd like to see how we can use better data

technologies, better types of inference to

:

01:21:28,088 --> 01:21:31,908

think about these hard, hard challenging

problems.

:

01:21:32,808 --> 01:21:36,908

Yeah, couldn't agree more.

:

01:21:37,168 --> 01:21:37,900

And...

:

01:21:37,900 --> 01:21:42,020

And I'm coming from a political science

background, so for sure these topics are

:

01:21:42,020 --> 01:21:46,860

always very interesting to me, quite dear

to me.

:

01:21:47,300 --> 01:21:52,700

Even though in the last years, I have to

say I've become more and more pessimistic

:

01:21:52,700 --> 01:21:54,200

about these.

:

01:21:55,140 --> 01:22:02,280

And yeah, like I completely agree with

your, like with the problem and the issues

:

01:22:02,280 --> 01:22:07,564

you have laid out and the solutions I am

for now.

:

01:22:07,564 --> 01:22:10,204

completely out of them.

:

01:22:10,344 --> 01:22:16,384

Unfortunately, but yeah, like that I agree

that something has to be done.

:

01:22:16,384 --> 01:22:28,204

Because these kind of political debates,

which are completely out of our out of the

:

01:22:28,204 --> 01:22:33,704

science, scientific consensus just so we

are to me, I'm like, but I don't know,

:

01:22:33,704 --> 01:22:37,164

we've talked about that, you know, we've

learned that I like,

:

01:22:37,164 --> 01:22:38,594

It's one of the things we know.

:

01:22:38,594 --> 01:22:41,044

I don't know what we're still arguing

about that.

:

01:22:41,044 --> 01:22:46,344

Or if we don't know, why don't we try and

find a way to, you know, find out instead

:

01:22:46,344 --> 01:22:52,744

of just being like, I know, but I'm right

because I think I'm right and my position

:

01:22:52,744 --> 01:22:54,664

actually makes sense.

:

01:22:54,884 --> 01:23:00,964

It's like one of the worst arguments like,

oh, well, it's common sense.

:

01:23:01,444 --> 01:23:07,122

Yeah, I think maybe there's some work we

have to do in having people trust.

:

01:23:07,180 --> 01:23:12,360

know, science and data -driven inference

and data analysis more.

:

01:23:12,360 --> 01:23:16,500

That's about by being more transparent, by

improving the ways in which they're being

:

01:23:16,500 --> 01:23:20,300

used, things of that nature, so that

people trust these and that it becomes the

:

01:23:20,300 --> 01:23:24,480

gold standard for talking about different

political issues or social issues or

:

01:23:24,480 --> 01:23:26,040

economic issues.

:

01:23:26,580 --> 01:23:27,840

Yeah, for sure.

:

01:23:27,840 --> 01:23:32,820

But at the same time, and that's

definitely something I try to do at a very

:

01:23:32,820 --> 01:23:35,554

small scale with these podcasts,

:

01:23:35,660 --> 01:23:43,340

It's how do you communicate about science

and try to educate the general public

:

01:23:43,340 --> 01:23:43,859

better?

:

01:23:43,859 --> 01:23:46,380

And I definitely think it's useful.

:

01:23:46,380 --> 01:23:52,520

At the same time, it's a hard task because

it's hard.

:

01:23:52,740 --> 01:23:58,800

If you want to find out the truth, it's

often not intuitive.

:

01:23:58,800 --> 01:24:03,380

And so in a way you have to want it.

:

01:24:03,380 --> 01:24:05,284

It's like, eh.

:

01:24:05,644 --> 01:24:12,464

I know broccoli is better for my health

long term, but I still prefer to eat a

:

01:24:12,464 --> 01:24:15,404

very, very fat snack.

:

01:24:15,404 --> 01:24:17,664

I definitely prefer sneakers.

:

01:24:17,664 --> 01:24:22,464

And yet I know that eating lots of fruits

and vegetables is way better for my health

:

01:24:22,464 --> 01:24:23,604

long term.

:

01:24:23,604 --> 01:24:30,304

And I feel it's a bit of a similar issue

where it's like, I'm pretty sure people

:

01:24:30,304 --> 01:24:34,532

know it's long term better to...

:

01:24:35,020 --> 01:24:39,380

use these kinds of methods to find out

about the truth, even if it's a political

:

01:24:39,380 --> 01:24:42,400

issue, even more, I would say, if it's a

political issue.

:

01:24:44,080 --> 01:24:50,520

But it's just so easy right now, at least

given how the different political

:

01:24:50,520 --> 01:24:58,260

incentives are, especially in the Western

democracies, the different incentives that

:

01:24:58,260 --> 01:25:01,540

are made with the media structure and so

on.

:

01:25:01,540 --> 01:25:04,940

It's actually way easier to

:

01:25:04,940 --> 01:25:10,880

not care about that and just like, just

lie and say what you think is true, then

:

01:25:10,880 --> 01:25:13,100

actually doing the hard work.

:

01:25:13,100 --> 01:25:14,340

And I agree.

:

01:25:14,340 --> 01:25:16,080

It's like, it's very hard.

:

01:25:16,080 --> 01:25:23,040

How do you make that hard work look not

boring, but actually what you're supposed

:

01:25:23,040 --> 01:25:26,220

to do and that I don't know for now.

:

01:25:26,220 --> 01:25:26,740

Yeah.

:

01:25:26,740 --> 01:25:32,480

Um, that makes me think like, I mean, I,

I'm definitely always thinking about these

:

01:25:32,480 --> 01:25:33,452

things and so on.

:

01:25:33,452 --> 01:25:40,092

Something that definitely helped me at a

very small scale, my scale where, because

:

01:25:40,092 --> 01:25:44,072

of course I'm always the, the scientists

around the table.

:

01:25:44,072 --> 01:25:48,952

So of course, when these kinds of topics

come up, I'm like, where does that come

:

01:25:48,952 --> 01:25:49,232

from?

:

01:25:49,232 --> 01:25:49,481

Right?

:

01:25:49,481 --> 01:25:51,202

Like, why are you saying that?

:

01:25:51,202 --> 01:25:53,092

Where, how do you know that's true?

:

01:25:53,092 --> 01:25:53,302

Right?

:

01:25:53,302 --> 01:25:55,832

What's your level of confidence and things

like that.

:

01:25:55,832 --> 01:26:01,732

There is actually a very interesting

framework where, which can teach you how

:

01:26:01,732 --> 01:26:03,108

to ask.

:

01:26:03,276 --> 01:26:07,396

questions to actually really understand

where people are coming from and how they

:

01:26:07,396 --> 01:26:12,956

develop their positions more than trying

to argue with them about their position.

:

01:26:13,156 --> 01:26:17,476

And usually it ties in also with the

literature about that, about how to

:

01:26:17,476 --> 01:26:23,836

actually not debate, but talk with someone

who has very entrenched political views.

:

01:26:24,816 --> 01:26:28,496

And it's called street epistemology.

:

01:26:28,496 --> 01:26:30,456

I don't know if you've heard of that.

:

01:26:30,456 --> 01:26:32,476

That is super interesting.

:

01:26:32,476 --> 01:26:32,716

And

:

01:26:32,716 --> 01:26:34,216

I will link to that in the show notes.

:

01:26:34,216 --> 01:26:39,296

So there is a very good YouTube channel by

Anthony McNabosco, who is one of the main

:

01:26:39,296 --> 01:26:42,876

person doing straight epistemology.

:

01:26:42,876 --> 01:26:44,226

So I will link to that.

:

01:26:44,226 --> 01:26:50,536

You can watch his video where he goes in

the street literally and just talk about

:

01:26:50,536 --> 01:26:54,736

very, very hot topics to random people in

the street.

:

01:26:54,736 --> 01:26:55,916

Can be politics.

:

01:26:55,916 --> 01:27:01,420

Very often it's about supernatural beliefs

about...

:

01:27:01,420 --> 01:27:06,580

religious beliefs, things like this is

really, these are not light topics.

:

01:27:06,960 --> 01:27:11,260

But it's done through the framework of

street epistemology.

:

01:27:11,260 --> 01:27:13,660

That's super helpful, I find.

:

01:27:14,300 --> 01:27:19,320

And if you want like a more, a bigger

overview of these topics, there is a very

:

01:27:19,320 --> 01:27:25,800

good somewhat recent book that's called

How Minds Change by David McCraney, who's

:

01:27:25,800 --> 01:27:29,460

got a very good podcast also called You're

Not So Smart.

:

01:27:30,020 --> 01:27:30,572

So,

:

01:27:30,572 --> 01:27:32,412

Definitely recommend those resources.

:

01:27:32,412 --> 01:27:34,326

I'll put them in the show notes.

:

01:27:36,300 --> 01:27:36,820

Awesome.

:

01:27:36,820 --> 01:27:41,660

Well, for us, that was an unexpected end

to the show.

:

01:27:41,660 --> 01:27:42,430

Thanks a lot.

:

01:27:42,430 --> 01:27:46,600

I think we've covered so many different

topics.

:

01:27:46,980 --> 01:27:49,940

Well, actually, I still have a second

question to ask you.

:

01:27:49,940 --> 01:27:56,260

The second last question I ask you, so if

you could have dinner with any great

:

01:27:56,260 --> 01:28:00,772

scientific mind, dead, alive, fictional,

who would it be?

:

01:28:03,468 --> 01:28:10,628

I think I will go with Hercules Poirot,

Agatha Christie's famous detective.

:

01:28:10,848 --> 01:28:16,988

So I read a lot of Hercules Poirot and I

would ask him, because he's an inference,

:

01:28:16,988 --> 01:28:19,188

everything he does is based on inference.

:

01:28:19,188 --> 01:28:23,748

So I'd work with him to come up with a

formal model of the inferences that he's

:

01:28:23,748 --> 01:28:26,268

making to solve very hard crimes.

:

01:28:28,288 --> 01:28:29,708

I am not.

:

01:28:29,908 --> 01:28:33,132

That's the first time someone answers

Hercules Poirot.

:

01:28:33,132 --> 01:28:38,602

But I'm not surprised as to the

motivation.

:

01:28:38,602 --> 01:28:39,842

So I like it.

:

01:28:39,842 --> 01:28:40,632

I like it.

:

01:28:40,632 --> 01:28:43,632

I think I would do that with Sherlock

Holmes also.

:

01:28:43,632 --> 01:28:45,732

Sherlock Holmes has a very Bayesian mind.

:

01:28:45,732 --> 01:28:47,062

I really love that.

:

01:28:47,062 --> 01:28:48,572

Yeah, for sure.

:

01:28:48,832 --> 01:28:49,332

Awesome.

:

01:28:49,332 --> 01:28:50,642

Well, thanks a lot, Ferris.

:

01:28:50,642 --> 01:28:52,512

That was a blast.

:

01:28:52,512 --> 01:28:53,882

We've talked about so many things.

:

01:28:53,882 --> 01:28:55,652

I've learned a lot about GPs.

:

01:28:55,652 --> 01:29:00,972

Definitely going to try AutoGP .jl.

:

01:29:01,580 --> 01:29:07,580

Thanks a lot for all the work you are

doing on that and all the different topics

:

01:29:07,580 --> 01:29:13,280

you are working on and were kind enough to

come here and talk about.

:

01:29:13,380 --> 01:29:18,860

As usual, I will put resources and links

to your website in the show notes for

:

01:29:18,860 --> 01:29:24,980

those who want to dig deeper and feel free

to add anything yourself or for people.

:

01:29:25,280 --> 01:29:29,600

And on that note, thank you again for

taking the time and being on this show.

:

01:29:29,600 --> 01:29:30,380

Thank you, Alex.

:

01:29:30,380 --> 01:29:31,876

I appreciate it.

:

01:29:35,756 --> 01:29:39,496

This has been another episode of Learning

Bayesian Statistics.

:

01:29:39,496 --> 01:29:44,456

Be sure to rate, review, and follow the

show on your favorite podcatcher, and

:

01:29:44,456 --> 01:29:49,356

visit learnbaystats .com for more

resources about today's topics, as well as

:

01:29:49,356 --> 01:29:54,096

access to more episodes to help you reach

true Bayesian state of mind.

:

01:29:54,096 --> 01:29:56,036

That's learnbaystats .com.

:

01:29:56,036 --> 01:30:00,886

Our theme music is Good Bayesian by Baba

Brinkman, fit MC Lass and Meghiraam.

:

01:30:00,886 --> 01:30:04,036

Check out his awesome work at bababrinkman

.com.

:

01:30:04,036 --> 01:30:05,196

I'm your host,

:

01:30:05,196 --> 01:30:06,196

Alex and Dora.

:

01:30:06,196 --> 01:30:10,456

You can follow me on Twitter at Alex

underscore and Dora like the country.

:

01:30:10,456 --> 01:30:15,516

You can support the show and unlock

exclusive benefits by visiting patreon

:

01:30:15,516 --> 01:30:17,696

.com slash LearnBasedDance.

:

01:30:17,696 --> 01:30:20,136

Thank you so much for listening and for

your support.

:

01:30:20,136 --> 01:30:26,036

You're truly a good Bayesian change your

predictions after taking information and

:

01:30:26,036 --> 01:30:29,396

if you think and I'll be less than

amazing.

:

01:30:29,396 --> 01:30:32,492

Let's adjust those expectations.

:

01:30:32,492 --> 01:30:37,892

Let me show you how to be a good Bayesian

Change calculations after taking fresh

:

01:30:37,892 --> 01:30:43,932

data in Those predictions that your brain

is making Let's get them on a solid

:

01:30:43,932 --> 01:30:45,772

foundation

Chapters

More Episodes
104. #104 Automated Gaussian Processes & Sequential Monte Carlo, with Feras Saad
01:30:47
103. #103 Improving Sampling Algorithms & Prior Elicitation, with Arto Klami
01:14:38
102. #102 Bayesian Structural Equation Modeling & Causal Inference in Psychometrics, with Ed Merkle
01:08:53
100. #100 Reactive Message Passing & Automated Inference in Julia, with Dmitry Bagaev
00:54:41
98. #98 Fusing Statistical Physics, Machine Learning & Adaptive MCMC, with Marylou Gabrié
01:05:06
97. #97 Probably Overthinking Statistical Paradoxes, with Allen Downey
01:12:35
94. #94 Psychometrics Models & Choosing Priors, with Jonathan Templin
01:06:25
90. #90, Demystifying MCMC & Variational Inference, with Charles Margossian
01:37:35
89. #89 Unlocking the Science of Exercise, Nutrition & Weight Management, with Eric Trexler
01:59:50
87. #87 Unlocking the Power of Bayesian Causal Inference, with Ben Vincent
01:08:38
86. #86 Exploring Research Synchronous Languages & Hybrid Systems, with Guillaume Baudart
00:58:42
84. #84 Causality in Neuroscience & Psychology, with Konrad Kording
01:05:42
83. #83 Multilevel Regression, Post-Stratification & Electoral Dynamics, with Tarmo Jüristo
01:17:20
56. #56 Causal & Probabilistic Machine Learning, with Robert Osazuwa Ness
01:08:57
68. #68 Probabilistic Machine Learning & Generative Models, with Kevin Murphy
01:05:35
71. #71 Artificial Intelligence, Deepmind & Social Change, with Julien Cornebise
01:05:07
78. #78 Exploring MCMC Sampler Algorithms, with Matt D. Hoffman
01:02:40
80. #80 Bayesian Additive Regression Trees (BARTs), with Sameer Deshpande
01:09:05