Artwork for podcast Learning Bayesian Statistics
#103 Improving Sampling Algorithms & Prior Elicitation, with Arto Klami
Episode 1035th April 2024 • Learning Bayesian Statistics • Alexandre Andorra
00:00:00 01:14:38

Share Episode

Shownotes

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

Changing perspective is often a great way to solve burning research problems. Riemannian spaces are such a perspective change, as Arto Klami, an Associate Professor of computer science at the University of Helsinki and member of the Finnish Center for Artificial Intelligence, will tell us in this episode.

He explains the concept of Riemannian spaces, their application in inference algorithms, how they can help sampling Bayesian models, and their similarity with normalizing flows, that we discussed in episode 98.

Arto also introduces PreliZ, a tool for prior elicitation, and highlights its benefits in simplifying the process of setting priors, thus improving the accuracy of our models.

When Arto is not solving mathematical equations, you’ll find him cycling, or around a good board game.

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser and Julio.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag ;)

Takeaways:

- Riemannian spaces offer a way to improve computational efficiency and accuracy in Bayesian inference by considering the curvature of the posterior distribution.

- Riemannian spaces can be used in Laplace approximation and Markov chain Monte Carlo algorithms to better model the posterior distribution and explore challenging areas of the parameter space.

- Normalizing flows are a complementary approach to Riemannian spaces, using non-linear transformations to warp the parameter space and improve sampling efficiency.

- Evaluating the performance of Bayesian inference algorithms in challenging cases is a current research challenge, and more work is needed to establish benchmarks and compare different methods. 

- PreliZ is a package for prior elicitation in Bayesian modeling that facilitates communication with users through visualizations of predictive and parameter distributions.

- Careful prior specification is important, and tools like PreliZ make the process easier and more reproducible.

- Teaching Bayesian machine learning is challenging due to the combination of statistical and programming concepts, but it is possible to teach the basic reasoning behind Bayesian methods to a diverse group of students.

- The integration of Bayesian approaches in data science workflows is becoming more accepted, especially in industries that already use deep learning techniques.

- The future of Bayesian methods in AI research may involve the development of AI assistants for Bayesian modeling and probabilistic reasoning.

Chapters:

00:00 Introduction and Background

02:05 Arto's Work and Background

06:05 Introduction to Bayesian Inference

12:46 Riemannian Spaces in Bayesian Inference

27:24 Availability of Romanian-based Algorithms

30:20 Practical Applications and Evaluation

37:33 Introduction to Prelease

38:03 Prior Elicitation

39:01 Predictive Elicitation Techniques

39:30 PreliZ: Interface with Users

40:27 PreliZ: General Purpose Tool

41:55 Getting Started with PreliZ

42:45 Challenges of Setting Priors

45:10 Reproducibility and Transparency in Priors

46:07 Integration of Bayesian Approaches in Data Science Workflows

55:11 Teaching Bayesian Machine Learning

01:06:13 The Future of Bayesian Methods with AI Research

01:10:16 Solving the Prior Elicitation Problem

Links from the show:

Transcript

This is an automatic transcript and may therefore contain errors. Please get in touch if you're willing to correct them.

Transcripts

Speaker:

Let me show you how to be a good b...

2

:

how they can help sampling Bayesian models

and their similarity with normalizing

3

:

flows that we discussed in episode 98.

4

:

ARTO also introduces Prelease, a tool for

prior elicitation, and highlights its

5

:

benefits in simplifying the process of

setting priors, thus improving the

6

:

accuracy of our models.

7

:

When ARTO is not solving mathematical

equations, you'll find him cycling or

8

:

around the good board game.

9

:

This is Learning Bayesian Statistics,

episode 103.

10

:

recorded February 15, 2024.

11

:

Welcome to Learning Bayesian Statistics, a

podcast about Bayesian inference, the

12

:

methods, the projects, and the people who

make it possible.

13

:

I'm your host.

14

:

You can follow me on Twitter at Alex

underscore and Dora like the country for

15

:

any info about the show.

16

:

Learnbasedats .com is left last to me.

17

:

Show notes, becoming a corporate sponsor,

unlocking Bayesian Merch, supporting the

18

:

show on Patreon.

19

:

Everything is in there.

20

:

That's Learnbasedats .com.

21

:

If you're interested in one -on -one

mentorship, online courses, or statistical

22

:

consulting, feel free to reach out and

book a call at topmate .io slash Alex

23

:

underscore.

24

:

And Dora, see you around, folks, and best

wishes to you all.

25

:

Clemmy, welcome to Layer Name Patient

Statistics.

26

:

Thank you.

27

:

You're welcome.

28

:

How was my Finnish pronunciation?

29

:

Oh, I think that was excellent.

30

:

For people who don't have the video, I

don't think that was true.

31

:

So thanks a lot for taking the time,

Artho.

32

:

I'm really happy to have you on the show.

33

:

And I've had a lot of questions for you

for a long time, and the longer we

34

:

postpone the episode, the more questions.

35

:

So I'm gonna do my best to not take three

hours of your time.

36

:

And let's start by...

37

:

maybe defining the work you're doing

nowadays and well, how do you end up

38

:

working on this?

39

:

Yes, sure.

40

:

So I personally identify as a machine

learning researcher.

41

:

So I do machine learning research, but

very much from a Bayesian perspective.

42

:

So my original background is in computer

science.

43

:

I'm essentially a self -educated

statistician in the sense that I've never

44

:

really

45

:

kind of studied properly statistics

design, well except for a few courses here

46

:

and there.

47

:

But I've been building models, algorithms,

building on the Bayesian principles for

48

:

addressing various kinds of machine

learning problems.

49

:

So you're basically like a self -taught

statistician through learning, let's say.

50

:

More or less, yes.

51

:

I think the first things I started doing,

52

:

with anything that had to do with Bayesian

statistics was pretty much already going

53

:

to the deep end and trying to learn

posterior inference for fairly complicated

54

:

models, even actually non -parametric

models in some ways.

55

:

Yeah, we're going to dive a bit on that.

56

:

Before that, can you tell us the topics

you are particularly focusing on through

57

:

that?

58

:

umbrella of topics you've named.

59

:

Yes, absolutely.

60

:

So I think I actually have a few somewhat

distinct areas of interest.

61

:

So on one hand, I'm working really on the

kind of core inference problem.

62

:

So how do we computationally efficiently,

accurately enough approximate the

63

:

posterior distributions?

64

:

Recently, we've been especially working on

inference algorithms that build on

65

:

concepts from Riemannian geometry.

66

:

So we're trying to really kind of account

the actual manifold induced by this

67

:

posterior distribution and try to somehow

utilize these concepts to kind of speed up

68

:

inference.

69

:

So that's kind of one very technical

aspect.

70

:

Then there's the other main theme on the

kind of Bayesian side is on priors.

71

:

So we'll be working on prior elicitation.

72

:

So how do we actually go about specifying

the prior distributions?

73

:

and ideally maybe not even specifying.

74

:

So how would we extract that knowledge

from a domain expert who doesn't

75

:

necessarily even have any sort of

statistical training?

76

:

And how do we flexibly represent their

true beliefs and then encode them as part

77

:

of a model?

78

:

That's maybe the main kind of technical

aspects there.

79

:

Yeah.

80

:

Yeah.

81

:

No, super fun.

82

:

And we're definitely going to dive into

those two aspects a bit later in the show.

83

:

I'm really interested in that.

84

:

Before that, do you remember how you first

got introduced to Bayesian inference,

85

:

actually, and also why it sticks with you?

86

:

Yeah, like I said, I'm in some sense self

-trained.

87

:

I mean, coming with the computer science

background, we just, more or less,

88

:

sometime during my PhD,

89

:

I was working in a research group that was

led by Samuel Kaski.

90

:

When I joined the group, we were working

on neural networks of the kind that people

91

:

were interested in.

92

:

That was like 20 years ago.

93

:

So we were working on things like self

-organizing maps and these kind of

94

:

methods.

95

:

And then we started working on

applications where we really bumped into

96

:

the kind of small sample size problems.

97

:

So looking at...

98

:

DNA microarray data that was kind of tens

of thousands of dimensions and medical

99

:

applications with 20 samples.

100

:

So we essentially figured out that we're

gonna need to take the kind of uncertainty

101

:

into account properly.

102

:

Started working on the Bayesian modeling

side of these and one of the very first

103

:

things I was doing is kind of trying to

create Bayesian versions of some of these

104

:

classical analysis methods that were

105

:

especially canonical correlation analysis.

106

:

That's the original derivation is like an

information theoretic formulation.

107

:

So I kind of dive directly into this that

let's do Bayesian versions of models.

108

:

But I actually do remember that around the

same time I also took a course, a course

109

:

by Akivehtari.

110

:

He's his author of this Gelman et al.

111

:

book, one of the authors.

112

:

I think the first version of the book had

been released.

113

:

just before that.

114

:

So Aki was giving a course where he was

teaching based on that book.

115

:

And I think that's the kind of first real

official contact on trying to understand

116

:

the actual details behind the principles.

117

:

Yeah, and actually I'm pretty sure

listeners are familiar with Aki.

118

:

He's been on the show already, so I'll

link to the episode, of course, where Aki

119

:

was.

120

:

And yeah, for sure.

121

:

I also recommend going through these

episodes, show notes for people who are

122

:

interested in, well, starting learning

about basic stuff and things like that.

123

:

Something I'm wondering from what you just

explained is, so you define yourself as a

124

:

machine learning researcher, right?

125

:

And you work in artificial intelligence

too.

126

:

But there is this interaction with the

Bayesian framework.

127

:

How does that framework underpin your

research in statistical machine learning

128

:

and artificial intelligence?

129

:

How does that all combine?

130

:

Yeah.

131

:

Well, that's a broad topic.

132

:

There's of course a lot in that

intersection.

133

:

I personally do view all learning problems

in some sense from a Bayesian perspective.

134

:

I mean, no matter what kind of a, whether

it's a very simple fitting a linear

135

:

regression type of a problem or whether

it's figuring out the parameters of a

136

:

neural network with 1 billion parameters,

it's ultimately still a statistical

137

:

inference problem.

138

:

I mean, most of the cases, I'm quite

confident that we can't figure out the

139

:

parameters exactly.

140

:

We need to somehow quantify for the

uncertainty.

141

:

I'm not really aware of any other kind of

principled way of doing it.

142

:

So I would just kind of think about it

that we're always doing Bayesian inference

143

:

in some sense.

144

:

But then there's the issue of how far can

we go in practice?

145

:

So it's going to be approximate.

146

:

It's possibly going to be very crude

approximations.

147

:

But I would still view it through the lens

of Bayesian statistics in my own work.

148

:

And that's what I do when I teach for my

BSc students, for example.

149

:

I mean not all of them explicitly

formulate the learning algorithms kind of

150

:

from these perspectives but we are still

kind of talking about that what's the

151

:

relationship what can we assume about the

algorithms what can we assume about the

152

:

result and how would it relate to like

like properly estimating everything

153

:

through kind of exactly how it should be

done.

154

:

Yeah okay that's an interesting

perspective yeah so basically putting that

155

:

in a in that framework.

156

:

And that means, I mean, that makes me

think then, how does that, how do you

157

:

believe, what do you believe, sorry, the

impact of Bayesian machine learning is on

158

:

the broader field of AI?

159

:

What does that bring to that field?

160

:

It's a, let's say it has a big effect.

161

:

It has a very big impact in a sense that

pretty much most of the stuff that is

162

:

happening on the machine learning front

and hence also on the kind of all learning

163

:

based AI solutions.

164

:

It is ultimately, I think a lot of people

are thinking about roughly in the same way

165

:

as I am, that there is an underlying

learning problem that we would ideally

166

:

want to solve more or less following

exactly the Bayesian principles.

167

:

don't necessarily talk about it from this

perspective.

168

:

So you might be happy to write algorithms,

all the justification on the choices you

169

:

make comes from somewhere else.

170

:

But I think a lot of people are kind of

accepting that it's the kind of

171

:

probabilistic basis of these.

172

:

So for instance, I think if you think

about the objectives that people are

173

:

optimizing in deep learning, they're all

essentially likelihoods of some

174

:

assume probabilistic model.

175

:

Most of the regularizers they are

considering do have an interpretation of

176

:

some kind of a prior distribution.

177

:

I think a lot of people are all the time

going deeper and deeper into actually

178

:

explicitly thinking about it from these

perspectives.

179

:

So we have a lot of these deep learning

type of approaches, various autoencoders,

180

:

Bayesian neural networks, various kinds of

generative AI models that are

181

:

They are actually even explicitly

formulated as probabilistic models and

182

:

some sort of an approximate inference

scheme.

183

:

So I think the kind of these things are,

they are the same two sides of the same

184

:

coin.

185

:

People are kind of more and more thinking

about them from the same perspective.

186

:

Okay, yeah, that's super interesting.

187

:

Actually, let's start diving into these

topics from a more technical perspective.

188

:

So you've mentioned the

189

:

research and advances you are working on

regarding Romanian spaces.

190

:

So I think it'd be super fun to talk about

that because we've never really talked

191

:

about it on the show.

192

:

So maybe can you give listeners a primer

on what a Romanian space is?

193

:

Why would you even care about that?

194

:

And what you are doing in this regard,

what your research is in this regard.

195

:

Yes, let's try.

196

:

I mean, this is a bit of a mathematical

concept to talk about.

197

:

But I mean, ultimately, if you think about

most of the learning algorithms, so we are

198

:

kind of thinking that there are some

parameters that live in some space.

199

:

So we essentially, without thinking about

it, that we just assume that it's a

200

:

Euclidean space in a sense that we can

measure distances between two parameters,

201

:

that how similar they are.

202

:

It doesn't matter which direction we go,

if the distance is the same, we think that

203

:

they are kind of equally far away.

204

:

So now a Riemannian geometry is one that

is kind of curved in some sense.

205

:

So we may be stretching the space in

certain ways and we'll be doing this

206

:

stretching locally.

207

:

So what it actually means, for example, is

that the shortest path between two

208

:

possible

209

:

values, maybe for example two parameter

configurations, that if you start

210

:

interpolating between two possible values

for a parameter, it's going to be a

211

:

shortest path in this Riemannian geometry,

which is not necessarily a straight line

212

:

in an underlying Euclidean space.

213

:

So that's what the Riemannian geometry is

in general.

214

:

So it's kind of the tools and machinery we

need to work with these kind of settings.

215

:

And now then the relationship to

statistical inference comes from trying to

216

:

define such a Riemannian space that it has

somehow nice characteristics.

217

:

So maybe the concept that most of the

people actually might be aware of would be

218

:

the Fisher information matrix that kind of

characterizes the kind of the curvature

219

:

induced by a particular probabilistic

model.

220

:

So these tools kind of then allow, for

example, a very recent thing that we did,

221

:

it's going to come out later this spring

in AI stats, is an extension of the

222

:

Laplace approximation in a Riemannian

geometry.

223

:

So those of you who know what the Laplace

approximation is, it's essentially just

224

:

fitting a normal distribution at the mode

of a distribution.

225

:

But if we now fit the same normal

distribution in a suitably chosen

226

:

Riemannian space,

227

:

we can actually model also the kind of

curvature of the posterior mode and even

228

:

kind of how it stretches.

229

:

So we get a more flexible approximation.

230

:

We are still fitting a normal

distribution.

231

:

We're just doing it in a different space.

232

:

Not sure how easy that was to follow, but

at least maybe it gives some sort of an

233

:

idea.

234

:

Yeah, yeah, yeah.

235

:

That was actually, I think, a pretty

approachable.

236

:

introduction and so if I understood

correctly then you're gonna use these

237

:

Romanian approximations to come up with

better algorithms is that what you do and

238

:

why you focus on Romanian spaces and yeah

if you can if you can introduce that and

239

:

tell us basically why that is interesting

to then look

240

:

at geometry from these different ways

instead of the classical Euclidean way of

241

:

things geometry.

242

:

Yeah, I think that's exactly what it is

about.

243

:

So one other thing, maybe another

perspective of thinking about it is that

244

:

we've also been doing Markov chain Monte

Carlo algorithms, so MCMC in these

245

:

Riemannian spaces.

246

:

And what we can achieve with those is that

if you have, let's say, a posterior

247

:

distribution,

248

:

that has some sort of a narrow funnel,

some very narrow area that extends far

249

:

away in one corner of your parameter

space.

250

:

It's actually very difficult to get there

with something like standard Hamiltonian

251

:

Monte Carlo, but with the Riemannian

methods we can kind of make these narrow

252

:

funnels equally easy compared to the

flatter areas.

253

:

Now of course this may sound like a magic

bullet that we should be doing all

254

:

inference with these techniques.

255

:

Of course it does come with

256

:

certain computational challenges.

257

:

So we do need to be, like I said, the

shortest paths are no longer straight

258

:

lines.

259

:

So we need numerical integration to follow

the geodesic paths in these metrics and so

260

:

on.

261

:

So it's a bit of a compromise, of course.

262

:

So they have very nice theoretical

properties.

263

:

We've been able to get them working also

in practice in many cases so that they are

264

:

kind of comparable with the current state

of the art.

265

:

But it's not always easy.

266

:

Yeah, there is no free lunch.

267

:

Yes.

268

:

Yeah.

269

:

Yeah.

270

:

Do you have any resources about these?

271

:

Well, first the concepts of Romanian

spaces and then the algorithms that you

272

:

folks derived in your group using these

Romanian space for people who are

273

:

interested?

274

:

Yeah, I think I wouldn't know, let's say a

very particular

275

:

reasons I would recommend on the Romanian

geometry.

276

:

It is actually a rather, let's say,

mathematically involved topic.

277

:

But regarding the specific methods, I

think they are...

278

:

It's a couple of my recent papers, so we

have this Laplace approximation is coming

279

:

out in AI stats this year.

280

:

The MCMC sampler we had, I think, two

years ago in AI stats, similarly, the

281

:

first MCMC method building on these and

then...

282

:

last year one paper on transactions of

machine learning research.

283

:

I think they are more or less accessible.

284

:

Let's definitely link to those papers if

you can in the show notes because I'm

285

:

personally curious about it but also I

think listeners will be.

286

:

It sounds from what you're saying that

this idea of doing algorithms in this

287

:

Romanian space is

288

:

somewhat recent.

289

:

Am I right?

290

:

And why would it appear now?

291

:

Why would it become interesting now?

292

:

Well, it's not actually that recent.

293

:

I think the basic principle goes back, I

don't know, maybe 20 years or so.

294

:

I think the main reason why we've been

working on this right now is that the

295

:

We've been able to resolve some of the

computational challenges.

296

:

So the fundamental problem with these

models is always this numeric integration

297

:

of following the shortest paths depending

on an algorithm we needed for different

298

:

reasons, but we always needed to do it,

which usually requires operations like

299

:

inversion of a metric tensor, which has

the kind of a dimensionality of the

300

:

parameter space.

301

:

So we came up with the particular metric.

302

:

that happens to have computationally

efficient inverse.

303

:

So there's kind of this kind of concrete

algorithmic techniques that are kind of

304

:

bringing the computational cost to the

level so that it's no longer notably more

305

:

expensive than doing kind of standard

Euclidean methods.

306

:

So we can, for example, scale them for

Bayesian neural networks.

307

:

That's one of the application cases we are

looking at.

308

:

We are really having very high

-dimensional problems but still able to do

309

:

some of these Riemannian techniques or

approximations of them.

310

:

That was going to be my next question.

311

:

In which cases are these approximations

interesting?

312

:

In which cases would you recommend

listeners to actually invest time to

313

:

actually use these techniques because they

have a better chance of working than the

314

:

classic Hamiltonian Monte Carlo semper

that are the default in most probabilistic

315

:

languages?

316

:

Yeah, I think the easy answer is that when

the inference problem is hard.

317

:

So essentially one very practical way

would be that if you realize that you

318

:

can't really get a Hamiltonian Monte Carlo

to explore the space, the posterior

319

:

properly, that it may be difficult to find

out that this is happening.

320

:

Of course, if you're ever visiting a

certain corner, you wouldn't actually

321

:

know.

322

:

But if you have some sort of a reason to

believe that you really are handling with

323

:

such a complex posterior that I'm kind of

willing to spend a bit more extra

324

:

computation to be careful so that I really

try to cover every corner there is.

325

:

Another example is that we realized on the

scope of these Bayesian neural networks

326

:

that there are certain kind of classical

327

:

Well, certain kind of scenarios where we

can show that if you do inference with the

328

:

two simple methods, so something in the

Euclidean metric with the standard

329

:

Vangerman dynamics type of a thing, what

we actually see is that if you switch to

330

:

using better prior distributions in your

model, you don't actually see an advantage

331

:

of those unless you at the same time

switch to using an inference algorithm

332

:

that is kind of able to handle the extra

complexity.

333

:

So if you have for example like

334

:

heavy tail spike and slap type of priors

in the neural network.

335

:

You just kind of fail to get any benefit

from these better priors if you don't pay

336

:

a bit more attention into how you do the

inference.

337

:

Okay, super interesting.

338

:

And also, so that seems it's also quite

interesting to look at that when you have,

339

:

well, or when you suspect that you have

multi -modal posteriors.

340

:

Yes, well yeah, multimodal posteriors are

interesting.

341

:

I'm not, we haven't specifically studied

like this question that is there and we

342

:

have actually thought about some ideas of

creating metrics that would specifically

343

:

encourage exploring the different modes

but we haven't done that concretely so we

344

:

now still focusing on these kind of narrow

thin areas of posteriors and how can you

345

:

kind of reach those.

346

:

Okay.

347

:

And do you know of normalizing flows?

348

:

Sure, yes.

349

:

So yeah, we've had Marie -Lou Gabriel on

the show recently.

350

:

It was episode 98.

351

:

And so she's working a lot on these

normalizing flows and the idea of

352

:

assisting NCMC sampling with these machine

learning methods.

353

:

And it's amazing.

354

:

can sound somewhat similar to what you do

in your group.

355

:

And so for listeners, could you explain

the difference between the two ideas and

356

:

maybe also the use cases that both apply

to it?

357

:

Yeah, I think you're absolutely right.

358

:

So they are very closely related.

359

:

So there are, for example, the basic idea

of the neural transport that uses

360

:

normalizing flows for

361

:

essentially transforming the parameter

space in a suitable non -linear way and

362

:

then running standard Euclidean

Hamiltonian Monte Carlo.

363

:

It can actually be proven.

364

:

I think it is in the original paper as

well that I mean it is actually

365

:

mathematically equivalent to conducting

Riemannian inference in a suitable metric.

366

:

So I would say that it's like a

complementary approach of solving exactly

367

:

the same problem.

368

:

So you have a way of somehow in a flexible

way warping your parameter space.

369

:

You either do it through a metric or you

kind of do it as a pre -transformation.

370

:

So there's a lot of similarities.

371

:

It's also the computation in some sense

that if you think about mapping...

372

:

sample through a normalizing flow.

373

:

It's actually very close to what we do

with the Riemannian Laplace approximation

374

:

that you start kind of take a sample and

you start propagating it through some sort

375

:

of a transformation.

376

:

It's just whether it's defined through a

metric or as a flow.

377

:

So yes, so they are kind of very close.

378

:

So now the question is then that when

should I be using one of these?

379

:

I'm afraid I don't really have an answer.

380

:

that in a sense that I mean there's

computational properties on let's say for

381

:

example if you've worked with flows you do

need to pre -train them so you do need to

382

:

train some sort of a flow to be able to

use it in certain applications so it comes

383

:

with some pre -training cost.

384

:

Quite likely during when you're actually

using it it's going to be faster than

385

:

working in a Riemannian metric where you

need to invert some metric tensors and so

386

:

on.

387

:

So there's kind of like technical

differences.

388

:

Then I think the bigger question is of

course that if we go to really challenging

389

:

problems, for example, very high

dimensions, that which of these methods

390

:

actually work well there.

391

:

For that I don't quite now have an answer

in the sense that I would dare to say that

392

:

or even speculate that which of these

things I might miss some kind of obvious

393

:

limitations of one of the approaches if

trying to kind of extrapolate too far.

394

:

from what we've actually tried in

practice.

395

:

Yeah, that's what I was going to say.

396

:

It's also that these methods are really at

the frontier of the science.

397

:

So I guess we're lacking, we're lacking

for now the practical cases, right?

398

:

And probably in a few years we'll have

more ideas of these and when one is more

399

:

appropriate than another.

400

:

But for now, I guess we have to try.

401

:

those algorithms and see what we get back.

402

:

And so actually, what if people want to

try these Romanian based algorithms?

403

:

Do you have already packages that we can

link to that people can try and plug their

404

:

own model into?

405

:

Yes and no.

406

:

So we have released open source code with

each of the research papers.

407

:

So there is a reference implementation

that

408

:

can be used.

409

:

We have internally been integrating these,

kind of working a bit towards integrating

410

:

the kind of proper open ecosystems that

would allow, make like for example model

411

:

specification easy.

412

:

It's not quite there yet.

413

:

So there's one particular challenge is

that many of the environments don't

414

:

actually have all the support

functionality you need for the Riemannian

415

:

methods.

416

:

They're essentially simplifying some of

the things that directly encoding these

417

:

assumptions that the shortest path is an

interpolation or it's a line.

418

:

So you need a bit of an extra machinery

for the most established libraries.

419

:

There are some libraries, I believe, that

are actually making it fairly easy to do

420

:

kind of plug and play Riemannian metrics.

421

:

I don't remember the names right now, but

that's where we've kind of been.

422

:

planning on putting in the algorithms, but

they're not really there yet.

423

:

Hmm, OK, I see.

424

:

Yeah, definitely that would be, I guess,

super, super interesting.

425

:

If by the time of release, you see

something that people could try,

426

:

definitely we'll link to that, because I

think listeners will be curious.

427

:

And I'm definitely super curious to try

that.

428

:

Any new stuff like that, or you'd like to?

429

:

try and see what you can do with it.

430

:

It's always super interesting.

431

:

And I've already seen some very

interesting experiments done with

432

:

normalizing flows, especially Bayox by

Colin Carroll and other people.

433

:

Colin Carroll is one of the EasyPindC

developer also.

434

:

And yeah, now you can use Bayox to take

any

435

:

a juxtifiable model and you plug that into

it and you can use the flow MC algorithm

436

:

to sample your juxtifiable PIMC model.

437

:

So that's really super cool.

438

:

And I'm really looking forward to more

experiments like that to see, well, okay,

439

:

what can we do with those algorithms?

440

:

Where can we push them to what extent, to

what degree, where do they fall down?

441

:

That's really super interesting, at least

for me, because I'm not a mathematician.

442

:

So when I see that, I find that super,

like, I love the idea of, basically the

443

:

idea is somewhat simple.

444

:

It's like, okay, we have that problem when

we think about geometry that way, because

445

:

then the geometry becomes a funnel, for

instance, as you were saying.

446

:

And then sampling at the bottom of the

funnel is just super hard in the way we do

447

:

it right now, because just super small

distances.

448

:

What if we change the definition of

distance?

449

:

What if we change the definition of

geometry, basically, which is this idea

450

:

of, OK, let's switch to Romanian space.

451

:

And the way we do that, then, well, the

funnel disappears, and it just becomes

452

:

something easier.

453

:

It's just like going beyond the idea of

the centered versus non -centered

454

:

parameterization, for instance, when you

do that in model, right?

455

:

But it's going big with that because it's

more general.

456

:

So I love that idea.

457

:

I understand it, but I cannot really read

the math and be like, oh, OK, I see what

458

:

that means.

459

:

So I have to see the model and see what I

can do and where I can push it.

460

:

And then I get a better understanding of

what that entails.

461

:

Yeah, I think you gave a much better

summary of what it is doing than I did.

462

:

So good for that.

463

:

I mean, you are actually touching that, of

course.

464

:

So there's the one point is making the

algorithms.

465

:

available so that everyone could try them

out.

466

:

But then there's also the other aspect

that we need to worry about, which is the

467

:

proper evaluation of what they're doing.

468

:

I mean, of course, most of the papers when

you release a new algorithm, you need to

469

:

emphasize things like, in our case,

computational efficiency.

470

:

And you do demonstrate that it, maybe for

example, being quite explicitly showing

471

:

that these very strong funnels, it does

work better with those.

472

:

But now then the question is of course

that how reliable these things are if used

473

:

in a black box manner in a so that someone

just runs them on their favorite model.

474

:

And one of the challenges we realized is

that it's actually very hard to evaluate

475

:

how well an algorithm is working in an

extremely difficult case.

476

:

Because there is no baseline.

477

:

I mean, in some of the cases we've been

comparing that let's try to do...

478

:

standard Hamiltonian MCMC on nuts as

carefully as we can.

479

:

And they kind of think that this is the

ground truth, this is the true posterior.

480

:

But we don't really know whether that's

the case.

481

:

So if it's hard enough case, our kind of

supposed ground truth is failing as well.

482

:

And it's very hard to kind of then we

might be able to see that our solution

483

:

differs from that.

484

:

But then we would need to kind of

separately go and investigate that which

485

:

one was wrong.

486

:

And that is a practical challenge,

especially if you would like to have a

487

:

broad set of models.

488

:

And we would want to show somehow

transparently for the kind of end users

489

:

that in these and these kind of problems,

this and that particular method, whether

490

:

it's one of ours or something else, any

other new fancy.

491

:

When do they work when they don't?

492

:

Without relying that we really have some

particular method that they already trust

493

:

and we kind of, if it's just compared to

it, we can't kind of really convince

494

:

others that is it correct when it is

differing from what we kind of used to

495

:

rely on.

496

:

Yeah, that's definitely a problem.

497

:

That's also a question I asked Marilu.

498

:

when she was on the show and then that was

kind of the same answer if I remember

499

:

correctly that for now it's kind of hard

to do benchmarks in a way, which is

500

:

definitely an issue if you're trying to

work on that from a scientific perspective

501

:

as well.

502

:

If we were astrologists, that'd be great,

like then we'd be good.

503

:

But if you're a scientist, then you want

to evaluate your methods and...

504

:

And finding a method to evaluate the

method is almost as valuable as finding

505

:

the method in the first place.

506

:

And where do you think we are on that

regarding in your field?

507

:

Is that an active branch of the research

to try and evaluate these algorithms?

508

:

How would that even look like?

509

:

Or are we still really, really at a very

early time for that work?

510

:

That's a...

511

:

Very good question.

512

:

So I'm not aware of a lot of people that

would kind of specifically focus on

513

:

evaluation.

514

:

So for example, Aki has of course been

working a lot on that, trying to kind of

515

:

create diagnostics and so on.

516

:

But then if we think about more on the

flexible machine learning side, I think my

517

:

hunch is that it's the individual research

groups are kind of all circling around the

518

:

same problems that they are kind of trying

to figure out that, okay,

519

:

Every now and then someone invents a fancy

way of evaluating something.

520

:

It introduces a particular type of

synthetic scenario where I think that the

521

:

most common in tries that what people do

is that you create problems where you

522

:

actually have an analytic posterior and

it's somehow like an artificial problem

523

:

that you take a problem and you transform

it in a given way and then you assume that

524

:

I didn't have the analytic one.

525

:

But they are all, I mean, they feel a bit

artificial.

526

:

They feel a bit synthetic.

527

:

So let's see.

528

:

It would maybe be something that the

community should kind of be talking a bit

529

:

more about on a workshop or something

that, OK, let's try to really think about

530

:

how to verify the robustness or possibly

identify that these things are not really

531

:

ready or reliable for practical use in

very serious applications yet.

532

:

Yeah.

533

:

I haven't been following very closely

what's happening, so I may be missing some

534

:

important works that are already out

there.

535

:

Okay, yeah.

536

:

Well, Aki, if you're listening, send us a

message if we forgot something.

537

:

And second, that sounds like there are

some interesting PhDs to do on the issue,

538

:

if that's still a very new branch of the

research.

539

:

So, people?

540

:

If you're interested in that, maybe

contact Arto and we'll see.

541

:

Maybe in a few months or years, you can

come here on the show and answer the

542

:

question I just asked.

543

:

Another aspect of your work I really want

to talk about also that I really love and

544

:

now listeners can relax because that's

going to be, I think, less abstract and

545

:

closer to their user experience.

546

:

is about priors.

547

:

You talked about it a bit at the

beginning, especially you are working and

548

:

you worked a lot on a package called

Prelease that I really love.

549

:

One of my friends and fellow Pimc

developers, Osvaldo Martin, is also

550

:

collaborating on that.

551

:

And you guys have done a tremendous job on

that.

552

:

So yeah, can you give people a primer

about Prelease?

553

:

What is it?

554

:

When could they use it and what's its

purpose in general?

555

:

Maybe I need to start by saying that I

haven't worked a lot on prelease.

556

:

Osvaldo has and a couple of others, so

I've been kind of just hovering around and

557

:

giving a bit of feedback.

558

:

But yeah, so I'll maybe start a bit

further away, so not directly from

559

:

prelease, but the whole question of prior

elicitation.

560

:

So I think the...

561

:

Yeah.

562

:

What we've been working with that is the

prior elicitation is simply an, I would

563

:

frame it as that it's some sort of

unusually iterative approach of

564

:

communicating with the domain expert where

the goal is to estimate what's their

565

:

actual subjective prior knowledge is on

whatever parameters the model has and

566

:

doing it so that it's like cognitively

easy for the expert.

567

:

So many of the algorithms that we've been

working on this are based on this idea of

568

:

predictive elicitation.

569

:

So if you have a model where the

parameters don't actually have a very

570

:

concrete, easily understandable meaning,

you can't actually start asking questions

571

:

from the expert about the parameters.

572

:

It would require them to understand fully

the model itself.

573

:

The predictive elicitation techniques kind

of ask

574

:

communicate with the expert usually in the

space of the observable quantities.

575

:

So they're trying to make that is this

somehow more likely realization than this

576

:

other one.

577

:

And now this is where the prelease comes

into play.

578

:

So when we are communicating with the

user, so most of the times the information

579

:

we show for the user is some sort of

visualizations.

580

:

of predictive distributions or possibly

also about the parameter distributions

581

:

themselves.

582

:

So we need an easy way of communicating

whether it's histograms of predicted

583

:

values and whatnot.

584

:

So how do we show those for a user in

scenarios where the model itself is some

585

:

sort of a probabilistic program so we

can't kind of fixate to a given model

586

:

family.

587

:

That's actually what's the main role of

Prelease is essentially making it easy to

588

:

interface with the user.

589

:

Of course, Prelease also then includes

these algorithms themselves.

590

:

So, algorithms for estimating the prior

and the kind of interface components for

591

:

the expert to give information.

592

:

So, make a selection, use a slider that I

would want my distribution to be a bit

593

:

more skewed towards the right and so on.

594

:

That's what we are aiming at.

595

:

A general purpose tool that would be used,

it's essentially kind of a platform for

596

:

developing and kind of bringing into use

all kinds of prioritization techniques.

597

:

So it's not tied to any given algorithm or

anything but you just have the components

598

:

and could then easily kind of commit,

let's say, a new type of prioritization

599

:

algorithm into the library.

600

:

Yeah and I re -encourage

601

:

folks to go take a look at the prelease

package.

602

:

I put the link in the show notes because,

yeah, as you were saying, that's a really

603

:

easier way to specify your priors and also

elicit them if you need the intervention

604

:

of non -statisticians in your model, which

you often do if the model is complex

605

:

enough.

606

:

So yeah, like...

607

:

I'm using it myself quite a lot.

608

:

So thanks a lot guys for this work.

609

:

So Arto, as you were saying, Osvaldo

Martín is one of the main contributors,

610

:

Oriol Abril Blas also, and Alejandro

Icazati, if I remember correctly.

611

:

So at least these four people are the main

contributors.

612

:

And yeah, so I definitely encourage people

to go there.

613

:

What would you say, Arto, are the...

614

:

like the Pareto effect, what would it be

if people want to get started with

615

:

Prelease?

616

:

Like the 20 % of uses that will give you

80 % of the benefits of Prelease for

617

:

someone who don't know anything about it.

618

:

That's a very good question.

619

:

I think the most important thing actually

is to realize that we need to be careful

620

:

when we set the priors.

621

:

So simply being aware that you need a tool

for this.

622

:

You need a tool that makes it easy to do

something like a prior predictive check.

623

:

You need a tool that relieves you from

figuring out how do I inspect.

624

:

my priors or the effects it has on the

model.

625

:

That's actually where the real benefit is.

626

:

You get most of the...

627

:

when you kind of try to bring it as part

of your Bayesian workflow in a kind of a

628

:

concrete step that you identify that I

need to do this.

629

:

Then the kind of the remaining tale of

this thing is then of course that the...

630

:

maybe in some cases you have such a

complicated model that you really need to

631

:

deep dive and start...

632

:

running algorithms that help you eliciting

the priors.

633

:

And I would actually even say that the

elicitation algorithms, I do perceive them

634

:

useful even when the person is actually a

statistician.

635

:

I mean, there's a lot of models that we

may think that we know how to set the

636

:

priors.

637

:

But what we are actually doing is

following some very vague ideas on what's

638

:

the effect.

639

:

And we may also make

640

:

severe mistakes or spend a lot of time in

doing it.

641

:

So to an extent these elicitation

interfaces, I believe that ultimately they

642

:

will be helping even kind of hardcore

statisticians in just kind of doing it

643

:

faster, doing it slightly better, doing it

perhaps in a more better documented

644

:

manner.

645

:

So you could for example kind of store all

the interaction the modeler had.

646

:

with these things and kind of put that

aside that this is where we got the prior

647

:

from instead of just trial and error and

then we just see at the end the result.

648

:

So you could kind of revisit the choices

you made during an elicitation process

649

:

that I discarded these predictive

distributions for some reason and then you

650

:

can later kind of, okay I made a mistake

there maybe I go and change my answer in

651

:

that part and then an algorithm provides

you an updated prior.

652

:

without you needing to actually go through

the whole prior specification process

653

:

again.

654

:

Yeah.

655

:

Yeah.

656

:

Yeah, I really love that.

657

:

And that makes the process of setting

priors more reproducible, more transparent

658

:

in a way.

659

:

That makes me think a bit of the scikit

-learn pipelines that you use to transform

660

:

the data.

661

:

For instance, you just set up the pipeline

and you say, I want to standardize my

662

:

data, for instance.

663

:

And then you have that pipeline ready.

664

:

And when you do the auto sample

predictions, you can use the pipeline and

665

:

say, okay, now like do that same

transformation on these new data so that

666

:

we're sure that it's done the right way,

but it's still transparent and people know

667

:

what's going on here.

668

:

It's a bit the same thing, but with the

priors.

669

:

And I really love that because that makes

it also easier for people to think about

670

:

the priors and to actually choose the

priors.

671

:

Because.

672

:

What I've seen in teaching is that

especially for beginners, even more when

673

:

they come from the Frequentis framework,

sending the priors can be just like

674

:

paralyzing.

675

:

It's like products of choice.

676

:

It's way too many, way too many choices.

677

:

And then they end up not choosing anything

because they are too afraid to choose the

678

:

wrong prior.

679

:

Yes, I fully agree with that.

680

:

I mean, there's a lot of very simple

models.

681

:

that already start having six, seven,

eight different univariate priors there.

682

:

And then I've been working with these

things for a long time and I still very

683

:

easily make stupid mistakes that I'm

thinking that I increase the variance of

684

:

this particular prior here, thinking that

what I'm achieving is, for example, higher

685

:

predictive variance as well.

686

:

And then I realized that, no, that's not

the case.

687

:

It's actually...

688

:

Later in the model, it plays some sort of

a role and it actually has the opposite

689

:

effect.

690

:

It's hard.

691

:

Yeah.

692

:

Yeah.

693

:

That stuff is really hard and same here.

694

:

When I discovered that, I'm extremely

frustrated because I'm like, I always did

695

:

hours on these, whereas if I had a more

producible pipeline, that would just have

696

:

been handled automatically for me.

697

:

So...

698

:

Yeah, for sure.

699

:

We're not there yet in the workflow, but

that definitely makes it way easier.

700

:

So yeah, I absolutely agree that we are

not there yet.

701

:

I mean, the Prellis is a very well

-defined tool that allows us to start

702

:

working on it.

703

:

But I mean, then the actual concrete

algorithms that would make it easy to

704

:

let's say for example, avoid these kind of

stupid mistakes and be able to kind of

705

:

really reduce the effort.

706

:

So if it now takes two weeks for a PhD

student trying to think about and fiddle

707

:

with the prior, so can we get to one day?

708

:

Can we get it to one hour?

709

:

Can we get it to two minutes of a quick

interaction?

710

:

And probably not two minutes, but if we

can get it to one hour and it...

711

:

It will require lots of things.

712

:

It will require even better of this kind

of tooling.

713

:

So how do we visualize, how do we play

around with it?

714

:

But I think it's going to require quite a

bit better algorithms on how do you, from

715

:

kind of maximally limited interaction, how

do you estimate.

716

:

what the prior is and how you design the

kind of optimal questions you should be

717

:

asking from the expert.

718

:

There's no point in kind of reiterating

the same things just to fine -tune a bit

719

:

one of the variances of the priors if

there is a massive mistake still somewhere

720

:

in the prior and a single question would

be able to rule out half of the possible

721

:

scenarios.

722

:

It's going to be an interesting...

723

:

let's say, rise research direction, I

would say, for the next 5, 10 years.

724

:

Yeah, for sure.

725

:

And very valuable also because very

practical.

726

:

So for sure, again, a great PhD

opportunity, folks.

727

:

Yeah, yeah.

728

:

Also, I mean, that may be hard to find

those algorithms that you were talking

729

:

about because it is hard, right?

730

:

I know I worked on the...

731

:

find constraint prior function that we

have in PMC now.

732

:

And it's just like, it seemed like a very

simple case.

733

:

It's not even doing all the fancy stuff

that Prellis is doing.

734

:

It's mainly just optimizing distribution

so that it fits the constraints that you

735

:

are giving it.

736

:

Like for instance, I want a gamma with 95

% of the mass between 2 and 6.

737

:

Give me the...

738

:

parameters that fit that constraint.

739

:

That's actually surprisingly hard

mathematically.

740

:

You have a lot of choices to make, you

have a lot of things to really be careful

741

:

about.

742

:

And so I'm guessing that's also one of the

hurdles right now in that research.

743

:

Yeah, it absolutely is.

744

:

I mean, I would say at least I'm

approaching this.

745

:

more or less from an optimization

perspective then that I mean, yes, we are

746

:

trying to find a prior that best satisfies

whatever constraints we have and trying to

747

:

formulate an optimization problem of some

kind that gets us there.

748

:

This is also where I think there's a lot

of room for the, let's say flexible

749

:

machine learning tools type of things.

750

:

So, I mean, if you think about the prior

that satisfies these constraints, we could

751

:

be specifying it with some sort of a

flexible

752

:

not a particular parametric prior but some

sort of a flexible representation and then

753

:

just kind of optimizing for within a much

broader set of this.

754

:

But then of course it requires completely

different kinds of tools that we are used

755

:

to working on.

756

:

It also requires people accepting that our

priors may take arbitrary shapes.

757

:

They may be distributions that we could

have never specified directly.

758

:

Maybe they're multimodal.

759

:

priors that we kind of just infer that

this is what you couldn't really and

760

:

there's going to be also a lot of kind of

educational perspective on getting people

761

:

to accept this.

762

:

But even if I had to give you a perfect

algorithm that somehow cranks out a prior

763

:

and then you look at the prior and you're

saying that I don't even know what

764

:

distribution this is, I would have never

ever converged into this if I was manually

765

:

doing this.

766

:

So will you accept?

767

:

that that's your prior or will you insist

that your method is doing something

768

:

stupid?

769

:

I mean, I still want to use my my Gaussian

prior here.

770

:

Yeah, that's a good point.

771

:

And in a way that's kind of related to a

classic problem that you have when you're

772

:

trying to automate a process.

773

:

I think there's the same issue with the

automated cars, like those self -driving

774

:

cars, where people actually trust the cars

more if they think they have

775

:

some control over it.

776

:

I've seen interesting experiments where

they put a placebo button in the car that

777

:

people could push on to override if they

wanted to, but the button wasn't doing

778

:

anything.

779

:

People are saying they were more

trustworthy of these cars than the

780

:

completely self -driving cars.

781

:

That's also definitely something to take

into account, but that's more related to

782

:

the human psychology than to the

algorithms per se.

783

:

related to human psychology but it's also

related to this evaluation perspective.

784

:

I mean of course if we did have a very

robust evaluation pattern that somehow

785

:

tells that once you start using these

techniques your final conclusions in some

786

:

sense will be better and if we can make

that kind of a very convincing then it

787

:

will be easier.

788

:

I mean if you think about, I mean there's

a lot of people that would say that

789

:

very massive neural network with four

billion parameters.

790

:

It would never ever be able to answer a

question given in a natural language.

791

:

A lot of people were saying that five

years ago that this is a pipeline, it's

792

:

never gonna happen.

793

:

Now we do have it and now everyone is

ready to accept that yes, it can be done.

794

:

And they are willing to actually trust

these judge -y pity type of models in a

795

:

lot of things.

796

:

And they are investing a lot of effort

into figuring out what to do with this.

797

:

It just needs this kind of very concrete

demonstration that there is value and that

798

:

it works well enough.

799

:

It will still take time for people to

really accept it, but I mean, I think

800

:

that's kind of the key ingredient.

801

:

Yeah, yeah.

802

:

I mean, it's also good in some way.

803

:

Like that skepticism makes the tools

better.

804

:

So that's good.

805

:

I mean, so we could...

806

:

Keep talking about Prolis because I have

other technical questions about that.

807

:

But actually, since you're like, that's a

perfect segue to a question I also had for

808

:

you because you have a lot of experience

in that field.

809

:

So how do you think can industries better

integrate the patient approaches into

810

:

their data science workflows?

811

:

Because that's basically what we ended up

talking about right now without me nudging

812

:

you towards it.

813

:

Yeah, I have actually indeed been thinking

about that quite a bit.

814

:

So I do a lot of collaboration with

industrial partners in different domains.

815

:

I think there's a couple of perspectives

to this.

816

:

So one is that, I mean, people are

finally, I think they are starting to

817

:

accept the fact that probabilistic

programming with kind of black box

818

:

automated inference is the only sensible

way.

819

:

doing statistical modeling.

820

:

So looking at back like 10 -15 years ago,

you would still have a lot of people,

821

:

maybe not in industry but in research in

different disciplines, in meteorology or

822

:

physics or whatever.

823

:

People would actually be writing

Metropolis -Hastings algorithms from

824

:

scratch, which is simply not reliable in

any sense.

825

:

I mean, it took time for them to accept

that yes, we can actually now do it with

826

:

something like Stan.

827

:

I think this is of course the way that to

an extent that there are problems that fit

828

:

well with what something like Stan or

Priency offers.

829

:

I think we've been educating long enough

master students who are kind of familiar

830

:

with these concepts.

831

:

Once they go to the industry they will use

them, they know roughly how to use them.

832

:

So that's one side.

833

:

But then the other thing is that I

think...

834

:

Especially in many of these predictive

industries, so whether it's marketing or

835

:

recommendation or sales or whatever,

people are anyway already doing a lot of

836

:

deep learning types of models there.

837

:

That's a routine tool in what they do.

838

:

And now if we think about that, at least

in my opinion, that these fields are

839

:

getting closer to each other.

840

:

So we have more and more deep learning

techniques that are, like various and

841

:

autoencoder is a prime example, but it is

ultimately a Bayesian model in itself.

842

:

This may actually be that they creep

through that all this bayesian thinking

843

:

and reasoning is actually getting into use

by the next generation of these deep

844

:

learning techniques that they are doing.

845

:

They've been building those models,

they've been figuring out that they cannot

846

:

get reliable estimates of uncertainty,

they maybe tried some ensembles or

847

:

whatnot.

848

:

And they will be following.

849

:

So once the tools are out there, there's

good enough tutorials on how to use those.

850

:

So they might start using things like,

let's say, Bayesian neural networks or

851

:

whatever the latest tool is at that point.

852

:

And I think this may be the easiest way

for the industries to do so.

853

:

They're not going to go switch back to

very simple classical linear models when

854

:

they do their analysis.

855

:

But they're going to make their deep

learning solutions Bayesian on some time

856

:

scale.

857

:

Maybe not tomorrow, but maybe in five

years.

858

:

Yeah, that's a very good point.

859

:

Yeah, I love that.

860

:

And of course, I'm very happy about that,

being one of the actors making the

861

:

industry more patient.

862

:

So I have a vested interest in these.

863

:

But yeah, also, I've seen the same

evolution you were talking about.

864

:

Right now, it's not even really an issue

of

865

:

convincing people to use these kind of

tools.

866

:

I mean, still from time to time, but less

and less.

867

:

And now the question is really more in

making those tools more accessible, more

868

:

versatile, easier to use, more reliable,

easier to deploy in industry, things like

869

:

that, which is a really good point to be

at for sure.

870

:

And to some extent, I think it's...

871

:

It's an interesting question also from the

perspective of the tools.

872

:

So to some extent, it may mean that we

just end up doing a lot of the kind of

873

:

Bayesian analysis on top of what we would

now call deep learning frameworks.

874

:

And it's going to be, of course, it's

going to be libraries building on top of

875

:

those.

876

:

So like NumPyro is a library building on

PyTorch.

877

:

But the syntax is kind of intentionally

similar to what they've used in

878

:

used to in the deep learning type of

modeling these.

879

:

And this is perfectly fine.

880

:

We are anyway using a lot of stochastic

optimization routines in Bayesian

881

:

inference and so on.

882

:

So they are actually very good tools for

building all kinds of Bayesian models.

883

:

And I think this may be the layer where

the industry use happens, that it's going

884

:

to be always.

885

:

They need the GPU type of scaling and

everything there anyway.

886

:

So just happy to have our systems.

887

:

work on top of these libraries.

888

:

Yeah, very good point.

889

:

And also to come back to one of the points

you've made in passing, where education is

890

:

helping a lot with that.

891

:

You have been educating now the data

scientists who go in industry.

892

:

And I know in Finland, in France, not that

much.

893

:

Where are you originally from?

894

:

But in Finland, I know there is this

really great integration between the

895

:

research part, the university and the

industry.

896

:

You can really see that in the PhD

positions, in the professorship positions

897

:

and stuff like that.

898

:

So I think that's really interesting and

that's why I wanted to talk to you about

899

:

that.

900

:

To go back to the education part, what

challenges and opportunities do you see in

901

:

teaching Bayesian machine learning as you

do at the university level?

902

:

Yeah, it's challenging.

903

:

I must say that.

904

:

I mean, especially if we get to the point

of well, Bayesian machine learning.

905

:

So it is a combination of two topics that

are somewhat difficult in itself.

906

:

So if we want to talk about normalizing

flows and then we want to talk about

907

:

statistical properties of estimators or

MCMC convergence.

908

:

So they require different kinds of

mathematical tools.

909

:

tools, they require a certain level of

expertise on the software, on the

910

:

programming side.

911

:

So what it means actually is that it's

even that if we look at the population of

912

:

let's say data science students, we can

always have a lot of people that are

913

:

missing background on one of these sites.

914

:

So I think this is a difficult topic to

teach.

915

:

If it was a small class, it would be fine.

916

:

But it appears to be that at least our

students are really excited about these

917

:

things.

918

:

So I can launch a course with explicitly a

title of a Bayesian machine learning,

919

:

which is like an advanced level machine

learning course.

920

:

And I would still get 60 to 100 students

enrolling on that course.

921

:

And then that means that within that

group, there's going to be some CS

922

:

students with almost no background on

statistics.

923

:

There's going to be some statisticians who

924

:

certainly know how to program but they're

not really used to thinking about GPU

925

:

acceleration of a very large model.

926

:

But it's interesting, I mean it's not an

impossible thing.

927

:

I think it is also a topic that you can

kind of teach on a sufficient level for

928

:

everyone.

929

:

So everyone agrees is able to understand

the basic reasoning of why we are doing

930

:

these things.

931

:

Some of the students may struggle,

932

:

figuring out all the math behind it.

933

:

But they might still be able to use these

tools very nicely.

934

:

They might be able to say that if I do

this and that kind of modification, I

935

:

realize that my estimates are better

calibrated.

936

:

And some others are really then going

deeper into figuring out why these things

937

:

work.

938

:

So it just needs a bit of creativity on

how do we do it and what do we expect from

939

:

the students.

940

:

What should they know once they've

completed a course like this?

941

:

Yeah, that makes sense.

942

:

Do you have seen also an increase in the

number of students in the recent years?

943

:

Well, we get as many students as we can

take.

944

:

So I mean, it's actually been for quite a

while already that in our university, by

945

:

far the most...

946

:

popular master's programs and bachelor's

programs are essentially data science and

947

:

computer science.

948

:

So we can't take in everyone we would

want.

949

:

So it actually looks to us that it's more

or less like a stable number of students,

950

:

but it's always been a large number since

we launched, for example, the data science

951

:

program.

952

:

So it went up very fast.

953

:

So there's definitely interest.

954

:

Yeah.

955

:

Yeah.

956

:

That's fantastic.

957

:

And...

958

:

So I've been taking a lot of your time.

959

:

So we're going to start to close up the

show, but there are at least two questions

960

:

I want to get your insight on.

961

:

And the first one is, what do you think

the biggest hurdle in the Bayesian

962

:

workflow currently is?

963

:

We've talked about that a bit already, but

how do you want to get your structured

964

:

answer?

965

:

Well, I think the first thing is that

getting people to actually start

966

:

using more or less systematic workflows.

967

:

I mean, the idea is great.

968

:

We kind of know more or less how we should

be thinking about it, but it's a very

969

:

complex object.

970

:

So we're going to be able to tell experts,

statisticians that, yes, this is roughly

971

:

how you should do.

972

:

Then we should still also convince them

that, like, almost force them to stick to

973

:

it.

974

:

But then especially if we then think about

newcomers, people who are just starting

975

:

with these things, it's a very complicated

thing.

976

:

So if you would need to read 50 page book

or 100 page book about Bayesian workflow

977

:

to even know how to do it, it's a

technical challenge.

978

:

So I think in long term, we are going to

get essentially tools for assisting it.

979

:

So really kind of streamlining the

process.

980

:

thinking of something like an AI assistant

for a person building a model that they

981

:

really kind of pull you that now I see

that you are trying to go there and do

982

:

this, but I see that you haven't done

prior predictive checks.

983

:

I actually already created some plots for

you.

984

:

Please take a look at these and confirm

that is this what you were expecting?

985

:

And it's going to be a lot of effort in

creating those.

986

:

It's something that we've been kind of

trying to think about.

987

:

how to do it, but it's still.

988

:

I think that's where the challenge is.

989

:

We know most of the stuff within the

workflow, roughly how it should be done.

990

:

At least we have good enough solutions.

991

:

But then really kind of helping people to

actually follow these principles, that's

992

:

gonna be hard.

993

:

Yeah, yeah, yeah.

994

:

But damn, that would be super cool.

995

:

Like talking about something like a Javis,

you know, like the AI assistant

996

:

environment, a Javis, but for...

997

:

Beijing models, how cool would that be?

998

:

Love that.

999

:

And looking forward, how do you see

Beijing methods evolving with artificial

:

01:06:54,476 --> 01:06:56,198

intelligence research?

:

01:06:58,284 --> 01:07:00,728

Yeah, I think.

:

01:07:02,476 --> 01:07:06,356

For quite a while I was about to say that,

like I've been kind of building this basic

:

01:07:06,356 --> 01:07:10,486

idea that the deep learning models as such

will become more and more basic in any

:

01:07:10,486 --> 01:07:10,976

way.

:

01:07:10,976 --> 01:07:13,276

So that's kind of a given.

:

01:07:13,276 --> 01:07:19,916

But now of course, now the recent very

large scale AI models, they're getting so

:

01:07:19,916 --> 01:07:25,656

big that then the question of

computational resources is, it's a major

:

01:07:25,656 --> 01:07:31,658

hurdle to do learning for those models,

even in the crudest possible way.

:

01:07:31,658 --> 01:07:37,678

So it may be that there's of course kind

of clear needs for uncertainty

:

01:07:37,678 --> 01:07:41,258

quantification in the large language model

type of scopes.

:

01:07:41,258 --> 01:07:43,088

They are really kind of unreliable.

:

01:07:43,088 --> 01:07:47,558

They're really poor at, for example,

evaluating their own confidence.

:

01:07:47,558 --> 01:07:52,228

So there's been some examples that if you

ask how sure you are about these states,

:

01:07:52,228 --> 01:07:55,238

more or less irrespective of the

statement, give similar number.

:

01:07:55,238 --> 01:07:56,388

Yeah, 50 % sure.

:

01:07:56,388 --> 01:07:57,888

I don't know.

:

01:07:58,708 --> 01:08:01,412

So it may be that the

:

01:08:01,580 --> 01:08:05,150

It's not really, at least on a very short

run, it's not going to be the Bayesian

:

01:08:05,150 --> 01:08:10,040

techniques that really sells all the

uncertainty quantification in those type

:

01:08:10,040 --> 01:08:10,400

of models.

:

01:08:10,400 --> 01:08:13,560

In the long term, it maybe is.

:

01:08:13,560 --> 01:08:15,140

But I think there's a lot of...

:

01:08:15,140 --> 01:08:16,380

It's going to be interesting.

:

01:08:16,380 --> 01:08:21,480

It looks to me a bit that it's a lot of

stuff that's built on top of...

:

01:08:21,480 --> 01:08:27,058

To address specific limitations of these

large language models, it is...

:

01:08:27,058 --> 01:08:28,368

separate components.

:

01:08:28,368 --> 01:08:32,468

It's some sort of an external tool that

reads in those inputs or it's an external

:

01:08:32,468 --> 01:08:35,088

tool that the LLM can use.

:

01:08:35,248 --> 01:08:39,298

So maybe this is going to be this kind of

a separate element that somehow

:

01:08:39,298 --> 01:08:40,228

integrates.

:

01:08:40,228 --> 01:08:50,088

So an LLM, of course, could be having an

API interface where it can query, let's

:

01:08:50,088 --> 01:08:51,948

say, use tan.

:

01:08:51,948 --> 01:08:56,418

to figure out an answer to type of a

question that requires probabilistic

:

01:08:56,418 --> 01:08:57,328

reasoning.

:

01:08:57,328 --> 01:09:03,248

So people have been plugging in, there's

this public famous examples where you can

:

01:09:03,248 --> 01:09:07,108

query like some mathematical reasoning

engines and so on.

:

01:09:07,108 --> 01:09:11,058

So that the LLM, if you ask a specific

type of a question, it goes outside of its

:

01:09:11,058 --> 01:09:13,128

own realm and does something.

:

01:09:13,248 --> 01:09:17,578

It already kind of knows how to program,

so maybe we just need to teach LLMs to do

:

01:09:17,578 --> 01:09:19,214

statistical inference.

:

01:09:19,596 --> 01:09:24,716

by relying on actually running an MCMC

algorithm on a model that they kind of

:

01:09:24,716 --> 01:09:26,596

specify together with the user.

:

01:09:26,596 --> 01:09:29,086

I don't know whether anyone is actually

working on that.

:

01:09:29,086 --> 01:09:31,246

It's something that just came to my mind.

:

01:09:31,246 --> 01:09:34,096

So I haven't really thought about this too

much.

:

01:09:35,436 --> 01:09:41,436

Yeah, but again, we're getting so many PhD

ideas for people right now.

:

01:09:41,436 --> 01:09:42,576

We are.

:

01:09:42,576 --> 01:09:48,684

Yeah, I feel like we should be doing the

best of all your...

:

01:09:48,684 --> 01:09:50,564

Awesome PhD ideas.

:

01:09:51,804 --> 01:09:52,194

Awesome.

:

01:09:52,194 --> 01:09:59,324

Well, I still have so many questions for

you, but let's go to the show because I

:

01:09:59,324 --> 01:10:01,064

don't want to take too much of your time.

:

01:10:01,064 --> 01:10:02,884

I know it's getting late in Finland.

:

01:10:02,884 --> 01:10:07,344

So let's close up the show and ask you the

last two questions.

:

01:10:07,344 --> 01:10:10,124

I always ask at the end of the show.

:

01:10:10,124 --> 01:10:14,814

First one, if you had unlimited time and

resources, which problem would you try to

:

01:10:14,814 --> 01:10:15,684

solve?

:

01:10:16,780 --> 01:10:17,720

Let's see.

:

01:10:17,720 --> 01:10:23,760

The lazy answer is that I am now trying to

get unlimited resources, well, not

:

01:10:23,760 --> 01:10:28,160

unlimited resources, but I'm really trying

to tackle this prior elicitation question.

:

01:10:28,160 --> 01:10:32,900

I think most of the other parts on the

Bayesian workflow are kind of, we have

:

01:10:32,900 --> 01:10:36,750

reasonably good solutions for those, but

this whole question of really how to

:

01:10:36,750 --> 01:10:42,926

figure out complex multivariate priors

over arbitrary complex models.

:

01:10:42,988 --> 01:10:47,348

That's a very practical thing that I am

investing on.

:

01:10:47,388 --> 01:10:51,588

But maybe if I'm kind of taking, if it

really is infinite, then maybe I could

:

01:10:51,588 --> 01:10:55,948

actually continue on the quick idea that

we just talked about.

:

01:10:55,948 --> 01:11:01,278

That I mean really getting this

probabilistic reasoning at the core of

:

01:11:01,278 --> 01:11:04,638

these large language model type of AI

applications.

:

01:11:04,638 --> 01:11:13,188

That it would really be reliably answering

proper probabilistic judgments of the

:

01:11:13,228 --> 01:11:17,048

kind of decision -making reasoning

problems that we ask from them.

:

01:11:17,148 --> 01:11:18,988

So that would be interesting.

:

01:11:19,308 --> 01:11:19,528

Yeah.

:

01:11:19,528 --> 01:11:21,668

Yeah, for sure.

:

01:11:22,748 --> 01:11:26,808

And second question, if you could have

dinner with any great scientific mind,

:

01:11:26,808 --> 01:11:29,988

dead or alive or fictional, who would it

be?

:

01:11:30,328 --> 01:11:34,228

Yes, this is something I actually thought

about it because I figured you would be

:

01:11:34,228 --> 01:11:36,208

asking it also from me.

:

01:11:36,248 --> 01:11:39,708

And I chose that I mean fictional

characters.

:

01:11:39,708 --> 01:11:41,238

I like fictional characters.

:

01:11:41,238 --> 01:11:43,052

So I went with...

:

01:11:43,052 --> 01:11:48,232

Daniel Waterhouse from Niels Deffensen's

The Baroque Cycle books.

:

01:11:48,492 --> 01:11:50,772

So they are kind of semi -historical

books.

:

01:11:50,772 --> 01:11:57,352

So they talk about the era where Isaac

Newton and others are kind of living and

:

01:11:57,352 --> 01:11:59,372

establishing the Royal Society.

:

01:11:59,372 --> 01:12:03,872

And there's a lot of high fantasy

components involved.

:

01:12:04,132 --> 01:12:12,970

And Daniel Waterhouse in those novels is

his roommate of Isaac Newton and a friend.

:

01:12:12,980 --> 01:12:14,840

of Gottfried Leibniz.

:

01:12:14,840 --> 01:12:20,250

So he knows both sides of this great

debate on who invented calculus and who

:

01:12:20,250 --> 01:12:21,600

copied whom.

:

01:12:21,600 --> 01:12:27,020

So if I had a dinner with him, I would get

to talk about these innovations that I

:

01:12:27,020 --> 01:12:29,840

think are one of the foundational ones.

:

01:12:29,840 --> 01:12:34,170

But I wouldn't actually need to get

involved with either party.

:

01:12:34,170 --> 01:12:39,020

I wouldn't need to choose sides, whether

it's Isaac or Gottfried that I would be

:

01:12:39,020 --> 01:12:40,200

talking to.

:

01:12:41,164 --> 01:12:42,344

Love it.

:

01:12:42,344 --> 01:12:43,704

Yeah, love that answer.

:

01:12:43,704 --> 01:12:47,204

Make sure to record that dinner and post

it on YouTube.

:

01:12:47,204 --> 01:12:50,564

I'm pretty sure lots of people will be

interested in it.

:

01:12:50,564 --> 01:12:51,334

Fantastic.

:

01:12:51,334 --> 01:12:51,804

Thanks.

:

01:12:51,804 --> 01:12:53,284

Thanks a lot, Arto.

:

01:12:53,644 --> 01:12:56,184

That was a great discussion.

:

01:12:56,184 --> 01:13:01,214

Really happy we could go through the,

well, not the whole depth of what you do

:

01:13:01,214 --> 01:13:04,204

because you do so many things, but a good

chunk of it.

:

01:13:04,204 --> 01:13:06,114

So I'm really happy about that.

:

01:13:06,114 --> 01:13:08,108

As usual,

:

01:13:08,108 --> 01:13:12,008

I'll put resources and a link to your

website in the show notes for those who

:

01:13:12,008 --> 01:13:13,188

want to dig deeper.

:

01:13:13,288 --> 01:13:17,148

Thank you again, Akto, for taking the time

and being on this show.

:

01:13:18,348 --> 01:13:19,348

Thank you very much.

:

01:13:19,348 --> 01:13:20,718

It was my pleasure.

:

01:13:20,718 --> 01:13:22,856

I really enjoyed the discussion.

:

01:13:26,796 --> 01:13:30,496

This has been another episode of Learning

Bayesian Statistics.

:

01:13:30,496 --> 01:13:35,486

Be sure to rate, review, and follow the

show on your favorite podcatcher, and

:

01:13:35,486 --> 01:13:40,376

visit learnbaystats .com for more

resources about today's topics, as well as

:

01:13:40,376 --> 01:13:45,116

access to more episodes to help you reach

true Bayesian state of mind.

:

01:13:45,116 --> 01:13:47,076

That's learnbaystats .com.

:

01:13:47,076 --> 01:13:51,916

Our theme music is Good Bayesian by Baba

Brinkman, fit MC Lass and Meghiraam.

:

01:13:51,916 --> 01:13:55,036

Check out his awesome work at bababrinkman

.com.

:

01:13:55,036 --> 01:13:56,234

I'm your host.

:

01:13:56,234 --> 01:13:57,184

Alex and Dora.

:

01:13:57,184 --> 01:14:01,464

You can follow me on Twitter at Alex

underscore and Dora like the country.

:

01:14:01,464 --> 01:14:06,524

You can support the show and unlock

exclusive benefits by visiting patreon

:

01:14:06,524 --> 01:14:08,704

.com slash LearnBasedDance.

:

01:14:08,704 --> 01:14:11,144

Thank you so much for listening and for

your support.

:

01:14:11,144 --> 01:14:17,034

You're truly a good Bayesian change your

predictions after taking information and

:

01:14:17,034 --> 01:14:20,324

if you think and I'll be less than

amazing.

:

01:14:20,364 --> 01:14:26,304

Let me show you how to be a good Bayesian.

:

01:14:26,304 --> 01:14:29,844

Change calculations after taking fresh

data in.

:

01:14:29,844 --> 01:14:33,124

Those predictions that your brain is

making.

:

01:14:33,124 --> 01:14:36,624

Let's get them on a solid foundation.

Chapters

Video

More from YouTube