Artwork for podcast Learning Bayesian Statistics
#110 Unpacking Bayesian Methods in AI with Sam Duffield
Causal Inference, AI & Machine Learning Episode 11010th July 2024 • Learning Bayesian Statistics • Alexandre Andorra
00:00:00 01:12:27

Share Episode

Shownotes

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work!

Visit our Patreon page to unlock exclusive Bayesian swag ;)

Takeaways:

  • Use mini-batch methods to efficiently process large datasets within Bayesian frameworks in enterprise AI applications.
  • Apply approximate inference techniques, like stochastic gradient MCMC and Laplace approximation, to optimize Bayesian analysis in practical settings.
  • Explore thermodynamic computing to significantly speed up Bayesian computations, enhancing model efficiency and scalability.
  • Leverage the Posteriors python package for flexible and integrated Bayesian analysis in modern machine learning workflows.
  • Overcome challenges in Bayesian inference by simplifying complex concepts for non-expert audiences, ensuring the practical application of statistical models.
  • Address the intricacies of model assumptions and communicate effectively to non-technical stakeholders to enhance decision-making processes.

Chapters:

00:00 Introduction to Large-Scale Machine Learning

11:26 Scalable and Flexible Bayesian Inference with Posteriors

25:56 The Role of Temperature in Bayesian Models

32:30 Stochastic Gradient MCMC for Large Datasets

36:12 Introducing Posteriors: Bayesian Inference in Machine Learning

41:22 Uncertainty Quantification and Improved Predictions

52:05 Supporting New Algorithms and Arbitrary Likelihoods

59:16 Thermodynamic Computing

01:06:22 Decoupling Model Specification, Data Generation, and Inference

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan and Francesco Madrisotti.

Links from the show:

Transcript

This is an automatic transcript and may therefore contain errors. Please get in touch if you're willing to correct them.

Transcripts

Speaker:

Folks, strap in, because today's episode

is a deep dive into the fascinating world

2

:

of large -scale machine learning.

3

:

And who better to guide us through this

journey than Sam Dofeld.

4

:

Currently honing his expertise at normal

computing, Sam has an impressive

5

:

background that bridges the theoretical

and practical realms of Bayesian

6

:

statistics, from quantum computation to

the cutting edge of AI technology.

7

:

In our discussion, Sam breaks down complex

topics such as the posterior's Python

8

:

package, minimatch methods, approximate

inference, and the intriguing world

9

:

of thermodynamic hardware for statistics.

10

:

Yeah, I didn't know what that was either.

11

:

We delve into how these advanced methods

like stochastic gradient MCMC and Laplace

12

:

approximation are not just theoretical

concepts but pivotal in shaping enterprise

13

:

AI models today.

14

:

And Sam is not just about algorithms and

models, he is a sports enthusiast who

15

:

loves football, tennis and squash.

16

:

and he recently returned from an awe

-inspiring trip to the Faroe Islands.

17

:

So join us as we explore the future of AI

with Bayesian methods.

18

:

This is Learning Bayesian Statistics,

,:

19

:

Welcome to Learning Bayesian Statistics, a

podcast about Bayesian inference, the

20

:

methods, the projects, and the people who

make it possible.

21

:

I'm your host, Alex Andorra.

22

:

You can follow me on Twitter at alex

-underscore -andorra.

23

:

like the country.

24

:

For any info about the show, learnbasedats

.com is Laplace to be.

25

:

Show notes, becoming a corporate sponsor,

unlocking Beijing Merch, supporting the

26

:

show on Patreon, everything is in there.

27

:

That's learnbasedats .com.

28

:

If you're interested in one -on -one

mentorship, online courses, or statistical

29

:

consulting, feel free to reach out and

book a call at topmate .io slash alex

30

:

underscore and dora.

31

:

See you around, folks, and best Beijing

wishes to you all.

32

:

I'm Sam Duffield, welcome to Learning

Bayesian Statistics.

33

:

Thanks, thank you very much.

34

:

Yeah, thank you so much for taking the

time.

35

:

I invited you on the show because I saw

what you guys at normal computing were

36

:

doing, especially with the Posteriors

Python package.

37

:

And I am personally always learning new

stuff.

38

:

Right now I'm learning a lot about sports

analytics, because that's a

39

:

Like that's always been a personal pet

peeves of mine and Bayesian says extremely

40

:

useful in that field.

41

:

But I'm also in conjunction working a lot

about LLMs and the interaction with the

42

:

Bayesian framework.

43

:

I've been working much more on the base

flow package, which we've talked about

44

:

with Marvin Schmidt in episode 107.

45

:

So.

46

:

Yeah, working on developing a PIMC bridge

to base flow so that you can write your

47

:

model in PIMC and then like using

amortized patient inference for your PIMC

48

:

models.

49

:

It's still like way, way down the road.

50

:

I need to learn about all that stuff, but

that's really fascinating.

51

:

I love that.

52

:

And so of course, when I saw what you were

doing with Posterior, I was like, that

53

:

sounds...

54

:

Awesome.

55

:

I want to learn more about that.

56

:

So I'm going to ask you a lot of

questions, a lot of things I don't know.

57

:

So that's great.

58

:

But first, can you tell us, give us a

brief overview of your research interests

59

:

and how Bayesian methods play a role in

your work?

60

:

Yeah, no, I know.

61

:

Thanks again for the invite.

62

:

I think, yeah, sports analytics, Bayesian

statistics, language models, I think we

63

:

have a lot to talk about.

64

:

should be fun.

65

:

Bayesian methods in my work, yes, so at

normal we have a lot of problems where we

66

:

think that Bayes is the right answer if

you could compute it exactly.

67

:

So what we're trying to do is trying to

look at different approximations and

68

:

different, like how they scale in

different methods and different settings.

69

:

and how we can get as close to the exact

phase or the exact sort of integral and

70

:

updating under uncertainty that can

provide us with some of those benefits.

71

:

Yeah.

72

:

OK.

73

:

Yeah.

74

:

That's interesting.

75

:

I, of course, agree.

76

:

Of course.

77

:

Can you, like, actually, do you remember

when you were first introduced to patient

78

:

inference?

79

:

Because you had a

80

:

an extensive background you've studied a

lot.

81

:

When in that, in those studies, were you

introduced to the Bayesian framework?

82

:

And also, how did you end up working on

what you're working on nowadays?

83

:

Yeah, okay.

84

:

I'll try not to rant too long about this.

85

:

But yeah, so I guess I, yeah, mathematics,

undergraduate at Imperial.

86

:

So I think that's

87

:

I was very young at this stage, we were

very young in our undergraduates, so not

88

:

really sure what we want to do.

89

:

At some point, it came to me that

statistics within the field of mathematics

90

:

is kind of like where I can like, that

should be working on like, applied

91

:

problems and how what where the sort of

field is going.

92

:

And that's what got me excited.

93

:

Statistics at undergraduate are different,

different places, but you get thrown a lot

94

:

of different

95

:

I mean, probably in all courses, you get

different, you get different point of view

96

:

and you get like, yeah, you get your

frequencies, your hypothesis testing, and

97

:

then you have your Bayesian method as

well.

98

:

And that is just the Bayesian approach

really sort of settled with me as being

99

:

more natural in terms of you just write

down the equation and the Bayes Bayes

100

:

Bayes theorem handles you write down, you

have your forward model and your prior and

101

:

then Bayes theorem handles everything

else.

102

:

So you're kind of writing down it's like,

103

:

mathematicians is kind of like one of the

lecturers in my first year said, yeah,

104

:

mathematicians are lazy.

105

:

You want to they want to do as little as

possible.

106

:

So base theorem is kind of nice there

because you just write down your your

107

:

likelihood you write down your prior and

then base theorem handles the rest.

108

:

So you have to do like the minimum

possible work you have your data

109

:

likelihood prior and then done.

110

:

So that was that was really compelling to

me.

111

:

And that led me to a to my PhD, which was

in the engineering department in

112

:

Cambridge.

113

:

So that was like, yeah, I had a few

114

:

thoughts on what to do for my PhD.

115

:

There was some more theoretical stuff and

I wanted to get into some problems, get

116

:

into the weeds a bit.

117

:

So yeah, engineering department of

Cambridge working on Bayesian statistics,

118

:

state space models and in a state space

model sequential Monte Carlo.

119

:

And I think, yeah, I mean, for terminology

wise, I use state space model and hidden

120

:

Markov model as the same thing.

121

:

So yeah, you have this time series style

data and that was working on that sort of

122

:

data gave me

123

:

I feel like this propagation of

uncertainty really shines there because

124

:

you need to take into account your

uncertainty from the previous experiments,

125

:

say, when you update for your new ones.

126

:

That was really compelling for me.

127

:

That was, I guess, my route into Bayesian

statistics.

128

:

Yeah, okay.

129

:

Actually, here I could ask you a lot of

questions, but...

130

:

those time series models.

131

:

I'm always fascinated by time series

models.

132

:

I don't know, I love them for some reason.

133

:

I find there is a kind of magic in the

ability of a model to take time

134

:

dependencies into account.

135

:

I love using Gaussian processes for that.

136

:

So I could definitely go down that rabbit

hole, but I'm afraid then I won't have

137

:

enough time for you to talk about post

-series.

138

:

Let me just say one minute about it.

139

:

So I'll just say like, yeah, in terms of

yeah, Gaussian process is really cool.

140

:

Like Gaussian process, like can think of

as like a continuous time or continuous

141

:

space or whatever that the time variant

access, we'll call it time continuous time

142

:

varying version of a state space model and

state space model or hidden Markov model.

143

:

Kind of like that to me is like the

canonical extension of a just a static

144

:

based inference model to a

145

:

the time varying setting because you can

and they kind of unify each other because

146

:

you can write smoothing in a state space

model as one big static Bayesian inference

147

:

problem and then you can write static

Bayesian inference problems they're just p

148

:

of y given x or p of yeah recovering x

from from y as as the first step as a

149

:

single step of state space model so the

techniques that you build just overlap and

150

:

you can yeah at least conceptually on the

mathematical level when you actually get

151

:

into the approximations and the

commutation

152

:

there are different things to consider,

different axes of scalability considered,

153

:

but conceptually, I really like that.

154

:

I probably ranted for a bit more than a

minute there, so I apologize.

155

:

No, no, that's fine.

156

:

I love that.

157

:

Yeah.

158

:

I have much more knowledge and experience

on GPs, but I'm definitely super curious

159

:

to also apply these state space models and

so on.

160

:

So definitely going to read the...

161

:

the paper you sent me about skill rating

of football players where you're using, if

162

:

I understand correctly, some state space

models.

163

:

That's going to be two birds with one

stone.

164

:

So thanks a lot for writing that.

165

:

The whole point of that paper is to say

that rating systems, ELO, TrueSkill are

166

:

and should be reframed as state space

models.

167

:

And then you just have your full Bayesian

understanding of it.

168

:

Yeah, yeah.

169

:

I mean, for sure.

170

:

I'm working myself also on the

171

:

project like that on football data.

172

:

And yeah, the first thing I was doing is

like, okay, I'm gonna write the simple

173

:

model.

174

:

But then as soon as I have that down, I'm

gonna add a GP to that.

175

:

It's like, I have to take these

nonlinearities into account.

176

:

So yeah, I'm like, super excited about

that.

177

:

So thanks a lot for giving me some weekend

readings.

178

:

So actually now let's go into your

posteriors package because I have so many

179

:

questions about that.

180

:

So could you give us an overview of the

package, what motivated this development

181

:

and also putting it in the context of

large scale AI models?

182

:

Yeah, so as I said, we at normal think the

base is the right answer.

183

:

So we want to get, we want to, but yeah,

we're interested in

184

:

large scale enterprise AI models.

185

:

So we need to be able to scale these to

big, big models, big, big parameter sizes

186

:

and big data at the same time.

187

:

So this is what Posterior's Python package

built on PyTorch really hopes to bring.

188

:

It's built with sort of flexibility and

research in mind.

189

:

So really we want to try out different

methods and try out for different data

190

:

sets and different goals.

191

:

what's going to be the best approach for

us.

192

:

That's the motivation of the Posteriors

package.

193

:

When would people use it?

194

:

For instance, for which use cases would I

use Posteriors?

195

:

There's a lot of just genuinely fantastic

Bayesian software out there.

196

:

But most of it has focused on the full

batch setting, as is classically the case

197

:

with Metropolis Hastings, except for

Jekste.

198

:

And we feel like we're moving or we have

already moved into the mini batch era, the

199

:

big data era.

200

:

So posterior is mini batch first.

201

:

So if you have a lot of data, even if you

have a small model, and you have a lot of

202

:

data, and you want to try posterior

sampling with mini batches, you want to

203

:

see how that...

204

:

If that can speed up your inference rather

than doing full batch on every step, then

205

:

Posterior is the place for that, even with

small models.

206

:

So you can just write down your model in

Pyro, in PyTorch, and then use Posterior

207

:

to do that.

208

:

But then that's like moving from like

classical Bayesian statistics into like

209

:

the mini batch one.

210

:

But then there are also benefits of

Bayesian

211

:

very crude approximations to the Bayesian

posterior in these really high scale large

212

:

scale models.

213

:

So like, yeah, like language models, big

neural networks, these aren't going to get

214

:

you you're not you're not going to be able

to do your convergence checks and these

215

:

sort of things in those models, but you

might still be able to get some advantages

216

:

out of distribution detection, as a

distributed improved attribution

217

:

performance sort of continual learning,

and these are the sort of things we're

218

:

investigating is if like,

219

:

the sort of what if you just trained with

grading essentially, you wouldn't

220

:

necessarily get these things.

221

:

But even very crude, crude Bayesian

approximations will hopefully provide

222

:

these benefits.

223

:

I think I will talk about this more later.

224

:

I think.

225

:

Yeah, okay.

226

:

So basically, what what I understand is

that you can use Posters for basically any

227

:

model.

228

:

So I mean, we're still young.

229

:

And it doesn't have like the

230

:

very young and it doesn't have like the

support of, I don't know, if you want to

231

:

do Gaussian processes, we were not going

to have a whole suite of kernels that

232

:

you're going to be able to just type up.

233

:

But fundamentally, it takes any, it just

takes a function, a log posterior

234

:

function, and then you will be able to try

out different methods.

235

:

But as I said, like the big data regime is

much less researched, and as much and the

236

:

sort of big parameter regime is much

harder.

237

:

at least.

238

:

So it's going to be, it's not going to be

like a silver bullet.

239

:

You're going to have to, there's research,

basically, posterior is a tool for

240

:

research a lot of the time where you're

going to research what inference methods

241

:

you can use, where they fail, and

hopefully where they succeed as well.

242

:

Okay.

243

:

Okay.

244

:

I see.

245

:

And so to make sure listeners understand,

well, you can do both in posers, right?

246

:

You can write your model in posterior.

247

:

and then sample from it?

248

:

Or is that only model definition or is

that only model sampling?

249

:

So it only does approximate posterior

sampling.

250

:

So you write down the log posterior,

you're given some data and you write down

251

:

the log posterior.

252

:

Or the joint, you could say.

253

:

It doesn't have the sophisticated support

of Stan or IMC or where you actually have

254

:

the, you can write down the model.

255

:

but it has the support for all the

distributions and doing forward samples.

256

:

It leans on other tools like Pyro or

PyTorch itself for that in no other case.

257

:

It is about approximate inference in the

posterior space, in the sample space.

258

:

So you can do Laplace approximation with

these things and compare them.

259

:

And importantly, it's mini -batch first.

260

:

So every method only expects to receive

batch by batch.

261

:

So you can support the large data regime.

262

:

Okay, so I think there are a bunch of

terms we need to define here for

263

:

listeners.

264

:

Okay, yeah, sorry about that.

265

:

Can you define minibatch?

266

:

Can you define approximate inference and

in particular, Laplace approximation?

267

:

Okay, so minibatch is the important one,

of course.

268

:

Yeah, so normally in traditional Bayesian

statistics, if you're running random walk

269

:

-through troblos -Hastings or HMC, you

will be seeing your whole dataset, all end

270

:

data points at every step of the

iteration.

271

:

And there's beautiful theory about that.

272

:

But a lot of the time in machine learning,

you have a billion data points.

273

:

Or if you're doing a foundation model,

it's like all of Wikipedia, it's billions

274

:

of data points or something like that.

275

:

And there's just no way that every time

you do a gradient step, you just can't sum

276

:

over a billion data points.

277

:

So you take 10 of them, you do this

unbiased approximation.

278

:

And this doesn't propagate through the

exponential, which you need.

279

:

for the metropolis hastening step.

280

:

So it rules out a lot of traditional

Bayesian methods, but there's still been

281

:

research on this.

282

:

So this is the we saw a scalable Bayesian

learning is what we talked about with

283

:

posterior.

284

:

So we're investigating mini batch methods.

285

:

So yeah, methods that only use a small

amount of the data, as is very common in

286

:

so it's like gradient descent, stochastic

gradient descent and optimization terms.

287

:

So hopefully

288

:

Mini -batches, okay, you said approximate

inference.

289

:

So approximate, okay, yeah, inference is a

very loaded term.

290

:

Maybe I should try not to use it, but when

I say approximate inference, I mean

291

:

approximate Bayesian inference.

292

:

So you can write down mathematically the

posterior distribution, P of theta given

293

:

y, and then yeah, proportional to P of

theta, P of y given theta.

294

:

But that's

295

:

You only have access to pointwise

evaluations of that and potentially even

296

:

only mini -batch pointwise evaluation

sets.

297

:

So approximate inference is forming some

approximation to that posterior

298

:

distribution, whether that's a Gaussian

approximation or through Monte Carlo

299

:

samples.

300

:

So yeah, just like an ensemble of points

and approximate inference.

301

:

So that's approximate inference.

302

:

And yeah, you have different fidelities of

this posterior approximation.

303

:

Last one, Laplace approximation.

304

:

Laplace approximation is the simplest

305

:

arguably the simplest in like machine

learning setting approximation to the

306

:

posterior distribution.

307

:

So it's just a Gaussian distribution.

308

:

So all you need to define is a mean and

covariance.

309

:

You define the mean by doing an

optimization procedure on your log

310

:

posterior or just log likelihood.

311

:

And that will give you a point that will

give you your mean.

312

:

And then

313

:

And then you take okay, it gets quite in

the weeds the Laplace approximation, but

314

:

ideally you you then do a Taylor expansion

across them.

315

:

Second order Taylor expansion will give

you Hessian.

316

:

We would recommend the Hessian being the

co your approximate covariance.

317

:

But there are tiny quantities there and

use the Fisher information as said.

318

:

And yeah, you can read that there's lots

of I'm sure you've had people on the on

319

:

the podcast explain it better than me.

320

:

Yeah.

321

:

For Laplace, no.

322

:

Actually, so that's why I asked you to

define it.

323

:

I'm happy to go down into the weeds if you

want.

324

:

Yeah, if you think that's useful.

325

:

Otherwise, we can definitely do also an

episode with someone you'd recommend to

326

:

talk about Laplace approximation.

327

:

Something I'd like to communicate to

listeners is for them to understand.

328

:

Yeah, we say approximation, but at the

same time, MCMC is an approximation

329

:

itself.

330

:

So that can be a bit confusing.

331

:

Can you talk about the fact, like, about

why these kind of methods, like Laplace

332

:

approximation, I think VI, variational

inference, would fall also into this

333

:

bucket.

334

:

Why are those methods called

approximations?

335

:

in contrast to MCMC?

336

:

What's the main difference here?

337

:

I honestly I would say MCMC is also an

approximation in the same terminology but

338

:

yeah the difference is that we talk about

bias and asymptotically some methods

339

:

asymptotically unbiased which MCMC is

stochastic gradient MCMC which is what

340

:

Prosterus is as well in some

341

:

under some caveats, and there are caveats

for MCMC, normal MCMC as well.

342

:

But yeah, so you have your Gaussian

approximations from variational inference

343

:

and the applies approximation.

344

:

And these are very much approximations in

the sense there's no axes on which you can

345

:

increase if you increase it to infinity or

change the posterior.

346

:

You cannot do that with the Gaussian

approximations unless your posterior is

347

:

you're known to be Gaussian, in which case

is more and more I mean, the amount of

348

:

interesting cases like that like Gaussian

processes and things.

349

:

But yeah, so they don't have this

asymptotically unbiased feature that MTMC

350

:

does or important sampling as sequential

Monte Carlo does, which is very useful

351

:

because it allows you to trade compute for

accuracy, which you can't do with a

352

:

Laplace approximation or VI beyond

extending, like going from diagonal

353

:

covariance to a full covariance or things

like that.

354

:

And this is very useful in the case that

you have extra compute available.

355

:

So I'm a big fan of the

356

:

asymptotic unbiased property because it

means that you can increase your compute

357

:

and safety.

358

:

Yeah.

359

:

Yeah.

360

:

Great explanation.

361

:

Thanks a lot.

362

:

And so yeah, but so as you were saying,

there is not these asymptotic unbiasedness

363

:

from these approximations, but at the same

time, that means they can be way faster.

364

:

So it's like if you're in the right use

case in the right, in the right

365

:

Yeah, in the right use case, then that

really makes sense to use them.

366

:

But you have to be careful about the

conditions where the approximation falls

367

:

down.

368

:

Can you maybe dive a bit deeper into

stochastic gradient descent, which is the

369

:

method that Posterioris is using, and how

that fits into these different methods

370

:

that you just talked about?

371

:

Actually, stochastic gradient descent is

not a method that Posterioris is using per

372

:

se.

373

:

descent is stochastic gradient descent is

the workhorse of machine most machine

374

:

learning algorithms, but posteriors would

kind of be this kind of same like it kind

375

:

of saying it shouldn't be perhaps or like,

not in all cases.

376

:

stochastic gradient descent is what you

use.

377

:

If you have extremely large data, and you

just want to find the MLE or so the

378

:

maximum likelihood or the minimum of a

loss, which you might say.

379

:

So

380

:

that is just as an optimization routine.

381

:

So you just want to find the parameters

that minimize something.

382

:

If you're doing variational inference,

what you can do is you can trackively get

383

:

the KL divergence between your specified

variational distribution and the log

384

:

posterior.

385

:

And then you have parameters.

386

:

So they're like parameters of the

variational distribution over your model

387

:

parameters.

388

:

And then you use stochastic gradient

system like that.

389

:

So this is nice because it just means that

you can throw the workhorse from machine

390

:

learning at a

391

:

Bayesian problem and get the Bayesian

approximation out.

392

:

Again, as we mentioned, it doesn't have

this asymptotic unbiased feature, which is

393

:

maybe less of a concern in machine

learning models where you have less of

394

:

ability to trade compute because you've

kind of filled your compute budget with

395

:

your gigantic model.

396

:

Although we may see this, we think that I

think this might change over the coming

397

:

years.

398

:

But yeah, maybe not.

399

:

Maybe we'll just go even bigger and bigger

and bigger.

400

:

You...

401

:

Okay, sorry.

402

:

I got lost.

403

:

You said you're asking about stochastic

gradient descent.

404

:

So actually, there's something interesting

to say here.

405

:

And then that means also what the main

difference characteristics of posterior

406

:

is, like these, so that really people

understand the use case of posterior here.

407

:

Yeah.

408

:

So we didn't want to...

409

:

Okay.

410

:

So yeah, there's a key thing about the way

we've written posterior is that we like...

411

:

where possible to have stochastic gradient

descent, so optimization, as sort of

412

:

limits under some hyperparameter

specifications of the algorithms.

413

:

And it turns out that in a lot of cases,

so we talked about MCMC, and then we

414

:

talked about stochastic gradient MCMC,

which are MCMC methods that strictly

415

:

handle mini -batch methods.

416

:

And a lot of the time, you can write down

the temperature, you have the temperature

417

:

parameter of your posterior distribution.

418

:

And then as you take that to zero,

419

:

So the temperature is like, if the

temperature is very high, your posterior

420

:

distribution is very heated up.

421

:

So you've increased the tails and there's

a lot like a much closer to sort of a

422

:

uniform distribution.

423

:

You take it very cold, it comes very

pointed and focused around optima.

424

:

So we write the algorithms so that there's

this convenient transition through the

425

:

temperature.

426

:

So you set the temperature to zero, you

just get optimization.

427

:

And this is a key thing about posteriors.

428

:

So we have the, so the posteriors

stochastic grain MCMC methods.

429

:

this temperature parameter which if you

set to zero will become a variant of

430

:

stochastic gradient descent.

431

:

So you can just sort of unify gradient

descent and stochastic gradient MCMC and

432

:

it's nice so you have your yeah you have

your Langevin dynamics which tempered down

433

:

to zero just becomes vanilla gradient

descent you have underdamped Langevin

434

:

dynamics or stochastic gradient HMC,

stochastic gradient Hamiltonian Monte

435

:

Carlo, you set the temperature to zero and

then you've just got stochastic gradient

436

:

descent with momentum.

437

:

So yeah, this is a nice thing about

Posterius to sort of unify these

438

:

approaches and it hopefully will make it

less scary to use Bayesian approaches

439

:

because you know you always have gradient

descent and you can sanity check by just

440

:

setting the temp, just filling with a

temperature parameter.

441

:

Okay, that's really cool.

442

:

Okay.

443

:

So it's like, it's a bit like the

temperature parameter in the, in the

444

:

transformers that, that like make sure, I

mean, in the LLMs that

445

:

It's like adding a bit of variation on top

of the prediction stat that the LL could

446

:

make.

447

:

Yeah, so it's exactly the same as that.

448

:

So when you use this in language models or

natural language generation, you

449

:

temperature the generative distribution so

that the logits get tempered.

450

:

So if you set the temperature there to

zero, you get greedy sampling.

451

:

But we're doing this in parameter space.

452

:

So it's, yeah.

453

:

It has this, yeah, exactly.

454

:

Distribution tempering is a broad thing,

particularly in, I'm not going to go too

455

:

philosophical, but I mean, I've first met

with like tempering, then we thought about

456

:

it in the settings of sequential Monte

Carlo, and it's like, is it the natural

457

:

way?

458

:

Is it something that's natural to do?

459

:

But in the context of Bayes, because

Bayes' theorem is multiplicative, right,

460

:

you have your P of theta, P of y given

theta, it kind of makes sense to temper

461

:

because it means like, okay, I'll just

introduce the likelihood a little bit.

462

:

and sort of tempering as a natural way to

do it because there's multiplicative

463

:

feature of Bayes' theorem.

464

:

So, I kind of settled with me after

thinking about it like that.

465

:

Yeah, no, I mean, that makes perfect

sense.

466

:

And I was really surprised to see that was

used in LLMs when I first read about the

467

:

algorithms.

468

:

And I was pleasantly surprised because

I've worked a lot on electoral forecasting

469

:

models.

470

:

That's how I were introduced to Bayesian

stats.

471

:

Actually, I've done that without knowing

it.

472

:

So first I'm using the softmax all the

time because they're called forecasting.

473

:

Unless you're doing that in the U S you

need a multinomial likelihood.

474

:

The multinomial needs a probability

distribution.

475

:

And how do you get that from the softmax

function, which is actually a very

476

:

important one in the LLM framework.

477

:

And, and, and also the thing is your

probability is not, it's like the latent.

478

:

observation of popularity of each party,

but you never observe it, right?

479

:

And so the polls, you could, you could

like conceptualize them as a tempered

480

:

version of the true latent popularity.

481

:

And so that was really interesting.

482

:

I was like, damn, this like, this, this

stuff is much more powerful than what I

483

:

thought, because I was like applying only

on electoral forecasting models, which is

484

:

like a very niche application, you could

say of these models in

485

:

actually there are so many applications of

that in the wild.

486

:

No, it's so yeah, tempering in general is

very widespread and also I would say not

487

:

particularly understood that well.

488

:

Like yeah, we have this thing, there's

been research in this cold posterior

489

:

effect which is quite a, I don't know,

it's perhaps a...

490

:

annoying things for Bayesian modeling on

neural networks where you get, as I said,

491

:

you have this temperature parameter that

transitions between optimization and the

492

:

Bayesian posterior.

493

:

So zero is optimization, one is the

Bayesian posterior.

494

:

And empirically, we see better predictive

performance, which is a lot of time we

495

:

care about in machine learning, with

temperatures less than one.

496

:

So like, yeah, which is annoying because

we're Bayesians and we think that the

497

:

Bayesian posterior is the optimal decision

-making under uncertainty.

498

:

So this is annoying, but at least in our

experiments, we found this to be this so

499

:

-called cold posterior effect much more

prominent under Gaussian approximations,

500

:

which we only believe to be very crude

approximations to the posterior anyway.

501

:

And if we do more MCMC or deep ensemble

stuff, where deep ensemble is, we've got a

502

:

paper we'll be able to archive shortly,

which describes deep ensembles.

503

:

In deep ensembles, you just run gradient

descent in parallel.

504

:

with different initializations and batch

shuffling.

505

:

And then you just have like, I know you

run 10 ensembles, 10 optimizations in

506

:

parallel, then you've got 10 parameters at

the things at the end.

507

:

So Monte Carlo approximation posterior

size 10.

508

:

And then we describe in the paper that how

to get this asymptotic and biased property

509

:

by using that temperature.

510

:

Because as we said earlier, you have SG

MCMC becomes SGD with temperature zero.

511

:

So you can reverse this.

512

:

for deep ensembles, so you add the noise

and then you'll get an asymptotic and

513

:

biased deep ensembles become

asymptotically unbiased MCMC between SGMC

514

:

and PSE.

515

:

But in those cases when you have the non

-Gaussian approximation we found much less

516

:

of the cold posterior effect.

517

:

So yeah, it's, but it's still not, maybe

the cold posterior effect is a natural

518

:

thing because it's not really like Bayes'

theorem.

519

:

Yeah, we still need to be better

understood.

520

:

I don't, at least in my head I'm not.

521

:

fully clear on whether the cold posterior

effect is something we should be surprised

522

:

about.

523

:

Okay, yeah.

524

:

Yeah, me neither.

525

:

That makes you feel any better because I

just learned about that.

526

:

So yeah, I don't have any strong opinion.

527

:

Okay, I think we're getting clearer now on

the like the what posterior ears is for

528

:

listeners.

529

:

So then I think one of

530

:

the last question about the algorithms

that that's underlying all of that.

531

:

So, stochastic gradient MCMC.

532

:

That's, that's where I got confused.

533

:

Like I hear stochastic gradient and like

stochastic gradient isn't, but no, it's

534

:

like SG MCMC not SGG.

535

:

So, Posteriority is like really to use SG

MCMC.

536

:

Why, like, why would you do that and not

use MCMC?

537

:

like the classic HMC from Stan or PyMC?

538

:

Yeah, so I mean, it's not just for SGMCMC.

539

:

There's also variational inference,

Laplace approximation, extended count

540

:

filter, and we're really excited to have

more methods as well as we look to

541

:

maintain and expand the library.

542

:

Why would you use SGMCMC?

543

:

So yeah, I think we've already touched

upon this.

544

:

The thing is, if you've got loads of data,

it's just going to be inefficient to...

545

:

sum over all of that data at every

iteration of your MCMC algorithm as Stan

546

:

would do.

547

:

But there's mathematical reasons why you

can't just do that in Stan.

548

:

It's because the Metropolis -Hastings

ratio has this exponential of the log

549

:

posterior.

550

:

But it's in log space is the only place

you can get the unbiased approximation,

551

:

which is what you need if you did want to

naively subsample.

552

:

So you need to, you can't do the

Machrofist Hastings except reject.

553

:

So you have to use different toolage.

554

:

And in its simplest terms, SGMCMC just

omits it and just runs a Langevin.

555

:

So it just runs your Hamiltonian Monte

Carlo without the extract project.

556

:

But there's more theory on top of this and

you need to control the disqualification

557

:

error and stuff like that.

558

:

And I won't go into the weeds of that.

559

:

Okay.

560

:

Yeah.

561

:

Okay.

562

:

And that's

563

:

And that's tied to mini -batching

basically.

564

:

Like the power that SGMCMC allows you when

you're in a high data regime is tied to

565

:

the mini -batching, if I understand

correctly.

566

:

It's the difference between MCMC and

SGMCMC.

567

:

Okay, so that's like the main difference.

568

:

Okay.

569

:

Yeah, stochastic gradient.

570

:

So you can't actually get the exact

gradient like you need in Amazigh in Monte

571

:

Carlo and for Metropolis Hastings step,

you only get an unbiased approximation.

572

:

And then there's theory about this is like

sometimes you can deploy the central limit

573

:

theorem and then you've got a you can go

covariance attached to your gradients and

574

:

you could do nice theory and improve the

equivalence like that, which, yeah.

575

:

Okay.

576

:

All clear now.

577

:

All clear.

578

:

Awesome.

579

:

Yeah.

580

:

And I think that's the first time we talk

about that on the show.

581

:

So I think it was it's definitely useful

to be extra clear about that.

582

:

And so that listeners understand and me,

like myself, so that I understand.

583

:

Thanks a lot.

584

:

It's in some setting actually much simpler

because you kind of like remove the tools

585

:

that you have available to you by removing

that much of the step.

586

:

So it makes the implementation a bit

simpler.

587

:

But you kind of lose the theory in that.

588

:

And then a lot of the argument is like if

you use a decreasing step size, then your

589

:

noise from the mini match, your noise from

the stochastic gradient decreases Epsilon

590

:

squared, which is faster.

591

:

So you

592

:

If you decrease your step size and run it

for infinite time, then you'll just be

593

:

running, eventually just be running the

continuous time dynamics, which are exact

594

:

and do have the right stationary

distribution.

595

:

So if you run it with decreasing step

size, then you are asymptotically

596

:

unbiased.

597

:

But running with decreasing step size is

really annoying because you then don't

598

:

move as far.

599

:

As we know from normal MCMC, we want to

increase our step size and move and

600

:

explore the posterior more so.

601

:

There's lots of research to be done here.

602

:

I hope and I feel that it's not the last

time you'll talk about stochastic gradient

603

:

MCMC on this podcast.

604

:

Yeah, no.

605

:

I mean, that sounds super interesting.

606

:

I'm really interested also to really

understand the difference between these

607

:

algorithms.

608

:

Right now, that's really at the frontier

of research.

609

:

You not only have a lot of research done

on how do you make HMC more efficient, but

610

:

you have all these new algorithms.

611

:

approximate algorithms as we said before.

612

:

So, VLM plus approximation, stuff like

that.

613

:

But also now you have normalizing flows.

614

:

We talked about that in episode 98 with

Marilou Gabrié.

615

:

Marilou Gabrié, actually, I don't know why

I said the second part with the Spanish.

616

:

Because my Spanish is really available in

my brain right now.

617

:

So, she's French.

618

:

So, that's Marilou Gabrié.

619

:

Episode 98, it's in the show notes.

620

:

Episode 107, I already mentioned it with

Marvin Schmidt about amortized patient

621

:

inference.

622

:

Actually, do you know about amortized

patient inference and normalizing flows?

623

:

I know a bit about normalizing flows.

624

:

Amortized patient inference I would be

less comfortable with.

625

:

Okay.

626

:

But I mean, if you could explain it.

627

:

Yeah, I haven't watched that episode and

listened to that episode.

628

:

Yeah, I mean, we released it yesterday.

629

:

Yeah, I don't...

630

:

I'm a bit disappointed, Sam, but that's

fine.

631

:

Like, it's just one day, you know.

632

:

If you listen to it just after the

recording, I'll forgive you.

633

:

That's okay.

634

:

No, so, kidding aside, I'm actually

curious to hear you speak about the

635

:

difference between normalizing flows

636

:

and SGMCMC.

637

:

Can you talk a bit about that if you're

comfortable with that?

638

:

I mean, I can't.

639

:

It's been a while since I've read about

normalizing flows.

640

:

When I did read about them, I understood

it to be essentially a form of variational

641

:

inference where you have more elaborate,

you define a more elaborate variational

642

:

family through like, essentially through

like a triangular mapping.

643

:

Like, the thing why you can't just use

someone might say,

644

:

Why can't you use it just a neural network

as your variational distribution?

645

:

And it's not so easy because you need to

have this tractable form.

646

:

Hang on a second.

647

:

Let me remember.

648

:

But the thing is with normalizing flows,

you can get this because you can invert.

649

:

That's it.

650

:

They're invertible, right?

651

:

Normalizing flows are invertible.

652

:

So you can get this.

653

:

You can write the change of distribution

formula and then you can calculate

654

:

essentially just y -maxum likelihood.

655

:

the using these normalizing flows to fit

to a distribution.

656

:

Whereas SGMCMC doesn't.

657

:

So you have to, in normalizing flows, you

kind of have to define your ansatz that

658

:

will fit to your distribution.

659

:

I think normalizing flows are really

exciting and really interesting, but yeah,

660

:

you have to specify your ansatz.

661

:

So it's another, so there's another tool

on top, another specification on top of

662

:

how you.

663

:

rather than just writing the log

posterior, you then need to find an

664

:

approximate ansatz which you think will

fit the posterior or the distribution

665

:

you're targeting.

666

:

Whereas SGMCMC is just log posterior, go.

667

:

Which is sort of what we're trying to do

with posterior, is we're trying to

668

:

automate, well not automate, we're trying

to research, of course, so much for that.

669

:

But normalizing flows might be, yeah, as I

said, I think it's really interesting that

670

:

you can get these more expressive

variational families through like

671

:

triangular mappings, yeah.

672

:

Yeah, super interesting.

673

:

And yeah, I'm also like spatial inference

is related in the sense that you first

674

:

feed a deep neural network on your model.

675

:

And then once it's feed, you get posterior

inference for free, basically.

676

:

So that's quite different from what I

understand as GMC to be.

677

:

But that's also extremely interesting.

678

:

That's also why I'm

679

:

hammering you down on the different use

cases of SGMCMC so that myself and

680

:

listeners have a kind of a tree in their

head of like, okay, my use case then is

681

:

more appropriate for SGMCMC or, no, here

I'd like to try multi -spacian inference

682

:

or, I know here I can just stick to plain

vanilla HMC.

683

:

I think that's very interesting.

684

:

But thanks for that question that was

completely improvised.

685

:

I definitely appreciate you taking the

time to rack your brain about the

686

:

difference with normalizing flows.

687

:

No, I'd love to talk more on that.

688

:

I'd need to refresh myself.

689

:

I've written down some notes on

normalizing flows, and I was quite

690

:

comfortable with them, but it's just been

a while since I refreshed.

691

:

So I would love to refresh, and then we

can chat about them.

692

:

Because I'd love to do a project on them,

or I'd love to work on them, because I

693

:

think that's it.

694

:

way to fit distribution to data, which is,

after all, a lot of what we do.

695

:

Yeah.

696

:

Yeah.

697

:

So that makes me think we should probably

do another episode about normalizing

698

:

flows.

699

:

So listeners, if there is a researcher you

like who does a lot of normalizing flows

700

:

and you think would be a good guest on the

show, please reach out to me and I'll make

701

:

that happen.

702

:

Now let's let's get you closer to home

salmon and talk about posteriors again

703

:

Because so basically if understood

correctly posteriors aims to address

704

:

uncertainty quantification in deep

learning Why it's is that my right here

705

:

and also if that's the case why is this

particularly important for neural networks

706

:

and How does the package help in?

707

:

managing especially overconfident in model

predict, overconfidence in model

708

:

predictions.

709

:

Yeah, so it's that's our primary use case.

710

:

And normal is to use posterity as a

proximate base, we're getting as close to

711

:

base as we can, which is probably not that

close, but still getting somewhere on the

712

:

way to base base, base and posterior in

big deep learning models.

713

:

But we feel posterior is to be as modular

and general as possible.

714

:

So as I said, if you have a

715

:

classical Bayesian model, you can write it

down in Pyro, but you've got loads of

716

:

data, then okay, go ahead.

717

:

And it posterior should be well suited to

that.

718

:

In terms of what advantages we want to see

from uncertainty communication or this

719

:

approximate Bayesian inference in deep

learning models, there are three sorts of

720

:

key things that we distilled it down to.

721

:

So yeah, you mentioned

722

:

confidence in outer distribution

predictions.

723

:

So yeah, we should be able to improve our

performance in predicting on inputs that

724

:

we haven't seen in the training set.

725

:

So I'll talk about that after this.

726

:

The second one is continual learning,

where we think that if you can do Bayes

727

:

theorem exactly, you have your prior, you

get some likelihood and you have the

728

:

likelihood, you get some data, you have a

posterior, then you get some more data.

729

:

and then your posterior becomes your prior

and do the update.

730

:

And you can just write like that if you

can do Bayes' theorem exactly.

731

:

And then, yeah, this is, you can extend it

even further and then you have, with some

732

:

sort of evolution along your parameters,

then you have a state space model, and

733

:

then the exact setting linear Gaussian,

you've got a count filter.

734

:

So continual learning is, in this case,

Bayes' theorem does that exactly.

735

:

And in continual learning research in

machine learning settings, they have this

736

:

term of avoiding catastrophic forgetting.

737

:

So,

738

:

If you just continue to do gradient

descent, there was no memory there, so you

739

:

would just, apart from the initialization,

you would just forget what you've done

740

:

previously and there's lots of evidence

for this, whereas Bayes' theorem is

741

:

completely exchangeable between of the

order of the data that you see.

742

:

So you're doing Bayes' theorem exactly,

there's no forgetting, you just have the

743

:

capacity of the model.

744

:

So that's where we see Bayes solving

continual learning, but as I said, you

745

:

can't

746

:

can't do Bayes' theorem exactly in a

billion -dimensional model.

747

:

And then the last one is, we'll call it

like decomposition of uncertainty in your

748

:

predictions.

749

:

So if you just have gradient descent model

and you're predicting reviews, someone's

750

:

reviews and you have to predict the stars,

it will just give you, as you said, it

751

:

gives you your softmax, it'll just give

you this distribution over the reviews and

752

:

it'll be like that.

753

:

But what you really want is you want to

have some indication of

754

:

like also distribution detection, right,

you want to know, okay, yeah, I'm

755

:

confident in my, my prediction.

756

:

And you might get to review that is like,

the food was terrible, but the service was

757

:

amazing, or something like that, like a

user amazing food was terrible.

758

:

And then, like, let's say we're perfect

models, say of this, we know how people

759

:

review things, but we can we can give, we

have quite a lot of uncertainty under

760

:

review, because we don't know how the

reviewer values those different things.

761

:

So we might have just a completely

uniform.

762

:

distribution over the stars for that

review.

763

:

But we'd be confident in that

distribution.

764

:

But what Bayes gives you is it gives you

the ability to the sort of the second

765

:

order uncertainty quantification, is if

you have this distribution over parameters

766

:

and you have a distribution over logits at

the end, the predictions, you can

767

:

identify, you can split between it from

information theories called aleatoric and

768

:

epistemic uncertainty.

769

:

Aleatoric uncertainty or data uncertainty

is what I just described there, which is

770

:

natural uncertainty in the model and the

data generating process.

771

:

Epistemic uncertainty is uncertainty that

was removed in the infinite data limit.

772

:

So that would be where the model doesn't

know.

773

:

So this is really important for us to

quantify that.

774

:

Okay.

775

:

I, yeah, around to the bit there.

776

:

I can in like 30 seconds elaborate on the

point you specifically mentioned on alpha

777

:

distribution performance and improving

performance and alpha distribution.

778

:

And I think that's quite compelling from a

Bayesian point of view, because what Bayes

779

:

says on like a supervised learning sector

setting is said, gradient descent just

780

:

fits one parameter, finds one parameter

configuration that's plausible given the

781

:

training data.

782

:

Bayes' theorem says, I find the whole

distribution of parameter configurations

783

:

that's plausible given the data.

784

:

And then when we make predictions, we

average over those.

785

:

So it's perfectly natural to think that a

single configuration might overfit.

786

:

and might just give, it might just be very

confident in its prediction when it sees

787

:

out the distribution data.

788

:

But it doesn't necessarily solve a bad

model, but it should be more honest to the

789

:

model and the data generating process

you've specified is if you average over

790

:

plausible model configurations under the

training data when you have your testing.

791

:

So that's sort of quite a compelling, to

me, argument for improving

792

:

performance on after distribution

predictions, like the accuracy of them.

793

:

And there's a fair bit of empirical

evidence for this, with the caveat again,

794

:

being that the Bayesian posterior in high

dimensional models, machine learning

795

:

models is pretty hard to approximate, cold

posterior effect, caveats, things like

796

:

these things.

797

:

Okay, yeah, I see.

798

:

Yeah, super interesting in that.

799

:

So now I understand better.

800

:

what you have on the posteriors website,

but the different kind of uncertainties.

801

:

So definitely that's something I recommend

listeners to give a read to.

802

:

I put that in the show notes.

803

:

So both your blog post introducing

posteriors and the docs for posteriors,

804

:

because I think it makes that clear

combined to your explanation right now.

805

:

Yeah.

806

:

And...

807

:

Something I was also wondering is that if

I understood correctly, the package is

808

:

built on top of PyTorch, right?

809

:

Yeah, that's correct.

810

:

Yeah.

811

:

Okay.

812

:

So, and also, did I understand correctly

that you can integrate posteriors with pre

813

:

-trained LLMs like Lama2 and Mistral, and

you do that with a...

814

:

Hacking's Feast Transformers package?

815

:

So, yeah, so, I mean, yeah, Posterior is

open source.

816

:

We're fully supported the open source

community for machine learning, for

817

:

statistics, which is, and in terms of,

yeah, I mean, we're sort of in the fine

818

:

tuning era or like we have like, there's

so much, there are these open source

819

:

models and you can't get away from them.

820

:

We have that Lama 2, Lama 3, Mistral,

like, yeah.

821

:

And basically we want to harness this

power, right?

822

:

But as I mentioned previously, there are

some issues that we like to remedy with

823

:

Bayesian techniques.

824

:

So the majority of these open source

models are built in PyTorch.

825

:

I'm also a big Jax fan.

826

:

I also use Jax a lot.

827

:

So I was very happy to see and work with

the torch .funk like sub library, which

828

:

basically makes it

829

:

you can write your PyTorch code and you

can use Llama 3 or Mistral with PyTorch

830

:

but writing functional code.

831

:

So that's what we've done with Posterior.

832

:

So, yeah, Hugging Face Transformers, you

can download the models, that's where all

833

:

they're hosted, and how you access them.

834

:

But then what you get is just a PyTorch

model.

835

:

It's just a PyTorch model.

836

:

And then you throw that in Composers and

all nicely with the Posterior updates.

837

:

Or you write your own new updates in the

Posterior framework and you can use that

838

:

as well.

839

:

still with Lama 3.

840

:

Mr.

841

:

Robin.

842

:

Yeah.

843

:

Okay.

844

:

Nice.

845

:

And so what does it mean concretely for

users?

846

:

That means you can use these pre -trained

LLMs with posteriors and that means adding

847

:

a layer of uncertainty quantification on

top of those models?

848

:

Yeah.

849

:

So you need, I mean, Bayes theorem is a

training theorem.

850

:

So you need data as well.

851

:

So you take

852

:

You take your pre -trained model, which

is, yeah, transformer, or it could be

853

:

another type of model, it could be an

image model or something like that, and

854

:

then you give it some new data, which we

would say was fine -tuning, and then you

855

:

combine, use posterior to combine the two,

and then you have your new model out at

856

:

the end of the day, which has uncertainty

quantification.

857

:

It's difficult, as I said, we're sort of

in this fine -tuning era as open -source

858

:

large language models.

859

:

It's still to be, this is different.

860

:

There's still lots of research to do here

and it's different to our classical

861

:

Bayesian regime where we just have our,

there's only one source of data and it's

862

:

what we give it.

863

:

In this case, there's two sources of data

because you have your data, whatever,

864

:

whatever Lama3 saw in its original

training data and then it has your own

865

:

data.

866

:

It's, yeah, can we hope to get uncertainty

chronification and the data that they used

867

:

in the original training?

868

:

Probably not, but we might be able to get

uncertainty chronification and improved

869

:

predictions.

870

:

based on the data that we've committed.

871

:

So there's lots of lots for us to try out

here and learn because we are still

872

:

learning on this in terms of the fine

tuning.

873

:

But yeah, this is what Polastir is there

to make these sort of questions as easy as

874

:

possible to ask and answer.

875

:

Okay, fantastic.

876

:

Yeah, that's, that's so exciting.

877

:

It's just like, it's a bit frustrating to

me because I'm like, I'd love to try that

878

:

and learn on that and like, contribute to

that kind of packages.

879

:

At the same time, I have to work, I have

to do the podcast, and I have all the

880

:

packages I'm already contributing to.

881

:

So I'm like, my god, too much choices too

much, too many choices.

882

:

No, come on Alex, I'm gonna see you.

883

:

We're gonna see you again, Alex pull

request.

884

:

It's soon enough.

885

:

Actually, how does like, do the like this

ability to have the transformers in, you

886

:

know, use these pre trained models, does

that help facilitate the adoption of new

887

:

algorithms in in posteriors?

888

:

Because if I understand correctly, you can

support

889

:

new algorithms pretty easily and you can

support arbitrary likelihoods.

890

:

How do you do that?

891

:

I wouldn't say that the existence of the

pre -trained models necessarily allows us

892

:

to support new algorithms.

893

:

I feel like we've built the posterior to

be suitably general and suitably modular,

894

:

that it's kind of agnostic to your model

choice and your log posterior choice.

895

:

terms of arbitrary likelihoods.

896

:

But yeah, that's like a benefit.

897

:

That's like, yeah, as an hour, yeah, the

arbitrary like is is relevant, because a

898

:

lot of machine learning packages on.

899

:

I mean, a lot of machine learning is

essentially just boils down to

900

:

classification or regression.

901

:

And that is true.

902

:

And because of that, a lot of a lot of

machine learning algorithms will a lot of

903

:

machine learning packages will essentially

constrain it to classification or

904

:

regression.

905

:

At the end, you either have your softmax

or you have your mean squared error.

906

:

Yeah, softmax cross entry means greater.

907

:

In posterior, we haven't done that.

908

:

We're more faithful to the sort of the

Bayesian setting where you just write down

909

:

your log posterior and you can write down

whatever you want.

910

:

And this allows you greater flexibility in

the case you did want to try out a

911

:

different likelihood or like even in like

simple cases, like it's just more

912

:

sophisticated than just classification or

regression a lot of the time.

913

:

Like in sequence generation where you have

the sequence and then you have the cross

914

:

entropy over all of that.

915

:

It just allows you to be more flexible and

write the code how you want.

916

:

And there's additional things to be taken

into account.

917

:

Like sometimes if you were doing a

regression, you might have knowledge of

918

:

the noise variance.

919

:

And that's just the observation noise

variance.

920

:

And that's just much easier to, yeah, if

we don't constrain like this, it's just

921

:

much easier to write your code much

cleaner code than if you were.

922

:

And it's also future -proofing.

923

:

We don't know what's going to be.

924

:

happening in going forward.

925

:

We may see like, yeah, in multimodal

models, we may see like, text and images

926

:

together, in which case, yeah, we will

support that.

927

:

You have to supply the compute and the

data, which might be the harder thing, but

928

:

we'll support those likelihoods.

929

:

Okay, I see.

930

:

I see.

931

:

Yeah, that's very, very interesting.

932

:

Any stats related to the fact that I think

I've read in your blog post or on the

933

:

website that

934

:

You say that Posterior is swappable.

935

:

What does that mean?

936

:

And how does that flexibility benefit

users?

937

:

Yeah.

938

:

So, I mean, this is the point of swappable

is that when I say that is that you can

939

:

change between if you want to, if you

think, as I said, Posterior is a research

940

:

like toolbox and it's to us to investigate

which inference method is appropriate in

941

:

the different settings, which might be

different if you care about decomposing.

942

:

predictive uncertainty, it might be

different if you care about boarding cast

943

:

-off, you're forgetting it's in your

continued learning.

944

:

So the thing is that you can just, the way

it's written is you can just swap, you can

945

:

go from sthmc and you can go to the class

approximation or you can go to vi just by

946

:

changing one line of code.

947

:

And the way it works is like you have your

builds, you have your transform equals

948

:

posterior .infant method .build and then

any configuration argument, step size.

949

:

things like this, which are algorithm

specific.

950

:

And then after that is all unified.

951

:

So you just have your init around the

parameters that you want to do based on.

952

:

And then you iterate through your data

loader, you iterate through your data.

953

:

And then it just updates based on the

batch.

954

:

And batch can be very general.

955

:

So that's what it means.

956

:

So you can just change one line of code to

swap between Variational Imprints and

957

:

STHMC or Extended Calama Filter or any and

all the new methods that the listeners are

958

:

going to add in the future.

959

:

Heh.

960

:

Okay.

961

:

Okay.

962

:

I see.

963

:

And so I have so many more questions for

you and posterior's but let's start and

964

:

run, wrap that up because also when I ask

you about another project you're working

965

:

on so maybe to close that up on

posterior's.

966

:

What are the future plans for posterior's

and are there any upcoming features or

967

:

integration integrations that you can

share with us?

968

:

So we're quite happy with the framework at

the moment.

969

:

There's lots of little tweaks that we have

a list of GitHub issues that we want to go

970

:

through, which are mostly and excitingly

about adding new methods and new

971

:

applications.

972

:

So that's really what we're excited about

now is actually use it in the wild and

973

:

hopefully experiment all these questions

that we've discussed.

974

:

Yeah, like, like how we how does it make

sense and how we get the benefits of

975

:

Bayesian, true Bayesian inference on fine

tuning or on large models or large data.

976

:

And so yeah, we are really excited and to

add more methods.

977

:

So if listeners have mini batch, big data

Bayesian methods that we want to want to

978

:

try out with a large data model, then

we're hopefully accepting that we will.

979

:

I do.

980

:

I do like, I do promote like generality

and doing it like in a way that is sort of

981

:

flexible and stuff.

982

:

So we may have, we may think a lot.

983

:

It's not, it's not, we want to add methods

that somehow feel natural and, and one way

984

:

is to extend and compose with other

methods.

985

:

So it might be that if we've got some very

complicated last layer,

986

:

requires classes just for classification

method, we're probably not going to add

987

:

it.

988

:

So it has to be methods that stick within

the posterior framework, which is this

989

:

arbitrary likelihood Bayesian swappable

computation.

990

:

Okay.

991

:

Okay.

992

:

Yeah.

993

:

Yeah, that makes sense.

994

:

Yeah, because you have like, yeah, you

have that kind of vision of wanting to do

995

:

that and having that as a as a research

tool, basically.

996

:

So

997

:

Yeah, that makes sense to keep that under

control, let's say.

998

:

Something I want to ask you in the last

few minutes of the show is about

999

:

thermodynamic compute.

:

00:58:34,344 --> 00:58:37,074

I've seen you, you are working on that.

:

00:58:37,074 --> 00:58:39,204

And you've told me you're working on that.

:

00:58:39,204 --> 00:58:41,094

So yeah, I don't know anything about that.

:

00:58:41,094 --> 00:58:43,604

So can you like, what's that about?

:

00:58:43,824 --> 00:58:47,584

Yeah, so I mean, this is yeah, this is

something that's very normal, normal

:

00:58:47,584 --> 00:58:48,044

computing.

:

00:58:48,044 --> 00:58:49,744

And it's like,

:

00:58:50,807 --> 00:58:52,378

It's something that we have.

:

00:58:52,378 --> 00:58:53,728

Yeah, we have this hardware team.

:

00:58:53,728 --> 00:58:55,428

It's like a full stack AI company.

:

00:58:55,428 --> 00:59:00,128

And we, yeah, on the posterior side, on

the client side, we look at how we can

:

00:59:00,128 --> 00:59:06,468

bring in principle Bayesian uncertainty

quantification and help us solve the

:

00:59:06,468 --> 00:59:10,228

issues with machine learning pipelines

like we've already discussed.

:

00:59:10,228 --> 00:59:12,548

And on the other side, there's lots of

parts to this.

:

00:59:12,548 --> 00:59:16,204

More just like traditional MCMC is

difficult sometimes because

:

00:59:16,204 --> 00:59:19,824

Or just it's just like about simulating

SDEs essentially as what the thermodynamic

:

00:59:19,824 --> 00:59:25,664

hardware is simulating SDEs Normally, you

have this real pain with the step size and

:

00:59:25,664 --> 00:59:32,344

as the mention grows steps, let's get

really small and so SDEs, where do we see

:

00:59:32,344 --> 00:59:32,634

SDEs?

:

00:59:32,634 --> 00:59:37,544

You see SDEs in physics all the time and

physics is real we can use physics so it's

:

00:59:37,544 --> 00:59:43,904

doing so it's building physical hardware

analog hardware that We can hopefully that

:

00:59:43,904 --> 00:59:45,100

evolves as SDEs

:

00:59:45,100 --> 00:59:49,920

then we can harness that SDEs by encoding,

you know, like currents and voltages and

:

00:59:49,920 --> 00:59:50,350

things like that.

:

00:59:50,350 --> 00:59:52,760

So I'm not a physicist, so I don't know

exactly how it is.

:

00:59:52,760 --> 00:59:57,800

But I'm always reassured at how the when I

speak to the hardware team, how simple the

:

00:59:57,800 --> 01:00:00,900

they talk about these things, it's like,

yeah, we can just stick some resistors and

:

01:00:00,900 --> 01:00:03,850

capacitors on a chip, and then it'll then

it'll do this SDE.

:

01:00:03,850 --> 01:00:08,100

So this is the and then we want to use

those SDEs for scientific computation.

:

01:00:08,100 --> 01:00:11,730

And with a real focus on statistics and

machine learning.

:

01:00:11,730 --> 01:00:14,476

So yeah, we want to be able to do an HMC

:

01:00:14,476 --> 01:00:17,216

on device, on an analog device.

:

01:00:17,216 --> 01:00:21,976

The first step is to do like with a

linear, so we'll have a Gaussian posterior

:

01:00:21,976 --> 01:00:24,816

or with a linear drift in terms of this.

:

01:00:24,816 --> 01:00:29,736

This is an Ornstein -Ollenbeck process and

we've developed hardware to do this and

:

01:00:29,736 --> 01:00:33,596

turns out that an Ornstein -Ollenbeck

process, because it has a Gaussian

:

01:00:33,596 --> 01:00:37,196

stationary distribution and you have this,

you can input like you can input the

:

01:00:37,196 --> 01:00:40,972

precision matrix and output the covariance

matrix, that's matrix inversion.

:

01:00:40,972 --> 01:00:45,432

So, and you just, your physical device

just does this.

:

01:00:45,432 --> 01:00:50,152

And it's because it's an SDE, it has noise

and is kind of noise aware, which is

:

01:00:50,152 --> 01:00:54,972

different to classical analog computation,

which has historically been plagued, which

:

01:00:54,972 --> 01:00:57,572

is really old, really old, but

historically been plagued by noise.

:

01:00:57,572 --> 01:01:00,172

And it's like, yeah, there's all this

noise in physics.

:

01:01:00,172 --> 01:01:03,752

And because we're doing SDEs, we want the

noise.

:

01:01:03,752 --> 01:01:05,062

So yeah, that's the whole idea.

:

01:01:05,062 --> 01:01:07,012

It's obviously very young, but it's fun.

:

01:01:07,012 --> 01:01:07,832

It's fun stuff.

:

01:01:07,832 --> 01:01:08,112

Yeah.

:

01:01:08,112 --> 01:01:10,434

So that's basically to...

:

01:01:11,288 --> 01:01:12,928

accelerate computing?

:

01:01:13,648 --> 01:01:18,528

That's hardware first, so that computing

is accelerated?

:

01:01:18,908 --> 01:01:21,878

We want to, I mean, it's a baby field.

:

01:01:21,878 --> 01:01:24,568

So we're trying to accelerate different

components.

:

01:01:24,568 --> 01:01:28,568

What we worked out is with the simplest

thermodynamic chip we can build is this

:

01:01:28,568 --> 01:01:30,528

linear chip with the Ornstein -Ullenberg

process.

:

01:01:30,528 --> 01:01:34,092

And that can speed up with some error.

:

01:01:34,092 --> 01:01:38,432

some error, but it has asymptotic speed

ups for linear algebra routines, so

:

01:01:38,432 --> 01:01:41,112

inverting a matrix or solving a linear

system.

:

01:01:41,252 --> 01:01:42,792

That's awesome.

:

01:01:44,712 --> 01:01:48,952

In this case, it would speed up a certain

component, but that could be useful in a

:

01:01:48,952 --> 01:01:52,912

Laplace approximation or these sort of

things also in machine learning.

:

01:01:52,912 --> 01:01:57,272

Okay, that must be very fun to work on.

:

01:01:57,652 --> 01:02:02,552

Do you have any writing about that that we

can put in the show notes?

:

01:02:02,552 --> 01:02:03,244

Because

:

01:02:03,244 --> 01:02:06,164

I think it'd be super interesting for

listeners.

:

01:02:06,604 --> 01:02:07,224

Yeah, yeah.

:

01:02:07,224 --> 01:02:12,184

We've got the normal computing scholar

page has a list of papers, but we also

:

01:02:12,184 --> 01:02:16,404

have more accessible blogs, which I'll

make sure to put in the shop.

:

01:02:16,424 --> 01:02:21,404

Yeah, yeah, please do because, yeah, I

think it's super interesting.

:

01:02:22,004 --> 01:02:25,514

And yeah, and when you have something to

present on that, feel free to reach out.

:

01:02:25,514 --> 01:02:29,244

And I think that'd be fun to do an episode

about that, honestly.

:

01:02:29,244 --> 01:02:30,324

That'd be great.

:

01:02:30,424 --> 01:02:31,124

Yeah.

:

01:02:32,684 --> 01:02:36,984

Yes, so maybe one last question before

asking you the last two questions.

:

01:02:37,244 --> 01:02:40,174

Like very, like, let's do Zoom be way less

technical.

:

01:02:40,174 --> 01:02:44,624

We've been very technical through the

whole episode, which I love.

:

01:02:45,084 --> 01:02:52,684

But maybe I'm thinking if you have any

advice to give to aspiring developers

:

01:02:52,684 --> 01:02:57,484

interested in contributing to open source

projects like Posterior's, what would it

:

01:02:57,484 --> 01:02:58,424

be?

:

01:03:00,812 --> 01:03:05,632

Okay, yeah, I don't know, I don't feel

like I'm necessarily the best place to say

:

01:03:05,632 --> 01:03:10,352

all this, but yeah, I mean, I would just,

the most important thing is just to go for

:

01:03:10,352 --> 01:03:17,292

it, just get stuck in, get in the weeds of

these libraries and see what's there.

:

01:03:17,292 --> 01:03:22,852

And there's loads of people building such

cool stuff in the open source ecosystem

:

01:03:22,852 --> 01:03:25,872

and it's really fun to, honestly, it's

really fun and rewarding to get involved

:

01:03:25,872 --> 01:03:26,182

for it.

:

01:03:26,182 --> 01:03:28,832

So just go for it, you'll learn so much

along the way.

:

01:03:29,280 --> 01:03:30,960

something more tangible.

:

01:03:31,200 --> 01:03:35,300

I find that when I'm stuck on, starting

on, it's not like I don't understand

:

01:03:35,300 --> 01:03:40,300

something in code or mathematics, then I

often struggle to find it in papers per

:

01:03:40,300 --> 01:03:40,420

se.

:

01:03:40,420 --> 01:03:44,620

And I find that textbooks, I love

textbooks, textbooks I find as a real

:

01:03:44,620 --> 01:03:48,020

source of gold for these because they

actually go to the depths of explaining

:

01:03:48,020 --> 01:03:53,600

things, without having this sort of horse

in the race style writing that you often

:

01:03:53,600 --> 01:03:54,360

find in papers.

:

01:03:54,360 --> 01:03:58,348

So yeah, get stuck in check text textbooks

if you, if you get lost.

:

01:03:58,348 --> 01:03:59,138

Or I don't understand.

:

01:03:59,138 --> 01:04:00,218

Or just ask as well.

:

01:04:00,218 --> 01:04:04,048

Open source is all about asking and

communicating and bouncing ideas.

:

01:04:04,188 --> 01:04:05,768

Yeah, yeah, yeah, for sure.

:

01:04:05,768 --> 01:04:07,208

Yeah, that's usually what I do.

:

01:04:07,208 --> 01:04:13,308

I ask a lot and I usually end up

surrounding myself with people way smarter

:

01:04:13,308 --> 01:04:14,508

than me.

:

01:04:14,508 --> 01:04:16,848

And that's exactly what you want.

:

01:04:17,508 --> 01:04:19,448

That's exactly how I learned.

:

01:04:19,848 --> 01:04:26,428

Yeah, textbook DICI, I would say I kind of

find the writing boring most of the time,

:

01:04:26,428 --> 01:04:28,168

depends on the textbooks.

:

01:04:28,236 --> 01:04:30,636

And also, it's expensive.

:

01:04:31,516 --> 01:04:32,216

Yeah.

:

01:04:32,716 --> 01:04:35,336

So that's kind of the problem of

textbooks, I would say.

:

01:04:35,336 --> 01:04:40,676

I mean, you often can have them in PDFs,

but I just hate reading the PDF on my

:

01:04:40,676 --> 01:04:41,636

computer.

:

01:04:41,636 --> 01:04:48,416

So, you know, I wonder on the book object

or having it on Kindle or something like

:

01:04:48,416 --> 01:04:48,556

that.

:

01:04:48,556 --> 01:04:51,416

But that doesn't really that doesn't

really exist yet.

:

01:04:51,416 --> 01:04:52,016

So.

:

01:04:54,252 --> 01:05:00,972

could be something that some editors solve

someday that'd be cool, I'd love that

:

01:05:00,972 --> 01:05:09,072

awesome, Sam, that was great thank you so

much, we've covered so many topics and my

:

01:05:09,072 --> 01:05:15,072

brain is burning so that's a very good

sign I've learned a lot and I'm sure our

:

01:05:15,072 --> 01:05:19,832

listeners did too of course, before

letting you go I'm gonna ask you the last

:

01:05:19,832 --> 01:05:23,820

two questions I ask every guest at the end

of the show so one

:

01:05:23,820 --> 01:05:28,600

If you had unlimited time and resources,

which problem would you try to solve?

:

01:05:53,676 --> 01:05:58,836

want to decouple the model specification,

the data generating process, how you go

:

01:05:58,836 --> 01:06:02,296

from your something you don't know to the

data you do have.

:

01:06:02,376 --> 01:06:04,216

That's your site freedom as a data model.

:

01:06:04,216 --> 01:06:07,916

I have you define that from like the

inference and the mathematical

:

01:06:07,916 --> 01:06:08,616

computation.

:

01:06:08,616 --> 01:06:12,316

So that's whatever, what the way you do

your approximate Bayesian inference.

:

01:06:12,316 --> 01:06:13,396

And you want to decouple those.

:

01:06:13,396 --> 01:06:14,966

You want to make it as easy as possible.

:

01:06:14,966 --> 01:06:16,416

Ideally, we just want to be doing that

one.

:

01:06:16,416 --> 01:06:19,596

We just want to be doing the model

specification.

:

01:06:19,656 --> 01:06:22,246

And this is like Stan and PyMC do this

really well.

:

01:06:22,246 --> 01:06:22,860

It's just like,

:

01:06:22,860 --> 01:06:25,160

you write down your model, we'll handle

the rest.

:

01:06:25,160 --> 01:06:28,820

And that's kind of like the dream we have

as Bayesian or Bayesian software

:

01:06:28,820 --> 01:06:29,920

developers.

:

01:06:31,020 --> 01:06:36,700

And it's so with Posterior, we're trying

to do something like this towards going to

:

01:06:36,700 --> 01:06:42,780

move towards this for bigger, big machine

learning models and so bigger models,

:

01:06:42,780 --> 01:06:44,220

bigger data settings.

:

01:06:44,360 --> 01:06:46,420

So that's kind of the dream there.

:

01:06:46,420 --> 01:06:50,080

But then in machine learning, what does

machine learning have differently to

:

01:06:50,080 --> 01:06:51,404

statistics in that setting?

:

01:06:51,404 --> 01:06:56,484

It's like, well, machine learning models

are less interesting than classical

:

01:06:56,484 --> 01:06:57,564

Bayesian models.

:

01:06:57,984 --> 01:07:01,384

The thing is they're more transferable,

right?

:

01:07:01,384 --> 01:07:06,064

It's just a neural network, which we

believe is machine learning and will solve

:

01:07:06,064 --> 01:07:07,624

a whole suite of tasks.

:

01:07:07,624 --> 01:07:11,744

So perhaps in terms of the machine

learning setting, where we decouple

:

01:07:11,744 --> 01:07:15,464

modeling and inference and data, you kind

of want to remove the model one as well.

:

01:07:15,464 --> 01:07:18,572

You want to have these general purpose

foundational models, you could say.

:

01:07:18,572 --> 01:07:20,892

So really you want to let the user focus.

:

01:07:20,892 --> 01:07:22,632

And so we're handling the inference.

:

01:07:22,632 --> 01:07:23,582

We're also handling the model.

:

01:07:23,582 --> 01:07:27,972

So really let the user just give it the

data and say, okay, let's do this data and

:

01:07:27,972 --> 01:07:30,882

let's use this data to predict other

things and let the user handle that.

:

01:07:30,882 --> 01:07:35,312

So that's potentially like a real

unlimited time and resources.

:

01:07:35,312 --> 01:07:37,152

Plenty of resources need to do that.

:

01:07:37,152 --> 01:07:43,672

But yeah, that's Sam May:

:

01:07:44,712 --> 01:07:45,792

Yeah.

:

01:07:46,172 --> 01:07:47,852

Yeah, that sounds...

:

01:07:47,852 --> 01:07:49,352

That sounds amazing.

:

01:07:49,352 --> 01:07:50,402

I agree with that.

:

01:07:50,402 --> 01:07:52,872

That's a fantastic goal.

:

01:07:53,532 --> 01:07:57,932

And yeah, also that reminds me, that's

also why I really love what you guys are

:

01:07:57,932 --> 01:08:06,452

doing with Posteriorus because it's like,

yeah, trying to now that we start being

:

01:08:06,452 --> 01:08:13,032

able to get there, making patient

inference really scalable to really big

:

01:08:13,032 --> 01:08:14,492

data and big models.

:

01:08:15,072 --> 01:08:17,356

I'm super enthusiastic about that.

:

01:08:17,356 --> 01:08:20,676

it would be just fantastic.

:

01:08:21,536 --> 01:08:26,056

So thank you so much for taking the time

to do that guys.

:

01:08:26,056 --> 01:08:28,716

Yeah we're doing it, we're gonna get

there.

:

01:08:28,716 --> 01:08:31,456

Yeah yeah yeah I love that.

:

01:08:31,756 --> 01:08:36,736

And second question, if you could have

dinner with any great scientific mind dead

:

01:08:36,736 --> 01:08:39,976

alive or fictional who would it be?

:

01:08:41,256 --> 01:08:44,236

Yeah I was a bit intimidated this

question.

:

01:08:44,236 --> 01:08:45,228

Yeah you know you ask everyone.

:

01:08:45,228 --> 01:08:46,388

again, it's a great question.

:

01:08:46,388 --> 01:08:48,548

But then I thought about it for a little

bit.

:

01:08:48,548 --> 01:08:49,928

And it wasn't too hard for me.

:

01:08:49,928 --> 01:08:55,548

I think that David Mackay is someone who,

yeah, I mean, it's been amazing work.

:

01:08:55,548 --> 01:08:58,448

David Mackay is doing Bayesian neural

networks in:

:

01:08:58,768 --> 01:09:02,968

And that's like, yeah, like crazy before

before I'm born.

:

01:09:03,068 --> 01:09:10,248

, Bayesian neural networks in:

then I've just been going through his

:

01:09:10,248 --> 01:09:13,508

textbook, as I said, I love textbooks, so

going through his textbooks on information

:

01:09:13,508 --> 01:09:14,316

theory and

:

01:09:14,316 --> 01:09:18,276

Basin statistics is a Bayesian or was a

Bayesian information theory and

:

01:09:18,276 --> 01:09:18,476

statistics.

:

01:09:18,476 --> 01:09:21,076

And there's something that he says like

right at the start of the textbook is

:

01:09:21,076 --> 01:09:25,616

like, one of the themes of this book is

that data compression and data modeling

:

01:09:25,616 --> 01:09:26,626

are one and the same.

:

01:09:26,626 --> 01:09:27,936

And that's just really beautiful.

:

01:09:27,936 --> 01:09:32,916

And we talked about stream codes, which in

a very information theory style setting,

:

01:09:32,916 --> 01:09:36,336

but it's just an auto -aggressive

prediction model, just like our language

:

01:09:36,336 --> 01:09:37,056

model.

:

01:09:37,056 --> 01:09:42,796

So it's just someone else the ability to

distill these informations and do these.

:

01:09:43,404 --> 01:09:47,914

distill information and help the

unification and be so ahead of their time.

:

01:09:47,914 --> 01:09:52,524

And then additionally, with a sort of like

groundbreaking book on sustainable energy.

:

01:09:52,524 --> 01:09:58,184

So like also tackling the one of the

greatest challenges we have at the moment.

:

01:09:58,244 --> 01:10:01,884

So yeah, that's the sustainable energy

book is really wonderful.

:

01:10:01,884 --> 01:10:04,064

I'm one of my favorite books so far.

:

01:10:04,144 --> 01:10:04,584

Nice.

:

01:10:04,584 --> 01:10:06,844

Yeah, definitely put that in the show

notes.

:

01:10:06,844 --> 01:10:07,564

I think.

:

01:10:07,564 --> 01:10:08,604

Yes, definitely.

:

01:10:08,604 --> 01:10:08,934

Yeah.

:

01:10:08,934 --> 01:10:11,164

Yeah, I'd like to keep that to read.

:

01:10:11,164 --> 01:10:11,436

So

:

01:10:11,436 --> 01:10:14,996

Yeah, please also put that in the show and

that's going to be fantastic.

:

01:10:15,776 --> 01:10:16,236

Great.

:

01:10:16,236 --> 01:10:19,586

Well, I think we can call it a show.

:

01:10:19,586 --> 01:10:20,796

That was fantastic.

:

01:10:20,796 --> 01:10:22,976

Thank you so much, Sam.

:

01:10:24,516 --> 01:10:29,656

I learned so much and now I feel like I

have to go and read and learn about so

:

01:10:29,656 --> 01:10:30,996

many things.

:

01:10:30,996 --> 01:10:36,856

And I can definitely tell that you are

extremely passionate about your doing.

:

01:10:37,036 --> 01:10:39,356

So yeah, thank you so much for.

:

01:10:39,404 --> 01:10:41,964

taking the time and being on this show?

:

01:10:42,644 --> 01:10:43,414

No, thank you very much.

:

01:10:43,414 --> 01:10:44,264

I had a lot of fun.

:

01:10:44,264 --> 01:10:44,724

Yeah.

:

01:10:44,724 --> 01:10:48,454

Thank you for, yeah, being parcel to my

rantings.

:

01:10:48,454 --> 01:10:50,264

I need that sometimes.

:

01:10:50,904 --> 01:10:53,164

Yeah, that's what the show is about.

:

01:10:53,624 --> 01:10:58,504

My girlfriend is extremely, extremely

happy that I have this show to rent about

:

01:10:58,504 --> 01:11:01,084

patient stats and any nerdy stuff.

:

01:11:02,664 --> 01:11:04,464

Yeah, it's so true, yeah.

:

01:11:04,844 --> 01:11:06,104

Well, Sam, you're welcome.

:

01:11:06,104 --> 01:11:08,944

Anytime you need to do some nerdy rant.

:

01:11:10,356 --> 01:11:10,796

thank you.

:

01:11:10,796 --> 01:11:12,156

I'm sure I'll be...

:

01:11:15,596 --> 01:11:19,296

This has been another episode of Learning

Bayesian Statistics.

:

01:11:19,296 --> 01:11:24,236

Be sure to rate, review, and follow the

show on your favorite podcatcher, and

:

01:11:24,236 --> 01:11:29,176

visit learnbayestats .com for more

resources about today's topics, as well as

:

01:11:29,176 --> 01:11:33,896

access to more episodes to help you reach

true Bayesian state of mind.

:

01:11:33,896 --> 01:11:35,856

That's learnbayestats .com.

:

01:11:35,856 --> 01:11:38,756

Our theme music is Good Bayesian by Baba

Brinkman.

:

01:11:38,756 --> 01:11:40,676

Fit MC Lance and Meghiraam.

:

01:11:40,676 --> 01:11:43,826

Check out his awesome work at bababrinkman

.com.

:

01:11:43,826 --> 01:11:45,004

I'm your host.

:

01:11:45,004 --> 01:11:45,994

Alex Andorra.

:

01:11:45,994 --> 01:11:50,224

You can follow me on Twitter at Alex

underscore Andorra, like the country.

:

01:11:50,224 --> 01:11:55,304

You can support the show and unlock

exclusive benefits by visiting Patreon

:

01:11:55,304 --> 01:11:57,504

.com slash LearnBasedDance.

:

01:11:57,504 --> 01:11:59,924

Thank you so much for listening and for

your support.

:

01:11:59,924 --> 01:12:02,164

You're truly a good Bayesian.

:

01:12:02,164 --> 01:12:05,684

Change your predictions after taking

information in.

:

01:12:05,684 --> 01:12:12,332

And if you're thinking of me less than

amazing, let's adjust those expectations.

:

01:12:12,332 --> 01:12:17,712

Let me show you how to be a good Bayesian

Change calculations after taking fresh

:

01:12:17,712 --> 01:12:23,732

data in Those predictions that your brain

is making Let's get them on a solid

:

01:12:23,732 --> 01:12:25,452

foundation

Chapters

Video

More from YouTube

More Episodes
110. #110 Unpacking Bayesian Methods in AI with Sam Duffield
01:12:27
107. #107 Amortized Bayesian Inference with Deep Neural Networks, with Marvin Schmitt
01:21:37
106. #106 Active Statistics, Two Truths & a Lie, with Andrew Gelman
01:16:46
104. #104 Automated Gaussian Processes & Sequential Monte Carlo, with Feras Saad
01:30:47
103. #103 Improving Sampling Algorithms & Prior Elicitation, with Arto Klami
01:14:38
102. #102 Bayesian Structural Equation Modeling & Causal Inference in Psychometrics, with Ed Merkle
01:08:53
100. #100 Reactive Message Passing & Automated Inference in Julia, with Dmitry Bagaev
00:54:41
98. #98 Fusing Statistical Physics, Machine Learning & Adaptive MCMC, with Marylou Gabrié
01:05:06
97. #97 Probably Overthinking Statistical Paradoxes, with Allen Downey
01:12:35
94. #94 Psychometrics Models & Choosing Priors, with Jonathan Templin
01:06:25
90. #90, Demystifying MCMC & Variational Inference, with Charles Margossian
01:37:35
89. #89 Unlocking the Science of Exercise, Nutrition & Weight Management, with Eric Trexler
01:59:50
87. #87 Unlocking the Power of Bayesian Causal Inference, with Ben Vincent
01:08:38
86. #86 Exploring Research Synchronous Languages & Hybrid Systems, with Guillaume Baudart
00:58:42
84. #84 Causality in Neuroscience & Psychology, with Konrad Kording
01:05:42
83. #83 Multilevel Regression, Post-Stratification & Electoral Dynamics, with Tarmo Jüristo
01:17:20
56. #56 Causal & Probabilistic Machine Learning, with Robert Osazuwa Ness
01:08:57
68. #68 Probabilistic Machine Learning & Generative Models, with Kevin Murphy
01:05:35
71. #71 Artificial Intelligence, Deepmind & Social Change, with Julien Cornebise
01:05:07
78. #78 Exploring MCMC Sampler Algorithms, with Matt D. Hoffman
01:02:40
80. #80 Bayesian Additive Regression Trees (BARTs), with Sameer Deshpande
01:09:05