Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work!
Visit our Patreon page to unlock exclusive Bayesian swag ;)
Takeaways:
Chapters:
13:17 Understanding DADVI: A New Approach
21:54 Mean Field Variational Inference Explained
26:38 Linear Response and Covariance Estimation
31:21 Deterministic vs Stochastic Optimization in DADVI
35:00 Understanding DADVI and Its Optimization Landscape
37:59 Theoretical Insights and Practical Applications of DADVI
42:12 Comparative Performance of DADVI in Real Applications
45:03 Challenges and Effectiveness of DADVI in Various Models
48:51 Exploring Future Directions for Variational Inference
53:04 Final Thoughts and Advice for Practitioners
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Giuliano Cruz, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Aubrey Clayton, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Joshua Meehl, Javier Sabio, Kristian Higgins, Matt Rosinski, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan, Francesco Madrisotti, Ivy Huang, Gary Clarke, Robert Flannery, Rasmus Hindström, Stefan, Corey Abshire, Mike Loncaric, David McCormick, Ronald Legere, Sergio Dolia, Michael Cao, Yiğit Aşık, Suyog Chandramouli and Guillaume Berthon.
Links from the show:
Transcript
This is an automatic transcript and may therefore contain errors. Please get in touch if you're willing to correct them.
Today, we're diving into the world of variational inference with a twist, a deterministic
one.
2
:My guest is Martin Ingram, a data scientist and Beijing researcher whose work bridges
physics, probabilistic modeling, and modern machine learning.
3
:And if you've ever struggled with the quirks of a DVI or wondered how to make variational
inference more stable, more predictable, and frankly more trustworthy,
4
:this episode is for you.
5
:We also explore where dead V shines, mixed models, hierarchical structures, large data
sets, and places where you want quick approximate inference without giving up model
6
:complexity.
7
:Martin shares what the theoretical results taught him, why empirical validation is
non-negotiable, and how normalizing flows may influence the next evolution of death.
8
:This is Learning Vasion Statistics, episode 147,
9
:recorded October 16, 2025.
10
:Welcome to Learning Bayesian Statistics, a podcast about Bayesian inference, the methods,
the projects, and the people who make it possible.
11
:I'm your host, Alex Andorra.
12
:You can follow me on Twitter at Alex underscore and Dora like the country for any info
about the show.
13
:Learnbasedats.com is Laplace to be.
14
:Show notes, becoming a corporate sponsor, unlocking Beijing Merch, supporting the show on
Patreon.
15
:Everything is in there.
16
:That's Learnbasedats.com.
17
:If you're interested in one-on-one mentorship, online courses or statistical consulting,
feel free to reach out and book a call at topmate.io slash Alex underscore and Dora.
18
:See you around.
19
:folks and best patient wishes to you all.
20
:And if today's discussion sparked ideas for your business, well, our team at Pimc Labs can
help bring them to life.
21
:Check us out at pimc-labs.com.
22
:Welcome to Learningvation Statistics.
23
:uh
24
:When I was studying in Berlin though, I had courses, I had master's level courses in
political science.
25
:I was able to follow in German.
26
:Oh, that's pretty impressive.
27
:That's like long sentences and stuff in political science.
28
:With long words.
29
:Well, I mean, actually I struggled with German a bit too, because I did all my studies in
English, you know?
30
:And then I came back here and I started interviewing for German jobs.
31
:And I was like, what word, you know, how do I say like...
32
:Gaussian process or something in German, you know, but yeah, it's not that easy.
33
:Yeah, yeah, no, no, no, for sure.
34
:So let's keep it in English.
35
:So, but yeah, it's great to have you here today.
36
:You've done great and interesting work, especially in the time sea ecosystem.
37
:But maybe can you tell us what you're doing nowadays and how you ended up working on that?
38
:Yeah, sure.
39
:Thank you.
40
:Well, thanks so much for inviting me.
41
:Also, yeah, I'm really honored to be part of the show.
42
:Like so many cool people on the show.
43
:I am very honored to be part of it.
44
:Yeah.
45
:So I mean, at the moment, I actually, work as a data scientist at Conox.
46
:It's a company that does like railway monitoring.
47
:But kind of, yeah, my background and my PhD was like in Bayesian stats, applied Bayesian
stats.
48
:I guess what I most recently did was as I contributed something called dadv to 2.0.
49
:So um dadv is short for like deterministic addv and it's basically like a variational
inference method.
50
:Sorry, it's a little wibes already.
51
:mean, basically it's fast pacing inference, right?
52
:That's kind of the motivation here.
53
:How did I get to that?
54
:Well, I mean, I started out as a physicist actually.
55
:I studied physics, but I always kind of liked the sort of data analysis part best.
56
:So after doing my physics degree, I did a computer science degree because I thought, you
know, maybe with programming skills and physics skills, I could do some interesting data
57
:work.
58
:And then I worked as like a machine learning engineer for a while.
59
:But I guess what really brought me to Bayesian stats was like, I did a master's degree in
computing and then I did my project on like tennis prediction.
60
:So I wanted to like predict tennis matches and, but I didn't really have the stats
background so much, you know, like I thought this was a really cool problem.
61
:But I found myself kind of struggling to like write it down like mathematically, you know,
like I wanted to fit a model kind of in my head, but it was really hard to like express
62
:it, you know, cause it's like if I didn't have the stats training, you know.
63
:So then that was in like 2014.
64
:And then a few years after I came across Stan and like, I really loved, know, how you can
basically fit any model you want, but it got a bit slow when I started fitting big models,
65
:you know, so.
66
:Basically, yeah, that was kind of my arc.
67
:Like, yeah, I really wanted to fit these big models and they got a bit slow.
68
:And so I got started in like fast Bayesian inference.
69
:wanted to make these things faster.
70
:And, and a lot of my PhD was on that.
71
:So yeah, that's kind of how I ended up there.
72
:Damn.
73
:Yeah.
74
:Okay.
75
:And how, so it seems like you were doing a lot of already Bayesian flavor lag statistics
or did that?
76
:come up later on in your, in your path.
77
:Yeah, well, I mean, yeah.
78
:So, so like when I did this tennis project, I, I basically just had a physics and a
computer science background.
79
:And at the time the hot topic was like machine learning, you know, it was like random
forests, deep learning, that kind of thing.
80
:So I kind of naturally went into that first actually.
81
:So I did some years like consulting in like computer vision problems, you know, that kind
of thing, deep learning these things.
82
:But then around then I kind of found this.
83
:I think a colleague, when I was working, he showed me the stand thing and I was like, man,
this is so cool.
84
:You know, can write my model the way I want it and then do inference on it.
85
:so yeah, once I found that, I got really excited about it.
86
:But I also realized I was missing a bit of the background, you know, cause like I started
reading the Bayesian data analysis book, the Gellman and others, right.
87
:And I really wanted to work through it, you know, but it wasn't that easy cause like.
88
:They talk about like Jacobians and variable transformations and all these kinds of things.
89
:Right.
90
:And I was like, I guess I don't really do that in my physics degree so much.
91
:know, like, I mean, I obviously it's a matrix algebra and stuff.
92
:So, but that really made me want to like study that stuff in depth, you know, so then I
looked for a PhD and I was lucky enough that Nick Golding was in Melbourne at the time
93
:where I was at the time.
94
:And he, I don't know if you've come across him, but he actually wrote like an MCMC, like a
program, uh programming.
95
:probabilistic programming language for, called Greta.
96
:Anyway, yeah.
97
:So, so yeah.
98
:And then that was like, I had some years, you know, to really get into like Bayesian stats
and try to learn.
99
:really tried to learn, you know, some of the foundations and stuff.
100
:So was really cool to do that.
101
:But yeah, but it took a while, right?
102
:It was first like, first like deep learning and stuff.
103
:And then I got a bit annoyed with it, you know, cause it was like, you couldn't, there's
not that much freedom in the end, right?
104
:Like, yeah.
105
:So that's what I was really liked about the Bayesian stats.
106
:So I mean with the deep learning models you were using, yeah.
107
:Yeah, because in the end, lot of it is like the goal is a lot of the time just trying and
abstract away the inference, right?
108
:So in the end, you don't have a lot of flexibility about what the model is about.
109
:Yeah, yeah.
110
:mean, like, I mean, my, my feeling of at least the deep learning research at the time was
kind of like, you know, you could try to come up with some new architecture, but it would
111
:just...
112
:be so hit and miss, you know, cause it's like, have to have some intuition, you know, it's
usually intuition.
113
:wasn't like very theoretically guided, right?
114
:You have some idea that you test it and more often than not, right?
115
:It doesn't converge or whatever and you spend, yeah.
116
:yeah, it's.
117
:Yeah.
118
:No, for sure.
119
:Yeah.
120
:It's like either very hard and you have to be along the sign that develop the algorithms
or it's like, well, almost everything is abstracted away from you.
121
:So you don't have a lot of leeway to work on the model.
122
:Exactly.
123
:Yeah.
124
:And how is your, so what's the Jacobian?
125
:Are you now in a position where you know what the Jacobian is?
126
:Is that even useful?
127
:Well, yeah, I mean, it's been a while since I've had to like hand code a Jacobian, but I
mean, I can try to give a flavor for it, right?
128
:So actually, I mean, I can kind of link it back to the topic because like, so this, this,
this AVI thing, right?
129
:Like basically you need to have, you need to, for it to work.
130
:And actually for HMC to work as well, mean, there's other ways of doing it.
131
:Like HMC, is obviously the sampler that is used in, variant there of, something that is
used in PIMC and Stan and these languages.
132
:Like to make it work, you need to have your parameters on like an unconstrained space.
133
:So they basically all have to live on like the real line, right?
134
:So they can all have, they all have to be between minus and plus infinity.
135
:And some of them aren't, right?
136
:Like if you fit a variance term, right, you'll be like, well, that's...
137
:That's going to be between zero and infinity, right?
138
:It's a positive quantity.
139
:you have to like, when you still want to use these approaches, you have to transform your
variables so that like, you know, instead of sampling the variance, you'll sample the log
140
:variance, right?
141
:And then that thing lives on like negative to positive infinity.
142
:So that's good.
143
:But it turns out when you do that transformation, and if you want to like, still like keep
all your uncertainty estimates, right?
144
:If you want to...
145
:take into account uncertainty, you're going to squeeze or stretch out, right?
146
:Different parts of, like, if you take the variance, right, I guess the log will like
expand probably the part between zero and one, right?
147
:Cause that now will be like fill everything from like zero to negative infinity, right?
148
:But it'll like, it'll like kind of shrink the other part.
149
:So for that to work out, you have to use this thing called the Dechergobian.
150
:Yeah, which is like basically, yeah, like you have to take some, some gradient of like how
that shifts.
151
:And then if you do that, then, you know, you can do this transformation and not like
152
:get, get your wrong estimates in the end.
153
:Yeah.
154
:So that's, that's my best, best take, guess.
155
:Yeah.
156
:I mean, that's a good thing right now with, with the PPLs you don't really have to care
about what the take-up yen is.
157
:So I so I feel this trick question because I was like, yeah, I understand this is very,
this is very intimidating.
158
:eh But I think now it's actually kind of useless unless you, I mean, to know about it is
definitely useful, but.
159
:Like most of the time you'd really...
160
:mean that's thing.
161
:custom cases.
162
:That's the great thing, right?
163
:With these probabilistic programming languages that like, it kind of frees you up to think
about the problem more than like, right?
164
:The like stuff behind it.
165
:exactly.
166
:And I mean, so I think you were talking about BA3, the book and yeah, sure.
167
:And that one, but that one is quite dated now.
168
:It's from the start of the 2010s.
169
:And so we didn't have all the...
170
:all the software we have now.
171
:And so, and that, I think it still made sense to teach that because it was like, you had
to interact with the Jacobians probably on a, at least weekly basis.
172
:Now with the software, you have to do that once a year.
173
:It's like probably a stretch.
174
:I'd be actually curious if people work on with that in the, in the audience.
175
:We have a good, a good panel of patients here.
176
:So, you know, but yeah, basically you, you learned enough so that you don't have to
177
:hear about it.
178
:Yeah, yeah.
179
:And sometimes you don't know, right?
180
:Like you don't know what will be useful.
181
:I mean, you're right, right?
182
:The Tacovians, mean, it's been less, I guess, of an impactful thing to learn.
183
:But like, mean, some other things, right?
184
:Sometimes you never know what like comes in handy.
185
:And especially, think once you, if you get interested in like developing, you know,
approximate methods or some, right?
186
:You kind of go behind the scenes a little bit.
187
:That's when I think sometimes it can be helpful to like.
188
:Yes.
189
:know the like nitty gritty, even though yeah, like in the day to day usually, right?
190
:You can kind of trust that it all works as it should.
191
:Yeah.
192
:Yeah, no, for sure.
193
:sure.
194
:Actually, can you tell us, can you define, I mean, introduce DadV listeners?
195
:know, like I would say, like I was surprised you didn't.
196
:So there are two blog posts, at least we'll put into the show notes.
197
:One that's on your website and another one that you wrote for PonyTXRES.
198
:I am very surprised that one, one of them is not called, um, who is your dad, Vinayar
based there.
199
:That would have been perfect.
200
:Next one.
201
:Yeah, that would have been a better title.
202
:Yeah.
203
:Next time I have to consult you for, for title tips.
204
:Please do.
205
:Not sure.
206
:Yeah.
207
:The JMLR would have appreciated that.
208
:Well, you never know.
209
:But it's a good one.
210
:Yeah, I guess.
211
:Well, so, I mean, the starting point was a little bit that, so I think I mentioned briefly
before that, right?
212
:Like I.
213
:uh I worked with Stan at the time.
214
:I kind of got to PMC a bit later, but I started out using Stan more and I built this like
tennis prediction model and it was just a bit slow, you know?
215
:And part of it is like, I think now I look back, like there are ways of writing these
models more efficiently, right?
216
:There's kind of an art a little bit also to like how to write your like PMC or Stan
program, right?
217
:Which I think at first maybe you don't really appreciate, right?
218
:There are like these little tricks anyway, but
219
:But like it was a little slow and so I wanted to do something fast.
220
:then in my, so in my PhD, I started looking at like approximate inference methods, you
know, and there's kind of different kinds of them.
221
:there's, there's like, basically there's stuff that's like tailored towards particular
problems, right?
222
:Like if you do state space modeling, you might use a Kalman filter, some variant thereof,
right?
223
:If you do Gaussian process modeling, you might use some like tailored variational
approximations and there'll be packages for that and stuff.
224
:So I was kind of using those.
225
:But like, I guess what would be really nice, right?
226
:Is like, if you had your probabilistic program, right?
227
:Or your like Stan or PIMC code and you could just fit that thing more quickly, right?
228
:Without like actually having to worry about, okay, I'm now in like state space land.
229
:Can I frame my problem as like a Kalman filter?
230
:Right?
231
:And like, then how do I do that?
232
:Right?
233
:Can I still want to make parts of it?
234
:Right?
235
:Like, but I have to make, I have to put some work into it and that's kind of a shame,
right?
236
:Cause like part of the fun is that you could just do whatever you want.
237
:Well.
238
:some extent, with programming languages.
239
:So at the time when I was doing my PhD, yeah, there was already ADVI that came out.
240
:So that's automatic differentiation, variational inference, that is just, the promise of
it is really cool because like you basically, it's basically black box variational
241
:inference.
242
:So the idea is like, you actually do nothing different from the way you do anyway when you
write PIMC or Stan, but you just, you can run this variational inference algorithm and it
243
:should be much faster.
244
:And it should give you at least the promises kind of you should have good means, good mean
estimates, but the variance estimates maybe might not be that great.
245
:But I don't know what your experience is, but often when I fit these models with MCMC,
like at the end of the day, I kind of just care about the means.
246
:I mean, I like having the variances sometimes too, but for some problems, even the means
are not bad, right?
247
:If I could just get them the marginal means, And like ideally good variances, that already
goes quite far.
248
:I mean, the thing is like, actually when you fit MCMC, it's kind of bananas that like you
get all of the correlations, right?
249
:So it's a very powerful thing, right?
250
:You get all the posterior covariance, right?
251
:All the correlations should be sampled as they are, which is really cool.
252
:But like practically a lot of the time, I think if you just had marginal means and
variances, that would really be pretty cool.
253
:And actually slight tangent, but I think like, you in the right is quite popular and, and
in I think only does marginal, marginals most of the time.
254
:So I might be.
255
:I might be misrepresenting that, but it's like, you know, there's basically a lot of the
time that's good.
256
:Right.
257
:So, so was like, oh man, that's cool.
258
:AdV is here.
259
:But the thing is like, when you talk to people who've tried it, usually they've, my
impression was that they often ended up a bit disillusioned because like, and, one big
260
:problem with AdV is that like, just, it's just fiddly, you know?
261
:So like you, you, you specify your model, you run AdV and then actually the problems are a
bit different in Stan and PyMC.
262
:Like in Stan, I think they have this kind of
263
:They have some kind of tuning window where like at the start they try to set the
parameters.
264
:I should say, ADV has like a stochastic objective.
265
:like it's at each step, you basically get like a noisy estimate of what it's trying to
optimize.
266
:And then, and that means you're in stochastic optimization land, which means like you have
to use something like, you know, Adam or Ada Grad or whatever, right?
267
:One of these stochastic optimizers.
268
:And they need a step size.
269
:And so you have to tune the step size, right?
270
:And you...
271
:They don't have like a convergence flag.
272
:So you have to somehow like work out when things are converged and in Stan, yeah, they try
to tune it at the start and then they have some kind of criterion where it's supposed to
273
:try to flag when it's converged.
274
:But a lot of the times it doesn't get it right and it like, it terminates too early or
like it gets its tuning a little bit wrong at the start and it, you know, doesn't find a
275
:good optimum.
276
:And it's also not super repeatable, right?
277
:Like it's cause it's stochastic.
278
:So sometimes you get different answers if you run it multiple times.
279
:So.
280
:And in PyMC it's maybe a slightly different problem.
281
:Like basically, I think it sets like a, a conservative small step size, which is usually
small enough, but then it doesn't really, the convergence criterion rarely triggers.
282
:like if you, if you actually run it, you don't really know how long.
283
:And that to me, right, it kind of defeats the purpose a little bit because then it's like,
yeah, you, you have this, you have this supposedly faster method, but now you have to
284
:babysit it, right.
285
:And you have to check your step size.
286
:You have to look at like convergence plots.
287
:You know, it's like, well, I mean, maybe for some huge problems, you're willing to do
that, right?
288
:But like, generally it's just a bit, a bit fitty.
289
:And then at the end, even if it's converged, right, you're not sure whether the estimates
are that great, right?
290
:So that's why when I was doing my PhD, I wasn't really that into, I didn't use, I to be
that much because I kind of stayed away from it because it's like, yeah, you know, I
291
:thought, maybe, maybe it's just too good to be true, right?
292
:Like you, you just can't get these like fast approximate methods without
293
:sacrificing some of the generality, right?
294
:Where they're saying like, okay, I'm going to fit a GP.
295
:I kind of pick the GP approximation, you know?
296
:So that's kind of where I was at.
297
:And then in 2020, I went to conference and I saw a talk by Ryan Giordano and he's like a
super smart guy.
298
:I mean, I was so impressed because like, he did this like talk on some bridging between
frequentism and Bayesianism.
299
:Like, I guess that their group, like it's Ryan and...
300
:Ryan and Tamara Broderick and the group, like they're just like really, they've just read
a lot of stuff, you know, and they, they, they're equally comfortable, I think with like
301
:frequentism and like basic stats.
302
:And they actually like put them together to form some kind of improved version, which is
kind of cool.
303
:So I was like, oh, this is a cool guy.
304
:So, so I chatted with him a bit and then he told me that somehow we got on the topic of
ADV because I had done some other version of variational inference anyway.
305
:But yeah, and he said, oh, you know, I did this for a previous paper and I just fixed the
draws, you know, so in AVI, right, you have this estimator and you pick a new draw every
306
:step.
307
:And he said, well, I just took like, you know, some set, I think it was maybe 60 or
something of draws at the start.
308
:And then I just use those rather than drawing something new every time.
309
:And it worked fine.
310
:And I was like, huh, that's cool.
311
:You know, so I'll give that a go.
312
:So I tried it and sure enough, like it seemed to work well.
313
:And I wrote up a little blog post at the time.
314
:It's on my blog still and I had some questions, so I sent them to Ryan and then that's
kind of, we both kind of thought, man, it's actually, this is kind of cool.
315
:You know, it's working quite well and it'd be nice for, you know, to share this a bit and
see if it's useful for other people.
316
:So that's kind of how the paper got started.
317
:And yeah, I guess I should say how it works.
318
:But I mean, that's a great, I think that's awesome because that gives us a great
background about, know, like why you were using that, what this is about and...
319
:what the problems are.
320
:And then like probably now you can tell us a bit more about, yeah, the intuition into how
that works.
321
:Because in your blog post, I think it's really well done.
322
:Actually, if you have your blog post, you can share your screen because I think you have
these great plot, you know, where you show Minfield variational inference in red and the
323
:bivariate Gaussian in blue.
324
:And you're showing that we can get.
325
:For a simple multivariate normal, 2D multivariate normal, you can get the mean pretty
accurately, but then of course the covariance, you don't get it because ADVI is based on
326
:the fact that each parameter is approximated by a standard normal.
327
:so yeah, I think this is great.
328
:then you'll be, so yeah, let's start with that and then we'll take it from there.
329
:yeah, that's exactly right.
330
:Yeah.
331
:So, so I guess this is really the...
332
:key takeaway with mean field, racial inference that like, certainly for like, you're, if
you're like target posterior is like multivariate Gaussian, like even if it's really
333
:correlated, it will find the right means, but yeah, it will, it will like find, it will
avoid like, so basically the objective function that's used is, is called like live load
334
:divergence.
335
:And it, it's, it's also one particular way around.
336
:So it's the callback like divergence between the, approximation, right?
337
:Which is this like factorized Gaussian thing.
338
:and the true posterior.
339
:it turns out, it actually ends up being a sum of two terms.
340
:So it tries to like maximize the entropy of your approximation.
341
:So that'll try to make it like spread out as possible, but it also tries to maximize the
expected log posterior.
342
:kind of like basically tries to find the mode.
343
:So it's a bit like kind of a regularized map estimate in a way.
344
:it like, it'll, it's going to like try to seek the mode and it's going to try to expand,
but it really tries to stay away from like low density regions.
345
:That's why you kind of get this picture where like it's, it kind of finds the like, know,
part that has high posterior mass, but it will miss, you know, all of this part.
346
:And it will underestimate the variance as a result, right?
347
:Because like, but yeah, you can see that this axis, right?
348
:It should be all the way over here, but yeah, so that's kind of the price you pay.
349
:Yeah, exactly.
350
:So, as you were saying, sometimes these perfectly fine and you don't have to.
351
:You don't have to necessarily have a good idea of what the positive correlation is.
352
:But in other cases, also do.
353
:That's a problem.
354
:so, yeah, maybe can you tell us, well, what do we do then?
355
:if you can't use nuts, but you still care about the covariance, what do you do?
356
:Because that seems to be a problem with the black box method that...
357
:There is already in Prime C the ADV method.
358
:And so yeah, what's the problem here and how can we solve it?
359
:Yeah, exactly.
360
:Yeah.
361
:So, I mean, that's the other thing.
362
:I already mentioned, right, that with ADV, ideally it should give you this answer, right?
363
:But it doesn't even really give you this answer is the problem a lot of the time, right?
364
:It's hard to get it to get to this answer.
365
:So the first thing that ADV kind of does is that like it pretty reliably gives you that
answer.
366
:So like you fix the number of draws that you use.
367
:And that introduces a little bit of noise because you're like, you should use infinite
number of draws to get the perfect result.
368
:So it won't exactly get this mean right.
369
:Actually, pretty sure I fit this with dadV.
370
:So it's like I just took this correlated Gaussian and then I fit dadV.
371
:So you can see it gets it pretty much right.
372
:Obviously, this is a toy example.
373
:But generally, it's pretty close.
374
:But yeah, so gets, let's say it gets the means right, which it will in a fair few cases.
375
:then you still have the problem that you don't have the correct marginal variances and you
don't have the correlation like you say.
376
:So that's where there is this other thing that comes in that you can do with that called
linear response.
377
:So there the idea is that if you have good means and you can differentiate, there's this
kind of cool identity that you can derive called a linear response that then gives you the
378
:covariance.
379
:So I guess
380
:It's kind of a little bit magical, I guess.
381
:I mean, I thought so anyway.
382
:So this is something that Ryan worked on a lot, you know, so he really brought this.
383
:like, yeah, there is this, there's this relationship.
384
:I mean, I guess intuitively kind of it's like, you have your means and you perturb them a
little bit.
385
:And I guess how they shift basically ends up being the covariance.
386
:And it turns out, yeah, if you find the exact optimum, which again, oh
387
:You don't do with ADV, right?
388
:With ADV you do, find the exact optimum of this like, like, know, deterministic objective.
389
:Then you can do this trick of like looking how would the means shift if I prod this thing
a little bit.
390
:And that gives you the covariance.
391
:So, so you can actually recover this, this covariance.
392
:And if the means are right, then the covariance will be right too.
393
:So actually the, the variances can be as wrong as you want.
394
:It's, it's really, it was all just about the finding the means.
395
:Yeah.
396
:That's kind of a neat result.
397
:Yeah.
398
:Yeah.
399
:And so can you actually talk about that linear response?
400
:yeah, so the one common critique of ADV, especially min-field ADVI, is its tendency to
underestimate posterior uncertainty, as we've just showed.
401
:So that ADV is claimed to help with more reliable covariance estimates by this linear
response.
402
:Can you explain how that works and...
403
:whether it fully resolves the energy dispersion issue.
404
:Yeah, that's a really good question.
405
:So, mean, so as I mentioned, right, like if the means are right, then it will get it
right.
406
:But there are some caveats.
407
:So one caveat is that it involves the inverse of the Hessian of the objective.
408
:And the Hessian for this mean field objective, it'll have like two times the number of
parameters of your model.
409
:So you have to invert something kind of big, right?
410
:The thing though is it's not as bad as you might think because, this is also something,
mean, Ryan pointed out at the time that like, you can, you don't need the full covariance,
411
:right?
412
:Again, like you might be interested in, for example, one parameter and then you can do
something called conjugate gradients to like, if you just have this thing called the
413
:Hessian vector product, so like the Hessian times a vector.
414
:There's this well-known right like linear algebra technique called conjugate gradients,
which will allow you to solve for like a, know, a particular parameter.
415
:And even actually you can have a function of your parameters as well.
416
:So if you have a function of your parameters and it's nonlinear, you can do something
called the delta method.
417
:That's how you derive it.
418
:Like basically tailor expand that thing.
419
:And then you can also do that with a Hessian vector product, you know, with this conjugate
gradient trick, cause it's just kind of...
420
:changes a little bit.
421
:basically, so, so, so yeah, caveat is right.
422
:involves this, this negative Hessian.
423
:However, like if, if you only care about certain functions of your posterior, which is,
you know, often the case, then you can, you can do the CG trick to get it kind of quick.
424
:Caveat two is that this is how it hopefully works, right?
425
:It finds the means.
426
:In my experience, that doesn't always happen.
427
:Like, and I haven't really formalized this.
428
:But like, mean, intuitively kind of imagine you can start thinking about like, you know,
curved stuff, you know, imagine some kind of banana here or something, right?
429
:Like a banana with like equal, equal like posterior density everywhere, kind of.
430
:Then this thing, it might not, it might, or I guess even worse, like a banana with like a
bit more weight in one corner, you know?
431
:Then this thing will seek out that bit, right?
432
:And it'll like kind of ignore the other bit.
433
:And at that point you don't get the mean right.
434
:And so your covariance will also not be correct.
435
:guess it'll be some kind of, it'll probably be an improvement because it kind of explores
things a bit more.
436
:But yeah, so that's the caveat number two that like, if you have a complex model and I
think it mostly comes down to like competing explanations, know, like Gaussian processes,
437
:for example, sometimes you have like, I think you could have like two kind of distinct
length scale settings that would make sense.
438
:But they only make sense if some other parameter also varies together.
439
:Maybe you could imagine something where there's a strong seasonal, you could explain
something with a strong seasonal pattern, but also maybe it's just some transient thing.
440
:There's a way of explaining it as noise or something.
441
:And then there's these two competing things.
442
:then I think if it gets, with the hyperparameters start interacting with the other
parameters, potentially you get some kind of curve thing.
443
:then, yeah, I'm...
444
:I'm not sure.
445
:However, we'll add, so I'm not sure that it could be that even in those kinds of cases,
you might actually end up with quite a predictive model.
446
:Cause like it will seek, you know, kind of the highest density part of the posterior and
potentially, right.
447
:That's perfectly fine if you predict, right.
448
:If you, if you're like, for example, you care about prediction, maybe, maybe that's
actually going to work.
449
:So like, yeah, I guess that's just the caution.
450
:And I would, I would say that like.
451
:If you suspect that your Pisteria is kind of nonlinear and difficult, then maybe this will
kind of, you know, still not like explore all of it, if that makes sense.
452
:Yeah, yeah.
453
:Yeah, it makes a ton of sense.
454
:we've given actually some tips to, I think you can stop sharing your screen by the way
now, unless you want to keep sharing that.
455
:And yeah, so in episode 142 with Gareal's Stage Troll T, we gave some
456
:some tips folks to work with approximation inferences and try and see and understand when
they are reliable or not.
457
:So I recommend listening to that one if you want more details.
458
:Staying on, on ADV, like, so if I understood correctly from the blog post you wrote, the
original ADV relies on stochastic gradient optimization, whereas the ADV optimizes
459
:a deterministic surrogate.
460
:So what's that about?
461
:Like, what does that mean?
462
:What are the trade-offs?
463
:Whether that's in terms of properseness, conversions, diagnostics, sensitivity, stuff like
that.
464
:Right, right.
465
:Yeah, exactly.
466
:So, Advi has this like stochastic objective.
467
:it basically, the Advi paper derives this estimator of the, like, this Kalbach-Leibler
divergence, right?
468
:We already talked about that, that like the, that's the kind of...
469
:measure that you're trying to minimize with the method.
470
:So it derives this estimator, it's an estimator, right?
471
:So like for any draw, basically you give it a draw of like actually from a standard
normal, and then it will give you like a noisy estimate of both the scale divergence and
472
:also its gradient.
473
:And yeah, and that means you're in stochastic optimization world.
474
:But the nice thing about it is that, you know, those draws are kind of
475
:fresh, right?
476
:So like you make a fresh draw at each step and eventually you do have guarantees from
stochastic optimization that it should kind of find the optimum, right?
477
:It's just that in practice it can be a bit fiddly.
478
:On the other hand, so what DADV does, is like it takes that estimator and it just takes a
set of fixed draws.
479
:So instead of taking a new one every step, you fix like, typically it's actually 30 works
well.
480
:And so that means
481
:You don't have the noise, right?
482
:You don't have a new draw at each step.
483
:So the noisy part completely disappears.
484
:That noise disappears, but you do have like a kind of a sample, like an error from that
sample, right?
485
:Cause you picked these 30 draws.
486
:If you picked a thousand, right?
487
:You would have less.
488
:Basically you take the expectation, right?
489
:Of this, this Kalbach-Leibler estimator and like the more draws you get, the sharper that
thing will get, right?
490
:So if you truncate it at 30, it's a bit fuzzy.
491
:So, yeah, and so that means, but it means right that like, so one is stochastic
optimization and the other is more sort of standard optimization, which is nice because
492
:then maybe if you've done, if you work with like SciPy and stuff, right, you can use the
SciPy has this minimize function and there's all these like optimizers in there, LBFGS,
493
:BFGS, SC, sorry, like Newton trust, whatever, right?
494
:So you can use any of those and you can pass.
495
:Not just the gradient, but you can also pass the Hessian vector product for it to be like
super reliable, which is what we actually do.
496
:So, yeah, you basically all the convergence issues go away.
497
:uh Maybe sounds bold, but I think that's pretty much the case.
498
:So like, you don't have to worry about the convergence so much.
499
:you know, sometimes it takes a while to converge because like it kind of gets it really
right, you know, until the like gradient norm is like zero and stuff.
500
:yeah, the convergence worries go away.
501
:But you you might worry about what you pay in terms of bias from this like sample.
502
:So, so you do introduce a little bit of noise from that, but it tends to be quite small.
503
:So in the blog post, I think I show like a comparison and at least empirically from all
the experiments I've run, it's, it's quite a small amount of scatter that introduces.
504
:So that's not really such a big, big deal.
505
:yeah, so, so, I mean, I guess another thing to say there is that like,
506
:It does, another thing it does cost you is that you can no longer do full rank.
507
:So in addv you also have the option of like fitting a full, you know, multivariate normal
for all the, across all your parameters.
508
:You kind of lose that by truncating the draws.
509
:Yeah.
510
:Okay.
511
:Yeah, yeah, yeah.
512
:Yeah, that makes sense.
513
:Okay.
514
:Yeah.
515
:So always try those, right?
516
:Something I didn't understand is I think in, at some point in the blog, you, you explain
the intuition behind
517
:fixing a set of Monte Carlo draws and that'd be like sampling average approximation
instead of redrawing at every optimization step.
518
:Is that related to what we just talked about?
519
:I'm like, wasn't sure and I'm so yeah, like could you walk us through how that change
affects the optimization landscape, the convergence behavior and so on?
520
:Yeah, sure.
521
:Yeah.
522
:So, yeah, that's basically the thing.
523
:So, in ADVI, you have this estimator, right?
524
:of the gradient and the K-L divergence.
525
:And at every step, the algorithm is basically, it's not that complicated in that sense.
526
:Like it's basically at every step of your algorithm, you draw a sample, you estimate this
thing, you get a noisy estimate of the gradient, you take a step, you do that.
527
:Right.
528
:That's all it does.
529
:But it takes a new step, new draw every time.
530
:Whereas yeah, the DADV, it's like you fix the set of draws.
531
:And you don't ever redraw, right?
532
:You just like, you just keep those, those rules.
533
:And maybe I should say, so the reason that also works is because of this thing called the
reparametrization trick.
534
:like, you know, you might say, well, I'm going to fix some draws, but how does that work?
535
:Right?
536
:Cause it's like, they're going to be like, what, like are those draws of like my,
parameters, right?
537
:Like I'm going to fix the, the variance and the mean or whatever, right?
538
:In my model.
539
:No, that's not it.
540
:So it's basically this reparametrization trick saves.
541
:the day, because like, you basically can rewrite this estimator as like, so you have your,
your, your parameters that you're trying to optimize, which are the means and the log
542
:standard deviations, right?
543
:Of this, of these like little Gaussians for each parameter.
544
:And it turns out right.
545
:Well, actually you've probably be familiar with this, like, you this, can write normal
distribution as like a shift and scale, right?
546
:You can say, well, I take my like standard normal, I add the mean and sorry, I scale it
and then I add the mean.
547
:So that's what.
548
:Why this fixed draw, there's another reason this fixed draw idea works is because you fix
these like, usually we call them Zs, right?
549
:It's like these standard normal things.
550
:And then what you optimize are the means and the scales.
551
:And so you can use those same standard normal draws at each step and just optimize these
mean and log SD parameters.
552
:So that's what I mean.
553
:You keep these standard normal draws that you're going to transform.
554
:You just keep one set of them.
555
:Whereas in ADV, you...
556
:You'd redraw a new one every time.
557
:Maybe hopefully that makes maybe a bit more sense.
558
:Okay.
559
:Okay.
560
:So it's again related to the non-centered parameterization, if I understand correctly.
561
:Yeah, classic.
562
:Okay.
563
:Yeah.
564
:Yeah.
565
:That is so useful.
566
:That old nugget, right?
567
:Yeah.
568
:Yeah.
569
:And so in the paper that you have, so because like your work on that V in the blog post is
derived from a paper you have called black box variational inference with a
570
:deterministic objective.
571
:So you and your co-authors, you discussed theoretical properties, simple complexity and
the conditions under which that we can perform well.
572
:I'm curious which of those insights surprised you most or proved most practical to you,
most practically useful.
573
:Yeah, I like that you phrased it that way because I think they're actually a bit distinct.
574
:like, I think...
575
:The most surprising thing to me that I still think, well, there's, I guess there's two
that are a little bit magical.
576
:So one is, one is the linear response thing.
577
:Like that, you you, you can basically recover the covariance from your mean estimates and
that it actually works really well.
578
:And then another nice one is like, you can actually quantify the like bias in your, from
the, from the sample, which is another cool thing.
579
:Cause it's, it uses some like theory from M estimation that Ryan was.
580
:really knows about basically like, I think it's a tool that often frequentists use more,
but you can use here.
581
:There's this kind of cool link because like, you know, frequentists worry about like,
redrawing their sample, right?
582
:Like you have your dataset, but you could, you worry a lot about what's called a sampling
distribution, right?
583
:Like that you could redraw a new set and then maybe your estimators would be a bit
different, right?
584
:So like that's one of the key.
585
:distinctions between Bayesian stats and frequentism, right?
586
:That in Bayesian stats, you quantify the uncertainty in your parameters, but you keep your
like, you consider your data fixed, right?
587
:And in frequentist stats, usually you think of your parameters as like points, right?
588
:And then like the variance comes from the data.
589
:So that's kind of cool because here we're doing Bayesian inference, but like this
estimation thing, as far as I understand it, comes
590
:from more of a frequentist viewpoint because now you have these like draws, but you could
imagine drawing different draws, right?
591
:And so now this is kind of where the frequentist machinery actually like works.
592
:Anyway, so that I think is theoretically really cool.
593
:Honestly though, practically like it more like, you know, I done the experiments and I
kind of saw it was really kind of close.
594
:So wasn't really that surprising.
595
:Yeah, so surprising I would say is the linear response.
596
:And practical also, think are the proofs.
597
:So Ryan put a lot of work into the theory of like why, why this, why DADV works, right?
598
:Like why you can get away with having so few samples.
599
:I, so basically in the paper, I did the experiments for the most part, and Ryan did a lot
of the theory.
600
:like, so from the experiments, I already kind of knew that it worked, but, but it was
really nice to have some justification from theory, you know, that it says, yeah, you
601
:know, it works because like.
602
:Cause I guess from the theory, would think you would need way more draws, like in
generality, but like in practice, when we ran it, you can, you don't actually need a lot.
603
:So yeah, so I guess, yeah, practically write those.
604
:That's a nice result to know.
605
:And then like in terms of surprise, I think just this linear response that that like works
so well.
606
:Yeah.
607
:I wouldn't have expected that.
608
:guess.
609
:Yeah.
610
:Yeah.
611
:Yeah.
612
:Yeah.
613
:And I agree that linear response thing is really fun.
614
:So you talk about it in the blog post, but if you have time, think a blog post dedicated
to it and explaining, okay, here is how we got the means and how we could recover the
615
:posterior covariances.
616
:I think it'd be super helpful to, if you can write that, think it would be super fun
because I'm guessing a lot of people would use that.
617
:Cool.
618
:Thanks Alex.
619
:Yeah, I should say, I should add that like I kind of broke the implementation in the PyMC
into steps.
620
:At the moment, the implementation only does the fixed draw optimization.
621
:doesn't have LRVB at the moment.
622
:But yeah, maybe once it's in, I'll write it up.
623
:Thanks Alex.
624
:I think people find it useful.
625
:Yeah, yeah, yeah.
626
:No, for sure.
627
:think it'd be super helpful for sure.
628
:And so actually to keep talking about like real applications, I'm curious if you've tried
Dead V a lot in real applications.
629
:How does it perform comparatively to standard ATVI or even other approximation algorithm,
be it in-labless approximation, pathfinder, or, and of course, nuts.
630
:How does it compare in your experience in runtime, accuracy of the posterior mean, of the
covariance estimation, ease of use, et cetera?
631
:Yeah, I mean, so I think...
632
:Like, mean, so generally, right, it's, it's going to be faster than nuts by, by quite a
lot.
633
:Typically it, it, I would say it works better on some problems than, than others.
634
:um Like, I think we briefly touched on this, this like banana problem, right?
635
:The like nonlinearities in the posterior.
636
:And so you, you do run into, into those sometimes where the means then seem to be off.
637
:Like, and then, you know, you might, you might want to look into.
638
:I mean, it depends.
639
:like I have an experience from my PhD where I actually used DadV on like a problem in
ecology.
640
:And it was one of those where like, seemed like the mean estimates didn't match MCMC super
well.
641
:But like when I looked at the predictive performance, it did beat MCMC at some point
because like, I could just use way more data, you know, and so kind of the predictive
642
:performance benefited more from having more data than like, you know, getting the
estimates exactly right.
643
:So yeah, I think in some complex models, found that there's just the mean field objective
will give you trouble.
644
:Where I think it's, I've had the experience of it working very well is like things like,
you know, simpler sort of mixed models, like that kind of thing where like, you know, your
645
:posterior is not going to be like incredibly complicated.
646
:Mean field seems to find the means quite well.
647
:And then it's, yeah, it's really quite nice.
648
:like, I think if you work with Bambi as well, right, in Prime C.
649
:I think it could be, it could definitely be worth a go.
650
:Like I think sometimes I, there might be trouble if like some of your, if you have like,
you know, in, in a mixed model, sometimes you have like within unit effects, right?
651
:Multiple ones that can be quite correlated.
652
:think sometimes then you might run into issues, but generally for, for those kinds of
models, you often like, DADV is worked really well.
653
:And yeah, so, and I mean, in ease of use, it's just very good, right?
654
:Cause it's like you, if you can fit nuts, you...
655
:You if you write your model anywhere in PyMC, right, you can just run it and it'll
converge.
656
:mean, if you have convergence problems, me know.
657
:Because yeah, usually I don't think that should be a problem.
658
:But yeah, if your problem is quite complex, you might want to, like, for example, you
could fit it on a small subset and see how far away the estimates are from MCMC to see if
659
:you're running into some issues there.
660
:But yeah, I hope that gives you some things to work with.
661
:Yeah, I know for sure.
662
:That makes sense.
663
:That's also what you're saying in your blog post.
664
:That has also been my experience.
665
:I haven't used DADV yet, but I've used other approximation algorithms and yeah, usually
it's the same.
666
:The same trade-offs you have to navigate and the same choices you have to make.
667
:In your experience though, are there classes of models for which DADV is especially
effective or
668
:Or conversely where he fails or underperforms?
669
:Yeah, I guess, I guess like I, like I mentioned, I think, I think, yeah, I think the mixed
models are good.
670
:I think where, where it might struggle potentially, right, would be things like, well, I
mean, time series models, I would be a little concerned if, if you don't, you know, cause
671
:you have all these correlations between all of your time steps.
672
:I think it's also quite, there are some papers, want to say Turner and Sahimi is maybe
one.
673
:where people have looked at this, the mean field objective in like time series and where
like that, that it can, it can have issues.
674
:yeah, basically any time.
675
:Well, the thing is, so correlations itself are not a problem, I think, right?
676
:Cause we, know, if it's like kind of multivariate normal, you know, it's okay.
677
:Cause it'll still find the means.
678
:You only really run into trouble.
679
:Yeah.
680
:If it's like nonlinear correlations, right?
681
:Cause then you have this problem of like potentially, you know, honing in on one
explanation, but missing another.
682
:But yeah, classes of model-wise, I think, I think if you're in sort of a mixed model
world, you might have a lot of joy.
683
:And also if you have a model that's like, so the one I had in my PhD was kind of like, it
was a modern ecology that was very complicated.
684
:mean, it was kind of complicated to the extent that like, I couldn't really find a great
like, handcrafted approximation that would work.
685
:Like I don't think Inla could be made to work.
686
:And you know, the, the variational stuff I was working with kind of didn't, didn't really
work.
687
:So at that point, DadV will work, right?
688
:Cause it's like, it doesn't make any assumptions.
689
:And like I said, it, once the, could use way more data and then actually the predictive
performance out, out did like MCMC, which was, guess, exact, but like, you know, I
690
:couldn't scale.
691
:So I guess also problems where like, you just have a lot of data, MCMC can't handle it.
692
:And maybe prediction is more the focus than like, you know,
693
:parameter inference.
694
:And I should say maybe also in mixed models, another nice thing is that the LRVB really
shines, right?
695
:Because you can compute it quickly.
696
:yeah, so yeah, maybe that gives a bit of an overview.
697
:Yeah, that's interesting also to hear that it works well on hierarchical models, because I
know ADV can have lot of problems with hierarchical models where the number of parameters
698
:increases and the correlations between the parameters increase.
699
:That's great to hear that that we tend to do better here because like in most real case
application, you're using a hierarchical model.
700
:that's great.
701
:Have you tried it on time series models?
702
:So I'm trying to infer Gaussian processes or state space models.
703
:I'm guessing it would be much harder because you are saying that the correlation structure
here have a very high correlation structure and I'm guessing it has troubles, but still
704
:curious.
705
:Yeah, yeah.
706
:mean, I haven't, I haven't really investigated it in depth.
707
:And I think it'd be interesting to see when exactly like it really breaks.
708
:But just anecdotally recently I did like a Hilbert space Gaussian process thing, more,
more, more for fun.
709
:And I thought, oh, you know, throw dad via it.
710
:And like, it, didn't match the, the nuts means super well.
711
:And I think it was like, it was kind of that situation where I think there's a lot of, I
mean, I have to investigate this.
712
:I didn't have a
713
:firm conclusion, but I'm guessing it's something like, lot of the posterior mass is for
example, on like a long length scale, you know?
714
:So it kind of hones in on the long length scale part and then, but it doesn't give a very
exciting like, you know, inference there.
715
:Cause I think, like the correlations between the hyperparameters and stuff are like
notoriously difficult, even for like, like at least earlier versions of nuts and stuff to
716
:sample sometimes, right?
717
:Cause you kind of want to go right in the tail and.
718
:So yeah, I think that's one where I was a little bit disappointed lately, but again, I
didn't spend a ton of time on it, so it was possible there was something else, it didn't
719
:look, but yeah, I have a sort of a sense of caution with these kinds of things.
720
:maybe you get in trouble.
721
:Yeah, yeah, yeah.
722
:So that's definitely where using simulated data, parameter recovery studies will be even
more important.
723
:Exactly.
724
:I think that would be a great way to go.
725
:Yeah.
726
:Just to see whether your model can be handled by the approach.
727
:Yeah.
728
:Yeah.
729
:The good thing is that as you were saying, we also have other algorithms that work better
for Gaussian processes and state space models.
730
:So state space models with Kalman filters.
731
:Of course, now that we have the state space sub module in Pimacy state space.
732
:So you can use that in your Pimacy models and you'll get the Kalman filter out of the box.
733
:And for GPs, in my experience, HSGPs have really changed the game because most of the time
you can use Nets with Snats with huge GPs.
734
:I've done that with a very big GP, hierarchical structure of GPs and was still using Nets
and that was fitting in 15 minutes on my M4.
735
:That's really cool.
736
:M4 Max.
737
:but that's like, that's really good.
738
:I bet your fan was going pretty well with that one.
739
:Right.
740
:Like I feel like nuts does a good job.
741
:But that's really cool.
742
:Yeah.
743
:And I mean, I have to say also, I think it's really amazing how much better these samplers
are getting all the time.
744
:And I mean, MCMC is fantastic.
745
:I mean, right.
746
:Cause you just get everything.
747
:There's no sacrifice.
748
:Like, I mean, you you have construction issues sometimes, right.
749
:But, like, I mean, just in terms of sampling everything, right.
750
:It's wonderful.
751
:If you can use it.
752
:Now I fit model with literally hundreds of thousands of parameters, these DPs I was
talking about, hundreds of thousands of parameters, 15 minutes on an inform X.
753
:know, it's with nuts, with NutPy, with nuts in NutPy.
754
:So, but HSTP is really, really helpful here.
755
:So like here, that's interesting because you still have to use an approximation somewhere.
756
:Instead of using an approximation in the algorithm, you're using an approximation in the
math of the GP.
757
:You so, but yeah, so you, you do that in for GPs.
758
:We have that and then you have the other technical specialized packages like GPi torch or
like, I always forget them, like Google, there's a ton of them.
759
:Here, the issue is that you have to have a pure GP model, but you have a lot of amazing GP
approximations and actually have coming episode.
760
:I mean, at the time of your episode, this episode will be out.
761
:I think it's episode 145 about deep caution GPs.
762
:So give a listen to that folks.
763
:I'm going to tell you right now which one it is, but I think you wanted to piggyback on
what I was saying, Martin.
764
:Yeah, I was just going to say, mean, well, I guess two short things.
765
:I mean, well, one side tangent, I the GP folks, I think are really into this point you
made, this like, you know, change the model versus change the approximation.
766
:Like I know that a lot of the like...
767
:I think was back then the GP flow developers, there was a lot of push to get more
inspirational.
768
:Yeah, into more like variational inference because they like this idea better of like,
early in the:
769
:process and stuff.
770
:And then later on, they kind of liked this idea better of like, well, we keep the GP as it
is, but we're going to approximate it with something that's simpler.
771
:Anyway, just wanted to mention, I think that's a really interesting topic.
772
:Yeah, me too.
773
:To me, you don't have like, there is no free lunch, right?
774
:It's like, well, yeah, I'd love to feed the vanilla GP with nuts and with tens of
thousands, I mean, hundreds of thousands of parameters, but I can't, unless we were
775
:getting even better computers, which we will for sure.
776
:But then we'll find a way to push them to the limit, you know, because like, what's good
to have an M4 max if you don't push it to the limit.
777
:So say at some point you have to approximate something.
778
:So you have to be.
779
:comfortable with that.
780
:And sometimes what you will approximate is the model itself.
781
:Sometimes you will approximate the algorithm, but doesn't mean you cannot have a great
answer with that.
782
:yeah, and it is episode 144 with Maurizio Filippone.
783
:Very, very fun episode about deep, deep Gaussian processes.
784
:I'm not going to spoil that for you guys.
785
:You have to listen to it.
786
:What is even a deep GP?
787
:Great question.
788
:But yeah, listen to that when Maurizio knows what he talks about, what he's talking about.
789
:Like, so it was a really fun episode.
790
:And you, Martin, I'm curious, you know, what's on deck for the future for you?
791
:Like what's on deck, especially for future uh integration of that V into Prime C?
792
:Because I think we need more of that in Prime C, you know, like, so we've given Nuts a lot
of love, uh like, and that's justified, but I think as a community we've lagged behind on
793
:other approximation algorithms, which can be faster in some cases.
794
:And so that's great because right now we're, you know, we're filling the gaps.
795
:have Pathfinder now.
796
:We have the Kalman filter in, in Paimc Extras.
797
:We're getting Inla also in there.
798
:We already have the Laplace approximation thanks to Jesse Krabowski in Paimc Extras.
799
:We have ADVI.
800
:Now we have a beginning of DadV in Paimc Extras thanks to you.
801
:So I think it's an awesome effort.
802
:need, we need to...
803
:To do more of that and yeah, what do you have on deck for us?
804
:Yeah, I think two things come to mind.
805
:I mean, one is the linear response, putting that in there.
806
:And then the other topic I kind of want to explore is like trying to see what you can do
with, with GPUs here, because I actually used to, back at the day, I had the
807
:implementation of DadV in, in Jax.
808
:And it's kind of neat because, know, Jax has this like V map functionality, right?
809
:So can do a vectorized map and just like.
810
:run like all of these.
811
:basically, right.
812
:And in DADV you have to compute the same thing for each draw more or less, and then you
average it.
813
:Right.
814
:So in, in Jax you can just do the specterized map.
815
:It turns out in, in PyTensor actually you could do something similar.
816
:So I think Ricardo pointed that out in the PR.
817
:So I actually switched to PyTensor for the implementation because it worked just as well
on CPUs.
818
:Yeah, it's really cool.
819
:It was a big magic of, of Jax's Vmap.
820
:Yeah, no, but, but yeah, I mean, there was, I forget exactly the name.
821
:But yeah, there's something very similar in PyTensor.
822
:But yeah, what I'm not totally sure about is the GPU support yet.
823
:I'm guessing that maybe Jack still does that a little bit better.
824
:I'm not sure.
825
:But anyway, yeah, that's something I'd like to explore.
826
:like that is, if you can just parallelize all of those executions, right?
827
:That's another kind of potentially a big boost in speed.
828
:So, I'd like to have a look.
829
:mean, yeah, like I said, I used to do that, but then the paper we wrote was more...
830
:focused on just like even running things on CPU, but I think there is at least like a
pretty easy speed up there.
831
:So yeah, I'd like to give that a shot.
832
:Yeah, those are the sort of the next steps.
833
:Yeah.
834
:mean, so Ricardo will be better positioned than myself to answer that, but probably now
that you have everything in PyTensor, and even in Jax you have, so yeah, I think the
835
:access to the GPU would be much, much easier.
836
:And that's definitely super exciting that that'd be great.
837
:Yeah.
838
:Because, you know, it's like, if you can combine all these, like basically we have all
these different algorithms available on CPU and on GPU each time.
839
:It's like at some point the combinatorials are really good and you will find a way to fit
your specific model much faster than you would have had before because you had one or two
840
:options and you were done.
841
:So
842
:Basically making sure the combinatorials help.
843
:Like the campending effect of having different algorithms, different backends in there
means that, you know, whether you're going to use an approximation of your model or all
844
:the, of the algorithm, use a GPU or maybe both approximations, know, at some point it will
be, it will, will become very fast.
845
:So, so this is, yeah, I think this is extremely promising.
846
:Where, do you need any help on that?
847
:Where can people reach out to you if they want to help you on that?
848
:Yeah, I mean, I guess you can always, you can always email me, I guess is probably the
easiest thing.
849
:So my, yeah, my address is just my first name, last name at Gmail.
850
:Yeah.
851
:And otherwise, I mean, I'm on GitHub a lot.
852
:Your website is in the show notes.
853
:So go there folks and yeah, send an email to Martin or, or LinkedIn and, I'm sure he'll
appreciate the help.
854
:And if you want a quick refresher about variational inference, Chris Von Speck and myself
rence for Prydata Virginia in:
855
:And Chris went to Virginia to present it.
856
:So you'll see there is the YouTube video in the show notes.
857
:That will give you an idea of what the landscape is and where.
858
:Dead V will fit in these because there is a lot of methods now.
859
:So I understand it can be confusing.
860
:Actually beyond Minfield, Minfield ADVI.
861
:And like, do you see, do you see pathways for Dead V style ideas in richer variational
families?
862
:Like normalizing flows, for instance, I see it as a very promising one.
863
:Yeah.
864
:What, what do you see here?
865
:What would be the main hurdles?
866
:Yeah, yeah, I agree.
867
:I normal listening flows seem really interesting.
868
:I haven't worked too much with them yet, but I think they do sound really cool.
869
:Yeah.
870
:I mean, the issue is really this like, so maybe I can give a bit of intuition.
871
:So I think I briefly mentioned that like full rank doesn't work with, with DadV, right?
872
:I should say that it's not actually a big deal, I think, because like full rank is not
usually a good idea in my experience.
873
:Cause like usually you, when you fit variational inference, you have a lot of parameters
and full rank will like basically.
874
:square your parameters, right?
875
:So like, anyway, it's not a big loss, for this purpose, but like, but you run into this
problem with that, be that like, if you're approximating family is like expressive enough,
876
:it kind of fails.
877
:So with full-rank it's kind of easy to see because like, you, your draws will kind of,
let's say you have 30 draws and you have a thousand parameters, right?
878
:Then you have like 30 kind of directions in this like thousand parameter space.
879
:And so the, the approximation is kind of flexible enough to like give you a good like
posterior.
880
:estimate, right, like high density, but also kind of use all the other like unsampled
directions to like make the entropy as big as it wants.
881
:So it basically just blows up.
882
:And I guess I'm concerned that with normalizing flows, you might have the same problem
that like if the, if you know, your like transformations, your neural nets to transform
883
:the drawers are like complex enough that they might be able to exploit some kind of like
something like that, you know.
884
:That said, I mean, I do wonder if there's some...
885
:tricks you could pull, know, like maybe you, you like somehow, you know, stop it from kind
of blowing up, right?
886
:Like by constraining things a little bit.
887
:That might be interesting, I think, to kind of like get closer to full rank, but maybe not
all the way there, but somehow like find some middle ground where it doesn't break
888
:anything, but still improves the like approximation.
889
:But yeah, with normalizing flows, I'm, I'm not sure.
890
:yeah, I mean, it'd be really cool, but I think it might be more like a smaller step, like
maybe, you know, some like lower rank thing or something could, could.
891
:Yeah.
892
:Yeah.
893
:Okay.
894
:Yeah, that'd be, that'd be super fun.
895
:And now, so I said that already, but in case people missed it, there is normalizing flow
adaptation in Netpy that you can use for, for very big bottles usually, which have lots of
896
:correlation in the posterior.
897
:So very complicated posterior geometry.
898
:You could use Netpy to feature neural network to find the normalizing flow.
899
:on like during the adaptation phase and then that will increase the sampling efficiency of
MCMC and decrease the sampling time.
900
:I will link to the NutPy documentation where Adrian is able to explain that because it's
really good.
901
:Normalizing flow adaptation, really recommend it.
902
:At least check it out because I mean, it's going to be rare that you need that, but when
you do, it's going to help you a lot.
903
:Maybe a last question before the last two questions, Martin, because it's getting late for
you.
904
:I appreciate you staying up that late, but I'm curious, you know, for practitioners or
researchers new to variational inference, what conceptual or practical pitfalls should
905
:they be most wary of?
906
:You know, is that convergence diagnostics, that a of dispersion or misspecification or...
907
:Something else.
908
:Yeah, I mean, so I think the big one, I do hope that like, that being particularly right,
will kind of remove your convergence headaches.
909
:But I think there is, there is still this concern that like it, it might just mean field
might just not be expressive enough.
910
:So I think, yeah, you have to be a bit careful.
911
:Right.
912
:Like I think you suggested before Alex, you could, you could look at if you can recover
some kind of known parameter estimates in some, in some model, you know, you set it up.
913
:Just to build some confidence in is this thing, is this thing going to work for, and I
mean, the scenario would be kind of like F, not just too slow, right?
914
:And you have such a big data set, you kind of need something, but you want to check it
works.
915
:So I think that would be the main pitfall is just be a little bit careful.
916
:That's for DadV, I think for variational inference in general, right?
917
:There's way more, there's like a lot of methods, like I think you already alluded to,
right?
918
:And, but yeah, I think, I think a lot of it is that like, just don't trust it.
919
:blindly, I think I would say, just make sure you have some kind of check to like see.
920
:I mean, even if it's like, even if it's like a holdout set, right?
921
:Like maybe you care about prediction more than anything.
922
:Like, I mean, in Bayesian stats, I think we often talk about like uncertainty
quantification, but I think a lot of the time you also just want to be able to write down
923
:the model you want, right?
924
:And like get good means or whatever, or get pretty good predictions.
925
:Like, I don't know, maybe if you do sports prediction, right?
926
:You might not necessarily care about the variance of your like residuals or something,
right?
927
:But you, want to make good predictions.
928
:So.
929
:So have some kind of way, maybe a holdout set where you compare if you care about
predictive performance, right?
930
:Like is this thing actually doing better than MCMC with a subsample or, you know, so that
you like just make sure to, you know, have a look at that.
931
:with DADV like takes away like the optimization headaches, but it still leaves, I think
the like worry about is, yeah, is Meanfil just going to like somehow mislead me here?
932
:Yeah.
933
:Yeah, that makes sense.
934
:Yeah.
935
:I would say that also it depends.
936
:think these questions depend a lot on the sample size you have.
937
:So if you do spots predictions and you care about players, you may not have enough sample
size for each players.
938
:And a lot of the metrics you look at the players are correlated.
939
:So often you do care about the correlations and often in spots you care about the details.
940
:The tell behavior is very important because a lot of players, like some players are in the
tell of a behavior, but you still care about it.
941
:because precisely they can be the best at it.
942
:so yeah, was depends as you say.
943
:Awesome.
944
:Martin, fantastic.
945
:I wouldn't, I wouldn't take you more time than that, but that was amazing to have you on
the show.
946
:Thank you so much.
947
:I think you did a great job at explaining what that V is, where it's useful and how, so we
can, can use that for, for models.
948
:Anything, anything you want to add that maybe you would like to.
949
:talk about that I didn't mention?
950
:No, not really.
951
:Thanks so much.
952
:Yeah, I really enjoy talking too.
953
:Yeah, thanks so much.
954
:Awesome.
955
:Yeah.
956
:So first, of course, before we think you go, I'm going to ask you the questions I ask
every guest at the end of the show.
957
:So first one, if you had unlimited time and resources, which problem would you try to
solve?
958
:Yeah, maybe not a big surprise, but I think I I love this idea of trying to make Bayesian
inference faster.
959
:So like by whatever means, Like, I mean, obviously I think the dream, the dream would be
that like you just have something like MCMC, but like incredibly fast, like basically as
960
:accurate as possible and as fast as possible.
961
:So I think that'd be a cool, cool thing to keep chipping away at.
962
:Yeah.
963
:I think that would, that would be my answer.
964
:Yeah.
965
:It sounds fun.
966
:And second one, if you could have dinner with any great scientific mind, dead, alive or
fictional hoodie.
967
:Yeah, I thought, I thought about this a little bit.
968
:think, I mean, I have kind of two, so one, I thought about, you know, like a dead,
sometimes, you know, I think about these things, but like, yeah, like a dead, so I just
969
:like, I think, I was wondering what Gauss would have been like actually, like, you know,
Calphery Gauss, because like all the stories you hear was like, this guy's just kind of
970
:bananas, right?
971
:Like, I mean, I think one story I remember was that.
972
:There's some other guys supposedly invented like non-Euclidean geometry and then wrote
about it.
973
:then Gauss was like, yes, I like, I worked this out 10 years ago, but I just couldn't be
bothered to publish this or something, you know, but it was like, how, how is this
974
:possible?
975
:So yeah, just out of interest, but he would have been like, and then I guess alive, I
think, I think probably Andrew Gelman actually like, I think it'd be cool too.
976
:talk to him.
977
:think he's like, what I really love about him is that like, he seems to straddle like both
like the theoretical sides, but also like the practical sides, you know, kind of like he,
978
:he really loves, I think fitting models, but he also is really interested in like how to
do it better.
979
:Yeah.
980
:So, so yeah, just I have big fan, guess.
981
:Yeah.
982
:Yeah.
983
:Yeah.
984
:Me too.
985
:think it's a great, great choice.
986
:Yeah.
987
:can tell you that.
988
:Yeah.
989
:Jerry is so knowledgeable about so many things.
990
:Like, yeah, it's incredible.
991
:You learn so much each time you talk to him.
992
:So, so definitely agree with your, with your choice.
993
:Fantastic.
994
:Well, Martin, thank you so much for taking the time.
995
:Mostly also thank you so much for taking the time to add that V into Pimc and making it
better and writing blog posts about it and taking time to be on a podcast to explain all
996
:that.
997
:And you're doing all of that for free.
998
:So folks, definitely should, should give some love to Martin and write to him just to
thank him.
999
:And if you can help him, that's amazing.
:
01:08:15,567 --> 01:08:17,749
But yeah, like thank you so much for doing that.
:
01:08:17,749 --> 01:08:30,055
I think it's like, yeah, this is really thanks to people like you taking time like that,
that we can make the software better for everybody and help science along the way, which I
:
01:08:30,055 --> 01:08:34,327
think we all hold dear here on the show.
:
01:08:34,327 --> 01:08:38,880
So yeah, thank you so much, Martin, for taking the time and being on this show.
:
01:08:38,880 --> 01:08:40,050
Thanks so much for having me, Alex.
:
01:08:40,050 --> 01:08:41,801
It's been a pleasure.
:
01:08:45,163 --> 01:08:48,894
This has been another episode of Learning Bayesian Statistics.
:
01:08:48,894 --> 01:08:59,377
Be sure to rate, review, and follow the show on your favorite podcatcher, and visit
LearnBasedStats.com for more resources about today's topics, as well as access to more
:
01:08:59,377 --> 01:09:03,458
episodes to help you reach true Bayesian state of mind.
:
01:09:03,458 --> 01:09:05,419
That's LearnBasedStats.com.
:
01:09:05,419 --> 01:09:10,260
Our theme music is Good Bayesian by Baba Brinkman, fit MC Lass and Meghiraan.
:
01:09:10,260 --> 01:09:13,421
Check out his awesome work at BabaBrinkman.com.
:
01:09:13,421 --> 01:09:14,613
I'm your host.
:
01:09:14,613 --> 01:09:15,594
Alex and Dora.
:
01:09:15,594 --> 01:09:19,833
can follow me on Twitter at Alex underscore and Dora like the country.
:
01:09:19,833 --> 01:09:27,102
You can support the show and unlock exclusive benefits by visiting Patreon.com slash
LearnBasedDance.
:
01:09:27,102 --> 01:09:29,484
Thank you so much for listening and for your support.
:
01:09:29,484 --> 01:09:31,775
You're truly a good Bayesian.
:
01:09:31,775 --> 01:09:38,590
Change your predictions after taking information and if you're thinking I'll be less than
amazing.
:
01:09:38,590 --> 01:09:41,852
Let's adjust those expectations.
:
01:09:41,852 --> 01:09:43,575
Let me show you how to
:
01:09:43,575 --> 01:09:55,012
Good days, you change calculations after taking fresh data in Those predictions that your
brain is making Let's get them on a solid foundation