Artwork for podcast Learning Bayesian Statistics
#78 Exploring MCMC Sampler Algorithms, with Matt D. Hoffman
Episode 781st March 2023 • Learning Bayesian Statistics • Alexandre Andorra
00:00:00 01:02:40

Share Episode

Shownotes

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

Matt Hoffman has already worked on many topics in his life – music information retrieval, speech enhancement, user behavior modeling, social network analysis, astronomy, you name it.

Obviously, picking questions for him was hard, so we ended up talking more or less freely — which is one of my favorite types of episodes, to be honest.

You’ll hear about the circumstances Matt would advise picking up Bayesian stats, generalized HMC, blocked samplers, why do the samplers he works on have food-based names, etc.

In case you don’t know him, Matt is a research scientist at Google. Before that, he did a postdoc in the Columbia Stats department, working with Andrew Gelman, and a Ph.D at Princeton, working with David Blei and Perry Cook.

Matt is probably best known for his work in approximate Bayesian inference algorithms, such as stochastic variational inference and the no-U-turn sampler, but he’s also worked on a wide range of applications, and contributed to software such as Stan and TensorFlow Probability.

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor, Thomas Wiecki, Chad Scherrer, Nathaniel Neitzke, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Joshua Duncan, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Raul Maldonado, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, David Haas, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Trey Causey, Andreas Kröpelin, Raphaël R, Nicolas Rode and Gabriel Stechschulte.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag ;)

Links from the show:


Abstract

written by Christoph Bamberg

In this episode, Matt D. Hoffman, a Google research scientist discussed his work on probabilistic sampling algorithms with me. Matt has a background in music information retrieval, speech enhancement, user behavior modeling, social network analysis, and astronomy. 

He came to machine learning (ML) and computer science through his interest in synthetic music and later took a Bayesian modeling class during his PhD. 

He mostly works on algorithms, including Markov Chain Monte Carlo (MCMC) methods that can take advantage of hardware acceleration, believing that running many small chains in parallel is better for handling autocorrelation than running a few longer chains. 

Matt is interested in Bayesian neural networks but is also skeptical about their use in practice. 

He recently contributed to a generalised Hamilton Monte Carlo (HMC) sampler, and previously worked on an alternative to the No-U-Turn-Sampler (NUTS) called MEADS. We discuss the applications for these samplers and how they differ from one another. 

In addition, Matt introduces an improved R-hat diagnostic tool, nested R-hat, that he and colleagues developed. 


Automated Transcript

Please note that the following transcript was generated automatically and may therefore contain errors. Feel free to reach out if you’re willing to correct them.

Transcripts

1:59

Okay, I mean now, can you hear me well yes. Okay, can I hear you don't need to

2:12

know I think Can you hear me?

2:14

Okay? Yeah, yeah. Oh good. Oh awesome. Yes. Thank you. Sorry for that slight delay. That as you know, technology sometimes snuck on our side.

2:33

someday, someday we'll make it work. It's gonna be great.

2:39

Because then we won't have work Oh, yeah. So just before anything, there is someone else in the chat. It's a bot. So don't worry. It's just listening to us so that we'll have a transcript of the episode. And and yeah, it's an automated transcript. So that's all okay. Okay, at times. I have to tell them because sometimes people freak out when they see a bartender in the middle

3:17

so I won't say I won't say anything that I don't want our future AI masters to, to hear.

3:25

Exactly. Yeah. I can see you're very smart. Okay, so that's perfect. Are you all set on the the audacity front?

3:36

I think so.

3:39

Okay, perfect.

3:41

Yeah. Um, you know, hopefully the hopefully the noise isn't too bad. It's like hopefully the sound quality isn't too bad. I wasn't able to find a good a good, like, microphone setup. Whatever. Yeah, hopefully. Hopefully it's okay.

4:05

And I'm gonna send you the link to the backup room that they have because so it's the it's in the Google me chat. I usually don't use that but we never know sometimes God sec. has troubles. So at least I have a backup. And once you're in you're gonna hear a small echo. Because we'll have we're gonna meet in St. Catherine at the same time. So you should the mute dead, either Zen tester or Google meet to you. So that there is no record basically. You know how to do that.

4:42

You you would think so?

4:47

So you go to the tab that you want to mute. And you, you right click on the tab, and then you'll see the option to mute that website. Got it. Okay, now you shouldn't you should hear me normally without a call.

5:03

That's right. Yes. All as well. Good.

5:07

Perfect. So that means that we are all set on the technical front. Any comments or questions in the question?

5:17

Ah, no, not actually is the on the technical front should I be seeing anything happening in the Zen caster thing?

5:31

Like it's okay I'd be a little bit okay, you

5:37

can see it. Okay, good.

5:40

Very you don't have to do anything else. Like that's all on my end. And and you should only start to that city when they call me when stuff.

5:51

Okay, sounds good. I just you know, when you're when you're talking, I see little wiggles on the line and when I'm talking to you so yeah, that's

6:00

okay. You're all green and Zen caster. So sometimes he does better than the way

6:05

okay, as long as it's cool. If you're good with it. I'm good with it. Cool. Awesome. Awesome.

6:13

So where are you joining from by the way, are you in New York? Hi, you're in Okay. Cool. protecting their length good. times to come.

6:30

Yeah. Yeah, spring or spring or fall or are good time.

6:36

Hey man, if I remember correctly, you know, Tom, I speak to you, right.

6:46

Oh, yeah, we've done a few times.

6:49

Okay. And, and I'm curious, do you know do you already know the show or you want to tell totally new for you. When I contacted you?

6:58

I knew of it but I wasn't like super familiar with it. Like I Yeah, look, look through.

7:10

Yeah. I mean, it's not a requirement and just curious about like, you know, the people need some we get some guests are like really fans so they're like, Oh, that's a cool and some people I didn't even know about where's the patient but guests like I have reactions.

7:27

What does a pod cast

7:32

Yeah, that's I've never had, but I'm starting to have some, some people who were like it's their first time going to boot camps, for instance. So that's, that's quite cool. So listen, I have so many questions that we should start. If they

7:52

start pre, okay, yeah. I'll start out asking.

7:56

Yes, same for me. On Zen. Caster. Matt Hoffman, Welcome to Learning Bayesian statistics.

8:08

Thank you. It's great to be here.

8:11

Yeah, thanks a lot for taking the time. Really happy to have you on the show and your name came up quite some times in the in the slack of the patrons. So finally, we made it so I'm sure lots of reasons are going to be happy when that episode hits the press. So thanks. Thanks for taking the time. And as usual, let's start with your origin. Story. Rob, basically tell us how you came to the stats and data world and how smooth this path was.

8:50

This would have been in like:

12:09

Okay, so basically, yeah, like it started. More on the, the amusing front and then like, the methods themselves became kind of your focus in the statistical side of things.

12:23

Yeah, yeah. And it was pretty gradual process. You know, just sort of started with you know, until, like the very last last year of my PhD, everything I'd done was was focused on some music application wasn't until the like online latent Dirichlet allocation stuff that started looking at something that didn't, didn't have an explicit Music Connection, although, although I had already kind of gotten interested in LDA as a way of doing separation for audio.

12:58

And so that begs the questions, the question, where are you in a band at some point of your life Matt?

13:06

I have been in at least one band at some point in my life.

13:13

In youth, my new it's worse they've been called latent theory clarification.

13:20

It was not. No. That's a shame,

13:23

because that would be an amazing name. I would definitely be them if I were in a band. Name it leads into your dedication. That sounds amazing.

13:35

Yeah, I mean, you know, it's, it's not too late.

13:42

So for listeners who don't have the video, I don't think Matt is approving of my name, but that's so cool. And so So yeah, okay. Thanks. Like you. You actually answered the second question I want to ask you, which is like, if you remembered how you first got introduced to Bayesian methods, methods that seems like you do. And finally, today, how frequently do you use them? Because you work a lot on algorithms. But kind of an irony I found when recording all these podcasts is that usually the people who work on the algorithms don't work a lot of models. And so again, they don't use a lot of Bayesian models. But they are designing the algorithms that everybody's using the vision world. So what about you?

14:32

Yeah, I mean, it's interesting, right? That's I feel like that's, you know, I mean, that's what we always have said is one of the strengths of the paradigm right is that we can separate modeling and inference on and actually have the separation of concerns. So, you know, to the extent that that's true, I guess that's like the system may be working as intended. But, um, but yeah, I mean, as far as you know, applied work is concerned. Um, you know, whatever. Works, I guess. Um, I think I like, yeah, I don't I don't do a ton of, kind of, sort of first author's style like, like applied work on my own these days, but I don't know. I mean, I try and only work on things where my expertise is actually relevant. That's kind of a luxury to working at a an organization as large as Google is that they, you know, they have other people who can do the the other stuff. And so yeah, I mean, you know, pretty much I it's pretty rare for you to work on something that doesn't have some probabilistic modeling component to it, because, you know, somebody else can probably do that better than I can. Um, so, yeah, I don't know if that fully answers your question, I guess as far as like the sort of Bayesian versus non Bayesian or, you know, like point estimates versus, versus, like, full posterior is are is concerned, or, you know, something in the middle, like variational inference, um, again, you know, I think whatever works, right, it just depends on what the, what the task really calls for, and kind of what the computational limitations are. But, um, but yeah, I mean, I like I like problems where there's actually good reason to be properly. If not, kind of at every level, right? Then, you know, maybe you're like, fitting a point estimate to, you know, doing an empirical Bayes thing to get a prior and fitting that to a bunch of data. But, but also, but you're hopefully doing something interesting with the doing some kind of, you've got some kind of interesting inference problem where there's actually some good reason to solve the inference problem.

17:00

I think Yeah. Yeah. And that's, I mean, in any case, it's interesting for, for listeners, to know where you come from. So what what you're, what you're doing day to day rate is going to inform everything we're going to talk about a bit later, where we're gonna dive into a bit more technical things. And so actually, can you define the work you're doing nowadays? For people to give them an idea and also the topics that you are particularly interested in?

17:30

rning revolution of the early:

24:16

My dad's kind of the tagline of reading stats. Not okay, well, thanks so much for for all those. All those beats now. I want to kind of get into all of that, but we'll see we have time. Definitely something i i want to ask you is started preparing for these tips today. I saw that you've worked already on like so many topics in your life, as you were saying like you work on music information retrieval, but also you did some work and speech enhancement and user behavior modeling, social network analyses and astronomy even. So, I'm wondering, what's the common Is there a common interest that unites all of those fields?

25:09

Yeah, I don't know. Like interestingness, I guess, right. This is sort of like, you know, I think this is one of the nice things about working in a kind of foundational technology, like machine learning or statistics, right, is that you can actually contribute something to fields that you don't actually have a ton of expertise in. And so you get to kind of, you know, be a dilettante and dabble in different things. Without actually, well actually, you know, hopefully actually doing something useful. So I think that's you know, that's, I mean, you know, there's there's a lot of interesting things in the world. And so the, I guess the common thread is just like, Is there is there something that I think I could do here? Yeah. So, other than that, I don't know.

26:06

Yeah, I guess you're, you're a curious person. That that's also why you'd eat so many things, I guess.

26:14

Yeah. Fair enough. I think that's fair.

26:19

And also, obviously, you're a co founder of Stan. So I'm curious in you, you talked a bit about that already. But it's a question open hand as I'm going through the broadcast, but also a lot through time snaps. Through the workshops and teachings I do is for which problems or in which circumstances would you advise someone to pick a stand and patience?

26:56

Yeah, I mean, I think you know, there are obvious, obvious cases, which is like, you know, you know, that you need to do some statistics to like, you know, you gathered some data and you know, you have to do some statistics on it. I and, like an off the shelf, you know, closed form packages isn't going to work for some reason. That's kind of a vacuous answer, I guess in some ways, but, um, you know, I think the, the larger question is sort of like, you know, what, when do you actually need to do fully Bayesian inference and, you know, like an interesting model. I think the answer that I sort of gravitate towards is decision problems, right. So like, you know, I don't I'm not like, you know, religious Bayesian, right. I'm not, I think, sort of ontologically the, the ontological status of like subjective probabilities is the thing that feels a little slippery to me. But, um, you know, I think that Bayesian decision theory stories is simple and rock solid, right? Like if you want to make decisions that lead to good outcomes, then you have to you know, maximize your expected utility under some distribution over possible, you know, for your with, you know, with respect to your degrees of belief. And, you know, I think those are the, I think those are the tasks if you if you look at where Bayesian methods are, are really useful, right, those are, those are the situations is where they're tied to some decision making task. Much more I think, than like, um, you know, just somehow getting a better point estimate or something like that. I I could be convinced, you know, that, you know, I could be convinced that there are situations or there will be that people will find ways to make the the Bayesian framework really, really pay off, um, that don't have that flavor where it's just about sort of being right, but it feels like a less obvious argument to me that just like, you know, you need to and you have some data, and you want to use that data to make better, better decisions. That That to me is what is the clearest at least the clearest use case for Bayesian methods?

29:25

Yeah, like that in I mean, also, in my experience, some people come to the, the Bayesian framework with, you know, the pistol, pistol logical hat, but the vast majority of people come here from the, from the practicality standpoint, or it's like, especially in the research area, where it's like, well, basically, please help me do what I want to do. Way, way more easily. And then I understood the ramifications and why it's justified. And I found that interesting, but that was not my main motivation to start with.

30:04

Yeah, yeah, absolutely. And I think, you know, it's, yeah, again, right. It's the separation of concerns thing to some degree, right that like you could say, I, rather than trying to design an algorithm that's going to do the thing that you want it to do, you can write down some some assumptions. And you know, trust that there is some possibility at least that an algorithm can be found that will take those assumptions and give you a reasonable answer. Obviously, there are a lot of caveats on on that little on that little fairy tale. And, you know, I also think, again, right like, you know, the right like, I work at Google, we talk about deep learning an awful lot. And deep learning has, you know, also logged some enormous successes, I think for a lot of the same kinds of reasons, right, the separation of modeling and algorithms right that like you know, the way you do deep learning, right, as you construct a function, and then you optimize it with Stochastic gradient descent. And, like, you know, that's also been a very, very successful paradigm, I think, for very much the same reasons. It's just that the, the paradigm there is, I have inputs and outputs, and I would like a machine that gives me the right outputs. And it's sort of less clear how to make it work in situation in situations where, what you really want is something with a little more causal structure.

31:52

Okay, yeah, yeah, anything super interesting. Yeah, definitely think about that. Also, it's going to help me answer these kind of questions. Thanks for letting me pick your brain. And so let's let's get a bit more technical now. Because that's also why I like to do e so if I remember correctly, in your meats, paper, so that for each nurse, and we'll definitely that during the show notes. You you use generalized HMC. So Hamiltonian Monte Carlo. And so can you give us the Jeast and tell us if you think for you, it's really one of the main future avenue for HMC?

32:45

Sure, yeah. So, um, you know, generalized HMC is, is this thing that's been around for quite a long time now. It came out not very, you know, just a few years after the original hybrid MonteCarlo paper. And basically, the generalization is in Hamiltonian Montecarlo, right, you alternate between resampling some momentum variables that correspond to your variables that you actually want to do inference on and simulating the some Hamiltonian dynamics where the you have a potential energy function that's given by your negative log, posterior negative log joint, same thing up to a constant and right to simulate those dynamics and if you do it, well, then the energy doesn't the total energy doesn't change. And so you're you're able to and it's reversible, and everything is it's very beautiful. Right now elegant kind of way of solving the problem of moving around efficiently in high dimensions. Right, but that's the that's sort of classic hybrid Monte Carlo is you alternate between you resampling or momentum. That's a good step. And you do this this Hamiltonian dynamics update, which is the heart at the heart of the algorithm sort of mixes sort of mixes entropy between these refreshed momenta and, and the the position variables which are the things you care about generalized HMC. The, the trick is the kind of big idea is to not completely update the momentum variables, but instead to essentially damp them and then replace the sort of and then sort of make up for that contraction with by adding some additional Gaussian noise. So this looks like a sort of classic, sort of in physics and in a physics sense, right? It's as though you introduced friction. is a is a pretty good way to think about it right? You have a particle that is undergoing that has mass, and so momentum, therefore, and it's being banged around by some Brownian motion. And there's also some friction that kind of makes up for the for the additional energy that's coming in in the form of that Brownian motion. And so it has this sort of like underdamped launch of an equation kind of feel to it if you if you only take one step, only take sort of, you know, a small only simulate for a small period of time before partially refreshing this momentum. Um, so what's nice about it is that it does have this sort of cute, like, you know, continuous time limit that looks like an underdamped launch of an equation and that's a nice sort of intuitive physics II thing. That is actually that's also like a lot easier to do theory on then. This kind of alternating Gibbs flavor thing. So like, if you look in the literature, there's way more theory about convergence of underdamped Launchpad methods than there is about HMC. But, again, because this continuous time limit is sort of available to work with, in a way that it's sort of harder to think about with HMC within an agency. Um, you know, the downside for generalized HMC. And the reason it hasn't been more widely used in practice is that it exposes you to kind of this gauntlet of accept reject steps where after every one of these after every one of these updates, there's a chance that you will reject. And I you can, you know, there's a straightforward kind of Jensen's inequality argument that, that that shows that this can be actually quite, quite a big penalty and so you wind up needing to use very small step sizes and it's, it's just not practical. Unless, like, the slight sampling trick that read for deal introduced a few years ago which, as one might expect, is, you know, elegant and clever and the sort of thing that you know, somebody should have thought of a long time ago but you know, that's that's why he's red for deal is that he makes it look obvious in hindsight. I, I should say You know, there's also another strategy which is what what Joshua Sol Dickstein, called the Look Ahead HMC strategy, which is also kind of you can also sort of interpret in through a slice sampling lens. Um, but, but anyway, so you know, so basically, that's what we sort of built on in the means paper was, was this innovation that made it possible to use generalized HMC without this horrible rejection behavior. Um, and so, okay, so that's sort of you know, what, what, that's sort of the what, and now like, what's the why I mean, one answer is just, you know, was there to be done and that seems like we could do something interesting. It's not a good enough answer, but it's, well, it can be a good enough answer, I guess. I think practically speaking. I'm there. You know, it's not really obvious to me that one method dominates the other. Um, I think there will be situations where it makes sense to use one. There will be situations where it makes sense to use the other um, but I don't think it's like a you know, like we're never going to use the sort of classic for momentum refreshment HMC. I think they're, you know, anything. I would say probably almost anything you can do in one framework, you can probably get to work in the other. It just might be more or less convenient from a kind of derivation and coding perspective. You know, I think one one possible application for the generalized HMC strategy that, you know, it's something you really does make easier to do than or, you know, that something that is sort of an affordance that I think you don't get with, with classic HMC is the ability to more frequently interleave Gibbs updates for some variables, for example, discrete variables, that might not be natural to update with with HMC, right so I would in general, I would never, maybe not never, but I would, it's hard for you to think of a situation where I would advocate doing HMC kind of alternating between some variables and some other variables, right sort of in HMC within Gibbs kind of thing I would usually rather do HMC on everything jointly. Because otherwise, you know, that sort of the whole point of HMC is that you're suppressing random walk behavior. And if you're doing something that looks like kids, you're going to get random walk behavior. And so you're sort of missing the point. There could be situations, you know, there could be situations where there are computational reasons, it's much cheaper to update one set of variables. Or something. I think in that kind of situation, I would usually rather put that logic into the integrator itself, rather than doing some Hamiltonian splitting thing, rather than kind of an outer outer loop kind of thing. But anyway, but you know, the point is that with discrete variables, that's not so much of an option, right? The the the ways to integrate discrete variables into HMC are, they exist, but they're, you know, definitely not. Not perfect. So thing you can do, and this is something that Robert mentioned in his confronti paper, is you could alternate every gradient step with doing Gibbs updates on some discrete things. And for example, which I think is a potentially interesting move. I think you do in that situation. My intuition is that in that situation, you do still have to be careful about random walk behavior, right? Because basically, there's now there's all this Gibbs noise that might interfere with the sort of the coherence of the just sort of momentum acceleration that you get in generalized HMC. And so to kind of keep that nice momentum, coherent momentum, kind of exploration based exploration, you might need use a pretty small step size, so that you're essentially averaging away. That gives noise. Um, right. So you're like, now you're, you're getting sort of, I guess the way to the way that I think about it is if you have a random value for your discrete variable you're you're setting your discrete variables. That's giving you you know, that's like the expected gradient that you would get with respect to the conditional for those discrete variables given your continuous state, which is the gradient that you want. I plus some noise. And that noise is going to kind of bounce you around and might kind of come to make you forget your momentum faster than you want to. But if you take smaller steps and update those discrete variables more frequently, then I essentially you you're still getting this random noise, but you're, you're, you know, if you take 10 steps with 1/10, the step size, it's there's sort of a continuous limit where that looks a lot like averaging kind of these independent gradients, and getting rid of a bunch of the noise that way.

42:41

Now, of course, that, you know, might get expensive, and and then you and you also want to, you know, it sort of starts to shade into something that looks a little bit like a pseudo marginal MCMC kind of strategy. So, you know, I don't know I mean, I guess that's that's sort of like a theme right? Kind of the longer I think about these things, the more I It feels as though you know, the the design space is not as big as it looks. In some sense, right? And sort of like we have a relatively small number of strategies for for solving various kinds of problems and they might kind of be derived very differently, and kind of look different in the way that you implement them. But if you kind of take a step back and look at them from the right perspective, they're still kind of exploiting the same levers and wind up having similar scaling behavior, and they're ones that being you know, no free lunch. Basically. But, you know, that's, that's obviously a bit hand wavy, but anyway, so the I guess the answer to your question is, I think I, there are some interesting affordances that the generalized HMC offers and also you know, just from a coding perspective, it it may be nicer or less nice in some in some frameworks, but I don't think you know, dominates the classic full momentum refreshment strategy, necessarily.

44:22

Oh, okay. I see that. So in your mind, it's more something that's that's complimentary than that, then something that's going to be to take over what already happened?

44:36

Yeah, I think that's right. I think it's, you know, it's just another another part of the design space.

44:43

Okay, are you already able to say in which cases that will be more interesting than the current? Let's pinpoint the nuts Um, well, I'm taking that as a baseline because it's what Stan uses by default and pints.

45:06

Right. Well, so you know, I think one nice thing about Meads is that because it's sort of in this mini chain, that if we, so you know, sort of talking about generalized HMC, but meets the point of the real contribution to me it's it's the the automatic tuning strategy. And now that I'm pretty, pretty optimistic about under certain circumstances. So it is. So one nice thing about needs is that it lets you do adaptation of the things we like to adapt like number of like step size and preconditioning and the damping parameter for generalized HMC, which is analogous to the serves a similar role to the number of leapfrog steps parameter that in vanilla HMC that knots was designed to get rid of. So means offers a way offers ways of tuning all of these things in a way that still preserves detailed balance and and keeps the target distribution invariant. So I, that's nice, you know, I think it's not I'm not that worried about kind of the bias that comes from stepsize adaptation and to some extent, preconditioner adaptation. From doing that during warm up, especially if you have a fair number of chains to average the to average that signal over you know, I don't think that bias is huge in practice, and I think you can get rid of it pretty quickly by taking by freezing and taking a few extra steps. But, you know, there could be situations where you want to use this as part of a larger procedure, where you really don't want to do that warm up thing. It just is, like unwieldy for some reason. So I think that could be interesting. Or you might, you know, it's also just nice to not have to worry about it. Um, situations where I don't think it's going to work. Is is you know, I think if you have serious multi modality that could very easily break it because it is really assuming that um, you know, it's assuming that the same step size and damping parameter and preconditioner are appropriate for all of your chains. And if that's not true, then you know, it's going to have to make some kind of compromise. That's not necessarily going to be the compromise that you want. So not at least in principle, if you ran multiple knots chains in parallel, and they found different modes than that had different scales and wanted different numbers of leapfrog steps in principle knots could do that. Now, you know, if you're in that situation, you might have other problems. But but at least you know, it's, you know, I think, I think probably for robust is not still gonna win. Um, but, but, you know, not not, not a simple algorithm to implement and, and has a lot of control flow and sort of ragged computation that make it kind of that introduce a lot of overhead when you're trying to run it on the kind of Sindhi hardware that that we like these days like GPUs and, and also nuts. Does waste a certain number of gradient steps in satisfying detailed balance because it has to do this forward and backward exploration and it's going to wind up choosing a state to move to that is not it's going to wind up exploring states and computing gradients of states that are not on the path between the initial state and the state that you want to move into. So on average, that's going to be about a factor of two inefficiency that you get that you get with not some of which you don't know if you really had enough parallel computation. You could probably get that down by some factor with speculative, speculative execution and stuff like that. But you know, it's messy.

49:22

My team, Okay, nice. I mean, yeah, and like the fact that we have we're starting to have all that diversity of phone call recently is actually super cool, because then like, you can definitely envision that sometimes of models that are going to assemble better with that kind of algorithms and then another one within another type and ideally, you would love Stan and painting and so on to see that automatically in a way. And then I just, well, yeah, I don't know you're fitting that kind of model. We're using those nuts. And then oh, you're fitting that kind of model we're going to use means and the user doesn't have to worry about that. And because, as you say, we enforce the separation or inference and, and, and algorithms.

50:11

Yeah. Or you know, you have multiple computers, you know, you try it try both ways.

50:18

Yes. But I mean, you that would work for you in me but then, like free to people who don't really know a lot about the theory. And that would add to the barrier. And so here I'm thinking about the people like basically lowering the barrier to entry in making the workflow easier to use, and abstracting that difficulty out, I guess, because otherwise it's even more overwhelming.

50:42

Of course, of course. Yeah. No, I mean, he would want that to be something in general, you know, ideally, we would also have diagnostics, right that, that would, that would at least make it possible to to automate some of that stuff. And I do want to along those lines just plugged the some some recent work that Charles Murray Gaussian and I and other people, number of other people have done recently on that many chain diagnostic called nested arhat which is, um, meant to, you know, hopefully, get us these, these kind of diagnostic results a little a little faster than classical diagnostics. would, you know, I think but I do think that better diagnostics are really an important part of the of the workflow even though you know, they're not necessarily something that have gotten a ton of maybe the same level of attention as sort of new algorithms.

51:47

Yeah. Yeah. And you do need good reasons. First, to have the diagnostics because you read the diagnostics.

51:57

Yeah, you need something to diagnose for sure. But

51:59

Exactly, yeah. So what would that nifty are do like is that is that like diagnostic that would compare different samplers or it's still in the sampler and it then compares the chains

52:13

are basically So at a high level, right, what it's trying to do, so you've got classical classic Arhat, which basically takes some chains and looks at the the variance of your estimates within a chain versus the variance of your estimates across chains. And, you know, asks, are those similar and the assumption is that the within chain variance is going to be? Well, not just an assumption, I mean, it's the law of total variance. The within chain variance is going to but just intuitively right you you would expect the within chain variance to be smaller because if you don't have convergence, and things aren't moving around, and you don't have the kind of diversity that leads to variants to higher variances, nested arhat the ideas we sort of combine, we simply have a bunch of chains we combined them into what we call super chains, which are all initialized at the same point. Or near the same point. Right. So again, to the extent that they are doing a good job of forgetting their initialization, we should expect them all to converge, we should expect it to be hard to kind of tell the difference between super chains in terms of the variance right the variance right, so we have a bunch of chains all initialized in the same position. If they haven't fully converged yet, then we would expect the variance estimates that we get from that set of chains to be small relative to the variance that we would get by looking at a bunch of different initializations. Right, there's like a component of the variance that's due to initialization and a component that's due to exploration. We expect that initialization component to get smaller, as relatively smaller as as the procedure forgets because it needs to forget its initialization. So that's kind of the high level idea, basically, right as instead of looking at individual chains, we look at these sort of super chains. But because we are running many chains within each super chain, we get a nice variance reduction in our estimates, and so we don't actually have to run each chain for long enough that we can get high accuracy estimates of the variance of the sort of the amount of variance that's due to initialization. And so the idea is that you could run because the whole point of this menu chain workflow is to run enough chains that you don't have to run your chains for a real any of your chains for a really long time. But with classic arhat you kind of you still to get it to sign off on your results you still have to run all of your chains for a long time. So that's the I don't know how nested our hat is to hopefully kind of get somewhat reliable you know, I mean, obviously there are no diagnostic is perfect. These are all just screens but but somewhat reliable. signals of convergence or non convergence fast as opposed to you know, like when when it actually happens, right convergence in the sense of low bias when it actually happens, as opposed to once it's happened and also we've run for long enough that each chain the estimates for each chain has low variance.

55:15

I see. Okay. Yeah. So definitely that kind of diagnostics would, yeah, would help for for, like adoption in understanding of those new samplers, that's for sure. Yeah. Okay. I see. Actually beat I mean,

55:31

also for, you know, I do want to, I do want to push back a little bit against kind of, you know, you said sort of for for adoption and usage, right, which kind of suggests, maybe like less sophisticated users but I do, which is the thing that we you know, we all we say all the time, right? That like, you know, we do this automated, you know, we build these automated systems and, you know, adaptive methods and so on and so forth. And you can diagnostics, right, you know, part of the story is about making these workflows available to less sophisticated users, which is obviously like super important and an absolutely a good enough reason to do it. But I think, you know, as, as somebody who I would not describe themselves necessarily as a novice user, right? I still really, really, really like not having to tune these things by hand. It's just such a pain. I, you know, like when we were writing. We were reading the newspaper, doing the experiments for that. I was like, Oh, right. Like now I've got to run all these experiments, where it's like, you know, the experiments we evaluated Geez, that was easy, right? Because it was just like, you just run it once. The experiments where we do a grid search over all of the other parameters, that was a pain. Right? And so it's like, you know, there's this there's this phrase that I feel like comes up often that I think, you know, certainly I've used it any number of times, and I think I probably stole it from somewhere which is tedious and error prone. Right that like, you know, we talked about like manually computing derivatives is tedious and error prone manually tuning the parameters of your sampler is tedious and error prone and it just you know, it isn't and that's true whether you're a novice, it certainly trophy or not, like the it's certainly true if you're a novice, but it's true if you're a sophisticated user, too.

57:22

Right? Yeah, for sure. Yeah. I mean, like for me, for instance, who has not going from that math background, having the ability to rely on on those automated samplers, it's just like, it's really incredible, because then I can have the code instead of focusing on the math and and it's just like, invaluable because it unlocks a lot of possibilities. And also it unlocks those exact miles for people who would otherwise not be able to use them. So it's, it's incredibly valuable, for sure. And actually, so I have at least another technical question for you, but also I'm curious a bit more generally, because you start started talking about that a bit. What do you think are the biggest hurdles right now in the Bayesian workflow?

58:24

Oh, I mean, you know, modeling inference data, you might call it, I mean, you know, probably, right data sciences is more data than science, right? Like, everybody you know, you actually want to solve a problem then, like, you know, all of the all the modeling and algorithms stuff is great and important, but you know, it's not a substitute for better data, right, like, better measurements, so for better modeling any day. But yeah, I mean, I think like, our Yeah, it feels like there's still just a lot of friction. Like, you know, you sort of you can build a simple model, and fit a simple model to your data, and that's fine. That usually works. Um, and sometimes that's enough. Often that's enough. And when it's not, I feel like, it feels like things get harder, kind of super linearly with the complexity of the model you want to fit and I think the computation also has a tendency to get super linear in the amount of data that you have. Right? So, particularly, like, I think a big category of problems is these underspecified models, where there's some degrees of freedom that are very tightly constrained by the data and some that aren't constrained by the data in all at all. And you're really just falling back on your prior

::

but I'm on it right. And so in those situations, right, the kind of, for agency at least right the distance you have to travel is not going down with the amount of data, but the amount to which your tightly constrained degrees of freedom are constrained is going down as kind of the square root of the amount of data. And so that means your step size has to go down with the square root of the amount of data, which means your number of steps or agency has to go up with the square root of the amount of data which means that you're scaling is like N to the one and a half instead of n. So that's like, you know, there's that super linear cost that comes up there's and then there's like, just the fact that like, as you Yeah, just for whatever reason, it feels just really, it feels like we should be seeing more complicated models, like more models that are complicated, solving really hard problems than we do. And I think the reason for that is that there's some interaction between model complexity and difficulty of inference that we don't fully

::

know how to solve yet. And, you know, in part, I think that's a workflow thing.

::

That like, we need to do a better job of kind of figuring out what the best practices are for coming up with in fitting more complicated models as we go and just sort of having that be kind of an incremental process. And in part, it's probably an algorithms thing. That the algorithms that work well for the problems that we, you know, there's there's sort of a chicken and egg thing, right, which is the problems that we study are the problems that we know how to solve. Right, like the models that are out there that we tune all of our algorithms on or the models that have been successful, not the models that have failed. Those just, you know, there's a file drawer effect. Um, so I don't know, I mean, I don't have Yeah, I don't know what the answer is. But I think um, I think that issue of like, you know, there's this ideal picture of like, I am going to get continuing reward for injecting for building a more and more realistic model. And in practice, it just seems like we hit a wall with that sort of scaling up modeling complexity. Oh, yeah. Right. And part of that is, of course, priors, right that like, for simple models, it's much easier to specify priors that won't have too much of an effect. But the more complicated your model becomes more important, it is that your prior actually means something and have and be, you know, reasonable. And constructing reasonable priors is hard work. Yeah, so I don't know. I just wanted it to be easier, I guess is the answer. But, but that's not much of an answer. Yeah. No, I

::

mean, that wasn't really the question. Like the question was, what do you feel what do you feel is hard so and that we should, as a community try, strive to improve so definitely, that that that's super interesting. In so time is flying by but I'd really want to do that. The technical question that comes from my friend and fellow PMC developer, who's channel paths, and he basically wanted to have your opinion on the blocked samplers so the idea of mixing mixing egg HMC steps with Gibbs steps and how ghmc comes into play there. So yeah, like,

::

right, so So yeah, I touched on this a bit before. Right. So I think like that is a nice affordance of ghmc versus vanilla HMC is that you can interleave these, these other Gibbs updates more frequently than you can with with conventional HMC. So, you know, I think that's a Yeah, I think it's an interesting degree of freedom that I have played around with a little bit on toy problems to just get some intuitions for I think that it's definitely not a substitute for. Yeah, like I said, I think I think there's a there's a point at which it sort of starts to look a bit like a pseudo marginal MCMC method where essentially this if you if you're doing these Gibbs updates frequently enough, then at some point that looks very similar to doing HMC on the marginal on the on the marginal distribution of the of the parameters that you're interested in. And so kind of having another way of achieving that, I think is an interesting point in the design space, but you know, how competitive it is, with the other alternatives that there are for doing that? I'm not totally sure. Right. I think that there is, I think if you do those updates too rarely, you're going to wind up with random walk behavior. And I think if you do them too frequently, you're going to wind up paying more than you need to. And hopefully, I guess what I would hope is that there's a reasonably there's a range of step sizes, essentially, sort of like frequencies with which you're doing these gifts updates. That where it doesn't matter much where essentially, your sort of trading off random walk behavior and additional computation in a way that makes it not that important, exactly where you wind up on that efficient frontier, which is I think, sort of similar to what you see with like, parallel tempering where you use the even odd swap strategy, right? So the idea is that this is a method that's meant to make it easier, right? So here, you have a bunch of chains, there are different temperatures, and you want to and you're swapping the states between the chains. And you can do that in a random way or a deterministic way. And if you do it in a deterministic way, it's possible to get information from kind of high temperature chains to the low temperature chains. Without this kind of weird random walk, you know, almost bubble sort like kind of behavior. But the price you pay for that is that if you want it to to actually make it from the top to the bottom, and then you need to use more temperatures. Right? And so it sort of winds up being like the same kind of complexity. But at least there's kind of a range of you're you're a little less sensitive to kind of getting one of these parameters right. So I guess that's that's what I would hope you could achieve this. It'd be nice if you could achieve more than that, but at least I would hope you could achieve that.

::

Yeah. Okay. Thank you. Nice. Thanks so much for taking that big question and answering it in a record amount. of time. That's super cool. How's your schedule looking? Do you have a hard stop in five minutes? Or can you okay, like, it's not gonna be long, but it's just like in case I think it's gonna be you know, 10 more minutes or something like that.

::

Okay. I'm, I'm available. Okay.

::

Cool. So I think we have time for like, two more questions. And then the last two questions. Okay, peek. It's hard because they have too many questions. Oh, yeah. I know. Okay. Um, okay. So I think a question I really want to ask you is a bit more globally. Once, what would you say is your fields like biggest question right now? Like the one you'd like the answer to before you die?

::

Let's see. I mean, so obviously, that's a that's a big question, but um, I guess one that's on my mind a lot. Is you know, I, in the world of machine learning, what is the place for Bayesian methods? So, you know, I think for a long time we, I sort of came up in a period when, when the Bayesian methodology star was was ascendant in machine learning. i And since then, of course, deep learning has, has really taken over. And I think one of the Yeah, I guess the, one of the questions that I would like an answer to his like, at the end of the day, in like 10 or 20 years. Are we going to be using Bayesian methods for these perceptual tasks for these language modeling tasks? Is there really a place for probate for really Bayesian methods in that context, or is all just going to be giant models that are fit with SGD? Right, like, I think there's a strong argument to be made, that if you're, you know, the Bayesian framework is good for dealing with limited amounts of data in a principled way. Some of these perceptual domains, it doesn't feel like there's a limited amount of data. It feels like the reason that we are limited is not because we have finite data, but because we have finite computational resources right like this. You know, it. Like, there were exceptions, certainly, but for for things like, you know, speech recognition, or image classification, that sort of thing, right? I mean, maybe there are these tail classes or something like that, where you don't have a ton of data. But as far as just like learning a machine for extracting features from images, like we have enough images, there is no posterior uncertainty. That's coming from Bayes rule about what that representation should look like. You know, that the uncertainty is just because we can't run SGD for long enough. And so, you know, does Bayes have anything to say in that context? Does it have something to say about like fine tuning these large models? Does it have something to say or does it have something to say like I mean, certainly, like I believe that Bayesian statistics gold thinking is a it's a relatively late achievement for human beings, right. Like, but it's an achievement. Um, and it's, it's important, but humanity got by fine without it for a long, long time, depending on your definition of fine of course. But like if our goal was just like, human level intelligence, of like, at the level of like, you know, medieval peasant or something like that, right. Like they didn't understand Bayesian statistics like they could put they could under you know, but they could show right like they could do all kinds of things that robots can't do yet. So I don't know. Um, I mean, I think there's definitely a place for at least approximate Bayesian methods when you start getting to higher level reasoning, right. Like, you know, that's sort of the level that the cognitive scientists sort of talk about, right is like, decision making under uncertainty and rationality and so on and so forth. Right? And like, even, you know, you can study animal behavior and show that, like, they, you know, behave in ways that that sort of like, relate to Bayes rule somehow, but it's not the low level perceptual system. It's not the level of their low level perceptual systems, right. Presumably, it's somewhere a little bit above that. So I guess I'd like to know where I, yeah, what, what level, this is going to be important and also like whether or not in the long run, we're going to be able to approximate the optimal Bayesian algorithm well enough with just like a big transformer or something. And so we shouldn't like, you know, all of this sort of separation principle stuff is nice. But, um, I, it intro introduces a bunch of approximation bias because our priors are wrong or are or we're not learning them properly, or we're using too restrictive of a set of assumptions or whatever, and we'd be better off just sort of doing something and and that's sort of the hypothesis that that I would really like to have. A Yeah, I guess I would be. I would be comfortable making much stronger prognostications about the future of the field if I really had a strong opinion. About that hypothesis. But as it is, I really, you know, I don't know.

::

Anything interesting to be green, like interesting.

::

I don't know. I mean, it depends on your perspective, right? It's like, I don't think Bayesian stats are going away. Right. Like, again, it's, you know, this is a this is statistical thinking is an important achievement. And, like, you know, we should be proud of it. It's just it might not be the way that robots decide. How to pick up delicate objects. All right, okay.

::

Yeah, yeah, we'll see but then and, by the way on that from my quick, quickly do on that future patient stance, like what something you like you'd really like to see and something really, you would really not like to see.

::

I mean, yeah, like I said, you know, I'd like to see a future where things are simpler. Where were some of this friction has gone away, where we kind of know how to think about it'd be great to like, have better best practices on prior elicitation I think in complicated models. You know, I think, certainly, people are thinking about that. But I think we've got a lot more thinking to do. I'm not and we have a lot more thing to do both in terms of you know, what those best practices actually are and also how we communicate them. Because of course, right that's like, you know, where where I'd really like things to get is to a place where Bayesian hierarchical Bayesian modeling is as kind of easy for people to get started with, and make like real progress with as deep learning as you know, I think, like, again, some of this is colored by my experience at Google, but like, you know, Google has done a great job of training software engineers with you know, limited, I mean, you know, math backgrounds, but, but really computer science backgrounds, training them to train huge neural networks to solve problems to the point where, like, you know, you don't need a machine learning specialist anymore to do that. And the reason I think that that has worked as well as it has is both because, you know, the framework can kind of deliver what it promises but also because it's the formalism is simple enough and easy enough to communicate that the people can get started and actually do something without kind of years of training. So I think, you know, we've got work to do as far as figuring out how to make that how to make these methods more accessible, you know, given a formalism that is fundamentally a little bit more complicated than just like, there was a circuit and you do gradient descent. There's a differentiable circuit and it has inputs and outputs, and we want the outputs to be correct for the inputs. You know, I think there's definitely also a need for software tools. Like I don't think that's gonna happen without better software tools. And, you know, figuring out kind of what is the level at which those software tools need to be customizable, I think is is an important one, you know, we're spending our group spending a lot of time talking with the customer in sync at MIT these days, and he certainly has very strong opinions about these, these questions. And so that's a you know, that's that's been really I think, helpful for, for me to kind of as a way of spending more time thinking about these things. You know, I don't think Yeah, I think that's also right. I mean, there's a bunch of cool stuff going on in probabilistic programming, but I don't think we have kind of the I think we're still very much exploring kind of what is the Yeah, what are the workflows? And what is the software what are their software tools that enable that and you know, maybe even one of the hardware tools that enable that

::

nice yeah, that's that does. That does some cool i like that feature. Cool. Well, I one really small last question from coding Carol. Who I think you know, and he asked me to ask you, why do the samplers you work on food based names, like nuts, cheese meats? Is that on purpose? Are you having green that wait, what? Oh, yeah, I

::

guess I guess they do. No, of course that was intentional. Yes. Um, so I you know started with with knots of course right and that the the no U turn. Thing was actually one of Bob Carter, one of Bob carpenter suggestions. Which, which I really, really liked. I thought it was a very nice evocative kind of, kind of term. Of course, no, you turn just like falls right out of there. So that was, you know, that was the, the beginning. And then, um, we were working on this project to try and, you know, so there was this issue with knots where it turned out to be just like a huge pain to get to be performance on a GPU. And so we're like, okay, maybe there's something else we could do, which turned which eventually got cheese, but originally was a little bit different than the earlier version. There was an earlier version called Tater. Which, but it and in any case, the the name was sort of like meant to reference knots in the sense of being like if your GPU has a has a knot allergy or knots allergy then you know, because it just like sort of slows down and puffs up I guess if you if you try and get it to run knots. Then then you could try this. This other identity food item, and hopefully it won't be allergic to that. So that was Tater. And then and then eventually cheese, which is sort of the the improved cater and then you know to is once you've got two things that's a pattern and and it gets its own momentum so to speak. So that's it took Yeah, actually needs each took a little while. It was a It took some thinking to get to get that particular backronym.

::

Again, Yep. Cool. All right. That's super cool. Thanks so much for for taking some time, man. I really enjoyed it. As usual, before letting you go though, I'm gonna ask you the questions I ask. You for guests at the end of the show. So first one, if you had unlimited time and resources, which problem would you try to solve? Well,

::

I mean, that's obviously a big if. So, you know, given unlimited time and resources, I think I would probably try and solve, you know, the biggest problem, which is moral philosophy, right? Like, you know, that's, you know, you have a huge responsibility given unlimited time and resources to do the right thing. But of course, we don't really know what the right thing is. But given our limited time and resources, maybe I could figure it out and succeed where 1000s of years of, of human thinkers have have failed. But like unlimited time means like, you know, possibly millions of years is how I'm interpreting that question. Or, you know, can you just try and crack like, you know, climate change or, or space exploration someone or one of those ensuring the survival of humanity kind of things. I mean, yeah, that's a but it's a obviously a big if

::

Yeah, of course, but it was the question. And think of question, if you could have dinner with any great scientific mind dead, alive or fictional, who would it be?

::

I mean, so that's also awesome. But I guess I don't know maybe like one of the Ghostbusters. I, you know, Venkman is obviously an obvious choice, but I guess if you're, if you're talking about like scientific minds, probably Spangler or maybe stance would be the the better choice. Donatella Tejaswini digital turtle, it's also also, also maybe a good one. I'm loving your dad. Yeah, living or dead people like actual human beings, is obviously a little more restricted. But I don't know. Maybe von Neumann seems like, seems like he was a smart guy by all accounts.

::

Yeah, sure. And it because I think it was the second one I think yesterday I interviewed neuroscientist Pascal polish. And I think he said John By knowing mine, too. So you know these dinners getting pretty nice. Awesome. And I think Bergkamp Carpenter, so I interviewed him for adjusting to routine for furniture. I think he told me you play heavy metal guitar. Is that true? That is true. I do. Nice. Nice. So if you want to challenge you can you can take the theme song of the podcast. Good Bayesian and turn it into some heavy metal guitar stuff. If you do that, though, like tell me and we'll definitely put that through the theme song of Europe. So

::

I'm not sure how quickly I can turn that around. But, but I'll I'll take a look.

::

Yeah. Yeah. Yeah, sometime before he is heating the press. So I'm just giving you a musical challenge. You do. Alright. Okay, so we're gonna haven't faked goodbye here and then we'll stop the recording and then I'll tell you what. To do. Very long. Okay. All right. Okay, well, I think it's time to call it the show. Thanks a lot, Matt. Yeah, I learned a lot from a lot of things I didn't understand because samplers, just lots of math. I have to read the papers over and over again. And then ask a lot of questions too many edge Adrian Seybold and Luciana paths inside the pipe inside the pipe seating to Panglao also in calling Carol but with time and stimulation and going to Uncle Yeah, so thanks a lot for having me. Yeah, well, I put resources in the link to your website in the show notes. We definitely need to put the links to your papers in the show notes. For people who want to dig deeper. Thank you again, man. for taking the time and being on the show.

::

My pleasure. Thank you.

::

Okay, that's a wrap. So now you can go to a density export. file, export, export as WMV. In then you save

::

it wherever you want. You want 24 bit.

::

Exactly, yeah. Make sure it's 24 bits and you keep the default metadata. We don't care about that. And then you can send me that however you want, probably by a Google Drive. And then you're all set. This should be episode 78. So as I was saying, it should take like one month and a half or two months even before it's released, so you'll get time to work on the music and yeah, also if you can put in to your papers and or and anything else you think that people would be interested in that be awesome. I already put your website in Google Scholar page, but of course, like prevent anything else from you. And then like any resource that you think are interesting, so usually people something that people get to coins quite a lot. So it's definitely useful. Oh, okay. Yeah. Awesome. Well, alright. Thanks so much for for taking the time that this was super cool. And

::

thank you yeah,

::

I make sure to send you an email one day one when I'm in New York. And yeah, let's, let's keep in touch. I'll definitely tell you of course when the visit is out, and have a nice rest of your team.

::

Thank you. You too. Bye.

Chapters

Video

More from YouTube