#74 Optimizing NUTS and Developing the ZeroSumNormal Distribution, with Adrian Seyboldt

Artwork for podcast Learning Bayesian Statistics

Episode 74 • 5th January 2023 • Learning Bayesian Statistics • Alexandre Andorra

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

We need to talk. I had trouble writing this introduction. Not because I didn’t know what to say (that’s hardly ever an issue for me), but because a conversation with Adrian Seyboldt always takes deliciously unexpected turns.

Adrian is one of the most brilliant, interesting and open-minded person I know. It turns out he’s courageous too: although he’s not a fan of public speaking, he accepted my invitation on this show — and I’m really glad he did!

Adrian studied math and bioinformatics in Germany and now lives in the US, where he enjoys doing maths, baking bread and hiking.

We talked about the why and how of his new project, Nutpie, a more efficient implementation of the NUTS sampler in Rust. We also dived deep into the new ZeroSumNormal distribution he created and that’s available from PyMC 4.2 onwards — what is it? Why would you use it? And when?

Adrian will also tell us about his favorite type of models, as well as what he currently sees as the biggest hurdles in the Bayesian workflow.

Each time I talk with Adrian, I learn a lot and am filled with enthusiasm — and now I hope you will too!

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, Adam Bartonicek, William Benton, James Ahloy, Robin Taylor, Thomas Wiecki, Chad Scherrer, Nathaniel Neitzke, Zwelithini Tunyiswa, Elea McDonnell Feit, Bert≈rand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Joshua Duncan, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Raul Maldonado, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, David Haas, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Trey Causey and Andreas Kröpelin.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag ;)

Links from the show:

LBS on Twitter: https://twitter.com/LearnBayesStats
LBS on Linkedin: https://www.linkedin.com/company/learn-bayes-stats/
Adrian on GitHub: https://github.com/aseyboldt
Nutpie repository: https://github.com/pymc-devs/nutpie
ZeroSumNormal distribution: https://www.pymc.io/projects/docs/en/stable/api/distributions/generated/pymc.ZeroSumNormal.html
Pathfinder – A parallel quasi-Newton algorithm for reaching regions of high probability mass: https://statmodeling.stat.columbia.edu/2021/08/10/pathfinder-a-parallel-quasi-newton-algorithm-for-reaching-regions-of-high-probability-mass/

Abstract

by Christoph Bamberg

Adrian Seyboldt, the guest of this week’s episode, is an active developer of the PyMC library in Python and his new tool nutpie in Rust. He is also a colleague at PyMC-Labs and friend. So naturally, this episode gets technical and nerdy.

We talk about parametrisation, a topic important for anyone trying to implement a Bayesian model and what to do or avoid (don't use the mean of the data!).

Adrian explains a new approach to setting categorical parameters, using the Zero Sum Normal Distribution that he developed. The approach is explained in an accessible way with examples, so everyone can understand and implement it themselves.

We also talked about further technical topics like initialising a sampler, the use of warm-up samples, mass matrix adaptation and much more. The difference between probability theory and statistics as well as his view on the challenges in Bayesian statistics complete the episode.

[: 00:00:00

It turns out he's courageous too. Although he's not a fan of public speaking, he accepted my invitation on this show and I'm really. He did. Adrian studied math and bioinformatics in Germany and now lives in the US where he enjoys doing math, baking bread, and hiking. We talked about the why and the how of his new project, , which is a more efficient implementation of the net sampler that he wrote in the programming language called rest.

ibution that Adrian created, [: 00:01:00

Adrian will also tell us about his favorite type of models as well as what he currently sees as the biggest hurdles in the Beijing workflow with all. Honestly, each time I talk with Adrian, I learn a lot and I'm filled with enthusiasm. And now I hope you will too. This is Learning Beijing Statistics, episode 74, recorded November 4th, 2022.

Welcome to Learning Beijing Statistics, a fortnightly podcast on based on in France, the method, the project. In the people who make it possible. I'm your host, Alex Andora. You can follow Twitter, Alex Manura, like the country. For any info about the podcast, learn based stats.com. Isla be show notes, becoming corporate sponsor.

rch. Everything is in there. [: 00:02:00

Thanks a lot folks, and best be wishes to you all. Let me show you how to be a good and change your predictions after taking information. And if you thinking now be less than amazing, less adjust those expect. Aian is someone who cares about evidence and doesn't jump to assumptions based on intuitions and prejudice.

less I'm Ryman in front of a [: 00:03:00

Hello Ians. Before we go on with the show, I like the warmly thing, Andrea Coplin, who just joined the LBS patron in the full poster tear. Your support makes big difference, for instance. It helped me pay for the new mic I'm using right now to record these words. So may the Bayes be with you and Yes. Oh, and if you wanna stay connected with your favorite podcast, MBS now has its own Twitter account and Lin page.

train his, his practice, his [: 00:04:00

So that's really a shame. . Yeah. Like that's how it goes. No, it's not true. It's my German is way too rusty. So yeah, let's continue in English. But yeah, I'm super happy to have you here because I've known you for why now and you've done so many things that it was hard actually to pick the questions I wanted to, to ask you for the podcast cause you've done so many projects and I had so many, yeah.

Exciting questions for you. So I'm happy that you're finally on the show, but I guess I was a bit shy, so , but Alex worked hard to get me here, so , that's among my superpowers, but people should not know. Yeah, I just. I'm relentless . Yeah. So, but let's start, you know, easy and as usual, I like to start with, uh, the background of the guests.

h. Definitely not a straight [: 00:05:00

That's just not what I really was interested in. Yeah. I guess I was somewhat interested in the philosophical part of statistics, like what are, what is the probability and the whole debate of frequent statistics versus patient statistics. That, that did interest me a little bit, I guess. But I was more side interest and just something I, I was curious about.

Not really something I would wanted to work on. But say during the first part of mathematics at university, I learned probability theory, measure theory and things like that, that I guess come in handy now. But nothing at all about applying probability theory in, in any sense. So what, what I would call statistics that came later.

ecause I wanted to have more [: 00:06:00

I think it came up in a couple of projects I was working on and I was just curious. So I actually learned Stan at the beginning. Mm-hmm. , which was a lot of fun and, but I was also into Python quite a lot. So I started looking into P M C and P M C, didn't do a couple of things that Stan could, and I was, yeah, wanted to use Python.

t poll request I did, I think: 2018

[00:07:00] And now I work at p c labs together with Alex, where we develop statistical models. Exactly. Yeah. Yeah, we have a lot of fun. we're gonna talk about, about, uh, like more of that later, but, okay. Actually, I, I didn't know you started with Stan. And, and then well, like that is so typical of you. Oh, I'm gonna start the mass matrix adaptation

Like, so I wanna, I wanna be clear to listeners, you don't need to do a new mass matrix adaptation for your first PR in two , but that would be the outcome. I think there's so still a lot of potential for improving, for improvement in that area. So if you want to do that as your first pr, please go ahead.

That's, that's all I'm say. Sure. Yeah. If you wanna, but uh, don't be intimidated. You don't need to. I think my first PR was probably your typo somewhere. , uh, you know, I'm, I'm not sure It actually was my first PR to be honest. The first and the, the first one I remember, I, I dunno if it actually was the first.

t's too late now. That's too [: 00:08:00

Okay. But okay. I see. So you started with the pure math world and then, or was drawn into the, into the dark of . Yeah. It's just this, I wanted to do applications and as soon as you start working on application statistics, pretty much in, I, yeah, probably not every field, but in really a lot of fields that just pops up at some point.

And if you're already a bit of interested in the philosoph philosophical part of it, then I think that that's easy to draw you in. Yeah, no, for sure. And actually, like how do you define statistics to people? I guess how do you explain your job? You know, because I always have a hard time, like, because it's not math.

eah, sure. It's not software [: 00:09:00

So that's where you just say probabilities have these properties and you don't say what a probability is. You just say what properties something should have for it to be called a probability. Mm-hmm. , then you just prove theorems based on those assumptions on those properties and you don't really, I mean, maybe you think about that, but the, the subject in, in the mathematical world really isn't how to do I apply that to a real world problem, but it's just what does just automatically follow from those?

eory, mostly that I guess to [: 00:10:00

So if we say, mm-hmm , we wanna know something and we also wanna know how, how well we know, yeah. We can, turns out, we can interpret that as a probability if we want. Mm-hmm. . And then I guess you can have really philosophical discussions around what, what this probability actually means, which in practice I think, I mean it's interesting, but in practice I think it's probably not always relevant because in the end it's more, yeah.

s are happening and how it's [: 00:11:00

So would you place, like if you had a vet diagram, would you place physics in statistics at the same layer? In the same layer? Yeah. I think, uh, may, maybe not exactly, because so physics really tries to explain how the, or describe how the world works around us. While statistics, I think doesn't really, but it's more.

Trying to quantify or tell what we know about it. So it's on a slightly different level, but I think kind of the relationship to mathematics is a, is a bit similar in that it's applying parts of mathematics to our knowledge of something or frequencies of something happening. So I guess in, in that case, it would, might actually be really describing something directly in the real world and not just our knowledge of it.

that got nerdy very quickly.[: 00:12:00

Cool. Okay. Yeah. Interesting. And so, yeah, basically, so you said you work now with us at labs and you're doing some fun statistical and mathematical projects, and so yeah, basically, can you also, can you tell us like the topics that you are particularly interested in these days? And so I know you like. You know, getting obsessed by some topics and projects.

So yeah. Maybe tell us about some of your latest obsessions and, and topics that you're particularly interested in these days. Ok. Something I've been looking into a lot because Okay. In a lot of projects we had. Some one problem came up repeatedly where we have a data set. We built a model that works really well with that data set, hopefully really well.

gs don't really work as well [: 00:13:00

So I was interested in trying to make that the computational part of that more robust, which then got me actually. Back to the mass matrix adaptation things, which never really left me. I guess that that stayed around. And so trying to find better algorithms to approximate the posterior so that sampler math sampling methods can be more robust.

think a bit a question about [: 00:14:00

Mm-hmm. Partially pull what, whatever you want to call them. Yeah. How to write, write those so that they end up being. Fast for the sampler, but also easy to interpret. And I think there are think still some, some open questions there. I think we can improve also priors for standard deviations, which is kind of an eternal subject I guess.

That never really goes away. Yeah, true. Yeah, yeah, yeah. Remember you posting something very interesting about the priors for St Deviations in, in our discord on by CNAs? Uh, yeah. It's like these kind of thought experiment about like basically trying to set the senate aviation the whole like on the whole data set.

e basic idea would be to ask [: 00:15:00

Yeah. How we parameterize that is then I guess a bit of a different question. And then we ask how much of that variance comes from various sources. And that way if we increase the number of predictors, for instance, the total variance doesn't grow larger and larger each time we add a add a predictor, which I think doesn't really make sense.

That shouldn't happen. Yeah, yeah, yeah. No, I agree. And also it makes, it's easier to set the priors because it's true that it's always like, yeah, I don't know. I don't know how to set the prior on that slope. You know? That's just like, what does that mean? Whereas having a total variance for the whole phenomenon you're explaining is more, yeah, more intuitive.

, I don't. What I definitely [: 00:16:00

Then I looked at that for a while and worked with it, and after some time I realized, oh, this is actually equivalent to setting half normal price of the variants, which is the, which is default, what we do all time was because this great new thing turns out to be, well, just same, same old, but also, I guess in a sense that I also like that because maybe that points to that.

natural next step now to do [: 00:17:00

Yeah, couple of real projects and just compare first how does it actually do something different? Does it make more sense? Does it actually make it easier to set the priors or maybe it doesn't, I don't know. So yeah, you have to experiment with that a bit. Yeah, that'd be, that'd be very interesting. Yeah. I have some projects in mind there were where trying that.

Yeah. As usual the problem. Like the time. It's the time. But, uh, yeah, like that, that'd be definitely super Griffin. Cool. Okay. So yeah, like actually before we, we dive into those, those different topics, I am, I'm wondering if you remember how first, first got introduced to Basin methods because like, Yeah, basically you got introduced more to stats during doing bioinformatics, so I'm guessing it happened at that time, but like did it or was it actually later?

at, at university, I think. [: 00:18:00

That's then I think the first time I really did that was near the end really worked use, used patient statistics to do anything was ne relatively near the end of my univer time university. I think using it to model RNA accounts for instance. So if you have actually pretty large data sets and I think actually pretty complicated.

Not, not that complicated models, but with, with horseshoe prior. So it kind of went all in and did all the things that, that actually pretty difficult to sample correctly. Mm-hmm. where it's really hard to avoid getting divergences and making sure you actually sample from the posterior. So it didn't, didn't start easy that way I guess.

hods for the, the kind of of [: 00:19:00

Comes around, which I don't think is necessarily a good thing, but it's also not necessarily a bad thing. So I, yeah, I dunno. I think that's definitely part of it. And then you also, I think you notice the problems more where you could use those, those methods and you kind of gravitate to, to one to one set of problems, like Yeah.

so with actually mass matrix [: 00:20:00

I'll, I'll stop there. Like what can you tell us about Nu Buy? Basically give us the elevator pitch and then, and then we'll. Yeah, sure. So that also started more as a small hobby thing. I wanted to try out, I just thought, hey, it would be fun. Rust is a fun language. Mm-hmm. always liked it. Why not? Why not write nuts?

Hamiltonian markoff, Jane Montecarlo methods in in rust and see how that goes. So I did that some time ago with really basic implementation and was sitting around for a while at some point. Then we had discussions in the iza backend, which we use to compute block P graphs, that we could compile that to number so we can just get out a plane C function that doesn't have any, any patent in it anymore.

all that. But in order to do [: 00:21:00

So there's actually pretty simple change we can do to mass matrix adaptation to use the gradients as well as the draws. And so we just use a bit of additional information. And the method looks really similar, but in all my experiments, it seems like it's actually working quite a bit better. So especially early in tuning, uh, we can get to good, good mass matrix and I think to, to the posterior where, where we want to get with quite a, with fewer gradient evaluations, not orders of magnitudes, fewer, but definitely fewers, fewer and seems also so, so far.

do your own experiments, but [: 00:22:00

Relatively small library written in rust implements, just the basic Hamiltonian mark of Jane Montelo sampler. And you can actually use that to sample Stan or P M C models, both with little asterisks in there that, because for Stan, you'll have to use a different branch of HTTP stands so that we can actually get at, at the gradient function easily.

And for P M C, that works out of the box and I think much nicer. So you can just install nut pie using Conda or Mumba or whatever, and just call two functions to sample using nut pie. But that requires the new number backend for aza, which is still a bit of a work in progress. So depending on your models, it might work really well, or you might just get an error message if you're unlucky.

t would be great. So I kinda [: 00:23:00

I think I actually not, not PPI yet, I think Oh, just should be actually, should actually pretty easy to add. But, but it's just haven't done that yet, so, uh, conduct, yeah. So yeah, I, I, yeah, I think that's why I installed it. I think I installed it with, with member. So yeah. I'm by install, not buy, and then you have it, uh, and you can try and, yeah, we'll put a link to the, to the GitHub repo in the show notes so that while people can, can look at the, the examples you are talking about, how do you sample the points model or stand model, thanks to that.

guide because it helps make, [: 00:24:00

And the last twist is that you used new mass matrix adaptation in this implementation of ht. Exactly. I get that right. Yeah. So let's dig a bit into that. Can you first, can you remind listeners. What the mass matrix adaptation is and why would we care about that when we are sampling a basin model? Okay.

so that it does the exactly, [: 00:25:00

It samples really slow. Or in one version you get divergence and in. So it doesn't really work actually, and in a different version it works just fine. So that's always the question of parameterizations. So the model might actually be the same, but the numbers you des used to describe that model are different and.

So sometimes, and some parameterizations are good for the sampler and some are bad for the sampler. Now we try all, always, when we implement agency pretty much to do some of those three parameterizations automatically. Namely we rescale all parameters. So you could just say, I sample stand normal and scale that by the standard deviation, or I just, yeah.

be the same model as before, [: 00:26:00

Right. It's not sampled on the, on the positive. And then you try to find, you rescale all those transformed variables so that they're posterior. Usually that's kind of the usual way of doing it, so that the posterior standard deviation is one. So that all of those have the same, same, same variants. So you mean all the parameters in the, in the model.

do a linear transformation. [: 00:27:00

And to be clear, you do that. You do. We do that only during the sampling. But then the poster you get back from time C or STEM is then rescaled. Yeah. Yeah. So that's completely hidden. You don't notice where you use the library. That happens automatically in the background and you, you don't need to to worry about it if you're just, just use the library.

But during sampling, that's really important because the sampler will have really bad performance. If one posterior center deviation, for instance, is 10 to the minus two and another posterior center deviation is maybe a thousand, then it just will, will be horribly slow. So we need to avoid that. And the usual way of doing that is just to try and sample a little bit, so UN, until you get a sense of how large the posterior ventilation is, and then you rescale everything to fit.

gain, and you iterate that a [: 00:28:00

Yeah. That's what, that's that. Yeah, that's the warmup or tuning phase that you like people have probably heard about. And then once you get to that phase where it's pretty much stable, then you, you think that you've reached the typical set and then you can start the sampling phase per se. That's the kind of samples that you get from in your trace.

Once the model has finished track, I mean during warmup also, or tuning also, a couple of other things are happening. So for instance, you are actually moving to the typical set. So because you. Start somewhere where that's just not in the, in the typical set at all, where you just don't wanna be. So you need to move in the right direction first and you need to find step sizes and, and, and the mass matrix.

you have a thousand samples [: 00:29:00

And the basic idea than behind modifying that a bit is saying that, okay, maybe we actually have more information than just the posterior than a deviation. We also already, for h c, we computed the gradients of each lock P value and gradients and draws. They both provide information about how much variation there is.

For a certain variable. And turns out if you use both, you can usually converge to something reasonable faster and you can't always, and it's not just that you can converge to something faster, but the thing that you converge to might also be a bit different. And that tends to lead to better effective sample size per draw values.

So it's the [: 00:30:00

Most of the time finding, having a more precise idea of what the typical set actually is. Yeah. Or a more accurate idea, maybe not more precise. And I think the basic idea of how I derive that can also be extended to re parametrizations that are not just rescaling, but that are more complicated. So hopefully that can be developed in a way so that correlations in the posterior, for instance, are way less of a problem even for large models.

ion, where you try to find a [: 00:31:00

And working with that just is no fun. And it's also hard to estimate then. So it definitely seems like if you use the same math I used to derive the new diagonal mass matrix adaptation and apply it to full mass matrix adaptation or some something in between, actually. That also seems to work pretty well.

But that's not enough PA yet. That's still experimental stuff I'm working on and hopefully works out well, but let's see. Mm, okay. Yeah, and then that would be. Stuff that you, like a new kind of adaptation that you would add to Nu Buy for instance, and then that people could be able to use in, in, so there actually is already an implementation on GitHub somewhere, COAPT, which actually works with a default prime C sampler.

t pie, but that's more of an [: 00:32:00

No, I get so, yeah, first that reminds me. So you see folks, that's why we tell you to not code your own samplers, . It's like you can see the whole, like the, the amount of research and expertise in math and like collective. With them and thoughts that go into all those samplers that you use through probab basic programming languages.

that , just use the samplers [: 00:33:00

Yeah. Yeah. You should use that and like get rid of your sampler that you've written in Python. So, yeah. Uh, first and um, second. Okay. So thanks. That actually helped me understand what the mass matrix patient is. Cause I always forget, so that's cool. Yeah. You also talked about the step size. That's right. So of course I would've questions for that, but like your question I often get when I teach patient stats is so with, you said that the sampler starts somewhere and that very often is not in the typical setting.

alization at the mean of the [: 00:34:00

Mm-hmm. . And we told people to not do that. So, Can you, yeah. Can you tell us why it's usually a bad idea to do that and usually leave that choice to point C or stand? So I think one reason is definitely that, I don't think it ever, never used the word ever, but it typically doesn't really help. So it definitely adds an additional thing that needs to be done.

So you need to run optimization first. And I think, so instance, I mean in the literature there's actually interesting ideas around that of doing that. So Pathfinder for instance, is I think, really interesting paper that that tries to develop that idea. But if you just basically doesn't really help, and I think there are cases where definitely can, can make things worse.

ere. I mean, the gradient is [: 00:35:00

So I think maybe it did help actually with, so before the whole h c thing came around, before the gradient based methods, maybe it did actually help there. So that might actually be why it was a thing. But with agency, I mean, usually there might be models where that's different, but really usually if it's a well-defined model, you have the first first couple of draws where the sampler just goes to the right spot, and after that it doesn't matter anymore.

fault, you can get them, for [: 00:36:00

If you develop mass matrix adaptation methods, for instance, you definitely wanna use them because then you can see what it's actually doing behind the behind the scenes. Yeah, that's definitely where you need them. But other than that, I mean, it's just an area where it tries to find good parameters and it's not trustworthy samples during that, during that period in a sense.

So other than for diagnostic and trying to understand what's happening, reasons, I don't really see what you would do with those draws. So they're there to find good parameters for the sampler. And after you found the parameters, you then sample. Yeah. They don't necessarily tell you anything about the inferences that your model make makes, right?

t you actually get from your [: 00:37:00

There shouldn't be any restrictions on the kind of models that you. Can in theory use net pie on the only, like the main restriction was the one you mentioned, which is, well, sometimes an OP would be missing in number, in which case open a, open an issue. But, uh, in general, like you can try net pie on any model.

e, any difficulties that you [: 00:38:00

If you encountered any difficulties. I, good question. I definitely encountered lots of small difficulties, so, mm-hmm. fighting. Yeah. Thinking about how to structure different parts of the library. It definitely, the, the whole adaptation things, they, they tend to, Cut across concerns a bit. So they are a bit tricky to separate out in, in a, in a nice and clean way, I think.

But to be honest, it went surprisingly smoothly. I think probably it helped that I worked on the P M C sampler before quite a bit, so that it wasn't the first nuts implementation I ever did. Then definitely getting all the details right of nuts. That's, that's always a bit tricky because there's a lot of small things you can do wrong and you, they don't look wrong, they just are wrong.

something up or you get less [: 00:39:00

So, you know, there was actually really a lot of fun and I really enjoyed working in Rust as well, which, It's definitely the largest project I ever did in Rust, and uh, I really enjoyed that. Yeah. Okay. Well that's cool. I didn't expect that answer, . I thought you had a lot of banging your head against the walls.

Oh, yeah. Definitely had banging my head against walls. I mean, but I think that's just a given for any, any software project, be honest. Mm-hmm. and, uh, listen, I don't have really this one thing that Yeah. You didn't have like one big block. And, uh, I wanna reassure the listeners because they don't have the video, but, uh, your head looks, it looks good now it looks, it looks like, yeah.

he bumps from the wall seems [: 00:40:00

So nut pie itself feels to me relatively finished. So if people have nice ideas, that that's always great. And I mean, you can definitely. Clean up coat more. I think it's decent hope, but it's not like there's nothing left to do. But it ha is a library that has relatively small scope, so implement this one algorithm and do it nicely, hopefully.

d about something like that. [: 00:41:00

So making sure, so just testing, testing the, the new mass matrix adaptation for on different models and see if we can find something where it's worse, definitely worse than the old implementation. That would be interesting to see than just and from the implementation point of view or the. Number ops that might be missing, still figure out which ones those actually are and implement them.

And I think that's, and test them. Make sure they actually do the right thing as well, which would be, that would be nice. So that's definitely something where I think lots of people could, could help them, which would be great. Good. And so the way to contact you about that is to go on the GitHub rebook of net.

ust open an issue. Simple as [: 00:42:00

And it's the perfect segue for the next topic because you know that, uh, the Statistical Academy now calls that zero abnormal Adrian. Okay. So I guess that would be a contribution, maybe pr that, uh, fix us that and especially since like now actually pm your normal exists, it match. You can even use that in the, in the example.

u know, uh, extension of the [: 00:43:00

So yeah. Can you tell us how, like, what it is and how you came up with the idea? Say first I guess, about being the father of that distribution? I'm not act, so I, I'm pretty sure I made that up on my own in a sense. I mean, as on, on your own as you can. I'm not entirely sure at all. Sure. I was the first person to make this up.

iscovered Bass formula and , [: 00:44:00

And uh, and then I think it's, I thinks maybe actually a bit, bit, bit less important than that as well. But , , well I'm not sure , we'll see what he story says, you know, but yeah, like DMO actually like that has already been discovered. So yeah, he was quite sad about, about that. You know, that was actually quite reassuring to me.

A genius like life. Could be sad sometimes and depressed, you know, that was inspiring. in some way. Yeah. But, okay. Anyways, so yeah, now that we've done the usual caveats, hey, can you tell us about like what that distribution is and how you Yeah. Came up with the idea basically the perfect, okay. So from a mathematical point of view, it's actually a really simple distribution in some sense.

ix that's just a constant co [: 00:45:00

You have too many parameters in a sense, because you are over parametrizing. Yeah, you're over parameterizing your model. So you would like to get rid of one of the degrees of freedom and how statisticians have often done this and which is, and it's because I'm, I'm interrupting you, but Yeah, it's over parametrized because once you know the behavior of 49 states, states, in that example, you would know the behavior of the 50th state without.

Having to compute it. [: 00:46:00

And a normal way of dealing with that is to just drop one of those states in a sense and say that corresponding parameter is defined to be zero. So we'll just basically use the intercept value then for predictions there. Now if you do that in a frequenter setting where you don't have any regularization, that's perfectly fine in a sense that it doesn't matter which state you choose.

distribution, for instance, [: 00:47:00

So you don't really wanna drop a state like you do, like you tend to do in, in, uh, frequentist regression because then you introduce this weird. Weird arbitrary choice. That changes a few results. I mean, in some settings that choice isn't arbitrary. So maybe one state is a control in a state that doesn't really make much sense.

Yeah. But maybe for drug term. Yeah, exactly. Placebo, it's control. And that makes sense to define that as zero, so that's fine. But in other cases it doesn't make sense. Yeah, and in a lot of cases, actually you don't like, because, so what you're talking about I think is called reference encoding or like Yeah, yeah.

s is to, as you were saying, [: 00:48:00

And so, P 10 codeine reference and coding, that's like, that's exactly what we're talking about, taking one of the categories as the reference and like a lot of times it's quite hard to have a reference category in mind, like that really makes sense unless it's really a placebo or something that really makes sense to like make reference to all the time say.

But this leads to the fact that often in in a patient model we actually have it over over parameterized in some sense, which isn't really a problem in some, in some sense because everything works out fine because we have our priors. So it might slow down the sampler quite significantly in some cases.

parameters that tell us not [: 00:49:00

States differ from the mean of all the states and the, I mean, in this case, I mean the sample. Okay. For states there's no, no difference between the mean of all states and the sample mean of states. But let's ignore that difference for a moment. So there's, what's the difference to the meme of each individual state?

So you could say we have a parameter that tells us that some. Has a larger than average value. And you have another parameter that tells us it's lower than average. So if there were only two states, you could say, actually we just say what's the difference between the states? So we define a new distribution that tells us both the values for both of them, but that distribution has one degree of freedom.

er would be minus that value [: 00:50:00

You can take any normal distribution if you like, and pull it into two parts, a zero sum normal distribution, and the sample mean distribution. So you could ask if you have 50 states, what's the distribution of the meme of those states? And what's the difference through the meme and both of those distributions you can make sure if you write it down correctly are normal distributions.

So multivariate normal in this case then, and then you have intercept plus this sample mean distribution. But there those two are just normal distributions. So you can combine them into one normal distribution without really changing the model if you like. So you could adapt the center deviation a bit of that parameter.

st using the zero sum normal [: 00:51:00

So for the states, for instance, it doesn't really make sense to have a parameter for so be because the number of states is a fixed, it's a fixed set of states, and it doesn't really make sense to want to make predictions for new states coming in to the US in the future or something like that. So it's not a really infinite distribu distribution for an infinite number of states, but it's just a finite, finite set of states and it makes sense to say, okay, the mean of those has to be zero.

one category, it's the mean [: 00:52:00

And so, yeah, in a lot of cases that makes more sense. First to interpret the model is like, that helps you understand. Oh yeah. Okay. So Alaska is really below average, whereas Massachusetts is where you above, and also that means that it's not gonna, like, the results are not gonna change because like you're not changing the category, the reference category.

Whereas like if you are using classic reference and coding and then you change the pivot or the reference category, then your re your resources are gonna change the parameters, the parameter values are gonna change. And also you probably, maybe you'll have to change the priors. Yeah. And, and not just the values change, but also their meaning might change.

lpful to use. So anytime you [: 00:53:00

And also, as I was saying, multinomial regression, well, it's like trying to infer the prob later probability. Given number of categories. So here that's the same. And so what I really love about that is that I do use quite a lot of Multim models and, and before you had to do that pivot thing. And so imagine you have a multi numeral regression and in the regression you have a categorical predictor.

in Python, cause Bambi will [: 00:54:00

But then if you can use zero normal because it makes more sense to you. What I love is that you can actually write the multi normal regression like you write the Panal regression. But it's just like instead of using pm normal, you're gonna use pm zero to normal. And that's super cool . It makes everything way easier,

So I think these are like, I would say the two main examples and I have to work on a notebook, Jupyter Notebook example with Ben Vincent to go through at least two examples of categorical regression, categor linear regression with categorical predictors and Al regression. So we'll get there folks at Timer recording, we haven't, well Ben has started that and Tony Capto, I just haven't looked at it at all.

ll, cuz the zero sum normous [: 00:55:00

Yeah. Which is really nice. Uh, if you, if you're working with interactions as well. Yeah. So it's like in your example, for instance, that that would be that common intercept. Then you have one perimeter per state where you have a zero oma that some zero cross states. Then you have one perimeter per, what could we use here?

The example as a second categorical predictor. . That could be age group. Yeah, that could be age group. Yeah, exactly. So then you have age groups, which are ca here are categories again. And so you have the main effect of age groups where you would use the zero oma on, uh, on age this time. And then you could look at the interaction of age group and states.

So that means another [: 00:56:00

And yeah, the cool stuff is that the implementation we have in prime C is that, uh, you can just say PM zero, or. And the usual stuff. So give a name, then your perimeter, uh, your prior for sigma. Then you just say, zero sumax axis equals two, and that means principal understand. You just know what the two zero sumax is.

Basically on the dimensions that you, that you also maybe could mention a case where it might not make sense to use the zero sum normal. That might, for instance, be if you actually have a predictor where it makes sense to say this is we can draw an arbitrary. Elements from this. So it's not just a fixed, finite set.

so I don't know, patient or [: 00:57:00

You then want to compare to the meme of all patients. There are probably, yeah. And then it's makes sense that that's actually a parameter in the model. If that set is finite and fixed, then I think fixing it to something, yeah. Based on the meme of those makes more sense. Mm-hmm. . And so if that set is not finite.

So using zero normal here Yeah. Is. The best idea, what would you use? Would you use like classic reference and coding? Uh, no, I think that would probably, I mean you can use the zero sum normal in the per re parametrization sense that you just say, I just wanna make the model faster. Sample faster. So I use this, but I kind of work out in the end again what, what it would've been if I hadn't done that.

ust use a piano norm. I mean [: 00:58:00

And so like then you could get away with a model that's over parametrized, but actually just way easier to read and code and then to understand. And then if you wanna make predictions on the new patient, for instance, well you can still do Cool. Nice. Yeah, I love, I love that distribution. It's really, is really interesting because as you say, like it's quite simple mathematically, but uh, like it still takes a bit of time to understand how it behaves and, and how useful and when it's useful.

That has been around. Yeah. [: 00:59:00

on that and I think useful way of looking at it. Okay. Time is flying by and, um, I still have a few questions for you. So actually when, like, when I'd like to ask you is a bit like more general. I dunno if you'll have an answer but that, but what do you think the biggest hurdles right now in the patient workflow?

rs for standard deviation or [: 01:00:00

in many cases, It's hard to really investigate all of them and really carefully think about all of those. And in a small model you can definitely do that and um, kind of try to get a feel for how, which ones are important. And often some of those priors really just don't, don't matter at all. So whether you have, if you have a large data dataset, whether you put a half normal prior with a standard deviation of five or 2.5 on your final zema, that, that, that might just not make any, any measurable difference after sampling.

d be if we could mark things [: 01:01:00

Yeah. So that you could mark those in a model by using, I don't know, PM hyper parameter or either something and then automatically have some method that helps you, which of those are actually important? Mm-hmm. . So either by resampling or by looking at gradients, trying to Yeah. Do sensitivity analysis in some sense of how, how does the posterior depend on those choices that I make there?

, yeah, I agree is it's hard [: 01:02:00

In a principled way. Yeah. Like definitely, and I mean, it ties back to what we're talking about at the beginning about how do you choose the sickness for any given individual parameter. Right. Whereas it makes way more sense often to just say, okay, that's the, the amount of variation I expect from that whole phenomenon.

And how like, here is how I think it is distributed among the parameters. , I don't know an exactly. Then another thing that I think is important for most, mostly for highly non-linear models, let's say you have something with an O D E, there's something my wife actually does at the university, uh, in her postdoc where it comes up a lot that if you have an O D E and you, you don't really know, think that this O D E Exactly describes the data generating process.

the, the whole setting where [: 01:03:00

But if that's wrong, then yeah, well what does the output mean? And definitely in highly non-linear models you can easily see example where does. Just goes horribly wrong and it's really hard to fix. So the question, how can we get uncertainty of more complicated processes in the model, how exactly they, it may work and make, make the uncertainty estimations more robust to, to changes to the model.

patient workflow, you know, [: 01:04:00

So, yeah. Uh, I mean, posterior predictive checking definitely is, is an important thing there where you can then easily notice that something is wrong, at least. Yeah. But that, that doesn't necessarily make it that easy to fix. So No, no, no, no. For sure. For sure. And. Yeah, for the record, I think that PM Careful is a really good name for that kind of automated you were talking about

e? Ah, I dunno actually just [: 01:05:00

I think they're kind of, they're family of models that seems really simple and easy to understand, but actually is pretty, I mean, generalized linear models I guess. So the more general version is surprisingly useful. Unreasonably useful I think so. It's really looks simple. Seems. Like, you could learn that pretty quickly, but actually learning all the details and making sure you can actually use it for different settings is, is really tricky.

And yeah, I think that's, yeah, just in general an interesting thing. Yeah. Yeah. Yeah. I agree. I, I really love generalized linear neuro aggressions. I, I mean that kind of like, you know, the, the illustration of the per principle in, in statistics, , you know, it's like just, it's like, it's sounds like 20% of, of models, but you get 80% of the results next to those.

y cool. And also kind, most, [: 01:06:00

Yeah. Yeah, I agree. It's like, you know, a traditional re cooking recipe that it's like each time you make it, it's like, yeah, that, that stuffs good. That works. , can I like pizza? You know, it's like, yeah, pizza works like most the time, you know, just love it. Uh, you try to do other staff and some new fancy fusion stuff in Michel Star restaurant, but yeah, just pizza will get you a long way,

with, uh, next Fighter pilot [: 01:07:00

Like it would make it, it would make it so sexy. So like definitely we need a movie about bass stats and that. Should be the name of the movie. Probably in Aggression is the Pizza of Statistics. such a bad name. . Yeah. Okay. Which projects are most excited about for the coming month? Do you have anything right now you are working on, in, uh, are super excited about?

Yeah, that we've been up Pie, I think. And just in general the, the mass matrix adaptation changes and kind of, I, I also kind of like the mathematical framework that I came up for that which mm-hmm. , which horribly slowed with writing down properly, which I just, yeah. Anyway. Yeah. But I'm really, really excited about that.

I think that [: 01:08:00

Yeah, probab, I mean, getting the best matrix or depletion. Yeah, getting the sample is more robust. , I think that , I knew it. That's probably it. Yeah. . I love that answer. You're definitely the first one to say that. . So now I, I guess it shows that I'd have to actually listen to your podcast that off, because I don't really know what most people answer there.

ny problem within statistics [: 01:09:00

Not kinda. Worldwide problem. I think that I might have other priorities to be honest. Okay. Okay. , I see what you did here. That's not gonna fly, but I see what you're trying to . So second question, if you could have dinner with any great scientific mind, dead, alive, or fictional, who would be, that's also a hard question.

As someone like Euclid would be actually pretty interesting, assuming there's no language barrier, because otherwise I'll definitely go for somebody German or English or British or whatever. Yeah, I think that would be interesting to just, I mean I, I don't know if he was actually a nice guy. Don't think we actually know that much about Tim personally.

of thinking about mathematics: 2000

I, I guess I like antiquity. Interesting. Yeah. Period. So even apart from math and science and stuff, uh, might be things to ask. Yeah, for sure. That must be fun. And the cool thing is that if it turns out to be a jerk, well probably after the dinner he would probably die because of all the germs that you gave him and that he has no immunity against the, hopefully it goes that direction.

I'm not entirely sure, but, uh, yeah, true. Oh, true. Yeah. I mean, it's not probable that it goes that direction than the other, but yeah, you never know. Yeah, you never know. fi dank. I, I would've liked so many more questions, but, um, it's already been a long time, so let's, let's call it a show and as usual, I put resources and a link to your.

r episode of Turning Patient [: 01:11:00

That's learn based dance.com. Our theme music is good patient. Bye Baba Ringman at Mc Lars and mega. Check out his awesome work@babaringman.com. I'm your host, Eric Sandora. You can follow me on Twitter, alexco endora, like the country. You can support the show and unlock exclusive benefits by visiting patriot.com/learn steps.

edictions that your brain is [: 01:12:00

Let's get the solid foundation.

Share Episode

Shownotes

Transcripts

Follow

Links

Chapters

Video

More from YouTube