Easily listen to Learning Bayesian Statistics

Artwork for podcast Learning Bayesian Statistics

#144 Why is Bayesian Deep Learning so Powerful, with Maurizio Filippone

Open-Source News Episode 144 • 30th October 2025 • Learning Bayesian Statistics • Alexandre Andorra

Speaker A: 00:00:02

Let me show you how to be a good Beezy.

Speaker B: 00:00:04

How do we bring rigorous uncertainty into modern machine learning without losing scalability?

Speaker B: 00:00:11

Today I am joined by Maurizio Filippone, Associate professor at KAUST and leader of the Bayesian Deep Learning Group whose path from physics to machine learning has been guided by a single obsession function estimation Done the Bayesian way we dive into the frontier where GP's meet deep learning, deep Gaussian processes, Bayesian neural networks trained with stochastic gradients and pragmatic tools like Monte Carlo Dropout for uncertainty quantification.

Speaker B: 00:00:41

Along the way we tackle trade offs between interpretability and flexibility, when to reach for a GP versus a neural net, and how Bayesian ideas improve optimization, experimental design and even generative models.

Speaker B: 00:00:56

Finally, we look ahead to the future where uncertainty isn't an afterthought, but a first class citizen of AI, integrated, efficient and indispensable.

Speaker B: 00:01:07

,: 2025

Speaker A: 00:01:16

Let me show you how to be a good peasy and change your predictions after taking information in.

Speaker A: 00:01:22

And if you're thinking I'll be less than amazing, let's adjust expectations.

Speaker A: 00:01:28

What's a Bayesian?

Speaker A: 00:01:30

Is someone who cares about evidence.

Speaker C: 00:01:31

Welcome to Learning Bayesian Statistics, a podcast about Bayesian inference, the methods, the projects and the people who make it possible.

Speaker C: 00:01:40

I'm your host Alex Endora.

Speaker C: 00:01:42

You can follow me on Twitter lexendora Like the country for any info about the show, learnbatestats.com is laplacetobie show notes becoming a corporate sponsor Unlocking Bajan Merch Supporting the show on Patreon Everything is in there.

Speaker C: 00:01:58

That's learnbased stats.com if you're interested in one on one mentorship, online courses or statistical consulting, feel free to reach out and book a call at topmate IO Alexandora CEO Round folks and best patient wishes to you all.

Speaker C: 00:02:14

And if today's discussion sparked ideas for your business, well our team at PMC Labs can help bring them to life.

Speaker C: 00:02:21

Check us out@pmc labs.com.

Speaker B: 00:02:28

Hello my dear.

Speaker D: 00:02:28

Patients, just a quick word to remind.

Speaker B: 00:02:32

You that I'm running my first ever live workshop and it's going to be a live cohort.

Speaker B: 00:02:39

We're going to kick this off with athletics and we're going to do hierarchical models in PC and Bambi on November 5 and 6 and in two sessions you will beat the and interpret a working multi level model with posterior checks and stakeholder ready visuals.

Speaker B: 00:02:57

It's going to be short live code.

Speaker B: 00:03:00

First, thanks to athletics, we're going to have pre authenticated GCP VMs so you can model without set of frictions.

Speaker B: 00:03:07

And that means that if you want to come learn live with me also join the discord that we have with the learn based stats patrons.

Speaker B: 00:03:16

Well, now is the time to join.

Speaker B: 00:03:19

We're going to learn with sports analytics examples and few other examples.

Speaker B: 00:03:24

Of course we're pretty much at capacity but there are still a few spots left.

Speaker B: 00:03:29

So I would love to see you in a few weeks November 5th and 6th.

Speaker B: 00:03:34

All the details are in the show notes and well, if you have any questions, feel free to reach out.

Speaker B: 00:03:40

Otherwise I'll see you very soon.

Speaker B: 00:03:43

Thank you folks.

Speaker B: 00:03:44

And now let's talk about what is a deep Gaussian process with Maurizio Filipponi.

Speaker B: 00:03:54

Maurezio Filipone Benvenuto I Learning Bayesian statistics.

Speaker E: 00:04:00

Well, thank you so much, Alex.

Speaker E: 00:04:01

Thank you so much.

Speaker E: 00:04:02

Great, great to be here.

Speaker B: 00:04:04

Thanks a lot again to Hans for, for putting us in context.

Speaker B: 00:04:08

You're like the, the podcast matchmaker of LBs.

Speaker D: 00:04:13

Because we already had some colleague of yours actually, Maurizio.

Speaker B: 00:04:17

I'll put the link in the, in the show notes we had Harvard Rue and Janet Van Nieckerk.

Speaker B: 00:04:23

They were in episode 136.

Speaker B: 00:04:26

So I'll put that into the show notes.

Speaker B: 00:04:27

That was a really, really fun episode.

Speaker B: 00:04:29

Talks everything about Bayesian inference at scale, everything about inla.

Speaker B: 00:04:33

So lots of good nuggets in this episode, folks.

Speaker B: 00:04:35

We talked about penalized complexity priors which are available out of the box in our package.

Speaker B: 00:04:40

I guess they will be in the Python package.

Speaker B: 00:04:43

If you're interested in inline Bayesian inference at scale, this is definitely an exciting time.

Speaker B: 00:04:50

I think it's very good for us to tackle the idea that Bayesian is not available to scale and we'll keep doing that today, I guess with you, Maurizio.

Speaker B: 00:04:59

We'll talk about gp, about deep learning, all the very fun stuff you do.

Speaker B: 00:05:03

But first, let's start with your origin story.

Speaker B: 00:05:06

Can you tell us what you're doing nowadays?

Speaker B: 00:05:08

And also how did you end up working on that?

Speaker B: 00:05:10

You know, because you're an Italian in Saudi Arabia.

Speaker B: 00:05:13

How did that happen?

Speaker E: 00:05:14

Yes, that's a long journey.

Speaker E: 00:05:18

So first of all, I would like to thank you for the great service you're doing for the community.

Speaker E: 00:05:22

I think this is very important and I'm really happy to be here and to be part of this long list of great speakers that you had before big important guests that you had before me.

Speaker E: 00:05:32

So I started with a master's in physics, so that's where I started in Italy.

Speaker E: 00:05:39

I got interested in dynamical systems at the time and I took a course on neural networks.

Speaker E: 00:05:49

This is many years ago, where we were trying to understand whether we could predict time series using neural networks without knowledge of the physics.

Speaker E: 00:06:01

So even if I was studying physics, we were trying to avoid having to know physics to make these predictions.

Speaker E: 00:06:07

And it turns out that there's some nice mathematics behind the theory of dynamical systems, allowing you to.

Speaker E: 00:06:16

Even for chaotic systems, so systems that are relatively simple to write in terms of differential equations, but then the trajectories really evolved in a very seemingly random way.

Speaker E: 00:06:28

But actually it's not random, it's just that the characteristics of this differential equation is so that there is this emergence of chaos.

Speaker E: 00:06:36

You can still predict this time series really well.

Speaker E: 00:06:40

And then I got interested in machine learning.

Speaker E: 00:06:42

And at the time people were not believing that this would be a smart move because it was, you know, more than 20 years ago.

Speaker E: 00:06:52

And I believed that this would be something interesting to pursue.

Speaker E: 00:06:56

And then I started a PhD in computer science and that led me to move into the uk.

Speaker E: 00:07:01

So one of my reviewers was Mark Girolamy at Glasgow at the time.

Speaker E: 00:07:07

So I got interested in exploring the uk.

Speaker E: 00:07:10

So I first did the postdoc in Sheffield and then eventually I moved to Glasgow with Mark Girolamy and then from there we moved to UCL for a year while I was doing a postdoc, because he got a Chair of Statistics in the statistics department at ucl.

Speaker E: 00:07:27

Then I got a lectureship back in Glasgow.

Speaker E: 00:07:29

And after a few years in computer science at Glasgow, I decided to move to France.

Speaker E: 00:07:34

There was a big opportunity to develop machine learning at scale in this institute and build something new there.

Speaker E: 00:07:41

So it was an exciting opportunity.

Speaker E: 00:07:43

And then after eight years that I successfully built something there, I decided to explore something new where there was this opportunity here at kaust.

Speaker E: 00:07:52

I knew a lot of great people working here, so I decided to give it a shot.

Speaker E: 00:07:57

And now I'm here.

Speaker E: 00:07:58

And in this journey I started from time series prediction, going through clustering, anomaly detection, and eventually now we're working on various applications in.

Speaker E: 00:08:11

Well, I've worked in some applications in neuroscience, fraud detection, industrial applications of various kinds, and now here there is stronger focus on environmental sciences.

Speaker E: 00:08:23

And, yeah, it's really exciting.

Speaker D: 00:08:25

Yeah, yeah, I agree.

Speaker B: 00:08:27

And before diving to the technical details, can you also give us an idea of what your group's main goals and research themes are?

Speaker B: 00:08:39

Because you lead the Bayesian Deep Learning group there at Kaust.

Speaker E: 00:08:44

Yeah.

Speaker E: 00:08:45

So in terms of themes, I think There is one big theme that is central to everything, which is function estimation.

Speaker E: 00:08:51

So a lot of the things I do every day, a lot of the things that most people do in machine learning every day and statistics, I think it's really function estimation.

Speaker E: 00:09:00

So I started working on kernel methods back when I was doing my master's and Ph.D. and then eventually through the postdoc, we started working on these non parametric models, so Gaussian processes with probabilistic nonparametric models.

Speaker E: 00:09:14

So Gaussian processes.

Speaker E: 00:09:15

And then eventually deep learning started to become quite popular and very powerful.

Speaker E: 00:09:23

So naturally it felt like we had to think about these extensions to deeper models.

Speaker E: 00:09:30

And so the natural thing for me was to take a Gaussian process and make it deep.

Speaker E: 00:09:34

Right.

Speaker E: 00:09:34

And so we started studying these deep Gaussian processes which were already been proposed a few years back, but of course we started thinking about approximations to make them scalable and so on.

Speaker E: 00:09:46

And then eventually today I don't have any, let's say preference or I don't see a distinct line between deep Gaussian processes and Bayesian deep neural networks.

Speaker E: 00:09:59

In the end, you can sort of view deep Gaussian processes as special case of a Bayesian neural network.

Speaker E: 00:10:04

So for me now, a lot of the techniques that we've always used to do scalable inference for gps, we sort of port them to Bayesian deep learning.

Speaker E: 00:10:14

And yeah, and so this is an exciting space because there is so much development going on and we're part of this.

Speaker E: 00:10:22

So it's really great.

Speaker B: 00:10:22

Yeah, yeah, that's super fun.

Speaker D: 00:10:25

And yeah, I'm glad that you already established the connection there is between Bayesian neural networks and Gaussian processes.

Speaker B: 00:10:33

Like in the end, everything is a Gaussian process.

Speaker B: 00:10:36

And so I'm curious if you can define what a deep Gaussian process is, because I think my audience has a good idea of what a Bayesian neural network is.

Speaker B: 00:10:49

And I've had especially recently, Vincent Fortwin talk about that on the show.

Speaker B: 00:10:55

So I'll put that also in the show notes.

Speaker B: 00:10:58

So these Bayesian deep learning I think people are familiar with.

Speaker B: 00:11:04

Can you tell us what a deep Gaussian process is?

Speaker B: 00:11:07

Because I think people see what a Gaussian process is, but what makes it a deep one.

Speaker E: 00:11:13

Great episode, the one with Vincent, by the way.

Speaker E: 00:11:17

I checked it out.

Speaker E: 00:11:17

Thank you.

Speaker E: 00:11:18

Because I guess he would say a lot of things that I would probably say also in my episode.

Speaker E: 00:11:22

So it was great to see it.

Speaker E: 00:11:25

So, yeah.

Speaker E: 00:11:25

So a Gaussian process, there's many ways which you can see it.

Speaker E: 00:11:29

The easiest way is probably to start from a linear model.

Speaker E: 00:11:31

I think I really like the construction from a linear model.

Speaker E: 00:11:35

So if we start from a linear model and we make it Bayesian, so we put a prior on the parameters, then we have analytical forms for the posterior, the predictions, Everything is nice and Gaussian.

Speaker E: 00:11:47

And so now one nice thing we can do is to start thinking about linear regression, but now with basis functions.

Speaker E: 00:11:54

So we start introducing linear combinations not of just the covariates or features, if you want to call them that, but you have a transformation.

Speaker E: 00:12:03

Let's say sine and cosine could be trigonometric functions of any kind, could be polynomials.

Speaker E: 00:12:09

And it turns out that you can use kernel tricks to be able to say what the predictive distribution is going to be for this, the model is still linear in the parameters.

Speaker E: 00:12:21

But now what we can do is to take the number of basis functions to infinity, so we can make an infinitely large polynomial.

Speaker E: 00:12:30

And now the number of parameters will be infinite.

Speaker E: 00:12:32

But what we can do is to use this so called kernel trick to actually express everything in terms of scalar products among this mapping of inputs to this polynomial.

Speaker E: 00:12:47

And so if you do that, then what you can do is to, instead of working with polynomials or these basis functions, now you can define a so called kernel function, which is the one that takes inputs features and it spits out a scalar product of these induced polynomials in this very large dimension, infinite dimensional space.

Speaker E: 00:13:06

So this kind of trick allows you to just then work with something which is infinitely powerful in a way, because it's infinitely flexible in a way that you have an infinite number of parameters now.

Speaker E: 00:13:19

But the great thing is that if you have only n observations, all you need to do is to care about what happens for these n observations.

Speaker E: 00:13:26

And so you can construct this covariance matrix and you know, it can do.

Speaker E: 00:13:30

And everything is Gaussian again, it's very nice.

Speaker E: 00:13:32

So the first time you generate a function from Gaussian process, it's beautiful because you get these nice functions that look beautiful and it's just a multivariate normal really.

Speaker E: 00:13:43

And it's just, that's all it is.

Speaker E: 00:13:45

So I still remember the first time I generated the function from a GP because it was Eureka moment where you realize how simple and beautiful this is.

Speaker E: 00:13:55

So then you can think that now this represents a distribution over functions.

Speaker E: 00:14:00

So if you draw from this gp, you obtain samples that are functions.

Speaker E: 00:14:06

And now what you can do is to say, well, what if I take this function now?

Speaker E: 00:14:09

And instead of just observing this function alone, I just put it inside as an input to another Gaussian process.

Speaker E: 00:14:17

So in a gp, you have inputs which are your input data, where you have observations.

Speaker E: 00:14:23

So now you're mapping into functions, and then this function can become now the input to another gp, for example, you know, and then you can even say, okay, let's take these inputs and map them not just to a univariate Gaussian process, where we have just one function, but maybe we can map it into 10 functions, and then these 10 functions become the input to a new Gaussian process.

Speaker E: 00:14:46

And so this would be a one layer, deep Gaussian process.

Speaker E: 00:14:51

Right.

Speaker E: 00:14:51

So you have now one layer, which is first hidden functions that then enter as input to another Gaussian process.

Speaker E: 00:15:00

What's the advantage of this?

Speaker E: 00:15:01

Why do we do this?

Speaker E: 00:15:03

Well, you know, with Gaussian process, that's.

Speaker D: 00:15:05

Going to be my question.

Speaker E: 00:15:06

Yeah, so with Gaussian processes, the.

Speaker E: 00:15:09

The characteristics that you observe for the functions that you generate are determined by the choice of the covariance function.

Speaker E: 00:15:16

So if you take a covariance function, which is rbf, you're going to have infinitely smooth functions that you generate.

Speaker E: 00:15:25

And the way these functions are going to be the length scale of these functions and the amplitude, they're going to be determined by the parameters that you put in the covariance function.

Speaker E: 00:15:36

And of course, there might be problems.

Speaker B: 00:15:38

Where.

Speaker E: 00:15:40

You have no stationarity.

Speaker E: 00:15:42

So in a part of the space, functions should be nice and smooth, and in other parts of the space, maybe you want more flexibility.

Speaker E: 00:15:50

And then a Gaussian process with a standard covariance function cannot achieve that.

Speaker E: 00:15:58

And so in order to increase flexibility, you either spend time designing kernels that actually can do crazy things, which is possible, but relatively hard, because now you have a lot of choices.

Speaker E: 00:16:09

You can combine kernels in multiple ways, and if you have a space of possible kernels you want to choose from, combining them becomes a combinatorial problem.

Speaker E: 00:16:19

So you may say, instead, let's just compose functions.

Speaker E: 00:16:22

And composition is very powerful.

Speaker E: 00:16:24

And this is why deep learning works, because in deep learning, you essentially have function compositions.

Speaker E: 00:16:30

And so even if you compose simple things, the result is something very complicated.

Speaker E: 00:16:36

And you can try it yourself.

Speaker E: 00:16:37

You know, take a sine function and put it into another sine function.

Speaker E: 00:16:41

If you play around with the parameters, you can get things that oscillates in a crazy way.

Speaker E: 00:16:46

And this is very simple, but very powerful.

Speaker E: 00:16:49

And so the idea of deep Gaussian process is exactly this, to try to enrich the kind of class of functions you can obtain by composing functions, composing Gaussian processes.

Speaker E: 00:17:02

And of course now the marginals in a Gaussian process, all the marginals are nice and Gaussian.

Speaker E: 00:17:07

If you compose, these marginals become non Gaussian.

Speaker E: 00:17:10

And this is really getting to the point where you start thinking, well, why should we then restrict ourselves to composing processes that are Gaussian?

Speaker E: 00:17:20

Maybe we can do something else.

Speaker E: 00:17:22

And then, you know, maybe thinking about other ways in which you can be flexible in the way you parameterize these complicated conditional distributions.

Speaker B: 00:17:31

Okay.

Speaker B: 00:17:32

Yeah, yeah.

Speaker B: 00:17:32

Damn, this is super fun.

Speaker B: 00:17:33

So it sounds to me like Fourier decomposition on steroids, basically.

Speaker B: 00:17:38

So it's like decomposing everything through these basis functions and plugging everything into each other.

Speaker B: 00:17:47

So like, you know, like these mamushkas of Gaussian processes, basically.

Speaker B: 00:17:53

So, yeah, and I can definitely see the power of that.

Speaker B: 00:17:56

It's like, yeah, it's, it's like having very deep neural networks, basically.

Speaker B: 00:18:01

So I see, I definitely see the connection.

Speaker B: 00:18:04

Why that would be super helpful.

Speaker B: 00:18:07

And that helps, I'm guessing that helps uncover very complex nonlinear patterns that are very hard to express in a functional form.

Speaker B: 00:18:20

That functional form would be.

Speaker B: 00:18:22

Well, you have to choose the kernels and sometimes, as you were saying, the out of the box kernels can't express the complexity you have in the data.

Speaker B: 00:18:31

And then having basically the machine discover the kernels by itself is much easier.

Speaker E: 00:18:38

Yeah, and it's really also about the marginals.

Speaker E: 00:18:42

If you believe that your marginals can be Gaussian and you're happy with that, then it's all fine.

Speaker E: 00:18:48

You can do kernel design.

Speaker E: 00:18:49

You can spend a bit of time trying to find a good kernel that gives you good fit to the data, good modeling, good uncertainties.

Speaker E: 00:18:56

But then there's still going to be this constraint in a way that you're working with the Gaussian process.

Speaker E: 00:19:01

You know, in the end, marginally, everything is Gaussian.

Speaker E: 00:19:03

You may not want that in certain applications where maybe the distributions are very skewed and other things, you know, and then maybe the skewness also is position dependent, input dependent, you know, so this non stationarity also, again, you can encode it in certain kernels, but it's just so much easier to compose, I mean, from the principle of just mathematical composition.

Speaker E: 00:19:26

Then of course, computationally, how to handle this.

Speaker E: 00:19:29

This is another pair of hands.

Speaker B: 00:19:30

Yeah, yeah, no, exactly.

Speaker B: 00:19:32

I mean, you're trading basically something that's more comfortable for the user for something that's much harder to compute for the computer.

Speaker B: 00:19:44

But yeah, like in the end that also can be something that is more transferable.

Speaker B: 00:19:50

Because if you have, unless you're a deep expert in Gaussian processes, coming up with your own kernels, each time you need to work on a project is very time consuming.

Speaker B: 00:19:59

So it can be actually worth your time to turn into the deep Gaussian processes framework, throw computing power at it and you know, go your merry way working on something in the meantime while the, while the computer samples.

Speaker E: 00:20:15

That definitely makes sense.

Speaker E: 00:20:16

But again, the deep aspect carries other design choices.

Speaker E: 00:20:22

Now you have to choose how many layers, what's the dimensionality of each layer.

Speaker E: 00:20:29

And then there is this other problem of now what kind of inference you choose, which definitely has an effect.

Speaker E: 00:20:37

So we've done some studies on this, trying to compare a little bit, various approaches.

Speaker E: 00:20:42

We did this a few years ago now because the deep, I think we started working on this right after TensorFlow came out.

Speaker E: 00:20:49

So this was: 2016

Speaker E: 00:20:51

So we started doing, we did our deep GP with a certain kind of approximation that is not very popular.

Speaker E: 00:20:56

I mean, the community seems to have agreed that, you know, inducing points methods are very powerful to do approximations.

Speaker E: 00:21:04

And you know, I've also done some work on that with some great people, particularly James Hansman, who has developed GP flow with some other great guys.

Speaker E: 00:21:16

But random features is what you said before, you mentioned Fourier transform on steroids.

Speaker E: 00:21:23

I mean, the idea is really for certain classes of kernels, you can do some sort of expansions and sort of linearize the Gaussian process.

Speaker E: 00:21:30

So before I was talking about going from a linear model to something which is infinite number of basis functions.

Speaker E: 00:21:37

And now the idea is just truncate this number of basis functions.

Speaker E: 00:21:41

You can do it in various ways.

Speaker E: 00:21:44

There is a randomized version that we do when we do these random features and then you sort of truncate.

Speaker E: 00:21:51

And so now instead of working with this, you turn a Gaussian process into a linear model with a, a large number of basis functions.

Speaker E: 00:21:59

And then linear models are nice to work with.

Speaker E: 00:22:01

And then if you compose them, then that's when you get the deep Gaussian process.

Speaker E: 00:22:05

Essentially you get a deep neural network with some stochasticity in the layers and that's all there is to it.

Speaker E: 00:22:13

And so when we did this, we implemented it in TensorFlow because it was the new thing and it was very scalable.

Speaker E: 00:22:21

We took some competitors and we really were really fast at converging to good solutions and getting good results, you know, so.

Speaker E: 00:22:31

And we have an implementation out there in TensorFlow, unfortunately, I mean we, we should now maybe port it to Pytorch, which has become what we work on more.

Speaker E: 00:22:41

Yeah, but.

Speaker F: 00:22:47

Hugo Bown Anderson here, data and AI scientist, consultant and educator.

Speaker F: 00:22:51

I'm a friend of Alex and I was on episode 122 of Learn Bayesian Statistics talking about learning and teaching in the age of AI.

Speaker F: 00:22:59

If you're building with LLMs and AI and especially if you've hit that wall where your prototype works sometimes but isn't reliable enough to ship, I've got something for you.

Speaker F: 00:23:09

I'm teaching a four week course called Building AI Applications.

Speaker F: 00:23:12

We focus on the actual software development, life cycle, agents, evals, logging rag, fine tuning, iterating, debugging and more.

Speaker F: 00:23:20

I teach it with Stefan Krachik who is currently working on AI agent infrastructure at Salesforce.

Speaker F: 00:23:26

Students get over $1,200 in cloud credits from Modal Pedantic, Logfire, Chroma Cloud and more to build with immediately.

Speaker F: 00:23:33

We're excited to offer you all 25% off the link is in the show notes.

Speaker F: 00:23:38

You can also go to Bit Ly Lbs friends.

Speaker F: 00:23:43

Class starts November 3rd.

Speaker F: 00:23:45

Would love to see you there.

Speaker B: 00:23:52

No, for sure.

Speaker B: 00:23:52

But I mean, yeah, that's definitely linked to that, to that TensorFlow implementation that you have because yeah, I'm very big on pointing people towards how they can apply that in practice and basically making the bridge between frontier research, as you're doing and, and helping people implement that in their own modeling workflows and problems.

Speaker B: 00:24:18

So let's definitely do that.

Speaker B: 00:24:20

And yeah, I was actually going to ask you.

Speaker B: 00:24:22

Okay, so that's, that's a great explanation and thank you so much for laying that out.

Speaker D: 00:24:27

So.

Speaker B: 00:24:28

So clearly I think it's awesome to start from the linear representation, as you were saying.

Speaker B: 00:24:32

And basically, yeah, going to the very big deep gps which are in a way easier for me to represent to myself because they, you know, it's like an in the infinity in the limit.

Speaker B: 00:24:49

It's easier, I find to work with than deep neural networks, for instance.

Speaker B: 00:24:54

But yes, like can you give us a lay of the land of how.

Speaker B: 00:25:01

What's the field about right now?

Speaker B: 00:25:03

Let's start with the practicality of it.

Speaker B: 00:25:07

What would you recommend for people?

Speaker B: 00:25:12

In which cases would these deep GPS be useful?

Speaker B: 00:25:17

First and second question, why wouldn't they use just deep neural networks instead of deep gps?

Speaker B: 00:25:26

Let's start with that.

Speaker B: 00:25:27

I have a lot of other questions, but let's start with that.

Speaker B: 00:25:29

I think it's the most general one.

Speaker E: 00:25:31

Yeah, yeah, I think, I mean it's a great question.

Speaker E: 00:25:34

It's a, it's the Mother of all questions, really.

Speaker E: 00:25:36

I mean, what kind of model should you choose for your day data?

Speaker E: 00:25:39

And I think that is going to be a lot of great work that is going to happen soon where we're going to maybe be able to give more definite answers to this.

Speaker E: 00:25:49

I think we're starting to realize that this overparameterization that we see in deep learning is not so bad after all.

Speaker E: 00:25:59

So for someone working in Bayesian statistics, I think we have this image in mind where we should find the right complexity for the data that we have.

Speaker E: 00:26:08

So there's going to be a sweet spot of a model that is sort of parsimonious in looking at the data and not too parameterized.

Speaker E: 00:26:15

But actually deep learning is telling us now a different story, which is not different from the story that we know for non parametric modeling, for Gaussian processes.

Speaker E: 00:26:24

In Gaussian processes we push the number of parameters to infinity, right?

Speaker E: 00:26:27

And in deep learning now we're sort of doing the same, but in a slightly mathematical, different form.

Speaker E: 00:26:34

But the.

Speaker E: 00:26:36

So where we're getting at is a point where actually this enormous complexity is in a way facilitating certain behaviors for these models to be able to represent our data in a very simple way.

Speaker E: 00:26:50

So the emergence of simplicity seems to be connected to this explosion in parameters.

Speaker E: 00:26:55

And I think Andrew Wilson has done some amazing work on this and it's recently published and I can link it to that paper which says deep learning is not so mysterious.

Speaker E: 00:27:08

And it's something I was reading recently.

Speaker E: 00:27:09

It's a beautiful read.

Speaker E: 00:27:12

And I think to go back to your question, so today, what should we do?

Speaker E: 00:27:15

Should we stick to a gp?

Speaker E: 00:27:17

Should we go for a deep neural network?

Speaker E: 00:27:19

I think for certain problems we may have some understanding of the kind of functions we want.

Speaker E: 00:27:25

And so for those that if it's possible and easy to encode them with gps, I think it's definitely a good idea to go for that.

Speaker E: 00:27:34

But there might be other problems where we have no idea, or maybe there is too many complications in the way we can think about the uncertainties and other things.

Speaker E: 00:27:46

And so maybe just throwing a data driven, I mean if we have a lot of data, maybe we can say, okay, maybe we can go for an approach that is data hungry and then, you know, we can leverage that.

Speaker E: 00:27:56

And deep learning seems to be like maybe a right choice there.

Speaker E: 00:27:59

But of course now there is also a lot of stuff happening in other spaces, let's say in terms of foundational models.

Speaker E: 00:28:09

So now there is this class, this breed of new things, new models that have been trained on A lot of data.

Speaker E: 00:28:17

And then with some fine tuning on your small data, you can actually adapt them.

Speaker E: 00:28:22

You know, this transfer learning actually works and we've done it for.

Speaker E: 00:28:26

So there's this paper again by Andrew Wilson on predicting time series with language models.

Speaker E: 00:28:32

So you take ChatGPT and you make it predict, you discretize your time series, you tokenize and you give it to GPT and you look at the predictions, you invert the transformation so you get back scalar values.

Speaker E: 00:28:46

And actually this seems to be working quite well.

Speaker E: 00:28:49

So we tried now for, with the multivariate versions of this probabilistic multivariate and so on.

Speaker E: 00:28:55

So we've done some work on that also.

Speaker E: 00:28:57

But just to say that, I mean now this is something also kind of new that is happening, you know, because before maybe it was really hard to train these models at such a large scale.

Speaker E: 00:29:08

But now if you train a model on the entire web with all the language, language is Markovian in a way.

Speaker E: 00:29:15

So you know, these Markovian structures are sort of learned by these models.

Speaker E: 00:29:19

And now if you feed these models with the stuff that is Markovian, it will try to make a prediction that is actually going to be reasonable.

Speaker E: 00:29:26

And this is what we've seen in the literature.

Speaker E: 00:29:30

And all these things are I think are going to change a lot of the way we think about designing a model for the data we have and how we do inference and all these things.

Speaker E: 00:29:42

So as of today I think maybe still is relevant to think about.

Speaker E: 00:29:46

Okay, if I have a particular type of data, I know that it makes sense to use a Gaussian process because I want certain properties in the functions, I want certain matern for example, gives us some sort of smoothness up to a certain degree and it's easy to encode length scales of these functions for the prior of the functions.

Speaker E: 00:30:05

And this is great for neural networks.

Speaker E: 00:30:07

This is very hard to do.

Speaker E: 00:30:08

So we've done some work trying to map the two.

Speaker E: 00:30:11

Right.

Speaker E: 00:30:11

So we tried to say okay, what can we make a neural network?

Speaker E: 00:30:14

Imitate what Gaussian processes do so that we gain sort of the interpretability and the nice properties of a Gaussian process.

Speaker E: 00:30:22

But then we also inherit the flexibility and the power of these deep learning models so that they can really perform well and also give us sound uncertain quantification.

Speaker B: 00:30:34

Yeah, okay.

Speaker B: 00:30:35

Yeah, yeah, yeah.

Speaker D: 00:30:36

So many things to unpack here.

Speaker D: 00:30:38

I love it.

Speaker D: 00:30:39

This is, this is super exciting to me because I love, I, I, I.

Speaker B: 00:30:42

Love working with these methods but I also end up working with them a lot.

Speaker B: 00:30:47

Gps, GPS of Course, as, as my listeners are tired of hearing.

Speaker E: 00:30:53

But.

Speaker B: 00:30:54

And yeah, everything you just said here is something that resonates because what I love in GPS is their composability and their interpretability.

Speaker B: 00:31:02

Especially because you can, and thanks to that you can, you can impose prior structure on the functions you're going to get.

Speaker B: 00:31:09

And I find this is extremely useful.

Speaker B: 00:31:13

Yeah.

Speaker B: 00:31:14

So two questions on that.

Speaker B: 00:31:15

First, do you still have the interpretability of DPs?

Speaker B: 00:31:21

If you have deep DPs, like, does the length scale still mean something?

Speaker B: 00:31:26

And the amplitude if you have like an exponential family kernel?

Speaker B: 00:31:31

And second question, what are the state of the art packages that you would recommend people check out right now, both in R and Python, or maybe just in Python, because deep learning is mostly Python centric.

Speaker B: 00:31:49

But like, let's say I'm a listener.

Speaker B: 00:31:51

I find what you're saying very interesting.

Speaker B: 00:31:53

For my use case, I want to check out how to do deep GPS for my project and put that in competition with deep neural networks and hopefully put that into in competition with deep Bayesian neural networks.

Speaker B: 00:32:10

But we talked with Vincent about the fact that for now there is no real out of the box package that helps you do that invasion neural networks.

Speaker B: 00:32:19

So yeah, then two big questions.

Speaker D: 00:32:21

Sorry.

Speaker D: 00:32:21

But like, I think it's going to be super interesting.

Speaker E: 00:32:24

Yeah, well, I think so.

Speaker E: 00:32:27

In terms of the code, in terms of code, I think gps, I think that GP Flow is probably one of the most accessible ones.

Speaker E: 00:32:40

James is a good friend.

Speaker E: 00:32:43

when we were chatting @nerifs: 2015

Speaker E: 00:32:53

And so he said, okay, I'm going to do a software package for GPS in TensorFlow.

Speaker E: 00:33:02

And this is something that he then developed over the years.

Speaker E: 00:33:06

He moved to a startup company called Prowler for a few years.

Speaker E: 00:33:11

He had a good team of developers helping him out.

Speaker E: 00:33:14

So he did a really great job in that.

Speaker E: 00:33:17

And I think GP Flow is a really good starting point.

Speaker E: 00:33:21

I think for some projects also with my students in the past, we relied on that.

Speaker E: 00:33:26

And I think you can also put.

Speaker B: 00:33:29

That in the show notes.

Speaker B: 00:33:32

And James should come on the show.

Speaker B: 00:33:33

It sounds like.

Speaker E: 00:33:34

Absolutely, yeah.

Speaker B: 00:33:35

Be a great guest for that.

Speaker D: 00:33:37

Like a GP Flow episode.

Speaker E: 00:33:39

Yeah.

Speaker E: 00:33:39

He's also a great cook, by the way.

Speaker E: 00:33:42

He invited me and Alex Matthews for dinner once in Sheffield for pasta.

Speaker E: 00:33:47

And I thought, okay, you know, he's gonna make some normal pasta.

Speaker E: 00:33:51

No, he made pasta from scratch.

Speaker D: 00:33:52

Non Italian pasta.

Speaker D: 00:33:53

Overcooked.

Speaker E: 00:33:54

Yeah, I was very impressed.

Speaker E: 00:33:56

He Did a fantastic job.

Speaker E: 00:33:57

It was really nice.

Speaker B: 00:33:58

Oh, damn, that's quite the endorsement.

Speaker D: 00:34:00

Yeah, yeah, yeah, that's cool.

Speaker B: 00:34:01

So then, no, like he needs to come on the show, but like for a live show, then I need to do a live show in Sheffield, it sounds like.

Speaker E: 00:34:09

Yes, and, and so.

Speaker E: 00:34:14

So, yeah, so I think there are also deep GPS you can easily do there.

Speaker B: 00:34:19

With GP flow.

Speaker E: 00:34:21

With GP flow, yes.

Speaker E: 00:34:22

And I think the type of approximation you can use is based on most of James's work, which is based on inducing points for random features, which is another way in which you could approximate.

Speaker E: 00:34:36

So for inducing points, instead of expressing a full process with N data points, you select M. Inducing points, we call it, that allow you to express the entire process, but having to do computations only with this M, let's say.

Speaker E: 00:34:55

So having to deal with matrices which are M by M, essentially.

Speaker E: 00:34:58

So you have M cube complexity rather than N cube that you would have with a full gp.

Speaker E: 00:35:04

And these work really well.

Speaker E: 00:35:05

And you have a nice beautiful variational sort of treatment for these models.

Speaker E: 00:35:13

You can optimize the position of the inducing inputs and everything is really nice and beautiful.

Speaker E: 00:35:20

There is a nice stream of papers by James.

Speaker E: 00:35:22

I contributed to a couple of these, where we also did some mcmc and later on, also with my group, we did some full fledged MCMC where we also sample the inducing locations, which was something that people typically optimize.

Speaker E: 00:35:37

But just to say, I think in GP flow you can start with a lot of great, you know, examples that can take you very far, as Vincent was saying.

Speaker E: 00:35:47

You know, you had Vincent here in another episode.

Speaker E: 00:35:50

He's right that it's a bit of a pain point, not having, let's say, I don't know, an accepted and widely used toolbox for Bayesian deep learning.

Speaker E: 00:36:02

So I think that's something that we should work as a community.

Speaker E: 00:36:07

There are many events that we are trying to participate, to get together, to reflect on what is the role of Bayes in the current state of AI.

Speaker E: 00:36:18

So we had one in Dagestu last year and we're going to have one in Abu Dhabi coming up soon at the end of this month.

Speaker E: 00:36:25

And I think we should talk about this specifically.

Speaker E: 00:36:28

How can we lead an initiative for co development?

Speaker E: 00:36:33

But I think it's not easy because each one of us as professors, as academics, we have to serve certain priorities which are in our case publications and maybe in my case also, engagement with applications here in the kingdom is something very valued.

Speaker E: 00:36:50

And so the effort of developing a software package I think goes a bit beyond that, right so there needs to be some nice conditions to be able to have a team of developers available to do something like that for a long time.

Speaker E: 00:37:07

And I think that's a challenge, at least for us.

Speaker E: 00:37:10

And people working in the industry also have, you know, have certain priorities for coming from constraints, coming from their company.

Speaker E: 00:37:20

So I think that's a difficult one for everybody, but definitely very valuable.

Speaker E: 00:37:26

Yeah, and there was another part of your question that I think I missed.

Speaker D: 00:37:29

Yeah, yeah, yeah, we'll come back to that, don't worry.

Speaker B: 00:37:31

Yeah, yeah.

Speaker B: 00:37:33

Just to piggyback on what you were saying.

Speaker B: 00:37:35

Yeah, for sure.

Speaker B: 00:37:36

In industry, I would say it's mostly you need to tie that to a project you have at work.

Speaker B: 00:37:41

Like if you need that for work, then that's definitely something that can make things happen very fast and much, much faster because then you can get some budget to finance an open source solution to that, to that problem, which will, which will make the development cycle much faster than if you have to do it internally alone.

Speaker B: 00:38:04

So yeah, for sure.

Speaker B: 00:38:05

But that's very good.

Speaker B: 00:38:06

That GP flow already has all this support for inducing point for deep GPS is very good.

Speaker B: 00:38:17

I would say that PMC also has very good GP support.

Speaker B: 00:38:21

I use that all the time for my Bayesian gps.

Speaker B: 00:38:26

Not only the vanilla gps, but the inducing point GPS too.

Speaker B: 00:38:29

We have that in PMC and PYMC extras, both for marginal likelihood gp, so for normal likelihood and for latent gps.

Speaker B: 00:38:40

So if you have a non normal likelihood.

Speaker B: 00:38:45

And of course the HSGP approximation has been a real game changer for using GPS in the wide and we have that in PYMC two out of the box.

Speaker B: 00:38:54

The great thing here, compared to GPflow, I would say, is that you can compose that with other parts in your Bayesian model.

Speaker B: 00:39:05

So it doesn't have to be a pure GP model, it can be combined to other random variables that you have in the model.

Speaker B: 00:39:13

So you could have like a classic linear slope added to a GP with a.

Speaker B: 00:39:17

The baseline.

Speaker B: 00:39:18

So this is very interesting too.

Speaker B: 00:39:22

And you get the difference inference, different inference methods that you get with, you know, a classic ppl.

Speaker B: 00:39:28

So not only mcmc but ADVI Pathfinder soon in lab.

Speaker B: 00:39:36

So yeah, this is great.

Speaker B: 00:39:38

So I encourage people checking that out.

Speaker B: 00:39:40

Definitely encourage people checking gpflow out.

Speaker B: 00:39:44

I think this is, as you were seeing, great baseline, very useful, great API.

Speaker B: 00:39:50

And we definitely, definitely need James on the show to dive deeper into that.

Speaker D: 00:39:54

Because I really want to dive deeper.

Speaker B: 00:39:56

Into that and I've never done a show about GP flow, so I'll keep that in mind.

Speaker B: 00:40:03

But there is that we'll come back to the inference part afterwards.

Speaker B: 00:40:07

But I asked you a question about the interpretation.

Speaker B: 00:40:12

Do we keep the benefit of interpretability of the kernel parameters when we're using DGPS then?

Speaker E: 00:40:20

Yeah.

Speaker E: 00:40:20

Well, in the composition, obviously things become more obscure in a way because now a length scale parameter for the first GP is a length scale for functions that become hidden variables, latent variables for a new Gaussian process.

Speaker E: 00:40:37

So I think it's possible to think a little bit about the implications of this.

Speaker E: 00:40:44

But, you know, you can start thinking about maybe how many oscillations you may expect by doing certain, excuse me, certain length scales over a certain domain.

Speaker E: 00:40:56

You know, you can start thinking, okay, you know, if I take derivatives, maybe I can start looking at how many zeros I may expect from this.

Speaker E: 00:41:03

But you know, it becomes much harder, I think, the deeper you go.

Speaker E: 00:41:06

And I think, of course, in the end there is a lot of other beautiful theory that tells you that if you start pushing now the number of Gaussian processes, so the dimensionality of the Gaussian process to infinity, then you go back to something which is again a Gaussian process.

Speaker E: 00:41:23

nice work by Radford Neel in: 1996

Speaker E: 00:41:39

And there's been a lot of follow up work on that, showing that convolutional neural networks with a lot of, when you take the number of filters to large values, then they become Gaussian processes and so on.

Speaker E: 00:41:52

So central limit theorem there kicks in, in a way, and then a lot of these things become Gaussian again.

Speaker E: 00:41:59

So I think maybe you may recover some interpretability again when you start pushing things to some limits, but then again in the output you get Gaussians.

Speaker E: 00:42:08

So then you lose, in a way the flexibility that you wanted by introducing the composition.

Speaker E: 00:42:14

So it's a trade off.

Speaker E: 00:42:15

Right.

Speaker E: 00:42:15

So how much you want to be flexible and how much you want to be interpretable.

Speaker E: 00:42:19

I think.

Speaker B: 00:42:20

Yeah, okay, yeah, that makes sense.

Speaker B: 00:42:23

That makes a ton of sense.

Speaker B: 00:42:26

So let's go back to the inference part.

Speaker B: 00:42:29

Now.

Speaker B: 00:42:30

Can you give us.

Speaker B: 00:42:32

Yeah, the lay of the land of what are the approximations and scalable GP methods.

Speaker B: 00:42:41

Also, feel free to talk about Bayesian or non Bayesian deep neural networks.

Speaker B: 00:42:47

You know, how can people sample from these models?

Speaker B: 00:42:54

And if you can walk us through the most promising techniques.

Speaker E: 00:43:00

Yeah, great.

Speaker E: 00:43:01

Well, maybe I break up the answer into maybe GPS first and then we move on to maybe deep neural networks.

Speaker E: 00:43:08

I think for gps there hasn't been much development in the last few years, I would say.

Speaker E: 00:43:13

I mean, there are still papers submitted and accepted in the major conferences, but I think they're really a small fraction compared to anything else that is happening.

Speaker E: 00:43:22

So I think a lot of people kind of settled now to some approximation methods and some inference methods.

Speaker E: 00:43:27

Variational seems, I mean, remains one of the nice formulations to be able to treat these models when you start introducing inducing points.

Speaker E: 00:43:37

So with inducing points it becomes kind of nice to work with these variational approximations.

Speaker E: 00:43:42

t work by Michal Listizias in: 2009

Speaker E: 00:43:52

And, and so I think, I would say variational methods for treating the latent variables in Gaussian processes I think is very predominant now to be able to handle scalability and any likelihoods you want.

Speaker E: 00:44:07

We've done some work on MCMC which also works quite well.

Speaker E: 00:44:11

I spent a lot of time doing MCMC a long time ago when I was trying to sample parameters of the covariance along with latent variables.

Speaker E: 00:44:20

So there's been nice works by Ian Marray, for example, Ryan Adams, Dave Mackay himself also has done some work with these guys.

Speaker E: 00:44:31

And at the time I was trying to do sampling.

Speaker E: 00:44:33

And there is this problem of being a hierarchical model.

Speaker E: 00:44:37

It introduces some complications.

Speaker E: 00:44:39

You have hyperparameters, latent variables and data.

Speaker E: 00:44:42

And because of this structure, sampling latent variables becomes quite tricky.

Speaker E: 00:44:47

Sorry, sampling hyperparameters becomes quite tricky because they're tightly coupled to the latent variables.

Speaker E: 00:44:53

So when you sample from the posterior of latent variables, you're conditioning on data, but also on the hyperparameters.

Speaker E: 00:45:00

And so imagine you have a length scale parameter.

Speaker E: 00:45:02

It means that you're sampling your latent functions to be compatible with the length scale you have.

Speaker E: 00:45:08

And then if you sample then length scale, given the latent variables, the length scale is not going to change much because the latent variables have a certain length scale which was informed by the length scale before.

Speaker E: 00:45:19

So you have this very, very slow convergence process for this mcmc.

Speaker E: 00:45:26

So you have to break it up in a way.

Speaker E: 00:45:28

So there has been a lot of work on these ancillary parameterizations, non centered parameterization people call it, in many different ways.

Speaker E: 00:45:36

And so you can start thinking about reparameterizing the Gaussian process in a way that you view latent variables from a Gaussian distribution with Covariance K, you start saying, okay, K, I decompose into ll transpose, where L is a Cholesky decomposition.

Speaker E: 00:45:55

And then you say, I write my latent functions F as L times nu, where nu now are variables that are standard normals.

Speaker E: 00:46:04

And if you do that now, you kind of decouple a little bit in the prior, at least you decouple the dependence between the hyperparameters, which affect the Cholesky of the covariance, and nu, which is now independent variables.

Speaker E: 00:46:17

And so now you can sample a bit more efficiently.

Speaker E: 00:46:19

Then people came up with even better ways of doing this kind of decoupling.

Speaker E: 00:46:24

I've done some work on pseudo marginal Markov chain Monte Carlo where you sort of use important sampling or adaptive importance sampling to integrate out latent variables approximately.

Speaker E: 00:46:33

So you can really sample much faster with faster convergence for the hyper parameters.

Speaker E: 00:46:40

So MCMC for the hyper parameters is possible.

Speaker E: 00:46:43

And I think now with the computing becoming more and more available and cheaper, I think this is something that is definitely something worth considering, especially because a lot of times people work on applications, especially for, you know, these expensive computer simulation problems where you have these simulations that run, you know, for, you know, hours, if not days, and then you have to fit a Gaussian process on these expensive observations to construct an emulator that then you use to sort of calibrate certain parameters of these computer models.

Speaker E: 00:47:18

So these things are very expensive.

Speaker E: 00:47:20

So mcmc, maybe it's not that expensive after all if you do that.

Speaker E: 00:47:24

And for me, like, when I started doing this, I was working on some neuroimaging applications.

Speaker E: 00:47:29

We were handling 68, I think, images from patients, and we're trying to do a classifier for Parkinson with this data.

Speaker E: 00:47:40

And we said, you know, we want to do a good job in quantifying uncertainty in our predictions.

Speaker E: 00:47:44

So we ran this MCMC for like a week.

Speaker E: 00:47:48

And yeah, we got, you know, long chains, good convergence, and yeah, we just did it, you know.

Speaker E: 00:47:55

So this is just what I wanted to say, maybe about gps.

Speaker E: 00:48:00

So we have implementations for this MCMC also.

Speaker E: 00:48:03

Now, for when you want to handle everything in a Bayesian way, you want to sample everything.

Speaker E: 00:48:08

You want to sample inducing inputs, inducing variables, hyper parameters, everything.

Speaker E: 00:48:13

And this was an ASTATS paper.

Speaker E: 00:48:14

We had: 2020

Speaker E: 00:48:20

But also in GP Flow, again, going back there, you see a lot of nice code there that you just use to optimize some of these parameters.

Speaker E: 00:48:29

And I think in many applications this may work quite well.

Speaker B: 00:48:32

Okay.

Speaker B: 00:48:33

And so for this, GP flow is a very good option.

Speaker B: 00:48:38

And the paper you talked about, did you implement that in GPFlow or is that a custom implementation in TensorFlow?

Speaker E: 00:48:45

So, yeah, we started from GPFlow as a code base.

Speaker E: 00:48:47

Yes.

Speaker B: 00:48:47

Okay.

Speaker E: 00:48:48

Speaker B: 00:48:48

So if people want to replicate, for instance, your paper, they can do that.

Speaker E: 00:48:52

Yeah, yeah, but code is available.

Speaker E: 00:48:53

They can download the code and.

Speaker B: 00:48:55

Yes, yeah, yeah, yeah, yeah.

Speaker E: 00:48:56

Also, then pretty much every paper we do, we try to also release code to make it reproducible.

Speaker B: 00:49:02

Yeah, yeah, yeah, yeah.

Speaker B: 00:49:04

So that's awesome.

Speaker B: 00:49:05

But that's also great that it's reproducible with GP flow, because it's a package that's evolving all the time, that's curated, and then people can safely use that in an industrial production setting.

Speaker B: 00:49:17

And that's.

Speaker B: 00:49:18

That's extremely helpful.

Speaker B: 00:49:22

And I find that also very, like, that's a very.

Speaker B: 00:49:24

That's a very good piece of news because, like, that's also been my experience that you can actually do a lot of MCMC sampling with GP's, even with big data sets.

Speaker B: 00:49:35

So, yeah, like, it's usually the.

Speaker B: 00:49:40

The bad priors that people have about it is usually usually not warrantied when you actually trying to do that.

Speaker B: 00:49:48

Yeah.

Speaker E: 00:49:48

So a few years ago, I gave a talk.

Speaker E: 00:49:50

I gave a talk in Cambridge at one event there, and o' Hagan was there, so I was presenting these deep GPS with random features, and then there was a plot that I didn't like so much when I gave a presentation.

Speaker E: 00:50:01

So people were actually giving me.

Speaker E: 00:50:03

Asking me questions about that that were not so clear.

Speaker E: 00:50:05

So then while we were.

Speaker E: 00:50:07

We went for lunch after my talk, and while we were in the queue, I just took my laptop and I just ran the code again to replicate that figure in a better way.

Speaker E: 00:50:16

And I was showing this beautiful function.

Speaker E: 00:50:18

So while we were queuing, you know, I showed this to Professor Hagen and he was very impressed that the code was running so fast, you know, so.

Speaker E: 00:50:25

And this was almost 10 years ago now, so.

Speaker B: 00:50:28

Yeah, yeah.

Speaker B: 00:50:30

And now we have even better MCMC samplers, we have better personal computers.

Speaker B: 00:50:36

So, yeah, I've definitely ran very big hierarchical GP models on my laptop in like running in 15 minutes in that PI.

Speaker B: 00:50:45

So definitely encourage people trying much more of that because, I mean, if you see all these huge LLMs which are running, imagine that you can run a much more efficient GP model on your computer for this paper, actually, do you remember the size of the data set to give people an idea?

Speaker E: 00:51:05

So, yeah, I mean, we can run our.

Speaker E: 00:51:08

You know, we were running MNIST.

Speaker E: 00:51:10

I think this was already 10 years ago almost.

Speaker E: 00:51:13

We're Running a NIST on a laptop for a couple of hours or something.

Speaker E: 00:51:17

I don't know.

Speaker B: 00:51:18

Okay, yeah, yeah.

Speaker B: 00:51:19

So it's millions of data sets.

Speaker E: 00:51:21

Sorry, no, that was.

Speaker E: 00:51:23

It's only 60,000.

Speaker E: 00:51:24

But we also run on this MNIST 8 million, which is 8 million MNIST images, you know, so we ran it on that, and again, you know, we could run it, you know, on a laptop.

Speaker E: 00:51:36

So.

Speaker B: 00:51:36

Yeah, yeah, okay.

Speaker B: 00:51:38

Yeah.

Speaker B: 00:51:38

With GP fuel.

Speaker E: 00:51:40

No, this was our implementation of these deep GPS with random features.

Speaker B: 00:51:45

Okay, but now is that available in GP Flow or.

Speaker E: 00:51:50

No, actually.

Speaker E: 00:51:51

So GPFlow focuses exclusively on these inducing points methods and not random features, at least as far as I know.

Speaker E: 00:51:57

Maybe they.

Speaker E: 00:51:58

I don't know if they've evolved that part, but as far as I know, that was not something there in their priorities.

Speaker E: 00:52:07

So.

Speaker B: 00:52:07

Yeah, okay.

Speaker B: 00:52:08

Okay.

Speaker B: 00:52:08

Yeah.

Speaker B: 00:52:09

Actually, can you.

Speaker B: 00:52:10

I don't think we made the clear distinction between inducing points and random features.

Speaker E: 00:52:15

So can you do that with inducing points?

Speaker E: 00:52:19

So you select a number of inputs that you can then optimize after if you want, and then introduce new random variables that allow you to express the full process as a function of only this small set of random variables with random features.

Speaker E: 00:52:38

Instead, you think of an expansion of your model as an infinite number of basis functions, and then you truncate this expansion to a fixed number.

Speaker E: 00:52:52

So for certain kernels, for example, the RBF kernel, these random features actually are random Fourier features.

Speaker E: 00:52:58

So you can express the GP just as a combination of a weighted combination of sine and cosine with different frequencies sampled appropriately.

Speaker E: 00:53:11

So in one way, in one case, you approximate in space.

Speaker E: 00:53:16

So this would be the inducing points method.

Speaker E: 00:53:18

And in the random feature, if you think about random Fourier features, you're doing some approximation of the spectrum of these processes, if that makes sense.

Speaker E: 00:53:29

One is in space, the other one is in frequency.

Speaker B: 00:53:33

Yeah, okay.

Speaker B: 00:53:34

Yeah, the random features sounds a lot like HSGP actually.

Speaker E: 00:53:39

Yeah, it probably has some connections.

Speaker E: 00:53:42

Yeah.

Speaker B: 00:53:42

Okay, interesting.

Speaker B: 00:53:44

Okay, that's cool.

Speaker B: 00:53:45

So if people want to use random features, this is going to be your implementation from the paper.

Speaker E: 00:53:51

Yeah, we have that.

Speaker E: 00:53:52

I mean, it's a bit old now, I think it was on TensorFlow, a very old version, so we should probably try to maintain it or maybe release a Python Pytorch version soon.

Speaker B: 00:54:03

Okay, yeah.

Speaker B: 00:54:04

And otherwise inducing points, then try that on GPFlow.

Speaker E: 00:54:08

Yes, but then of course, there is other approximations.

Speaker E: 00:54:10

So again, Andrew Wilson has done some work on this KISS gp, which is a way to do this scalable kernel approximations, which are with pretty powerful.

Speaker E: 00:54:23

So, yeah, there are different ways, you know, then.

Speaker B: 00:54:26

Yeah, yeah, exactly.

Speaker B: 00:54:27

And again, inducing policies in pmc HSGP is an awesome approximation too.

Speaker B: 00:54:33

You folks should give it a try if you want.

Speaker B: 00:54:35

That's in PMC also.

Speaker B: 00:54:36

So yeah, no, definitely a lot of, a lot of great options.

Speaker D: 00:54:40

So as for gps, let's turn to deep neural networks.

Speaker D: 00:54:45

Yeah, can you give us a lay.

Speaker B: 00:54:47

Of the land there?

Speaker E: 00:54:49

So, well, I mean one of the main things about deep learning is the possibility to do mini batch.

Speaker E: 00:54:55

So one of the great things about training a big neural network is that you just feed it small batches of data and then you keep updating the model using stochastic gradient optimization.

Speaker E: 00:55:05

So what is the problem with doing something like this for inference?

Speaker E: 00:55:11

So if we do a Bayesian neural network, we want to get a posterior over these parameters of the neural network.

Speaker E: 00:55:18

We want to sample from this.

Speaker E: 00:55:19

How do we do it?

Speaker E: 00:55:20

ually it turns out that since: 201314

Speaker E: 00:55:40

And so then there is a beautiful paper by Max Welling and EY T on stochastic gradient Langevan Dynamics sampler.

Speaker E: 00:55:50

I think it's a: 2014

Speaker E: 00:55:52

And then there is hybrid Monte Carlo, Hamiltonian Monte Carlo if you want version using stochastic gradient by Emily Fox and her group, which is also quite powerful.

Speaker E: 00:56:03

And there is some nice theory around this.

Speaker E: 00:56:05

Also we work a little bit on the theory as well in our group to try to understand a bit more about the properties.

Speaker E: 00:56:11

And essentially there is a way to show that you have, even if you're using stochastic gradients, which are not exact, these trajectories, somehow you dampen them with some friction.

Speaker E: 00:56:24

And then you can show that if you do things right, you can avoid having to compute the entire likelihood when you accept or reject.

Speaker E: 00:56:31

And therefore you can really be scalable.

Speaker E: 00:56:35

And we tried this with pretty big models.

Speaker E: 00:56:37

I mean, of course if we talk about LLMs, we're still very small.

Speaker E: 00:56:41

But we've done this with the convolutional neural networks with my students some time ago.

Speaker E: 00:56:45

We could sample easily models with a few tens of millions of parameters.

Speaker E: 00:56:49

We were doing convergence checks, R hat statistics, all the stuff that you need to do when you sample to make sure that you're really sampling from the posterior.

Speaker E: 00:57:03

Of course, because the parameter space is so the models are non identifiable.

Speaker E: 00:57:10

So we actually do the convergence checks on the predictions.

Speaker E: 00:57:15

So on some sort of projection of the parameters onto something that we can actually meaningfully understand.

Speaker E: 00:57:23

Because, you know, if you sample from multiple modes that represent the same kind of configurations, of course the Markov chains are very far away from each other when you do multiple chains, but actually you're sampling from the same configuration.

Speaker E: 00:57:37

Okay, yeah, so MCMC is possible.

Speaker E: 00:57:39

re was a paper by Alex Graves: 2011

Speaker E: 00:57:54

It doesn't work so well because people haven't spent enough time working on good priors.

Speaker E: 00:58:00

ddressed a bit in our work in: 2022

Speaker E: 00:58:11

But then we tested this mostly with MCMC rather than variational.

Speaker E: 00:58:15

And then a lot of people in the community are really excited about Laplace methods.

Speaker E: 00:58:19

So Gaussian approximations with looking at the Hessian and so on.

Speaker E: 00:58:24

But I think for deep learning, I don't know, this is maybe my outlier voice here in the community, but I don't think that's the right way of doing things because posteriors are not Gaussian at all and we're in lots of dimensions.

Speaker E: 00:58:38

There is a lot of redundancies in the parameter space so that this non identifiability creates ridges in the parameter space where the likelihood is the same.

Speaker E: 00:58:48

So I don't think that Gaussian approximation would do particularly well.

Speaker E: 00:58:53

But of course it's a very popular way of doing things and the community is really pushing that a lot.

Speaker E: 00:59:00

But I don't think that's the right way of doing things.

Speaker B: 00:59:03

Okay, okay.

Speaker B: 00:59:04

So what to you would be, would.

Speaker D: 00:59:07

Be the right way of doing things?

Speaker D: 00:59:09

Like let's say listeners want to try.

Speaker B: 00:59:12

Deep learning models right now, again, the.

Speaker D: 00:59:15

Bayesian version is not very easy, but.

Speaker B: 00:59:17

Let'S say they want to try deep learning models.

Speaker B: 00:59:21

What should they look at first?

Speaker B: 00:59:25

Which packages, which methods, which inference methods?

Speaker E: 00:59:29

I think the easiest thing, I mean, when I have students coming, maybe for a short project, you know, the first thing I tell them, you know, try Monte Carlo dropout.

Speaker E: 00:59:35

It's a very simple thing.

Speaker E: 00:59:37

I mean, I know that a lot of people would disagree with me, but it's a very practical way of doing things and there is connections with additional inference.

Speaker E: 00:59:44

So yeah, you retain some principle.

Speaker E: 00:59:46

Let's say, although the posterior now is very degenerate because you're just doing, switching off and on some weights.

Speaker E: 00:59:54

But It's a very intuitive way of doing things, very practical.

Speaker E: 00:59:59

You can take pre trained models and just introduce some dropout at test time, maybe a fine tune first with dropout and training time and then do it at test time.

Speaker E: 01:00:11

It's a beautiful idea.

Speaker E: 01:00:12

It's very simple.

Speaker E: 01:00:13

I think it's a perfectly valid way to start, at least to get some uncertainties and then, you know, what do you do with that?

Speaker E: 01:00:21

Of course, depends on the problem you have, but I think it's a good starting point.

Speaker E: 01:00:28

Otherwise I think variational has a good potential if you make the class of posterior distributions quite flexible.

Speaker E: 01:00:38

And now we're seeing these diffusion models or other powerful generative models being used for variational inference.

Speaker E: 01:00:46

I mean this was the way sort of normalizing flows were proposed for variational inference by Rezend and Mohammed and then you know, it was ported to just density estimation.

Speaker E: 01:00:58

And now, you know, we have diffusion models that do a wonderful job at diffusion at density estimation and now people are starting to use them for posterior sampling.

Speaker E: 01:01:07

And so I think, you know, having sort of these approxim, this flexible posteriors could be like a good way forward.

Speaker E: 01:01:15

And I think we're going to see more and more of that.

Speaker E: 01:01:18

Because if your variational, the class of distributions that you can represent with your variational, with your model is very large, you can really make the bound very tight.

Speaker E: 01:01:30

So the variational really is going to give you the true marginal likelihood.

Speaker E: 01:01:36

So eventually I think it would be nice to go in that direction.

Speaker E: 01:01:38

Of course, for these huge models it's very challenging.

Speaker E: 01:01:40

But yeah, there is a lot of great work now that people are doing on partial stochasticity.

Speaker E: 01:01:46

So you may not need to be stochastic about the entire network, but just a few parameters in your model.

Speaker B: 01:01:54

Yeah.

Speaker B: 01:01:54

Okay.

Speaker B: 01:01:55

And to do that, what's a great first bet, Mike, is are all of these methods available in PyTorch or TensorFlow so that people can come up with their neural network model and use these inference engines?

Speaker B: 01:02:11

Or is this too much of a frontier method so far?

Speaker E: 01:02:15

So I think for the Monte Carlo dropout is really almost.

Speaker E: 01:02:19

You don't need any skill to do it.

Speaker E: 01:02:21

You know, you just take a model that is already there code or you know, you just switch on and off, actually you switch on drop out layers at training and test time.

Speaker E: 01:02:31

That's it.

Speaker E: 01:02:32

And for variational, I think in terms of implementations, I think, I think Pyro has maybe a lot of these things already sort of embedded in the way they do things.

Speaker E: 01:02:43

I've never used it myself.

Speaker E: 01:02:44

I Mean, we tend to develop a lot of code ourselves because we have to break stuff and to try stuff.

Speaker E: 01:02:51

So we try to have code that we have under control ourselves.

Speaker E: 01:02:56

So that's why I tend not to use too many packages myself.

Speaker E: 01:03:01

But I guess Pyro has maybe a lot of things already sort of implemented for doing this.

Speaker E: 01:03:05

Yeah.

Speaker B: 01:03:06

Okay.

Speaker B: 01:03:07

Okay.

Speaker B: 01:03:07

So I'll put a link to Pytorch and Pyro documentation in the show notes for this episode and then people can give it a try.

Speaker B: 01:03:21

But it's great that.

Speaker B: 01:03:23

Yeah, from what you're saying, it sounds like it's pretty easy to implement for practitioners and to try these methods out.

Speaker E: 01:03:31

I think these days, I mean, when I teach my class, what I do, I say take an MNIST tutorial for deep learning and just, just turn it into variational.

Speaker E: 01:03:40

You know, what you need to do is to add a few extra variables and, you know, it's a good exercise.

Speaker E: 01:03:45

People can usually do it relatively easily and.

Speaker B: 01:03:49

Yeah, and do you have actually some.

Speaker B: 01:03:53

Like, are your public, Are your courses public?

Speaker B: 01:03:56

Can, can we put something in the show notes that people can study what you're teaching?

Speaker B: 01:04:00

Actually over there at, at Kaust, maybe you have the course, the exercises?

Speaker E: 01:04:06

Yeah, we record everything, but we keep it private for now.

Speaker E: 01:04:11

I don't think we can open it easily.

Speaker E: 01:04:13

Also, I record.

Speaker E: 01:04:14

I mean, it's been 10 years now that I've been recording my courses, even when I was in France before, when I was in Glasgow.

Speaker E: 01:04:21

But, yeah, they remain within, let's say, the usage for the students.

Speaker E: 01:04:28

I think I put in the notes a link to a tutorial we gave on Gaussian processes.

Speaker E: 01:04:36

I think there is another tutorial that I should probably also include there.

Speaker E: 01:04:40

I'm not sure I put the link to that that we gave at IJCAI on Bayesian deep learning.

Speaker E: 01:04:49

So we did a couple of tutorials, one on Gaussian processes, another one on Bayesian deep learning.

Speaker B: 01:04:54

Yeah, let's definitely add that.

Speaker B: 01:04:57

And I will add my own tutorial about GPS that I taught at PI Data New York last year.

Speaker E: 01:05:03

Awesome.

Speaker B: 01:05:03

I did that with Chris Von Speck.

Speaker B: 01:05:05

He went into the different methods, the different algorithms that you can use actually to feed GPS mainly in Prime C, so in the Bayesian framework, so vanilla GPS inducing points and HSGPs.

Speaker B: 01:05:20

And the last half of the tutorial was myself going through an example tutorial for people trying to infer player performance in soccer with GPS on three different timescale, the days, the month and the years, and pulling the GPS hierarchically across players.

Speaker B: 01:05:45

So while sharing the coherent structure.

Speaker B: 01:05:48

So it's a pretty advanced use case and you'll see it fits very fast on the laptop folks.

Speaker B: 01:05:54

So yeah, I'll put the, the link to the, to the GitHub repo and you have the link to the YouTube video in like at the beginning of the GitHub repo.

Speaker B: 01:06:04

So yeah, like let's, let's put that in there.

Speaker B: 01:06:06

Mauricio.

Speaker B: 01:06:07

I think it's going to be super, super interesting for people.

Speaker B: 01:06:10

And something I'm also curious about is like how how do Bayesian ideas integrate into modern deep learning?

Speaker B: 01:06:20

You know, especially in terms of uncertainty quantification.

Speaker B: 01:06:23

You talked a bit about that earlier.

Speaker B: 01:06:26

We were like, it's actually a good question right now.

Speaker B: 01:06:28

Where does base fit into that new AI and especially gen landscape?

Speaker B: 01:06:33

I'm curious to hear your thoughts about that.

Speaker E: 01:06:35

Yeah, so I think one of the practical sides, I think many times we tried to do this to start thinking about how do we put a prior over the parameters and quickly realized that it's very difficult to do because of this composition and everything.

Speaker E: 01:06:56

So it makes a lot of sense to think about priors over functions that you can represent with your model.

Speaker E: 01:07:02

So this is also something that Vincent talked about because he worked on this, we worked on this also in parallel.

Speaker E: 01:07:10

And so the idea is really that if we start thinking in that direction, then I think it's much more powerful to think about the kind of functions you can represent.

Speaker E: 01:07:21

And I think it goes a lot in the spirit of the things that we were discussing at the beginning of what kind of complexity would you allow for your functions?

Speaker E: 01:07:31

So are you happy with functions that have certain degree of complexity?

Speaker E: 01:07:36

And this idea of complexity is very profound because complexity is not just number of parameters, complexity is more about simplicity.

Speaker E: 01:07:44

And Kolmogorov complexity in a way tells you a lot about that.

Speaker E: 01:07:48

And here at Kaust, I'm interacting a lot with the professor Schmid Huber, who is here, is one of the greatest minds in AI and he's been thinking about this stuff for a long time.

Speaker E: 01:07:59

And whenever I get coffee with him I get a lecture on Kolmogorov complexity.

Speaker E: 01:08:03

And so I've been thinking about this a lot myself now, and also again, Andrew Wilson has done some work on that, talking about these type of things.

Speaker E: 01:08:14

And I think in the end we were making progress in a way in understanding how much stochasticity we need in the networks to be able to represent at least any distributions we want.

Speaker E: 01:08:27

But then we have to disentangle that in a way from the Complexity of the functions that we can represent.

Speaker E: 01:08:32

So there is these two aspects, I think, complexity of the functions and how crazy you want the uncertainties to be or the distributions that you can represent a priori before you're looking at any data.

Speaker E: 01:08:43

And this is how you design a model, right?

Speaker E: 01:08:46

And so there's this work by the work the group at Oxford, Tom Rainforth and Eric Nalisic, who did this work on partial stochasticity, which I think is very fundamental because it really gives you a practical way to say how many neurons in your neural network should you pick to be stochastic and how many you can just optimize.

Speaker E: 01:09:08

And this gives you already guiding principle on how to think about these Bayesian neural networks in the future.

Speaker E: 01:09:15

I think they're not very excited about this work.

Speaker E: 01:09:17

When I talked with Tom, I saw him a couple of weeks ago in Denmark at a workshop and also at Dijkstu last year.

Speaker E: 01:09:24

I was telling him like, Tom, this is great.

Speaker E: 01:09:27

This is one of the best things that happened in our community in a long time.

Speaker E: 01:09:30

And he was like, oh, come on, I don't think that this is so great.

Speaker E: 01:09:34

He was downplaying a lot, this contribution, which I think instead is very important because imagine now if you can do MCMC on a much smaller dimensional space and still achieve the same representation power of a full blown stochastic neural network with millions and millions of parameters.

Speaker E: 01:09:56

And instead maybe if your output is only 10 dimensional, you can get away with 10 neurons being stochastic.

Speaker E: 01:10:03

This is very powerful, you know.

Speaker E: 01:10:05

And so I got excited about this stuff and I started working on crazy things like Gans that nobody looks at anymore because guns are now this generative other side of networks are out of fashion.

Speaker E: 01:10:15

But actually they're based on neural networks themselves, you know, and they are partially stochastic.

Speaker E: 01:10:21

So it fits perfectly in the narrative of the kind of things I was looking at.

Speaker E: 01:10:25

And I got sucked into this.

Speaker E: 01:10:26

And it's been a pain because optimizing these models is extremely difficult.

Speaker E: 01:10:31

But at least now we have an understanding of this in a Bayesian way.

Speaker E: 01:10:35

And it's very nice because we can view not only now Gans, but pretty much any generative models where you take a set of random variables and you have a complicated neural network mapping it into something complicated, like complicated P of X.

Speaker E: 01:10:49

And this is the mother of all problems.

Speaker E: 01:10:51

If you estimate P of X, you solve any problems you want.

Speaker E: 01:10:54

And X can be, you know, if you have a supervised learning problem, it can be labels and inputs.

Speaker E: 01:11:01

If you're doing Unsupervised learning.

Speaker E: 01:11:02

It's just your inputs, you know, so if you can do this well, you can do a lot of things.

Speaker E: 01:11:07

And so this forces us to think a lot about regularization, model complexity and all these things.

Speaker E: 01:11:13

And I'm really excited about this and this is really what we're working on at the moment with my group.

Speaker B: 01:11:18

Yeah, this is fascinating and I agree, GANS are amazing.

Speaker D: 01:11:22

I mean this is, this is fascinating.

Speaker D: 01:11:23

I really love the, I mean it's a generative model, so of course I love it.

Speaker D: 01:11:27

But I really love this idea of.

Speaker B: 01:11:29

You know, having two networks competing against each other.

Speaker B: 01:11:33

This super, super interesting.

Speaker B: 01:11:34

And can help you in cases of rare, like, of sparse data actually.

Speaker B: 01:11:40

So it can be extremely, extremely powerful.

Speaker B: 01:11:43

And I see that you, you put a video tutorial of.

Speaker B: 01:11:48

About gans.

Speaker B: 01:11:49

Precisely.

Speaker B: 01:11:49

And how they secretly.

Speaker E: 01:11:50

Yeah, so.

Speaker E: 01:11:51

So yeah, I was invited to give a presentation on this.

Speaker E: 01:11:54

Yeah.

Speaker B: 01:11:54

So yeah, yeah.

Speaker D: 01:11:56

Well definitely check that out and encourage.

Speaker B: 01:11:59

People to do that.

Speaker B: 01:12:00

Thanks.

Speaker D: 01:12:01

I see you, you put indeed a lot of lectures already in the show notes.

Speaker B: 01:12:04

That's fantastic for myself and for listeners.

Speaker B: 01:12:08

It's going to be a great episode for show notes also folks.

Speaker B: 01:12:11

So definitely check them out and.

Speaker B: 01:12:15

Well, I'm going to start playing this out, Mauricio, because I could, I could.

Speaker D: 01:12:19

Keep talking with you for a long time because I'm really passionate about these topics and we work on very similar kind of models, so that's awesome.

Speaker D: 01:12:29

But I need to respect your bedtime.

Speaker D: 01:12:32

It's already late for you.

Speaker B: 01:12:35

I'm curious, more generally, in the context of the current gen AI developments, where do you see Gaussian processes in Bayesian deep learning heading in the next few years?

Speaker B: 01:12:50

And what advice do you give your young students, researchers, practitioners who want to dive deeper into Bayesian deep learning or deep learning in general?

Speaker E: 01:13:04

Yeah, making predictions about what's going to happen is very difficult.

Speaker E: 01:13:07

But I mean, I think a lot of this amortization through foundation models is happening really fast and I think we're not realizing how fast this is going.

Speaker E: 01:13:17

And so now.

Speaker B: 01:13:19

So mean.

Speaker B: 01:13:20

Amortize Bayesian inference, for instance.

Speaker E: 01:13:22

Yes.

Speaker E: 01:13:22

Amortize everything, you know, predictions, inference, everything.

Speaker E: 01:13:26

And through these big models that have learned from other data and so on.

Speaker E: 01:13:31

It's a very powerful idea.

Speaker E: 01:13:32

You know, you learn from lots of data sets and then you get a new data set.

Speaker E: 01:13:36

You know what to do.

Speaker E: 01:13:37

Right.

Speaker E: 01:13:38

In a way that's, that makes a lot of sense in terms of gps.

Speaker E: 01:13:41

I think they still play a pretty powerful role in.

Speaker E: 01:13:45

I think there was a paper not long ago showing that GPS actually for Bayesian optimization still perform pretty well compared to Bayesian neural networks of all kinds.

Speaker E: 01:13:56

So they still have a place there for Bayesian optimization, experimental design, incremental adaptive experimental design, and also for these computer models, calibration computer models.

Speaker E: 01:14:11

is paper by Kennedy O', Hagan: 2001

Speaker E: 01:14:20

I think there's still a lot of design choices you can make about the GPS that somehow allow you to model, emulate the code with uncertainty that is meaningful.

Speaker E: 01:14:31

I think o' Hagan has done tremendous amount of work on eliciting priors for these computer models.

Speaker E: 01:14:37

And you know, this is still very, very powerful and relevant, I think.

Speaker E: 01:14:43

And this is going to stay for some time.

Speaker E: 01:14:45

I think GP is for spatial temporal models also thanks to people like Howard here, Kaust.

Speaker E: 01:14:52

I mean, they're going to stay for a long time.

Speaker E: 01:14:54

remember when I met hover in: 2012

Speaker E: 01:15:05

And so he invited me for a keynote and one of these latent Gaussian models workshops and they had 120 seats, I still remember.

Speaker E: 01:15:12

And he sold out in like an hour, you know, that was like a rock concert, you know.

Speaker E: 01:15:19

And everybody wants to use this because so many people have problems in spatial temporal that involves some spatial temporal data and they want to do it fast and they want to try stuff out, they want to change models, they want to change assumptions.

Speaker E: 01:15:33

And the only way to do this fast is to have something that does the inference fast and accurately.

Speaker E: 01:15:37

And what they developed is just tailored for that and works brilliantly, you know.

Speaker E: 01:15:42

So just to say that I think for these type of data, I think it's going to be pretty hard to beat Gaussian Markov random fields.

Speaker E: 01:15:50

We tried a bit with neural networks to do things to make the models more flexible, more non stationary.

Speaker E: 01:15:56

We've done some work on this.

Speaker E: 01:15:57

But you know, still, I think the advantage of having such something so fast and so plug and play, really.

Speaker E: 01:16:03

I mean, you can just plug your data in, make a few assumptions, you know, about what you want, and then you just get the result.

Speaker E: 01:16:09

That's very powerful.

Speaker B: 01:16:11

Yeah, yeah, no, for sure.

Speaker D: 01:16:12

And I'll put again into the show.

Speaker B: 01:16:14

Notes an episode I did with Marvin Schmidt about amortized page and inference and the work they do on the base flow package.

Speaker B: 01:16:23

If you Maurizio have some links also you want to add on, amortize anything especially practical Python packages people can use.

Speaker B: 01:16:33

Definitely, definitely add that, please.

Speaker E: 01:16:35

And in terms of the future of Bayesian deep learning, I think that's a much bigger question.

Speaker E: 01:16:42

I think as a community, what we're trying to identify, I mean there have been some nice works and also Vincent was mentioning some nice works on various applications in healthcare, self driving cars.

Speaker E: 01:16:53

But I think we're still missing the kind of application that goes in the news.

Speaker E: 01:16:57

Something like, like a killer application, something that, you know, AlphaGo type thing, you know, where people are going to talk about it and BBC News or something like that, you know, something that is going to convince everybody that and ourselves perhaps that what we're doing is actually very meaningful.

Speaker E: 01:17:15

I think we rely a lot on other type of applications like computer vision problems because people work a lot on these or now LLMs have become popular.

Speaker E: 01:17:24

So some of my friends, colleagues are actually showing that you can do also LLMs a bit Bayesian with some Bayesian low rank Laplace for example, you know, and I think, yeah, so we're testing ourselves in these grounds.

Speaker E: 01:17:39

But actually ultimately uncertainties is what matters for decision making.

Speaker E: 01:17:44

And so I think ultimately this is, I think the kind of ground where we have to compete and compete and try to evaluate ourselves and how well we can do with this.

Speaker E: 01:17:57

And this is really also the difference between everybody talks about AI, but AI really is thinking about an agent that interacts with an environment, senses reasons and then acts on the environment.

Speaker E: 01:18:09

And machine learning is the reasoning and then all this pipeline is AI.

Speaker E: 01:18:13

And at the moment there is no real AI.

Speaker E: 01:18:16

You know, like AI would mean that we have an agent that actually interacts with this and intervenes on the environment, you know, So I think I tried to talk about this a lot with my students when I give lectures about machine learning, statistics, AI, what is everything, you know.

Speaker E: 01:18:35

going on in the back then in: 1720

Speaker E: 01:19:05

So anyway, it will be material for another episode.

Speaker D: 01:19:08

Yeah, yeah, yeah, no, for sure, for sure.

Speaker B: 01:19:12

And so do you have, before I ask you the last two questions, do you have any advice you give to people who want to start working in that field, whether they Are students researchers, practitioners?

Speaker E: 01:19:28

Yeah, you probably have heard this a lot.

Speaker E: 01:19:30

But I think working on the foundations is very important.

Speaker E: 01:19:33

A deep understanding of the foundations is always what gives you unparalleled advantage because you really can think in a very profound way about certain problems and what kind of problems we want to solve at a larger scale.

Speaker E: 01:19:51

You know, many times it boils down to, you know, have a deeper understanding of the fundamentals.

Speaker E: 01:19:56

And many times for me, I find it very useful to go back to linear models, you know, whenever we develop new theory, new algorithms, new methods.

Speaker E: 01:20:05

So we try to get some good grounding on linear models.

Speaker E: 01:20:09

So what does it mean for a linear model to have this?

Speaker E: 01:20:12

So we were studying now recently singular learning theory to try to explain some scaling laws for uncertainty.

Speaker E: 01:20:19

People ask me all the time, I have so much data, why do I need to be Bayesian?

Speaker E: 01:20:23

And now I can tell you we did work on the scaling laws.

Speaker E: 01:20:25

We know when epistemic uncertainties become small, as number of data increases.

Speaker E: 01:20:32

And now for ResNet18, I can tell you that you need 10 billion images before this uncertainty becomes on the second digit of your probabilities, something like that, just to give practical advice to people to understand these things.

Speaker E: 01:20:50

So we were trying to study a theory behind this and we think that singular learning theory can give us some intuition about this.

Speaker E: 01:20:57

And so Watanabe has done a lot of work on this and nice book.

Speaker E: 01:21:01

And so we were looking at this quantity generalization error, which was very mysterious.

Speaker E: 01:21:06

And so we sort of derived it for linear models and we understood what it means for real.

Speaker E: 01:21:12

So many times this grounding on something that is tractable is really important, I think.

Speaker D: 01:21:19

Yeah, yeah, completely agree.

Speaker B: 01:21:21

This is fascinating.

Speaker D: 01:21:23

Yeah, we need to have you back on the show at some point, Maurizio, to talk about these other topics because otherwise it's going to be a three.

Speaker B: 01:21:29

Hour episode and I'd be fine with that, but have a plane to catch and you have a bed to be in.

Speaker D: 01:21:39

So let's play us out and.

Speaker D: 01:21:42

Well, for the last two questions, if you have listened to the show and you know them, first one is, if you had unlimited time and resources, which.

Speaker B: 01:21:51

Problem would you try to solve?

Speaker E: 01:21:53

Awesome question.

Speaker E: 01:21:53

I love this question and I would say nutrition is a, is a huge interesting problem for me that, you know, we see the statistics about the number of people with diabetes worldwide and it's insane.

Speaker E: 01:22:09

Like we're talking hundreds of millions just in the US or we're talking, you know, even in India, I think it's 200 million people with adults with diabetes.

Speaker E: 01:22:21

This is serious stuff.

Speaker E: 01:22:23

And, and I think now we have the tools to understand all this.

Speaker E: 01:22:26

I mean, the food industry has done this experiment on all of us, right?

Speaker E: 01:22:32

And now we see the effect.

Speaker E: 01:22:33

So I think it's possible to draw some conclusions about all these things and understanding optimal health based on what we eat.

Speaker E: 01:22:46

And I think there are some people doing this.

Speaker E: 01:22:49

There is a famous guy that is spending millions on this.

Speaker E: 01:22:53

And I think I would probably spend time and energy on this if I had unlimited resources, because it would need a lot of resources to go against the common wisdom against food industry government regulations and so on.

Speaker E: 01:23:12

But I think there's something definitely we can optimize and now we have more and more tools to measure something about ourselves.

Speaker E: 01:23:18

So.

Speaker D: 01:23:19

Yeah, yeah, no, completely agree.

Speaker B: 01:23:21

And I think it's also related to these incredible ability.

Speaker B: 01:23:25

I mean weakness we have as humans, which is like our ability to entertain.

Speaker D: 01:23:29

Us to death, which is definitely not.

Speaker B: 01:23:33

One of our best instincts on that.

Speaker B: 01:23:35

I have actually at least two episodes to recommend.

Speaker B: 01:23:38

The latest one is the one just before yours, actually, maurizio.

Speaker B: 01:23:43

It's episode 143 with Christoph Bamberg and he does research exactly that about appetite, how it's related to cognitive processes and how it's related to self esteem and things like that.

Speaker B: 01:23:58

And second episode that is in the show notes for episode 143, which is the one I recorded with Eric Trexler, who is much more focused on weight management and exercise and how that relates to appetite and the environment that you're in is extremely important basically, to put it shortly.

Speaker B: 01:24:22

So second question, Maurizio.

Speaker B: 01:24:24

If you could have dinner with any great scientific mind, dead, alive or fictional, who would it be?

Speaker E: 01:24:30

Another great question.

Speaker E: 01:24:31

I think it's very easy to overthink this.

Speaker E: 01:24:34

As an Italian, I would say Leonardo da Vinci has been one of the greatest scientists, artist, architect, philosopher.

Speaker E: 01:24:43

It was just.

Speaker E: 01:24:46

And so much ahead of his time.

Speaker E: 01:24:47

And I think whenever you interact with these people that are so much ahead of their time, you really see it something new.

Speaker E: 01:24:57

You really get so much inspiration.

Speaker E: 01:24:59

It happened to me a few times.

Speaker E: 01:25:02

One last one of the latest ones was when I interviewed here at Kaust.

Speaker E: 01:25:06

I had a three hours dinner with Jurgen Schmidt Huber.

Speaker E: 01:25:09

And I can tell you that was there was something, an experience that I will never forget.

Speaker E: 01:25:14

And it was a great three hours of talking about wonderful things and being challenged about thinking about things that I've never thought about, you know, and it was, this is the kind of things, I think as scientists we need, you know, be challenged and get out of the comfort zone.

Speaker E: 01:25:31

And I like doing that a lot, getting out of the comfort zone, you know.

Speaker B: 01:25:35

So, yeah, yeah.

Speaker B: 01:25:36

I mean, I can tell it from your work for sure.

Speaker B: 01:25:39

And I think that's something, yeah.

Speaker B: 01:25:42

A lot of researchers have in common, for sure.

Speaker B: 01:25:44

Because you have to be comfortable being.

Speaker D: 01:25:48

Uncomfortable because you're always at the frontier.

Speaker B: 01:25:50

And so by definition, you don't know the answers.

Speaker B: 01:25:53

You don't even know if you'll get there.

Speaker D: 01:25:55

So this is definitely something that's hard any type of research you do.

Speaker D: 01:26:01

And definitely that's very awesome to have people like you in these kind of jobs because, well, you help us advancing in all the, the domains you're touching.

Speaker B: 01:26:13

Maurizio.

Speaker D: 01:26:14

So thank you so much and thank.

Speaker B: 01:26:17

You so much for being on this show.

Speaker B: 01:26:19

I think it was, it was a great one.

Speaker B: 01:26:21

It's time to wrap up now, but we'll have you on the show next time you have a fun paper or code or package to share with us.

Speaker B: 01:26:30

Thanks again to Hans.

Speaker B: 01:26:33

I think it's Hans Moncho.

Speaker B: 01:26:34

I may be butchering your name.

Speaker B: 01:26:36

Sorry about that.

Speaker D: 01:26:37

But yeah, thank you so much for putting us in contact.

Speaker B: 01:26:42

And Mauricio, thank you so much for.

Speaker D: 01:26:44

Taking the time and being on this show.

Speaker E: 01:26:46

Well, thank you so much.

Speaker E: 01:26:47

It's been a huge pleasure and yeah, I hope this has been interesting for your audience and for you and happy to be back on the show whenever you want.

Speaker E: 01:26:56

You're doing a great service and thank you so much for that.

Speaker D: 01:27:00

Yeah, definitely was super fun and thank you for your kind words.

Speaker D: 01:27:05

Definitely appreciate it.

Speaker C: 01:27:09

This has been another episode of Learning Bayesian Statistics.

Speaker C: 01:27:13

Be sure to rate, review and follow the show on your favorite podcaster.

Speaker C: 01:27:18

And visit learnbasedstats.com for more resources about today's topics as well as access to more episodes to help you reach true Bayesian state of mind.

Speaker B: 01:27:28

That's learn based stats.

Speaker C: 01:27:30

Our theme music is Good Bajan by Baba Brinkman, fit MC Lars and Megaran.

Speaker C: 01:27:35

Check out his awesome work@BabaBrinkman.com I'm your host Alexandora.

Speaker C: 01:27:40

You can follow me on Twitter lexandora.

Speaker C: 01:27:43

Like the country, you can support the show and unlock exclusive Benefits by visiting patreon.com learnbased dance thank you so much for listening and for your support.

Speaker B: 01:27:54

You're truly a good Bayesian.

Speaker A: 01:27:56

Change your predictions after ticking it information in.

Speaker A: 01:28:00

And if you're thinking I'll be less than amazing, let's adjust those expectations.

Speaker A: 01:28:06

Let me show you how to be a good daisy.

Speaker A: 01:28:09

Change calculations after taking fresh data in those predictions that your brain is making, let's get them on a solid foundation.

More Episodes

144. #144 Why is Bayesian Deep Learning so Powerful, with Maurizio Filippone