Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work!
Visit our Patreon page to unlock exclusive Bayesian swag ;)
Takeaways:
Chapters:
10:11 Understanding Structural Equation Modeling (SEM) and Confirmatory Factor Analysis (CFA)
20:11 Application of SEM and CFA in HR Analytics
30:10 Challenges and Advantages of Bayesian Approaches in SEM and CFA
33:58 Evaluating Bayesian Models
39:50 Challenges in Model Building
44:15 Causal Relationships in SEM and CFA
49:01 Practical Applications of SEM and CFA
51:47 Influence of Philosophy on Data Science
54:51 Designing Models with Confounding in Mind
57:39 Future Trends in Causal Inference
01:00:03 Advice for Aspiring Data Scientists
01:02:48 Future Research Directions
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan, Francesco Madrisotti, Ivy Huang, Gary Clarke, Robert Flannery, Rasmus Hindström and Stefan.
Links from the show:
Transcript
This is an automatic transcript and may therefore contain errors. Please get in touch if you're willing to correct them.
Today I am thrilled to host Nathaniel Ford, a staff data scientist at Personio, where he
works on people analytics for one of the leading HR intelligence platforms.
2
:With more than a decade of experience spanning insurance, gaming and e-commerce, Nathaniel
brings a wealth of knowledge to the table.
3
:He's also an active contributor to the
4
:time-sync ecosystem with a particular focus on causal inference.
5
:In this episode, Nathaniel takes us on a deep dive into the world of structural equation
modeling, or SEM, and confirmatory factor analysis, or CFA.
6
:We explore the advantages of Bayesian approaches in handling these models from greater
flexibility to enhanced model validation through sensitivity analysis.
7
:Whether
8
:you're curious about fitting complex models to real-world data, the intersection of SEM
and causal inference, or the growing role of Bayesian methods in industry applications,
9
:this conversation offers insights for both beginners and seasoned practitioners.
10
:This is Learning Bayesian Statistics, episode 121, recorded October 11, 2024.
11
:Welcome Bayesian Statistics, a podcast about Bayesian inference, the methods, the
projects, and the people who make it possible.
12
:I'm your host, Alex Andorra.
13
:You can follow me on Twitter at alex-underscore-andorra.
14
:like the country.
15
:For any info about the show, learnbasedats.com is Laplace to be.
16
:Show notes, becoming a corporate sponsor, unlocking Bayesian Merge, supporting the show on
Patreon, everything is in there.
17
:That's learnbasedats.com.
18
:If you're interested in one-on-one mentorship, online courses, or statistical consulting,
feel free to reach out and book a call at topmate.io slash alex underscore and dora.
19
:See you around, folks.
20
:and best patient wishes to you all.
21
:And if today's discussion sparked ideas for your business, well, our team at Pimc Labs can
help bring them to life.
22
:Check us out at pimc-labs.com.
23
:And before we start the show, I wanted to particularly warm our new members in our small
invasion family on Patreon.
24
:Thank you so much to the mysterious and drilled
25
:9830 and to Alex for joining the full posterity or higher on Patreon.
26
:I hope that you will enjoy your merch and Alex, I want to tell you that not only will I
love your first name of course but also you joined on November 13 which is my birthday so
27
:thank you so much for the birthday present Alex.
28
:Can't wait to see you guys in the Slack channel.
29
:And now, let's hear from Nathaniel.
30
:Nathaniel Ford, welcome to Learning Vision Statistics.
31
:Thanks for having me, Alex.
32
:Yeah, thanks for taking the time.
33
:It's always a pleasure to have you on the show, even though it's the first time.
34
:I mean, on the main format, but you've already been on the show actually to do a modeling
webinar.
35
:I think it's the first time that happens.
36
:So you were first, you first appeared doing a great...
37
:modeling webinar about Bayesian non-parametric causal inference.
38
:You demoed how to do that with PMC.
39
:And I think there is even some BART model in there.
40
:So it's very in-depth tutorial that I definitely recommend people interested in to check
out.
41
:I will add it to the show notes, course, as well as the video of the webinar because...
42
:Yeah, encourage people to follow along the video because you're demoing that live with the
tutorial and also some of the live questions.
43
:And you're a specialist of these in-depth guides and tutorials, mainly about causal
inference because it's one of your favorite topics.
44
:And that's also why I thought it was interesting to have you on the show today because you
have a new tutorial out on the PMC website.
45
:I was thinking, okay, now it's time to have you on the show and also talk about what you
do and your background.
46
:So thanks for taking the time and well, let's start as always do.
47
:Can you tell us what you're doing nowadays and how you ended up working on this?
48
:Yeah, certainly.
49
:So I'm a data scientist for Presonio at the moment.
50
:It's a sort of HR intelligence platform, like a kind of one stop shop for all your sort of
51
:HR workflows and sort of insights.
52
:I've been working there for about three years.
53
:And I've been working in data science or data science adjacent roles in industry for about
10 years.
54
:So yeah, I kind of, think I got a little bit lucky with my first job in industry, which
kind of set me up well to sort of move into data science more generally, which I can speak
55
:to as well.
56
:so I kind of graduated from university into the teeth of the global financial crisis.
57
:So like I was, took, the first job I could get.
58
:but it was fortunate because I got a job with, Marsha and McLennan, which is a huge
reinsurance insurance kind of company.
59
:They've got Oliver Wyman, Guy Carpenter, Marsh underneath them.
60
:And they were, they were spinning up a innovation center in Dublin.
61
:And, my first job.
62
:was working for Marsh on their new data quality function, which was kind of a neat first
job for anyone who was data curious because the sort of flow of that job was like every
63
:quarter you'd get a new data set from Marsh and you'd have to try and evaluate it for data
quality, missing data, poor sort of data entry, this kind of thing.
64
:It basically highlighted the risks
65
:of poor data quality for all the use cases that Marsh had in the company.
66
:So it was a good first training round for, say, budding data scientist, we'll say.
67
:Yeah, for sure.
68
:And how did you end up then in the...
69
:I can see causal inference coming in here, but how did Bayes come up?
70
:Do you remember when you were first introduced to Bayesian stats?
71
:Yeah, I was thinking about this the other day.
72
:So I think...
73
:I had to be introduced to Bayesian stats twice and the first time it didn't really take
for me.
74
:like in university, did philosophy and kind of went into mathematical logic after that.
75
:And as a sort of tangent from studying mathematical logic, I was working on different
logics of dependence.
76
:like justification, logics and dependence structures and independent structures.
77
:And I kind of came across Judea Pearl's work on independent structures and the graph
relationships between sort of probabilistic independence and graph like directed acyclic
78
:graphs.
79
:And that sort of backed me into sort of thinking about the sort of conditional
probabilities and Bayesian probabilities from like, I even worked on trying to replicate
80
:Pearl's completeness proof.
81
:of the relationships between probabilistic algebra and the independence relationships on
those graphs.
82
:So from a highly theoretical perspective, I came across Bayes and it didn't stick.
83
:It didn't resonate with me.
84
:I worked through some of the mathematics, but I didn't get it, I don't think.
85
:And it really took exposure in industry.
86
:In my second job, think, I worked with
87
:So I worked in Guy Carpenter for about a year and it was kind of a nice role because I
worked with this sort of catastrophic risk modelers.
88
:So they would have, they were building risk models for portfolios of property that were at
risk from some natural disasters like earthquakes or floods or fires.
89
:And they would try and sort of simulate over sort of multiple, basically Monte Carlo MMC
kind of modeling.
90
:to estimate the risk to each sort of portfolio of properties.
91
:like seeing that application was sort of my first sort of exposure to this is a practical
tool that it can be very useful and very informative for making massively important
92
:decisions about like the nature of disaster, catastrophic risk.
93
:Like countries were buying insurance contracts from say Guy Carpenter like.
94
:Mexico would ensure itself versus earthquake risk based on simulation data.
95
:So that was second time and it really stuck with me that that was a practical tool.
96
:And it connected the dots in my head a little bit with the theoretical background and the
practicalities.
97
:And then I just started looking into where should I learn this and found PiMC and the
welcoming community there.
98
:You know, the rest.
99
:There is DC story.
100
:Yeah.
101
:So in how
102
:So actually today I invited you because you have this new tutorial about structural
equation modeling, confirmatory factor analysis.
103
:So that's already two acronyms, SEM and CFA that we should definitely introduce.
104
:So could you start by explaining the basics of SEM and CFA for our listeners who might be
new to these concepts?
105
:Yeah.
106
:So I think maybe the...
107
:The way I kind of came into trying to understand factor analysis and sorry, confirmatory
factor analysis was probably initially the scikit-learn factor analysis methods.
108
:So the presentation there instead of a typical sort of machine learning workflow for
factor analysis is to think of it as a dimensional reduction technique.
109
:So you have like this wide array of features, your X matrix and your prediction problem or
whatever.
110
:And you think that there's a bunch of them that are kind of related, and you want to sort
of zip them up into one feature that simplifies your modeling workflow.
111
:So like you take these related features, push them through this factor analysis routine,
and it spits out like maybe one factor or two factors, depending on what you're aiming at.
112
:But it reduces the complexity of your data set by kind of not exactly aggregating up
multiple features, but creating new features from existing features.
113
:So that was my first exposure to factor analysis.
114
:Confirmatory factor analysis is more common, less not in machine learning workflows, but
more common in psychometrics and social sciences, educational sciences, learning
115
:development sciences, where the factor itself, the thing you wrap up all your features
into, is of independent interest.
116
:So you might have this kind of construct or notion of say mathematical aptitude, which is
not itself directly measurable, but you have a bunch of measures that seem related to it,
117
:right?
118
:Like, or should be informed by it.
119
:So like historic test results on like say your junior cycle math exams and your other
cycle math exams and like, how do they all kind of collectively inform a view of your
120
:individual mathematical aptitude?
121
:So a confirmatory factor analysis model is interested in these abstract constructs and you
might gather a wide data set and you want to sort of reduce it in terms of complexity, but
122
:also see which are good measures for which of these latent factors exist.
123
:So this is often the case.
124
:so, so yes.
125
:So yeah, so basically it's a dimensional reduction technique, but the things that you're
reducing to these multiple indicator variables are of independent interest themselves.
126
:So you might want to try and measure this latent construct of mathematical aptitude and
see if that itself is predictive of future math scores, for instance.
127
:So that's kind of factor analysis.
128
:And the confirmatory part is you're making a statement by building this model that
129
:These indicator variables, like say your historic math scores, load onto this aptitude
factor well, and they make sense collectively as indicators for this abstract notion of
130
:aptitude.
131
:So what you're trying to do when you fit a confirmatory factor analysis is make sure that
the specified relationships between these abstract constructs and your variables that you
132
:define in your modeling architecture make sense and can recover aspects of the observed
relationships in your data matrix.
133
:Primarily, you're interested in recovering
134
:the correlations and covariance structures between your observed variables.
135
:And so as you build out that architecture of the confirmatory relationships, you then
evaluate the success of your factor model by seeing how well it can recover the covariance
136
:structure in your observed data.
137
:Okay, so that means does that mean CFA has like more structure in it because you assume
the model has a structure can almost like a DAG?
138
:Yes, you're like by imagine you have a data matrix with six variables.
139
:You say that the first three variables are related to
140
:life satisfaction in the case that I was looking at.
141
:And the second three variables are related to measures of parental support.
142
:So you want to sort of abstract across those six measures to have like one score for each
individual on how well supported they were by their parents and another score for how well
143
:they report life satisfaction scores.
144
:But you are imposing that structure, like you're imposing that structure to say,
145
:these three variables load on this factor, these three variables load on this factor, and
you're trying to confirm that structural specification by seeing how well that model can
146
:then reproduce the observed covariance structures between the six indicator metrics that
you actually have in your data set.
147
:And the difference with classic factor analysis here in that example, for instance, would
be that
148
:You would not be telling the model that the first three factors relate to life
satisfaction and the three last factors relate to parenting.
149
:You would just put everything in the same model and just see if the model picks that up
itself.
150
:Machine learning approaches are often less theory driven and more sort of, does it perform
better predictively on the outcome I care about?
151
:Confirmatory factor analysis is sort of inherently theory driven.
152
:You're trying to describe what you believe to be the operating theory that drives these
outcomes.
153
:And by evaluating the confirmatory, or by trying to confirm the factor analysis structure
that you've specified in your model, you're in some sense trying to validate a theory of
154
:what drives these outcomes of interest.
155
:Okay, yeah.
156
:Yeah, so I can clearly see the link then.
157
:with DAGs and causal inference for sure.
158
:Yeah.
159
:And so, mean, so that's already seemingly kind of complicated enough or at least hard to
explain on a podcast setting.
160
:And then structural equation modeling on top of that is a way in some sense to add yet
more structure to the theory that you wish to confirm.
161
:So you now have you have all your measures for your indicator variables.
162
:So you're three for life satisfaction and three for parental support.
163
:But now you want to say that of the two latent constructs that you've just defined, that
there's a regression relationship between those two constructs, for instance.
164
:And then we say we try to predict life satisfaction from parental support.
165
:And structural equation modeling is used across a bunch of diverse fields and means maybe
subtly different things in each, but inherently it's just about adding more explicit
166
:structure.
167
:to the relationships in your big multivariate joint distribution.
168
:And you can be quite explicit in how you structure those relationships or those dependency
chains.
169
:You can have chains of regressions or chains of functional relationships between your
indicator variables, but also the latent constructs that you derive via your measurement
170
:model, the factor analysis structure that you have baked into your structural equation
model.
171
:Okay.
172
:Okay.
173
:Yeah, I see.
174
:So actually for people interested in more about structural equation modeling, causal
inference and psychometrics, I recommend listening to episode 102 with Ed Markle.
175
:I linked to that in the show notes because Ed is doing psychometrics.
176
:that's really, it seems to be used a lot SEM in psychometrics.
177
:So for people interested in that, definitely recommend that one.
178
:Today, we're going to focus on your tutorial about CFA and SEM, Nathaniel.
179
:I know you prefer going by that.
180
:Actually, I'm curious what were the main goals of this tutorial and what prompted you to
explore this area?
181
:What prompted me to explore...
182
:To be clear, none of the data we used in that tutorial is Personios data.
183
:But work in Presonio, so it's a HR platform and intelligence platform and like one of the
sort of primary ways in which HR departments attempt to gauge their effect on the employee
184
:base in any company is to run sort of regular surveys of different stripes and sort of the
main or canonical survey that gets run across most industries is like an engagement
185
:survey.
186
:where you're trying to build a model or understand what drives employee engagement or
satisfaction in some sense or other.
187
:And those surveys can be quite dense.
188
:There's many, many questions across different themes about your work-life balance or your
working conditions or your autonomy.
189
:so it's kind of a psychometric adjacent.
190
:question, you're trying to figure out like what is it that, what metrics load on aspects
of employee happiness and how does that happiness drive their engagement and how does
191
:ultimately questions of like how does engagement drive productivity in the business's
bottom line?
192
:And it's a problem that is just well suited to structural equation modeling.
193
:And I didn't know enough about structural equation modeling to
194
:to be comfortable applying it sort of out of the box tools.
195
:So I wanted to sort of dive into the modeling framework, understand it better and kind of
build it in a framework I know well, which is PyMC.
196
:And like the beautiful thing about probabilistic programming languages and PyMC as well is
that it offers you this language or freedom to express those complicated modeling
197
:structures in a way that...
198
:A, I'm familiar with, but B is also very intuitive and powerful given the Bayesian
setting.
199
:I see.
200
:Yeah.
201
:So that was a way for you to learn in an open garage way in a way.
202
:And so what are some of the key challenges actually that you experienced in applying SEM
and CFA?
203
:Yeah.
204
:So there's a lot of, like I think you mentioned, Blavan is like a package that's being
developed for Bayesian structural equation modeling.
205
:by Ed Merkel.
206
:Exactly.
207
:And that is sort of an augmentation of the more traditional Levan package, which is also
used to estimate structural equation models.
208
:Levan is, I think, tries to fit these models in a more
209
:as a classical way or frequentist way.
210
:it often fits these models with maximum likelihood and sort of methods.
211
:And so I've tried to use those those models and because there's a lot of tooling around
Levan, which makes it easier to interpret these models.
212
:And it's a powerful package in its own right.
213
:But these models are sort of inherently quite complex, like you're building rich
structural relationships.
214
:with many, parameters in a lot of cases across complex and possibly not well understood
relationships in this sort multivariate survey that you're going to be working with.
215
:And I often found that the model fit either wouldn't converge or if it did converge, it
would be a saturated model, which can be fine, don't get me wrong, but one of the sort of
216
:difficult balancing acts you have with
217
:fitting these structural equation models with maximum likelihood is that they, and this is
a detail that I think is relevant for our discussion about Bayesian model fitting, that
218
:Levin model tries to fit the data by optimizing the fit of the covariance matrix against
the observed covariance matrix.
219
:So the problem that you have there is that you have a number of degrees of freedom to
spend in your sort of model fitting routine.
220
:And you can, with these complex models, can quickly exhaust your degrees of freedom and
you get a saturated model very quickly, which then becomes harder to sort of build on and
221
:sort of ultimately interpret.
222
:So that was sort of one challenge I had, sort of just working with Levan and working with
SEMS in general.
223
:And what I kind of found to be compelling about the Bayesian approach to fitting SEM
models is that the...
224
:The estimation routine is entirely different.
225
:You're not aiming to fit a structure or optimize your fit to the covariance structure.
226
:So you're not looking to fit against a aggregate of your observed data.
227
:You're actually looking to sort of retrodict or predict the observed values of your
survey.
228
:So it's a kind of a completely different model estimation sort of approach.
229
:And it also kind of unlocks then for you the sort of more typical posterior predictive
checks that you get out of the Bayesian or contemporary Bayesian modeling workflow.
230
:You can also, of course, look at the covariance structures that have been derived over
through your Bayesian modeling estimation workflow.
231
:But it's not as limited in the manner in which it tries to fit data to the model.
232
:And you obviously get then the
233
:power of priors to kind of change up and down when you find like hard to fit models that
where the sampler doesn't behave well.
234
:So yeah, that was the primary challenge and Bayesian approaches to that model, to those
complex models in general helped overcome that challenge.
235
:Hmm.
236
:Okay.
237
:Yeah.
238
:And why?
239
:Why?
240
:does using the Bitcoin framework here helps?
241
:Is that a classic reason like, well, using priors is actually helping a lot because it
adds structure to the model?
242
:Or is that something else mainly?
243
:Yeah, so it adds structure to the model for sure.
244
:And you can of weigh in, in some sense, where you think that you have implausible values
without it.
245
:It helps in that respect, but there's also kind of a detail about like, once you have a
saturated MLE fit for a SEM model, it's harder to then add more structure and then
246
:interpret the outcomes of that saturated model.
247
:Whereas you don't, because you're not really working with the same degrees of freedom
problem when you're estimating the Bayesian model, you can kind of add more structure and
248
:you get different sort of evaluation criteria that help you sort of.
249
:I think, at least to my mind, more smoothly interpret the outcomes of your model fit.
250
:Okay.
251
:I see.
252
:That's interesting.
253
:How was using PIMC for that kind of model?
254
:Because I don't think when people pick up PIMC, they think about these kinds of models.
255
:Maybe causal inference, they pick it up for that.
256
:SEM or CFA, I'm not sure this is something that pops, that's a software that pops into
people's mind easily.
257
:So can you discuss how you used PIMC in this tutorial and what advantages did you see that
PIMC provided when modeling complex structures like those in your SEM and CFA tutorial?
258
:Yeah, so like I'm not.
259
:I mean, like I'm very impressed with the sort of flexibility of probabilistic programming
languages in general.
260
:And I think.
261
:Like I think even under the hood, Blavan, for instance, fits the complex models using
Stan.
262
:It's maybe not the historic natural fit for fitting these models.
263
:I don't know enough about the history of psychometrics.
264
:I think there have been a couple of recent books from like 2021 and 2020 on Bayesian
psychometrics or Bayesian structural equation modeling have been released.
265
:And I would be kind of maybe hopeful that there'll be more structural equation, Bayesian
structural equation modeling done in the future.
266
:I think these models are sort of inherently complicated to articulate.
267
:And with the reason why packages like Levan and
268
:I Blavan and think Lizzrel and all of these things were sort of used widely is because it
hides a lot of the complexity behind a nice cleaner UI.
269
:And like it's a validated UI in some sense, right?
270
:So these model fit routines have been well justified, battle tested.
271
:And like you don't want to be using necessarily a sort of bespoke tool for the complexity
that accrues to these types of modeling routines.
272
:So yeah, so I mean, I use PMC because I'm very familiar with PMC.
273
:These models are sort of inherently probabilistic.
274
:So I think it is a natural fit for expressing these types of structures.
275
:But because of the complexity of these models, if you were
276
:working in industry or if you were working in academia and you don't want to build these
models by hand, you should probably use a package that is more battle tested for SEM
277
:modeling than PyMC.
278
:That's not to say we couldn't invite an alternative to Blavan on a PyMC base, but it would
be a lot of work and I think perhaps redundant.
279
:I see.
280
:Okay.
281
:Yeah.
282
:Still, yeah, to me, that's amazing to know that if you're already comfortable with Stan or
Pimcey, you can write up these kind of models in this framework.
283
:That's really cool.
284
:And that's why I really encourage people to take a look at your tutorial because, well,
the code is in there.
285
:so that's, at least to me, that's really how I really understand the methods, much more by
reading the code.
286
:much more than by reading the equations and what the model is supposed to do.
287
:When I see the code, it really helps usually to understand what we're trying to do.
288
:Yeah, 100%.
289
:That was part of the appeal for me to learn the nature of the model was to see if I could
build it.
290
:And I was pleasantly surprised to see that I could.
291
:Yeah.
292
:And how do you go about
293
:Well, developing, we talked about that, but a question I often get and myself also I get
in my work with the Marlins is, well, how do you validate the models?
294
:What criteria do you use to assess model fit, model accuracy, to understand whether going
a more complex road would pay off or not?
295
:Because often,
296
:Developing a model is a lot of choices, right?
297
:It's a bit like a garden of fucking path, right?
298
:Where, well, you can always come up with a more complex method and model, but it's gonna
take time and resources and that's an opportunity because for you as the modeler and then
299
:of course the company.
300
:So yeah, how do you think about these topics?
301
:How do you decide on that?
302
:Where do you think, when do you think that adding complexity is
303
:worth it or not?
304
:Yeah, so these are good questions.
305
:I think there's probably two parts to this at least.
306
:So there's just questions of sort of evaluating model fit in general.
307
:And for a SEM or a CFA model, it breaks into at least two sort of views on the problem.
308
:One is sort of measures of global model fit.
309
:So this is where you try to look at the model as a whole and you'll find one summary
statistic that is your measure of performance.
310
:Like this kind of lens is like maybe in typical Russian, maybe someone overweights the
importance of R squared as a measure of performance, right?
311
:There is analogous summary statistics used in sort of traditional SEM models, a ton of
them actually.
312
:They're sort of like indices of global model fit.
313
:And then there are also measures of local model fit.
314
:So what I mean there is like, instead of looking at the overall model, you look at how
well does the model recapture a sort of relationship.
315
:So imagine your model has a covariance matrix of like kind of five by five or something
like that, right?
316
:Where you're interested in particular, the covariances between the outcome of interest.
317
:and one of the main drivers in your joint distribution.
318
:And you're primarily interested in preserving a good model fit to that relationship.
319
:Then you can look at sort of how well your model recovers the covariance or the
correlations between that component of your overall covariance matrix after your model has
320
:been fit.
321
:So in that way, you can distinguish model evaluation criteria to be local model evaluation
criteria or
322
:global model evaluation criteria.
323
:And so in the tutorial, I tried to pull out kind of two views on this, like because it's a
Bayesian model, first of all, you can do posterior predictive checks in general.
324
:So you can do posterior predictive checks across this multivariate distribution, right?
325
:So across the, whatever, 15 or so input variables, how well am I retrodicting all of those
variables after the model has fit?
326
:You can also then sort of aggregate up the relationships and sort of build a predicted
covariance relationship between each of those output or outcome variables and then compare
327
:that to the observed covariance relationship in your sample.
328
:So you can measure how well your model recaptures the observed covariances.
329
:You can also measure how well your model captures the observed data points.
330
:And you can then also do like kind of Loo or WAIC measures on the outcome in general.
331
:And that falls into sort of the typical Bayesian posterior predictive checking sort of
routines.
332
:So there's kind of multiple lenses on how you evaluate the model, just based purely say on
the statistical metric.
333
:There's another kind of subtle thing to think about here.
334
:When you have...
335
:Like, and this goes back to the fact that confirmatory factor analysis and structural
equation models are sort of articulations of a theory that you want to evaluate.
336
:So like if your theory is about a dependency relationship between, parental support and
life satisfaction, and you have this mediating structure between like whatever other
337
:variables that you're interested in, you want to answer questions about that dependency
relationship.
338
:So articulating the right SEM structure or the model structure to be able to interrogate
those questions in itself will kind of lead you to build a particular model structure,
339
:right?
340
:That would be, that could be compared to a less complex structure, but the less complex
structure won't be able to answer that question that you have in mind.
341
:So like, even though you might then look at these two models,
342
:on the aggregate statistics of fit and maybe the less complex model does better, but it
fails to answer the question you're interested in.
343
:And so then when you're weighing up your resources and your time allocation or whatever,
like you build a model that answers the question of interest to you, whether that's like
344
:the best fitting model or not, this is the closest articulation you can get to your
theory, your theoretical question.
345
:which presumably is important because it's going to drive a decision for you in work or in
your day to day.
346
:So, yeah, there's multiple dimensions on which to evaluate fit and complexity tradeoffs.
347
:But the psychometric discipline seems to be focused on trying to answer the theoretical
question that matters.
348
:like, yeah.
349
:That makes sense.
350
:And that's also usually what I answer people when they ask me about that.
351
:It's more you have to look at a collection of metrics and comparison much more than just
looking at one of them and call it a day.
352
:Yes, that's true.
353
:actually also another element, sorry, element of evaluating these models, especially in
the Bayesian setting, is the real and vital importance of sensitivity check-in.
354
:like sensitivity checking to how you set those priors, right?
355
:Like if you have articulated the model you want, it has the right structure, you really
need to make sure that you're not like skewing the thing by having too loose a prior.
356
:So like as part of this workflow with the same model at least, you want to be damn sure
you've done good sensitivity analysis to make sure that the results that you're interested
357
:in hold even after like changing reasonable priors.
358
:Yeah, that makes sense.
359
:That makes sense for sure.
360
:And that's related to simulation-based calibration, I guess.
361
:So that's also something I really recommend more and more if you have computing power or a
model that's not too long to fit, or if you can use a multi-spatial inference, is doing
362
:simulation-based calibration even before
363
:fitting the model to real data, that's really good because that gives you confidence that
the model is doing what you want it to do.
364
:So once you start hitting roadblocks when you start fitting the model to real data, you
know it comes either from a model structure or parameterization issue, but at least it's
365
:not coming from the...
366
:the DAG if you want that to hang in mind because you know that if the DAG you have in mind
corresponds to the data you observed, then you'll be able to recover the parameters.
367
:So we recommend doing that also.
368
:could you like actually talking about difficulties in roadblocks, are there any technical
insights or interesting findings from writing your tutorials that were particularly
369
:striking or unexpected to you?
370
:mean, the thing that tripped me up most when building these models probably is I started
with this confirmatory factor baseline kind of model, which is, I think, is also called
371
:the measurement model.
372
:So before you add extra structure amongst the variables, you want to just establish the
relationships between your observed data and the factors of interest.
373
:So that building that went reasonably smoothly.
374
:And the thing that tripped me up when I was trying to add structure onto that was like I
initially just added regression, regression formulas effectively into the PMC model
375
:context.
376
:And it wasn't working.
377
:And I realized at some point it was because I had just like fixed formulas.
378
:wasn't sampling those kind of latent structural formulas.
379
:I wasn't.
380
:putting them in as a random variable.
381
:Instead, the formula had to be put inside a normal distribution as the mu parameter.
382
:We're predicting the center of a normal distribution, and then the model gets to sample
the rest of the random variation associated with that regression equation.
383
:And I had been too optimistic in guessing that the fixed formula would be enough.
384
:Rather, there's also variation or random variation that needs to be accounted for as you
build the structural components of your SEM model.
385
:The regression equations themselves are sort of probabilistic random variables.
386
:And secondly, when you build these regression components of your SEM model, you are taking
them out of the multivariate structure that is in your simpler
387
:confirmatory factor analysis model, right?
388
:But you need to put the possibility back into the model that those regression formulas in
your structure also have covariance structures.
389
:So you needed to build extra sort of residual covariance structures to ensure that if
there is a sort co-determining effect between two regression components of your structural
390
:model, that model was able to articulate that.
391
:code determination effect and estimate the degree or strength of the code determination
effect in your model.
392
:So it tripped me up getting that extra structure in there.
393
:I see.
394
:And how did you realize it was an issue and you had to that?
395
:The model fits worse.
396
:The model fits were dramatically worse.
397
:There was more divergences, sometimes utter failure to converge.
398
:One of the maybe surprising things is like the model structure there in Pine Seed
converges really fast for such a complex model.
399
:It takes like five minutes to fit data.
400
:So yeah, so it just wasn't working cleanly.
401
:you could see that in the model fit statistics, but you could also see it in the sort of
health diagnostics.
402
:of the sampler, which is kind of a nice, that's a nice aspect of HMC in some sense.
403
:It's like, if it's not working, it's also an indication that your model is poorly
specified, right?
404
:So yeah.
405
:Yeah.
406
:I mean, divergences are really, you know, the nerdy realization of the idea that the
obstacle is the way, you know, it's just.
407
:It's always terrible to see the divergences, but in the end, they're always a blessing,
but it really costs adapting your mindset to be like, that's great.
408
:I have divergences instead of being like, my God, no.
409
:Yeah.
410
:There was some banging on my head against the wall as we tried to figure out.
411
:Actually, so we mentioned SEM, CFA.
412
:We mentioned causal inference, but we didn't explain yet how.
413
:these methods intersect together.
414
:So can you explain the role SEM and CFA play in understanding causal relationships in
data?
415
:Yeah, so I mean, there's actually a nice paper, Judea Pearl and think Ken Bowlin, I think
Ken Bowlin, and it talks about the nature of the relationship between structural equation
416
:modeling and causal inference.
417
:And the sort of argument there is that like it goes back to the fact that these models are
articulations of theory, like they're articulations of a relationship that you think holds
418
:between like whether it's constructs or independent metrics or not, you're sort of
building a theory that you ultimately want to believe is causal in some sense or other in
419
:most cases.
420
:Doesn't give you causation for free.
421
:You still need to
422
:the right variables in the model, structure them appropriately, make sure that you try to
remove as much confounding as possible.
423
:And in some sense, factor analysis, the fact that you're collecting a bunch of metrics
underneath one factor helps you articulate or disentangle the independence relationships
424
:between multiple data points and recover conditional independence.
425
:structures which will give you license to make causal claims about your model if it's well
fit.
426
:But the discipline is sort of inherently about articulating a claim about the effects of
the relationships between these variables.
427
:So it's kind of inherently causally keyed, if you get me.
428
:Yeah.
429
:I see.
430
:OK.
431
:OK, that's interesting.
432
:And so here the
433
:That's interesting to me to hear that the factor analysis is helping actually in this
causal, like recovering the causal relationships.
434
:Because intuitively I would say that it could muddy the waters because you're reducing the
third dimensionality of your data, which I really get why you would do that, but I would
435
:have intuited that that would make causal inference harder.
436
:There's a nice quote at the end of the tutorial I have from Judeo Pearl there where it's
like, you're using these factor like structures to potentially create a new construct, but
437
:that could be like, like doctors can just create a new syndrome based on different aspects
of behavior or sort of, you know, health response that is hard to sort of think about when
438
:it's this sort of multivariate presentation.
439
:10 different symptoms, you want to call those 10 different symptoms, group them together,
call it a syndrome or something like that, which helps you then think about the structure
440
:between outcomes.
441
:And so rather than having to work out what's the outcome, like do they live, die or suffer
in pain for 10 years due to this presentation of 10 symptoms, it's like, what is the
442
:probability of survival based on having this particular syndrome?
443
:Right?
444
:it's a sort of, it's not like, not only is it a dimensional reduction technique, it's also
sort of a theory refinement technique, right?
445
:Cause you have lots of data around things which are potentially presenting in a very
complex way by gathering them under a factor, you sort of unify, make more elegant or
446
:clean your theory in a way that you hope to be justified theoretically.
447
:Like it's not just like.
448
:There's work to be done to make a SAM model, a compelling causal account of the
relationships between these variables.
449
:So it's not for free, but like it's a, structure lends itself to building a theory about
what drives what, what causes what.
450
:And so that's how you can use it to articulate those, the complexity of those
relationships.
451
:Hmm.
452
:Okay.
453
:Okay.
454
:Fascinating.
455
:Yeah.
456
:Ian, do you?
457
:Do you have some practical applications in mind of these methods in real world scenarios,
especially when it comes to understanding risk and causal inference?
458
:I mean, so like, kind of want to use these ultimately to look at sort of employee
engagement type survey data.
459
:like, I want to know what are the sort of themes that are driving
460
:the successful outcomes for an engaged workforce, for instance, based on their responses
across clusters of thematically grouped questions.
461
:So those questions might be related to their worker autonomy in their role.
462
:It might be related to the working conditions of the job.
463
:It might be related to their view of the company writ large.
464
:Like what are the decisive factors that are more sway in influencing the outcome, which is
like
465
:a score of employee engagement.
466
:And I think like more generally just across like well-designed surveys will give you
clusters of questions which are sort of mapped to appropriate themes which can be like
467
:kind of articulated as a factor analysis model.
468
:And then you want to kind of figure out what is the sort of relationship between, you
know, the employee's sense of autonomy and their sense of engagement.
469
:You're not going to ever measure autonomy directly, but you might measure it indirectly
via a well-designed survey.
470
:I think similar use cases are done across the social sciences.
471
:These models are built in such a way that they can take these complex presentation of
multivariate survey responses.
472
:and sort of thematically group or cluster these responses in ways that allow you to better
articulate the theory of what's going on behind the scenes.
473
:Okay.
474
:Yeah.
475
:Okay.
476
:I see.
477
:So a lot of survey data, That makes sense.
478
:I survey data is the use case I kind of have in mind, but I think there's applications
beyond that as well.
479
:It's like any sort of complicated mess of measurements that you want to sort of...
480
:add structure to make more compelling an account.
481
:Yeah, that makes sense.
482
:Something I really like, I said it in the introduction to this episode that you have a
background in philosophy and mathematical logic, which I find super interesting and very
483
:original for that field.
484
:I'm wondering how do you see these disciplines influencing your approach to data science
and probabilistic modeling?
485
:Yeah, I mean, I think deeply, sort of deeply important to the way I think about these
problems.
486
:Maybe like philosophy in general, I think there's a fascinating study and more people
should do it.
487
:probably more direct influences the sort of study of logic that I did like a lot of the
focus and when you're studying a logic or various logics is
488
:The way you want to express different consequences or different inferences between your
assumptions and the predicted outcomes in some sense or other, you have various varieties
489
:of logic that encode different consequence relationships.
490
:So there's like relevance logics, there's justification logics, there's dynamic epistemic
logics, there's a whole range of these particular things.
491
:And they're all kind of just like different little lexicons.
492
:with different connectors and rules of consequence.
493
:Like, so how you can derive a conclusion from a set of assumptions when you're working
with a relevance logic versus when you're working with classical logic are different.
494
:They have different sets of consequences.
495
:And what I kind of found when I was working with those things is like, is sort of, like
there are natural arguments you want to make or inferences you want to make, which are
496
:blocked when you're using a particular logic.
497
:And what I find fascinating about sort of, PMC and probabilistic programming languages,
it's, it's also a language.
498
:you, you get to articulate different structures and draw inferences in ways that are sort
of, mathematically defined and structured.
499
:but it's, also lets you answer questions which are fuzzier than does this follow?
500
:necessarily from your assumptions.
501
:It's like this follows probabilistically with some degree of surety here and there.
502
:So I think the focus on logic, the structuring of your argument in that also leads me to
think about modeling as structuring an argument in some sense.
503
:Every model you build to my mind is like an argument about the state of the world.
504
:You were saying the world is thus and so.
505
:And here's how I'm measuring that and here's how I'm evaluating that claim.
506
:And I'm retrodicting, retrodicting against past data to sort of reinforce that my model is
a good fit to the world.
507
:It's an argument about how the world is or approximates, how we can approximate the world
through this linguistic structure.
508
:So yeah, I don't know if that answers your question exactly, but yeah.
509
:Yeah.
510
:No, for sure.
511
:And also something I've seen you talk about and express is keen interest in inference and
measurement in the presence of natural variation and confounding.
512
:How does this interest of yours shape the way you design your models?
513
:Yeah, so that's kind of a
514
:So, yeah, so I think maybe I just have a very suspicious personality and that I think that
the world is constantly trying to fool me or that there will be a confounding relationship
515
:in the data that I'm working with.
516
:And that's kind of generally led me to look at the sort of causal inference questions and
causal inference can be just considered like a collection of methods or models.
517
:that attempt to adjust for suspected patterns of confounding in your data generating
processes.
518
:A nice kind of overlap between the same world and the sort of causal inference type models
is the Bayesian formulation of instrumental variable designs.
519
:The Bayesian formulation of that model fits a multivariate normal distribution on your
520
:on your treatment and you're a instrumental variable because you want to measure the sort
of correlation and covariance between those two things to adjust for the confounding
521
:relationship, the confounding influence of a third variable on your treatment data.
522
:that is kind of like the instrumental variables and path tracing rules.
523
:were all of them were like first envisioned or imagined by Sewell Wright.
524
:who came up with a lot of these SEM structures and instrumental variable designs.
525
:And both are kind of like attempts to articulate a model structure that adjusts for the
risk of confounding in your data generating processes.
526
:And I find that species of this way of looking at the world that there is a data
generating process.
527
:there is a risk of confounding if you think your data generating process looks anything
like this.
528
:And here's a technique or a sophisticated adjustment you can make to your model to account
for that.
529
:And yeah, so that sort of sits at the heart of the way I think about modeling.
530
:What do I need to adjust for?
531
:What are the risks that this is going to go completely wrong?
532
:How do I sort of validate that I haven't gotten it completely wrong?
533
:Yeah, with the advancements that
534
:that we see now in generative AI.
535
:I'm curious what future trends do you anticipate or maybe hope for in the field of causal
inference and structural equation modeling?
536
:Specifically with respect to the advancement of generative AI.
537
:Yeah, or just what...
538
:What future trends do you see in general in these fields or maybe things that you hope
for?
539
:think the causal inference, I feel like has been getting a lot of traction, like the
importance of confounding in basic data analysis even, just like that is more and more
540
:visible.
541
:Even just in industry like the...
542
:the prevalence of sort of quasi-experimental designs to like, maybe you're working in
industry and people don't want to run an A-B test for everything, but they're just
543
:launching a new policy, a new procedure, and they want some sort of guide on how to
understand what the impact of that policy is.
544
:The proliferation and spread of understanding of causal inference and confounding risk for
the evaluation of those new policies or procedures, I think has
545
:gained a lot of traction in recent years and I would hope it continues to do so.
546
:And with that, the increased understanding of risk of bias, waste due to biased
conclusions, increased caution.
547
:think the poor causal inference is a kind of zero interest rate phenomenon.
548
:If money's on the table and you're going to waste money by doing something really poorly,
you want to do that inference well.
549
:And so now we're not quite in a zero.
550
:I know the interest rates came down recently in Europe, but we're not back to zero
interest rate world yet.
551
:I would expect causal inference to go from strength to strength over the next couple of
years.
552
:I see.
553
:also, I'm wondering if you have advice.
554
:that you would give to aspiring data scientists interested in specializing in what you're
doing, which is the intersection of probabilistic modeling and causal inference?
555
:think it's generally always just like if you're really young and you're on the job market
and you're looking for a portfolio piece or something like that, I always find a problem
556
:that you're interested in, not one that exists out there like a
557
:niche data set, maybe you create your own data set, kind of show the workflow and the
thought that went into your understanding of the data generating process.
558
:And that you can think about the aspects of that data generating process, which would
support your conclusions, but also threaten the conclusions.
559
:And the ability to able to articulate that understanding is more important to me when I'm
like hiring.
560
:than the ability to say you just deployed the latest fancy model.
561
:I just want to hear that you've thought about what is the thing you're actually measuring.
562
:Can think through the problem in some sense.
563
:I think that's more impressive on the job market for Brad than it is to say that you've
played with the latest deep learning phenomenon.
564
:Yeah, because I think this also shows that you're able to learn.
565
:And I think that's one of the most important things in our jobs because I think one of the
best job descriptions of our line of work is
566
:The ability to be uncomfortable all the time with what you think you know, and being able
to always update what you're doing, how you're doing, and why you're doing it, extremely
567
:important, completely agree with you.
568
:Yes.
569
:I would also emphasize good communication is going to be vital in any creative discipline.
570
:That's true.
571
:The ability to communicate complex topics to different audiences.
572
:and break them down is extremely important.
573
:So that's why also, you know, making the effort to do the communication you're doing with
the, your return articles and also coming on the podcast, different media.
574
:That's really part of the job.
575
:would say not something that you do, you know, on the site because what counts is on the
code.
576
:Because good code that's not used is not very useful.
577
:Yeah, utterly useless.
578
:Yeah.
579
:So to close this out before I ask you the last two questions, are there any future
projects or research areas that you're currently excited about?
580
:Particularly involving Bayesian methods, of course, but things that you're learning right
now that you're really excited about?
581
:Yeah, I'm kind of focused on two things at the moment.
582
:So more and more into sort synthetic control methods for causal inference.
583
:And I also, I kind of want to go back to basics a little bit on understanding just survey
methodology and how to think about surveys well, especially stratified survey sampling.
584
:But yeah, so that's kind of on my radar to do.
585
:Should we expect a new in-depth tutorial about that?
586
:Yeah, probably.
587
:a couple of months down the line, I think.
588
:Sounds good.
589
:Can't wait to read that.
590
:Well, Nathaniel, that was really a pleasure to have you on the show.
591
:think we were able to cover a lot of ground.
592
:So thank you so much.
593
:for clicking the time.
594
:Of course, before letting you go, I'm going to ask you the last two questions I ask every
guest at the end of the show.
595
:So first one, if you had unlimited time and resources, which problem would you try to
solve?
596
:Yeah, so I had a kind of idea for this and I'm not entirely sure how I'd go about it, but
like, I feel like there's like this general tragedy of the commons kind of phenomena that
597
:you like have, say for
598
:climate activism or for even just like working effectively in a big organization with
politics and little kind of kingdoms being built here, there and everywhere.
599
:I'd love to know like for different organizational structures, how do you mitigate if not
solve these sort of risks for tragedy of the commons like so for more efficient sort of
600
:I guess more efficient workflows for an organization that can mitigate the risks of tragic
commons effects.
601
:Yeah.
602
:So like if there's infinite time and money and like a big research proposal, probably
fine.
603
:For different organization structures, there's different mitigation strategies and then
which ones to apply where and how, that kind of thing.
604
:Definitely understand that.
605
:That's a great answer.
606
:And second question, if you could have
607
:dinner with any great scientific mind, dead, alive or fictional, who would it be?
608
:Yeah, so I was thinking about this one.
609
:So like, is it a hard requirement that it be a scientific mind?
610
:Because I was thinking like it would be great to have dinner with Borges, you know, the
Argentinian short story writer.
611
:He wrote beautifully concise, beautifully concise short stories.
612
:But if a scientific mind is a hard requirement, I think I would go for the sort of
philosopher Nelson Goodman.
613
:Okay.
614
:Yeah, both good choices.
615
:will allow José Luis Borges because he's Argentinian and my wife is Argentinian.
616
:know, like, I have to accept him.
617
:And also, definitely.
618
:I love that choice.
619
:Both are, I think for both choices, it's a first.
620
:on the show.
621
:yeah, I love both.
622
:And yeah, would argue also Borges influenced quite a lot of scientists, right?
623
:So think he was the one who wrote The Garden of Forking Path.
624
:I think one of the stories is called that.
625
:So you could argue it's it's almost a scientific story.
626
:yeah, I think so.
627
:Also Library of Babel is like this excellent little
628
:mediation on combinatorics in a short form.
629
:Nice.
630
:yeah, definitely.
631
:Awesome.
632
:Well, that was really a pleasure, Nathaniel.
633
:As usual, I'll put resources and a link to your website and socials and the papers and of
course your tutorial in the show notes for those who want to dig deeper.
634
:Thank you again, Nathaniel, for taking the time and being on this show.
635
:Thank you, Alex.
636
:It was great.
637
:This has been another episode of Learning Bayesian Statistics.
638
:Be sure to rate, review, and follow the show on your favorite podcatcher, and visit
learnbaystats.com for more resources about today's topics, as well as access to more
639
:episodes to help you reach true Bayesian state of mind.
640
:That's learnbaystats.com.
641
:Our theme music is Good Bayesian by Baba Brinkman, fit MC Lars and Meghiraam.
642
:Check out his awesome work at bababrinkman.com.
643
:I'm your host.
644
:Alex Andorra.
645
:You can follow me on Twitter at Alex underscore Andorra, like the country.
646
:You can support the show and unlock exclusive benefits by visiting Patreon.com slash
LearnBasedDance.
647
:Thank you so much for listening and for your support.
648
:You're truly a good Bayesian.
649
:Change your predictions after taking information.
650
:And if you're thinking I'll be less than amazing.
651
:Let's adjust those expectations.
652
:me show you how to be a good Bayesian Change calculations after taking fresh data in Those
predictions that your brain is making Let's get them on a solid foundation