When Fixing an Outage Means Staying Out of the Way

We often assume that resolving a major outage requires centralised command and control—getting the right experts in a room, coordinating their efforts, and directing the recovery. But what if the most important thing an incident commander can do is resist that impulse entirely, and simply create space for the right person to surface?

That's the situation Liz Fong-Jones found herself in during a July 2018 Google Cloud outage that took down nearly every service—not just Google's own, but every customer running on Google Cloud. As incident commander, Liz had the war room assembled, the escalation path triggered, and the right teams on the call. What broke the incident open was none of that. It was an engineer nobody had thought to page, who called in unprompted, said "I think this was my change," and had already started rolling it back.

That moment was only possible because of something built long before the outage: a culture where people don't hide under their desks when things break. Liz traces how psychological safety gets constructed—not in crises, but in how organisations respond to smaller failures every day. She shares the quiet signals that reveal when it's missing (the call that goes silent after an acronym nobody understands, the junior engineer who never speaks), and the heuristics she uses to build it deliberately as a senior engineer.

This conversation goes beyond incident response to explore what it actually means to build resilient systems and resilient people—and why those two things are inseparable.

Key Discussion Points

[00:01] The July 2018 Google Cloud Outage: Liz introduces her role as a volunteer incident commander and the scale of the incident—nearly every Google Cloud service down simultaneously
[06:00] The Fix That Came From Outside the War Room: An engineer nobody had thought to page calls in, identifies their change, and has already started the rollback before the room knows what's happening
[12:00] Why a Safety Feature Caused a System-Wide Failure: How a canary deployment designed to limit blast radius instead pushed metadata globally—and triggered a bug in every front end
[17:00] Distributed Debugging and the Limits of Centralisation: Why the person holding the critical piece of information is rarely in the escalation room, and how you design for that
[22:00] Psychological Safety Built Before the Crisis: Why the engineer's willingness to raise their hand depended entirely on how the organisation handles smaller failures day-to-day
[28:00] The Quiet Signals That Reveal Fear: Silence after acronyms, juniors who never speak, decisions nobody will revisit—how Liz reads the room for safety
[34:00] Design Ownership and Haunted Graveyards: Why accountability for running a system long-term requires input into its design—and what happens when it doesn't exist
[40:00] Building Resilient People, Not Just Systems: If an organisation crushes someone when they make a mistake, they won't be resilient the next time something breaks—and something always breaks

Guest: Liz Fong-Jones Hosts: Andrea Magnorsky, Kenny Schwegler

Kenny Schwegler: 00:00:00

Hello.

2

: 00:00:01

Good morning, good afternoon, good

evening, good night, wherever you are.

3

: 00:00:04

we're back to another stories

of facilitating software

4

: 00:00:07

architecture and design with me, my

co-conspirator, Andrea MCN Norski,

5

: 00:00:11

and today we have Liz Fong Jones.

6

: 00:00:14

And, I'm very curious,

about your story today.

7

: 00:00:16

Welcome.

8

: 00:00:17

Liz Fong-Jones: Thank

you for having me on.

9

: 00:00:20

So, this story is a, Google S3 war story.

10

: 00:00:24

So it was, July, 2018 and I had just

joined the Google Wide Incident Management

11

: 00:00:29

Program, as one of the volunteers.

12

: 00:00:33

my pagers started going off because

a bunch of Google Cloud was down.

13

: 00:00:36

People couldn't connect

to Google Cloud Backends.

14

: 00:00:38

they were just seeing timeouts

or error messages and.

15

: 00:00:42

People are not having a great time.

16

: 00:00:44

So I hopped on, to the incident

bridge and became the incident

17

: 00:00:47

commander for this incident.

18

: 00:00:49

so.

19

: 00:00:49

Many companies, have these kind

of centralized incident management

20

: 00:00:53

groups, that kind of handle overarching

response to very broad outages,

21

: 00:00:58

and Google is no exception to that.

22

: 00:01:00

at any time there are probably, I think

30 or 40 people scattered all around the

23

: 00:01:04

world, who are available at a moment's

notice to jump in, in case there's

24

: 00:01:08

some kind of widespread, incident that

no single team at Google can solve.

25

: 00:01:14

And one other useful thing to note here

is that I had previously, worked on the

26

: 00:01:20

piece of software that was implicated in

this, um, that I previously had worked on

27

: 00:01:23

the Google Front end, which is basically

the giant, giant, giant, reverse proxy

28

: 00:01:27

that handles requests for google.com.

29

: 00:01:30

docs@google.com.

30

: 00:01:31

Basically anything@google.com,

31

: 00:01:33

it flows through the, Google Front end.

32

: 00:01:36

So, was on this, uh, tragic, tragic

day, there was an alert that went

33

: 00:01:41

off that basically said almost every

single Google Cloud service is down.

34

: 00:01:47

Um, that the, that, you know, we're not

answering requests for cl cloud to go.com.

35

: 00:01:54

It didn't just affect Google services, it

affected services of every single customer

36

: 00:01:58

that was using Google Cloud as well.

37

: 00:02:00

and the set of teams that was

initially brought in, right, like

38

: 00:02:05

was obviously the team responsible

for, the Google front end, right?

39

: 00:02:09

Like both the development team for

it as well as the, traffic team,

40

: 00:02:13

the set of people who handle the

kind of load balancing and routing

41

: 00:02:16

and kind of all these front end

pieces that are required to get your

42

: 00:02:19

request from point A to point B.

43

: 00:02:22

But when they realized that, it was,

you know, not just their own services,

44

: 00:02:25

not just Google Web search, but that

it was Google Cloud's customers,

45

: 00:02:28

that it, you know, there were, there

were dozens of services impacted.

46

: 00:02:31

they signaled a, a Google wide escalation

and that's where I got brought in.

47

: 00:02:35

so I was the, first incident

that I worked as part of the,

48

: 00:02:38

central incident management team.

49

: 00:02:40

it's a volunteer position I should add.

50

: 00:02:41

And it's not a full-time position.

51

: 00:02:43

you have other responsibilities too.

52

: 00:02:46

And this outage lasted about 30 minutes,

and a lot of what we were doing was not

53

: 00:02:53

necessarily hands and keyboard fixing,

but just communicating, keeping people

54

: 00:02:57

in the loop as to what was going on.

55

: 00:02:59

Communicating with executives,

communicating with stakeholders,

56

: 00:03:02

communicating with, Google cloud

reps who were having to explain to

57

: 00:03:05

their customers what was going on,

as well as, you know, doing some

58

: 00:03:07

air cover for the impacted team.

59

: 00:03:09

And what I think is really interesting

about this incident, that I

60

: 00:03:13

wanted to highlight here is that.

61

: 00:03:15

This incident was not resolved by, you

know, the set of people in the, in the

62

: 00:03:19

war room or on the escalation call.

63

: 00:03:21

finding the solution together and like

deploying the fix, the fix to this evolved

64

: 00:03:28

in parallel with the escalation room.

65

: 00:03:30

Um, and it didn't take people in

the escalation room fixing it.

66

: 00:03:34

Instead, we had someone raise their hand

and call into the escalation room, who

67

: 00:03:39

we would not have thought to tag in.

68

: 00:03:41

And they said, I think this was

my change that did this, and I've

69

: 00:03:45

already started rolling it back.

70

: 00:03:47

Right.

71

: 00:03:48

I think this was a really, really

classic example of two things, right?

72

: 00:03:52

Like it's an example of.

73

: 00:03:55

Distributed debugging where

everyone potentially holds the

74

: 00:03:57

answer to it in their head, right?

75

: 00:03:59

Like that you cannot kind of

centralization of your incident band.

76

: 00:04:04

And secondly, it's a really, really

important lesson about psychological

77

: 00:04:07

safety, This was not an environment where

the engineer might have hidden under the

78

: 00:04:12

desk or tried to cover up evidence of the,

of the change that they pushed because

79

: 00:04:17

they were afraid of being fired, right?

80

: 00:04:18

Like they.

81

: 00:04:20

in, right?

82

: 00:04:20

Like they, they called into

the bridge and said, Hey, like,

83

: 00:04:23

you know, this was my fault.

84

: 00:04:24

Right?

85

: 00:04:25

know, don't love the word fault, right?

86

: 00:04:27

Like, you know, we try to

be, you know, blame aware.

87

: 00:04:29

We try to emphasize that, you know,

the fact that you as an individual

88

: 00:04:33

can, you know, make a, can, make a

mistake, indicates that there's some

89

: 00:04:36

kind of systematic flaw on the system.

90

: 00:04:38

But regardless, right, like, you know,

this person felt safe enough to raise

91

: 00:04:41

their hand to self-correct the error,

and also to tell everyone else about

92

: 00:04:45

it so that we could stop, looking

around for the root cause, right?

93

: 00:04:48

and there's heap of details as to,

you know, how exactly it happened.

94

: 00:04:52

you know, and the best I can probably

do for you there is to point you to

95

: 00:04:54

the, public Google retrospective, which

goes into some degree of depth about it.

96

: 00:04:59

but I think, you know, from

a engineering leadership.

97

: 00:05:02

right?

98

: 00:05:02

Like the details of how this incident

manifested don't matter quite as much

99

: 00:05:07

as kind of the people lessons about

how you structure resilience and

100

: 00:05:10

how you structure incident response.

101

: 00:05:12

So I imagine that might raise some

questions for you, Andrea and Kenny.

102

: 00:05:17

What would you like to dive into?

103

: 00:05:19

Kenny Schwegler: Let Andrea go first.

104

: 00:05:21

Andrea Magnorsky: Oh, okay.

105

: 00:05:21

This is, this is,

106

: 00:05:22

Kenny Schwegler: of questions.

107

: 00:05:23

Andrea Magnorsky: this is awesome

because I, I, I love it that, kind of

108

: 00:05:26

hearing the story of it happening and.

109

: 00:05:30

You talked about two things.

110

: 00:05:32

First, you talked about rigid debugging

and the psychological safety needed

111

: 00:05:36

for this person to step up and say,

Hey, I think I know what's going on.

112

: 00:05:40

I'm running it back.

113

: 00:05:41

But you know, one of the first

questions is like, how did

114

: 00:05:46

you, how did you validate?

115

: 00:05:48

'cause I mean, until this person goes

and undo something, which could be just

116

: 00:05:51

revert last pr, literally undeployed.

117

: 00:05:54

But in the meantime, it could

be that that's not the course.

118

: 00:05:57

It could be one contributing

course, or, you know, they could

119

: 00:06:00

have been wrong, basically.

120

: 00:06:01

a what happened in the, in the

confusion stage would be my, my

121

: 00:06:05

question from a facts point of view,

and I have a follow up question.

122

: 00:06:10

Liz Fong-Jones: I think the symptom

that people observed was that, they

123

: 00:06:14

were either, you know, getting timeouts

or that they were getting responses

124

: 00:06:18

saying that, you know, none of the

front ends or back ends were healthy.

125

: 00:06:20

and it took us definitely, you're

right, it took us some time

126

: 00:06:23

to recover from this, right?

127

: 00:06:24

Like, you don't go from a hundred

percent down to a hundred percent up,

128

: 00:06:27

immediately when you deploy a fix.

129

: 00:06:30

but the thing that I think we

were able to validate was some

130

: 00:06:32

leading indicators, right?

131

: 00:06:34

this particular bug, was a bug that

caught that, that resulted in the

132

: 00:06:39

Google front ends that were serving

requests for, Google Cloud resources.

133

: 00:06:43

Those were crashing when they

encountered this unexpected input.

134

: 00:06:48

So when the developer deployed the fix

that resulted in those front ends no

135

: 00:06:52

longer receiving the invalid input,

the crashing stopped immediately.

136

: 00:06:56

Right.

137

: 00:06:57

We had other problems like, not enough

backends available to serve, right?

138

: 00:07:01

Like thundering herds.

139

: 00:07:02

too much pressure on the

backends that wore up.

140

: 00:07:04

But like over time we could start

to see the number of available

141

: 00:07:08

front end stabilizing, right?

142

: 00:07:09

Like, and that was kind of the sign

that we were on the right track.

143

: 00:07:13

but yeah, I think the other thing here

was that, you know, obviously it's a

144

: 00:07:17

lot better to have some kind of idea

of mechanism of causation, right?

145

: 00:07:21

Like if you, if something fixes itself

and you have no idea why, right?

146

: 00:07:25

Like, you have no idea whether or

not it's gonna come back, right?

147

: 00:07:27

But you have an explanation of,

oh, you know, this bad data tickled

148

: 00:07:31

this, this other bug, right?

149

: 00:07:32

Like, then that's something

you can actually validate.

150

: 00:07:34

That's actually something,

something that you can check.

151

: 00:07:37

Andrea Magnorsky: Yeah.

152

: 00:07:38

Okay.

153

: 00:07:38

Liz Fong-Jones: I think the other

thing here, right, like is if you

154

: 00:07:40

just default to reverting whatever,

whatever happened, you know that put

155

: 00:07:44

the system into an end stable state,

most of the time you will be correct.

156

: 00:07:48

You'll be able to return

things to a stable state.

157

: 00:07:50

That's not always a hundred percent

true, but it's definitely a good

158

: 00:07:53

default first thing to try, right?

159

: 00:07:55

Like we talk a lot about the idea

of instead of trying to fix forward,

160

: 00:07:58

right, just roll back, right?

161

: 00:07:59

Like you have a known working state

that you can hopefully get back to,

162

: 00:08:02

and that should ideally be the happy

path for trying to result things.

163

: 00:08:06

Andrea Magnorsky: Yeah.

164

: 00:08:07

Yeah, totally.

165

: 00:08:08

Um, I did wonder how, how did

you conceptualize all of that?

166

: 00:08:11

The, the other kind of questions

that I have, one is on the people

167

: 00:08:15

side and the other one is on the

design changes that might happen and

168

: 00:08:18

they're kind of interrelated, For

example, for, for something, for one

169

: 00:08:22

change to trigger so much failure

that kind of talks about like, hmm,

170

: 00:08:27

is the design actually quite right?

171

: 00:08:30

So I kind of wonder how the, the,

the people side of this design change

172

: 00:08:35

because of the incident trickle down.

173

: 00:08:38

I'm very curious about that.

174

: 00:08:40

Like how, how was the

kinda orchestration of.

175

: 00:08:44

The design changes happened really

after the incident, 'cause I'm

176

: 00:08:48

sure it wasn't like 10 minutes

later we changed everything.

177

: 00:08:51

Liz Fong-Jones: This was really,

really ironic, I think, right?

178

: 00:08:54

That a system that we had put in

place to limit the blast radius

179

: 00:08:59

of changes paradoxically caused

a a, a system wide outage, right?

180

: 00:09:05

So here's a little bit more technical

detail to kind of unpack that, right?

181

: 00:09:09

every functioning software

organization has some kind of

182

: 00:09:12

canary deployment system, right?

183

: 00:09:13

Like, you have some mechanism

of rolling out software to, you

184

: 00:09:16

know, 1% of 1% or even 0.1%.

185

: 00:09:18

I think this was even

supposed to be a 0.01%,

186

: 00:09:22

traffic experiment.

187

: 00:09:23

So worked on this team before, right,

like, you know, I had the requisite

188

: 00:09:26

context to keep in my head when this

person came onto the bridge, right?

189

: 00:09:29

Like, because I knew.

190

: 00:09:31

That there was a

development cluster, right?

191

: 00:09:33

The people who developed the Google

front end have the ability to

192

: 00:09:37

push their own binaries, right?

193

: 00:09:39

Like to a subset of the cluster

that serves a very, very, very

194

: 00:09:42

small subset of the traffic, right?

195

: 00:09:44

0.0001%

196

: 00:09:45

of the traffic, right?

197

: 00:09:46

But the challenge here was that

even though they had meant to

198

: 00:09:50

only serve, you know, the small

fraction of the traffic, right?

199

: 00:09:53

Like the assumption was if something

goes wrong with that, right?

200

: 00:09:56

Like you've only failed a vanishingly

small percentage of traffic.

201

: 00:10:00

What was not accounted for was the fact

that any host that appears in the, in,

202

: 00:10:05

in the list of available backends, even

if it's set to receive less than 0.1%

203

: 00:10:09

of the traffic, you still

have to have information about

204

: 00:10:12

it in the list of available,

backends to send the traffic to.

205

: 00:10:16

Right?

206

: 00:10:17

That if you are intending on running

a canary experiment, you have to give

207

: 00:10:21

information about which host is the canary

and what percentage of traffic to route

208

: 00:10:25

to that canary and that information.

209

: 00:10:28

Has to be pushed out globally.

210

: 00:10:31

and that's basically where, where,

where the problem was, right?

211

: 00:10:33

Like so, so a safety feature to test a,

a release before it went into production.

212

: 00:10:38

Even, even the, like server itself

didn't have any bugs, but the way

213

: 00:10:44

that it communicated information

about, hi, I'm available to serve.

214

: 00:10:48

that's what caused every single host

in the fleet that was talking to it

215

: 00:10:52

to try to get information on, Hey, you

know, what protocols do you support?

216

: 00:10:55

You know, I, you know, just on

the off chance I to route traffic

217

: 00:10:58

to you, that's the thing that

caused the, systemwide outage.

218

: 00:11:02

Andrea Magnorsky: Right.

219

: 00:11:03

And how did, like, did, was there

a design change because of it?

220

: 00:11:09

I mean, for.

221

: 00:11:10

Liz Fong-Jones: there have been efforts

at Google and Amazon and you know, pretty

222

: 00:11:14

much every single large hyperscaler

to about how do you test and validate

223

: 00:11:20

global changes and recognize that

global changes are very, very dangerous.

224

: 00:11:24

I think the thing that was missed here

though was that the list of backends for

225

: 00:11:28

a, for, for a important service like this

was in fact a globally distributed set.

226

: 00:11:36

I.

227

: 00:11:37

So I, I think basically, right, like

yes, you know, in general too, when

228

: 00:11:42

you're working with a global system to,

to perceive more carefully, to, right,

229

: 00:11:46

like have automatic rollback, right?

230

: 00:11:48

Like to detect crashes.

231

: 00:11:50

But if you are not aware that you

should be looking for that right,

232

: 00:11:54

then I, then I think it's a lot,

it's, it's a lot harder to test that.

233

: 00:11:58

I think the other factor

is right, like, you know.

234

: 00:12:01

Backwards compatibility, right?

235

: 00:12:02

Like I, I think that if you test things

to make sure that things are forward and

236

: 00:12:05

backwards compatible, that you're going

to have a lot better of a time than if

237

: 00:12:09

you are out software in a, in a way where

you can send an input that's that some,

238

: 00:12:15

that something else re or send an output

that something else rejects as an input.

239

: 00:12:18

So that kind of goes seal Unix philosophy

of be liberal in what you accept,

240

: 00:12:22

be conservative in what you output.

241

: 00:12:26

Kenny Schwegler: I what I liked what

you said is if you want to really

242

: 00:12:30

create resilience, you also need to

create resilience in humans, right?

243

: 00:12:35

Liz Fong-Jones: Yep.

244

: 00:12:35

E exactly.

245

: 00:12:36

Kenny Schwegler: uh, I

really like that quote.

246

: 00:12:38

Never thought about it in that way.

247

: 00:12:40

So, what I guess you created

there was psychological safety for

248

: 00:12:44

that person to jump on the call.

249

: 00:12:47

So I, I have questions if you look at

the system, so how, how did management

250

: 00:12:53

or, or leadership reacted to, that

person calling in first and of

251

: 00:12:59

course saying, Hey, it was my fault.

252

: 00:13:01

What, what?

253

: 00:13:02

Can you remember?

254

: 00:13:03

What was the reaction at that point?

255

: 00:13:05

Because I think that's, there's

very much to learn for other

256

: 00:13:08

organizations there as well.

257

: 00:13:10

Liz Fong-Jones: I think that you know.

258

: 00:13:12

The number one thing that executives

at Google try tried to do, right?

259

: 00:13:16

Like is to stay out of the way, right?

260

: 00:13:17

Like they don't want to pressure

the team, they at the same time have

261

: 00:13:22

a responsibility to make sure that

an outage is being addressed with

262

: 00:13:26

sufficient urgency, that it's being

given the resources that it needs, right?

263

: 00:13:30

so.

264

: 00:13:32

When you have, you know, executives

join joining the bridge, right?

265

: 00:13:35

Like, you know, they're not going to

personally speak up and like, you know,

266

: 00:13:38

intimidate the person even from having

said, you know, Hey, like, you know,

267

: 00:13:40

I recognize that and thank you, right?

268

: 00:13:42

Like, the thank yous

happen afterwards, right?

269

: 00:13:44

Like, you know, even when the

person came onto the bridge, right?

270

: 00:13:47

Like, you know, I as an instant

commander was not like, you know,

271

: 00:13:50

in immediately like, you know.

272

: 00:13:52

Worried about this person's feelings.

273

: 00:13:53

I was just like, thank

you for that information.

274

: 00:13:56

Now let's see whether we can

validate whether your fix worked.

275

: 00:13:58

Right?

276

: 00:13:58

Like what time did you say

you landed that fix, right?

277

: 00:14:00

Like, you know, where, where

can I see it rolling out?

278

: 00:14:02

Right?

279

: 00:14:02

Like so much in motion at the same time

280

: 00:14:07

Kenny Schwegler: Yeah, so at the

time you're focusing on fixing

281

: 00:14:09

the problem, so you say thank you.

282

: 00:14:11

Of course, every information, every

transparency is of course helping

283

: 00:14:14

you, create that goal, in the moment.

284

: 00:14:17

And afterwards, what, what, what

is done afterwards to make sure

285

: 00:14:21

that, you know, if there's still

repercussions afterwards, then

286

: 00:14:25

Liz Fong-Jones: Exactly

right, like you know.

287

: 00:14:26

Kenny Schwegler: to

speak up in the moment.

288

: 00:14:28

Liz Fong-Jones: Thorough

retrospective processes, right?

289

: 00:14:29

Like making sure we're learning

from it, making sure that they know

290

: 00:14:32

that it wasn't their fault, right?

291

: 00:14:33

Like that, you know, there

are systematic changes that we

292

: 00:14:35

need to make across the system.

293

: 00:14:37

and then, you know, just making

sure that we're all debriefing and,

294

: 00:14:41

you know, talk, talking about it.

295

: 00:14:42

And, um, so I, I think,

right, like there are.

296

: 00:14:46

You know, thing, right?

297

: 00:14:47

Like we're having this conversation today.

298

: 00:14:49

Um, but you know, you can

imagine in the, you know, weeks

299

: 00:14:52

after this incident, right?

300

: 00:14:53

Like, you know, every, every

Google team had like at least one

301

: 00:14:57

representative, like join, join one of

these kind of learning sessions, right?

302

: 00:15:00

Like where we talk about major

outages, made major incidents, right?

303

: 00:15:05

And this outage.

304

: 00:15:06

Was one of many incident that we

discussed in these observative, right?

305

: 00:15:10

Like we don't only talk about the

high profile incidents, right?

306

: 00:15:12

Like we talk about individual learnings

that you can have from any team, right?

307

: 00:15:15

Just because this incident was

high profile not mean that it was,

308

: 00:15:19

you know, like unusually worthy of

being of, of, of, of like learnings.

309

: 00:15:24

Every incident you can

learn something from, right?

310

: 00:15:26

Like, and I think that's important too.

311

: 00:15:29

Is that, you know, you have to

have a culture of continuous

312

: 00:15:31

improvement, and you cannot just

do this for the large incidents.

313

: 00:15:35

You have to do it in the,

in the smaller instance too.

314

: 00:15:38

Kenny Schwegler: Yeah.

315

: 00:15:38

That's a great, that's, uh, great.

316

: 00:15:39

Yeah.

317

: 00:15:40

I, I remember one time at a small

company where a developer made a

318

: 00:15:44

mistake and it cost a company like,

10 thousands of euros, which was big.

319

: 00:15:50

Steal for that company.

320

: 00:15:52

and I

321

: 00:15:53

Liz Fong-Jones: You can joke about like,

you know that incident costs you $10,000.

322

: 00:15:56

Why would you fire the employee that

now has $10,000 worth of learning?

323

: 00:16:01

Kenny Schwegler: yeah, we didn't.

324

: 00:16:02

So I talked to the manager

because I was a tech lead.

325

: 00:16:04

I said, the manager was fine.

326

: 00:16:06

He was like, oh, it happens.

327

: 00:16:07

And I said, can you please

call the developer because,

328

: 00:16:10

the person is in stress.

329

: 00:16:11

Liz Fong-Jones: Exactly right.

330

: 00:16:11

Kenny Schwegler: me

331

: 00:16:12

Liz Fong-Jones: Yeah.

332

: 00:16:12

Kenny Schwegler: your,

333

: 00:16:12

Liz Fong-Jones: recognition, right?

334

: 00:16:13

Like you can give people a bonus, right?

335

: 00:16:15

Like, you know, right.

336

: 00:16:16

People may be like, you know,

why would you give a bonus to

337

: 00:16:18

someone who, who screwed up?

338

: 00:16:20

But the answer is, they didn't screw up.

339

: 00:16:21

Right?

340

: 00:16:21

Like, you know, the sys the system let

them down and they had the best possible

341

: 00:16:26

reaction to that was to, to, you know,

immediately roll back and also to call in.

342

: 00:16:31

Kenny Schwegler: But, but yeah.

343

: 00:16:32

And, and what I, what reminded me of

your, people also need to be resilient.

344

: 00:16:38

And that's what I saw happening the

moment, the manager called that.

345

: 00:16:42

Dev, it's fine.

346

: 00:16:43

No, you know, you did great and,

everyone failed, everyone together.

347

: 00:16:49

Made sure I, I can't remember the exact

thing, but it was very heartwarming.

348

: 00:16:53

Right.

349

: 00:16:53

And you saw that person, like, if I think

about resiliency, you saw that person

350

: 00:16:58

moving back to the original state, and

be more resilient so, so that's, yeah.

351

: 00:17:02

I like that metaphor of humans

also need to be resilient, so.

352

: 00:17:06

Andrea Magnorsky: And

have you ever been in a.

353

: 00:17:08

A situation that is kinda also

an incident, but what you think

354

: 00:17:14

should have happened didn't happen

as in like the opposite thing.

355

: 00:17:19

Have you experienced the opposite of that?

356

: 00:17:24

Liz Fong-Jones: Thankfully I've been on.

357

: 00:17:27

Teams that have kind of

psychological safety.

358

: 00:17:29

I think the thing that I do see

happening more often is not necessarily

359

: 00:17:37

in in the heat of the moment of the

incident, but often it's challenging

360

: 00:17:41

to get investment in reliability.

361

: 00:17:44

Before there is the, the

362

: 00:17:46

Andrea Magnorsky: Yeah.

363

: 00:17:46

Liz Fong-Jones: incident, right?

364

: 00:17:47

Andrea Magnorsky: Yeah.

365

: 00:17:47

Liz Fong-Jones: rather

not have the incident.

366

: 00:17:49

Right?

367

: 00:17:49

Like I, and, and I think that it

is potentially a problem, right?

368

: 00:17:54

Like it when you have, you know, all

of the praise and awards, the people

369

: 00:17:57

who do the firefighting and not the

people who prevent the fire from

370

: 00:18:00

even breaking out in the first place.

371

: 00:18:01

Right?

372

: 00:18:02

And I think that I've seen organizations

struggle with technical debt.

373

: 00:18:05

I've seen organizations prioritize

short term over the long term, right?

374

: 00:18:10

And that I have to keep in mind

is that it's very, very difficult

375

: 00:18:15

to expose safety as a signal aside

from Right, like outages, right?

376

: 00:18:18

Like, you know, we talk about the,

the triad of, of of cost surface

377

: 00:18:23

area and, and safety, right?

378

: 00:18:26

You can, you can, right?

379

: 00:18:28

Like, you know, you, you can, you

can drag it, drag the around anywhere

380

: 00:18:31

in that kind of try point space.

381

: 00:18:33

But if you stress safety too

much, it'll suddenly snap.

382

: 00:18:37

You don't really have an indication there.

383

: 00:18:40

Kenny Schwegler: Yeah, it,

it reminds me also, a little

384

: 00:18:43

bit about this, this person.

385

: 00:18:46

Showing this Superman, clip where

first Clark Kent goes to this little

386

: 00:18:51

boy standing at, at one of the

waterfalls and say, Hey, watch out.

387

: 00:18:55

Don't fall down.

388

: 00:18:55

And nobody believes Clark Kent.

389

: 00:18:57

And then all of a sudden the boy drops,

Superman comes in, saves the kid, and

390

: 00:19:02

then everyone's praising Superman, right?

391

: 00:19:04

While Clark Kent could have like, well

be careful not make that incident.

392

: 00:19:08

So how would you.

393

: 00:19:10

How would you know you are in a

psychological, safe environment?

394

: 00:19:14

What's, what are the key signs you

look for or see, or red flags even?

395

: 00:19:18

So how, how would you, what

are some heuristics you have?

396

: 00:19:24

Liz Fong-Jones: I think that has

a lot to do with, how do we talk

397

: 00:19:26

about smaller failures before

there are bigger failures, right?

398

: 00:19:29

Like, how do we make sure that we

have a culture of continuous learning?

399

: 00:19:32

are people, punished

for speaking out, right?

400

: 00:19:34

Like, you know, even on things like,

you know, hey, like, you know, I

401

: 00:19:37

might miss slip this deadline, right?

402

: 00:19:40

You know, or, Hey, we're really

behind on story points, right?

403

: 00:19:42

Like, I think that's kind of the

primary prime, you know, smaller

404

: 00:19:46

steps that you, that you take, right?

405

: 00:19:48

Like in order to then have

the psychological safety in

406

: 00:19:50

case something major happens.

407

: 00:19:51

I think, making sure that we're

listening to feedback, especially, right?

408

: 00:19:54

Like, you know, in the.

409

: 00:19:56

Category of right, like there's no

such thing as a dumb question, right?

410

: 00:19:58

Like, I think that it's important

for people, especially newcomers to

411

: 00:20:01

the team that like I, as a senior

even try to model this, right?

412

: 00:20:04

Like that it's okay to ask

questions that might seem obvious to

413

: 00:20:07

someone who, who knows the answer.

414

: 00:20:09

that way, right?

415

: 00:20:10

Like we emphasize that we're all,

we all have things to learn, right?

416

: 00:20:14

Like that, that there's kind

of no perfect expertise.

417

: 00:20:20

Kenny Schwegler: It reminds me of,

starting aikido and, and then going

418

: 00:20:25

with a black belt and saying to

the person having a black belt, oh,

419

: 00:20:28

sorry, you need to train with me.

420

: 00:20:29

And then the black belt said

to me, no, not at all, because

421

: 00:20:32

I can learn so much from you.

422

: 00:20:34

because, and this is, I think

what you're saying, right?

423

: 00:20:37

There's always learnings in new

people and, and I think that's.

424

: 00:20:41

One thing to notice for now.

425

: 00:20:45

Andrea Magnorsky: I really like

those three things that you said.

426

: 00:20:47

It's like, the, your, your heuristics

for the psychological safety.

427

: 00:20:50

It's like, can we talk

about smaller things?

428

: 00:20:53

Can people talk about problems and

are, if people feel like they can't

429

: 00:20:58

ask them questions, that, that kind

of fear and silence, you know, you

430

: 00:21:01

know when you're in a call and people

go like, and they're like, how to go?

431

: 00:21:07

Liz Fong-Jones: You know, says an acronym

you don't know, and like no one on the

432

: 00:21:10

call knows and everyone stays quiet.

433

: 00:21:11

Right?

434

: 00:21:12

Like, you know, that's,

that's potentially a problem.

435

: 00:21:14

Andrea Magnorsky: Yeah.

436

: 00:21:15

Yep.

437

: 00:21:15

Yep.

438

: 00:21:16

Kenny Schwegler: You

439

: 00:21:16

Andrea Magnorsky: Yeah,

440

: 00:21:16

Kenny Schwegler: for like juniors

le talking less than seniors, right?

441

: 00:21:21

Is is that something

442

: 00:21:22

Liz Fong-Jones: Yeah, definitely

making sure that everyone has an

443

: 00:21:24

opportunity to give their opinion.

444

: 00:21:26

Give their input, right?

445

: 00:21:27

Like ask why we did something.

446

: 00:21:29

You know, maybe that's a sign

that we're missing documentation.

447

: 00:21:33

Andrea Magnorsky: absolutely.

448

: 00:21:36

And how do you feel about the

collaborative design techniques?

449

: 00:21:40

You know, like what are your default

collaborative, practices, basically?

450

: 00:21:46

Liz Fong-Jones: Yeah.

451

: 00:21:46

I think that when we are designing

things, it's important to that teams

452

: 00:21:53

own software products and therefore that

teams own designs, not individuals, right?

453

: 00:21:58

Because you know that

individual may transfer teams.

454

: 00:22:00

They may leave the company.

455

: 00:22:01

Right?

456

: 00:22:01

Like, so I think it's important that the

people who are going to be accountable for

457

: 00:22:04

running the system in the long term input

into it, have feel ownership of it, right?

458

: 00:22:10

Um, you know, yes you can have a

lead author of a design, right?

459

: 00:22:14

Like, but, but I think that the broader

team that surrounds it needs to be in

460

: 00:22:19

the loop, needs to have an understanding

of the system, needs to understand

461

: 00:22:22

why the decision was made, right?

462

: 00:22:23

Like, and I think kind of combining

those two things, which design

463

: 00:22:25

psychological safety, right?

464

: 00:22:26

The ability to ask why did

we make the decision, right?

465

: 00:22:29

Like, you know, what are the

conditions under which it might be

466

: 00:22:32

okay to revisit it rather than, oh,

that decision was made 10 years ago.

467

: 00:22:35

I don't know why I don't

feel comfortable changing it.

468

: 00:22:38

Right?

469

: 00:22:38

Like, you know, it's, it's

happened so long ago that we've

470

: 00:22:40

forgotten and who knows it?

471

: 00:22:42

You might blow it up by

changing that, right?

472

: 00:22:44

So it's kind of this concept of

like a haunted graveyard, right?

473

: 00:22:47

Like no one daress.

474

: 00:22:48

You know, step foot in haunted

graveyard because there's just,

475

: 00:22:51

you know, spooky stories of the

last person who did that disappear.

476

: 00:22:53

No one can remember anyone ever

walking in the haunted graveyard.

477

: 00:22:56

Right.

478

: 00:22:57

think that when we treat design as kind

of a living conversation, right, when

479

: 00:23:01

we are looking at production, right?

480

: 00:23:04

Like when we're looking at things

to have an understanding not just

481

: 00:23:07

of what was the system as designed,

but also what's the system as built?

482

: 00:23:11

How's it behaving?

483

: 00:23:11

Right?

484

: 00:23:12

And I think that's true regardless of

whether, you know, you were designing with

485

: 00:23:16

an AI team made a human team made, right?

486

: 00:23:18

Like, I think either way, Like

it's, it's this group of people

487

: 00:23:22

who support the software.

488

: 00:23:25

Andrea Magnorsky: They build the theory

like Peter nor say, alright, cool.

489

: 00:23:31

That's, that's very like, I like that

design as a living conversation very much.

490

: 00:23:38

Kenny Schwegler: Thank you.

491

: 00:23:39

I think this was, this was it.

492

: 00:23:41

Our, bite-size story.

493

: 00:23:42

So thank you, for joining us.

494

: 00:23:44

Liz, thank you for telling your story.

495

: 00:23:46

I think, a lot of people can

relate to it and hopefully get some

496

: 00:23:49

learnings out of it themselves.

497

: 00:23:51

So if you're listening to

this, watching this, reading

498

: 00:23:55

this, please like, subscribe.

499

: 00:23:57

So, uh, more people can.

500

: 00:24:00

In these conversations and

hope to see you next time.

501

: 00:24:04

Bye-bye.

Share Episode

Shownotes

Transcripts

Follow

Links

Chapters

Video

More from YouTube