Speaker:
00:00:00
Modern applications require modern operations, and modern
Speaker:
00:00:03
operations requires a new definition for ownership that
Speaker:
00:00:07
most classical organizations must provide.
Speaker:
00:00:10
Today I continue my discussion on modern ops with Beth Long.
Speaker:
00:00:14
Are you ready? Let's go.
Speaker:
00:00:18
This is the Modern Digital Business Podcast, the technical
Speaker:
00:00:22
leader's guide to modernizing your applications and digital business.
Speaker:
00:00:26
Whether you're a business technology leader or a small business
Speaker:
00:00:29
innovator, keeping up with the digital business revolution is a
Speaker:
00:00:32
must. Here to help make it easier with actionable insights and
Speaker:
00:00:36
recommendations, as well as thoughtful interviews with industry experts.
Speaker:
00:00:40
Lee Atchison in this episode of Modern Digital Business,
Speaker:
00:00:44
I continue my conversation on Modern Operations with my good
Speaker:
00:00:48
friend SRE engineer and operations manager Beth
Speaker:
00:00:51
Long. This conversation, which focuses on service
Speaker:
00:00:54
ownership and measurement, is a continuation of our
Speaker:
00:00:57
conversation on SLAs in Modern Applications.
Speaker:
00:01:02
In a previous episode, we talked about Stosa, and this fits very much into
Speaker:
00:01:05
that idea is the idea that how you organize
Speaker:
00:01:09
your teams so that each team has a certain
Speaker:
00:01:13
set of responsibilities. We won't go into all the details of Stosa, but bottom
Speaker:
00:01:17
line is ownership is critical to the
Speaker:
00:01:20
Stosa model. Ownership is critical towards all DevOps
Speaker:
00:01:23
models. If you own a service, you're responsible
Speaker:
00:01:27
for how that service performs, because other teams are depending on
Speaker:
00:01:31
you to perform what those performance,
Speaker:
00:01:35
what it means to perform. The definition of what it
Speaker:
00:01:39
means to perform is what an SLA is all about.
Speaker:
00:01:43
Yeah. So what does a good SLA look like?
Speaker:
00:01:48
Beth that's a great question. Let's get to the measurement.
Speaker:
00:01:53
It does get into measurement.
Speaker:
00:01:59
That is always a hard question to answer.
Speaker:
00:02:03
If you look at the textbook
Speaker:
00:02:07
discussions of Slis and SLOs and
Speaker:
00:02:10
SLAs in particular, you'll often see references
Speaker:
00:02:14
to a lot of the things that are measurable. So you'll
Speaker:
00:02:17
have your golden signals of error, rate,
Speaker:
00:02:21
latency, saturation. So you have
Speaker:
00:02:25
these things that allow you to say, okay,
Speaker:
00:02:29
we're going to tolerate this many errors,
Speaker:
00:02:32
or this many of this type of error, this much
Speaker:
00:02:36
latency. But all of that is kind of trying
Speaker:
00:02:40
to distill down the customer experience
Speaker:
00:02:44
into these things that can be measured and
Speaker:
00:02:47
put on a dashboard. The term smart goals comes
Speaker:
00:02:51
to mind, right. That I think, is a good
Speaker:
00:02:55
measure. I know the idea of smart goals really hasn't been tied to
Speaker:
00:02:59
SLAs too closely, but I think there's a lot of similarities here. So
Speaker:
00:03:03
smart goals are five specific criteria. They're specific
Speaker:
00:03:07
measurable, attainable,
Speaker:
00:03:10
relevant, and time bound. So
Speaker:
00:03:14
now I think all five of those actually apply here
Speaker:
00:03:18
as well. Too right. When you create your SLAs,
Speaker:
00:03:21
they have to be specific. You can't say, yeah, we'll meet your
Speaker:
00:03:25
needs. That's not a good experience. But
Speaker:
00:03:29
in my mind, a good measurement is something
Speaker:
00:03:32
like, we will maintain
Speaker:
00:03:36
five milliseconds latency on average
Speaker:
00:03:40
for 90% of all requests that come in.
Speaker:
00:03:44
And I also like to put in an assuming.
Speaker:
00:03:47
Assuming you meet these criteria, such
Speaker:
00:03:50
as amount of traffic, the traffic load is
Speaker:
00:03:54
less than X, number of requests permitted or whatever the
Speaker:
00:03:57
criteria is. So in my mind, it's a specific
Speaker:
00:04:01
measurement with bounds for what that
Speaker:
00:04:05
means. Under assumptions. And these are the
Speaker:
00:04:08
assumptions. So something like five
Speaker:
00:04:12
milliseconds average latency for 90% of requests,
Speaker:
00:04:16
assuming the request rate is less than
Speaker:
00:04:20
5000 requests per second,
Speaker:
00:04:23
and assuming both those things occur. And you could also have assuming the
Speaker:
00:04:27
request rate is at least 100 /second because
Speaker:
00:04:30
caching can warming caches can have an effect there too. And things
Speaker:
00:04:34
like that. So you can have both bounded numbers. There
Speaker:
00:04:38
something like that is a very specific it's specific. It's
Speaker:
00:04:42
measurable. All of those numbers I specified are all things you could
Speaker:
00:04:45
measure. They're something you could see. Specific
Speaker:
00:04:49
measurable. You want to make sure they're attainable within
Speaker:
00:04:52
the service. That's your responsibility as the owner of a
Speaker:
00:04:56
service. If another team says, I need
Speaker:
00:05:00
this level of performance, it is your responsibility as the owner. Before
Speaker:
00:05:04
you accept that is to say yes, I can do that. So they have to
Speaker:
00:05:07
be attainable to you. And this actually gets at something very
Speaker:
00:05:11
important in implementing these sorts of things, which is to make sure that
Speaker:
00:05:15
you are starting with goals that are near what you're currently
Speaker:
00:05:19
actually doing and step your way towards
Speaker:
00:05:22
improvement instead of setting impossible goals. And then
Speaker:
00:05:26
punishing teams when they don't achieve something that was so far outside of
Speaker:
00:05:30
their ability. Oh absolutely there's two things that make a
Speaker:
00:05:33
goal bad. One is when the goal is so easy that
Speaker:
00:05:37
it's irrelevant. The other one is when it's so difficult that it's never
Speaker:
00:05:41
set never hit. You should set
Speaker:
00:05:45
goals that are in the case of
Speaker:
00:05:48
SLAs, your goal needs to hit the
Speaker:
00:05:51
SLA 100% of the time, but it
Speaker:
00:05:55
can't be three times what you are ever
Speaker:
00:05:59
going to see. Because giving you plenty of room
Speaker:
00:06:03
to have all sorts of problems because then that doesn't make it relevant to
Speaker:
00:06:06
the consumer of the goal. They need something better than that. That's
Speaker:
00:06:10
where the attainable and that's where relevant comes in. And
Speaker:
00:06:14
relevant is so important because it's so tempting. This is where when
Speaker:
00:06:17
it's the engineers that set those goals those
Speaker:
00:06:21
objectives in isolation you tend to get things that are
Speaker:
00:06:24
measurable and specific and
Speaker:
00:06:28
attainable but not relevant, right? I will
Speaker:
00:06:31
guarantee my service will have a latency of less than
Speaker:
00:06:35
37 seconds for this simple request guaranteed I
Speaker:
00:06:39
can promise you that, right? And the consumer will say
Speaker:
00:06:43
well I'm sorry I need ten milliseconds 37 seconds doesn't
Speaker:
00:06:47
that sounds an absurd number but you and I have both
Speaker:
00:06:51
heard numbers like that right? Where they're so far out of bounds they're
Speaker:
00:06:54
totally irrelevant, they're not worth even discussing.
Speaker:
00:06:58
Yes and a sneakier example would be something
Speaker:
00:07:01
like setting an objective
Speaker:
00:07:04
around how your infrastructure is behaving in ways that
Speaker:
00:07:08
don't translate directly to
Speaker:
00:07:12
the benefit to the customer. If you own a web
Speaker:
00:07:15
service that is serving directly to end
Speaker:
00:07:18
users. And your primary measures of
Speaker:
00:07:22
system health are around
Speaker:
00:07:26
CPU and I
Speaker:
00:07:29
O. Well, those might tell you something about what's
Speaker:
00:07:32
happening, but they are not directly
Speaker:
00:07:36
relevant to the customer. You need to have those on your dashboards for when
Speaker:
00:07:40
you're troubleshooting, when there is a problem, but that's not indicating the health
Speaker:
00:07:44
of the system. Right. So specific measurable
Speaker:
00:07:47
attainable relevant. So relevant
Speaker:
00:07:51
means the consumer of your service has to find them
Speaker:
00:07:55
to be useful. Attainable means that you as provider
Speaker:
00:07:59
of the service, need to be able to meet them. Measurable
Speaker:
00:08:02
means need to be measurable specific.
Speaker:
00:08:06
They can't be general purpose and ambiguous. They have to
Speaker:
00:08:10
be very specific. So all those make sense. Does time bound really apply
Speaker:
00:08:14
here? I think it does, but in the sense
Speaker:
00:08:17
that when you're setting these agreements,
Speaker:
00:08:22
you tend to say, this is my commitment, and
Speaker:
00:08:26
you tend to measure over a span of time and
Speaker:
00:08:30
there is a sense of the clock getting reset.
Speaker:
00:08:33
That's true. We'll handle this much traffic
Speaker:
00:08:37
over this period of time. You're right. That's a form of time bound. I think
Speaker:
00:08:41
when you talk about smart goals, they're really talking about the time
Speaker:
00:08:44
when you'll accomplish the goal. And what we're saying
Speaker:
00:08:48
is the time you accomplish the goal is now. It's
Speaker:
00:08:52
not really a goal, it's an agreement as far
Speaker:
00:08:55
as it's a habit. Rather than a habit.
Speaker:
00:09:01
And that's actually a good point. These aren't goals.
Speaker:
00:09:06
I'm going to try to make this no, this is what you're
Speaker:
00:09:10
going to be performing to and you can change them and improve them over time.
Speaker:
00:09:14
You can have a goal that says I'm going to improve my
Speaker:
00:09:17
SLA over time and make
Speaker:
00:09:21
my SLA twice as good by the state.
Speaker:
00:09:25
That's a perfectly fine goal. But that's what a goal is
Speaker:
00:09:29
versus an SLA, which says your SLA is
Speaker:
00:09:33
something like five millisecond latency
Speaker:
00:09:36
with less than 10,000 requests. And you can say, that's
Speaker:
00:09:40
great, I have a goal to make. It a two millisecond latency
Speaker:
00:09:44
with 5000 requests, and by this time next
Speaker:
00:09:48
quarter, and at that point in time then your SLA is now two
Speaker:
00:09:52
milliseconds. But the SLA is what it is and
Speaker:
00:09:55
what you're agreeing to, committing to now, it's a
Speaker:
00:09:59
failure if you don't meet it right
Speaker:
00:10:03
now. As opposed to a goal, which is what you're striving towards.
Speaker:
00:10:07
Yeah, towards completing something. Right.
Speaker:
00:10:12
One anecdote. That a well known anecdote that I
Speaker:
00:10:16
think is interesting to talk about. Here is
Speaker:
00:10:20
the example that Google gave. This is in the SRE
Speaker:
00:10:24
book of actually
Speaker:
00:10:28
overshooting and having a service that
Speaker:
00:10:32
was too reliable. I can't remember which service it was
Speaker:
00:10:36
off the top of my head, but they actually had a service that they did
Speaker:
00:10:39
not want to guarantee 100% uptime, but they ended up
Speaker:
00:10:43
getting over delivering on quality for a while.
Speaker:
00:10:46
And when that service did fail,
Speaker:
00:10:50
users were incensed because there was sort of this
Speaker:
00:10:54
implicit SLA. Well, it's been performing so well.
Speaker:
00:10:58
And so what I love about that story is that they ended
Speaker:
00:11:01
up deliberately introducing failures into the system
Speaker:
00:11:05
so that users would not become accustomed to too high of
Speaker:
00:11:09
a performance level. And what this
Speaker:
00:11:12
underscores is how much this is about
Speaker:
00:11:16
ultimately the experience of whatever person it is
Speaker:
00:11:20
that needs to use your service. This is not a purely
Speaker:
00:11:23
technical problem. This is very much about understanding
Speaker:
00:11:27
how your system can be maximally healthy
Speaker:
00:11:31
and maximally serve
Speaker:
00:11:35
whoever it is that's using it. So I love that story. I
Speaker:
00:11:38
didn't know that story before, but it plays very well into
Speaker:
00:11:43
the Netflix Chaos Monkey approach to testing. And that is
Speaker:
00:11:46
the idea that the way you ensure
Speaker:
00:11:50
your system as a whole keeps performing is you keep causing it to fail on
Speaker:
00:11:54
a regular basis to make sure that you can handle those failures.
Speaker:
00:11:58
So what the Chaos Monkey does, and I'm sure at some point in time we're
Speaker:
00:12:01
going to do an episode on Chaos Monkey. Matter of fact, we should add it
Speaker:
00:12:04
to our list. What Chaos Monkey is all about is the idea
Speaker:
00:12:07
that you intentionally insert faults into your system
Speaker:
00:12:14
at irregular times so that you can
Speaker:
00:12:20
verify that the
Speaker:
00:12:23
response your application is supposed to have to self heal around the
Speaker:
00:12:27
problems that are occurring can be tested to make sure they
Speaker:
00:12:31
occur. Now, you don't do this in staging, you don't do this in
Speaker:
00:12:34
dev, you do it in production. But you do it in production
Speaker:
00:12:38
during times when people are around. So that if
Speaker:
00:12:42
it does cause a real problem, if you turn off the service
Speaker:
00:12:45
and that causes a real problem and customers are really affected,
Speaker:
00:12:49
everyone's on board and you can solve the problem right away as opposed
Speaker:
00:12:53
to the exact same thing happening by happen chance at
:
00:12:56
00 in the morning when everyone's drowsy and sleeping and
:
00:13:00
not knowing what's going on. You can address the problem right there
:
00:13:04
right then as opposed to later on. And the other
:
00:13:08
thing it helps with is this problem that you were addressing which
:
00:13:11
is getting too
:
00:13:15
used to things working. So if you deploy a new
:
00:13:19
change and let's say I own a service, and one of the
:
00:13:22
things I'm doing service A and I call Service B and I need to
:
00:13:26
expect a service B will fail occasionally, well, I'm going to write
:
00:13:30
code into Service A to do different things. If Service B
:
00:13:33
doesn't work well, what if I introduce an error in that
:
00:13:37
code that I'm not aware of and then I deploy my
:
00:13:41
code? Well it's going to function, it's going to work,
:
00:13:44
everything's going to be fine until Service B fails and Service A is also going
:
00:13:48
to fail. But if Service B is regularly
:
00:13:52
failing, you're going to notice that a
:
00:13:56
lot sooner, perhaps immediately after deployment,
:
00:13:59
and you're going to be able to fix that problem, roll it back if necessary,
:
00:14:03
or roll forward with a fix to it to
:
00:14:06
get the situation resolved. The more
:
00:14:10
chaotic you put code into, the more stable the
:
00:14:13
code is going to be. It's a weird thought
:
00:14:17
to think that way, but the more chaotic a system, the
:
00:14:21
more stable the code that's in that system behaves
:
00:14:25
over the long term. I'm so glad you bring this up. And what I
:
00:14:28
love about this is that we're really touching
:
00:14:32
on similar themes in different contexts
:
00:14:35
because both Chaos Engineering and the DevOps
:
00:14:39
approach are really about
:
00:14:43
understanding that we don't just have a technical system,
:
00:14:46
we have a sociotechnical system. We have this intertwined human and
:
00:14:50
technology system. And so with DevOps, one
:
00:14:54
of the advantages of DevOps is that it changes the behavior of the people
:
00:14:58
who are creating the system itself. Because
:
00:15:01
again, if you're going to deploy code
:
00:15:05
and you know that if something goes wrong, it's going to wake up that person
:
00:15:08
over there that you don't even know.
:
00:15:12
You just build your services differently.
:
00:15:16
You're not as rigorous as
:
00:15:19
when you know you're going to be the one woken up at 02:00 A.m.. And
:
00:15:23
similarly with chaos engineering, if you know that
:
00:15:26
service B is going to fail absolutely in the coming
:
00:15:30
week, you're just going to be like, well, I may as well deal with this
:
00:15:34
now. As opposed to like, well, I'm under deadline. Service b is usually
:
00:15:37
stable. I'm just going to run the risk and we'll deal with it later.
:
00:15:41
So it really drives behavior inserted into
:
00:15:44
systems. Right. And the other thing
:
00:15:48
I love about how you kind of unpacked chaos
:
00:15:52
Engineering is it does work
:
00:15:56
on this very counterintuitive idea that you should be
:
00:15:59
running towards incidents and problems
:
00:16:03
instead of running away from them, you should embrace them.
:
00:16:06
And that will actually help you, as you said,
:
00:16:10
make the system more stable because you
:
00:16:13
are proactively encountering those issues rather
:
00:16:17
than letting them come to you. Yeah, that's absolutely great.
:
00:16:21
That's great. Yeah, you're right.
:
00:16:25
We're not talking about coding. We're talking about social systems here. We're
:
00:16:29
talking about systems of people that happen to include
:
00:16:32
code as opposed to systems of code. And that the vast
:
00:16:36
majority of incidents that happen have a
:
00:16:40
socio component to. It not just a code
:
00:16:43
problem. It's someone who said this is good
:
00:16:47
enough or someone who didn't spend the time
:
00:16:50
to think about whether or not it would be good enough or not and
:
00:16:54
therefore missed something. Right. And these aren't bad
:
00:16:58
people doing bad things. These are good people that are making mistakes that
:
00:17:01
are caused by the environment of which they're
:
00:17:05
working. And that's why environment and
:
00:17:09
systems of people and how they're structured and how they're organized
:
00:17:12
is so important. I keep hearing people
:
00:17:16
say how you
:
00:17:20
organize your companies irrelevant. Right? It shouldn't matter.
:
00:17:24
Nothing could be further from the truth. It matters the
:
00:17:28
way you organize a company.
:
00:17:32
I hate saying it this way because I don't always work in this one, but
:
00:17:34
how clean your desk is is a good indication of how clean the system
:
00:17:38
is. And I don't mean that literally because I've had dirty
:
00:17:42
desks too, but it really is a good indication
:
00:17:45
here. It's how well you organize your
:
00:17:49
environment, how well you organize your team,
:
00:17:52
how well you organize your organization,
:
00:17:57
gives an indication for how well you're going to perform as a
:
00:18:00
company from the standpoint. Yes,
:
00:18:06
when we look at the realm of incidents which
:
00:18:10
are messy and frustrating and scary and expensive,
:
00:18:14
and every tech company knows that they
:
00:18:17
are probably one really
:
00:18:21
bad incident away from going out of
:
00:18:24
business, every company knows
:
00:18:28
that there's that really bad
:
00:18:32
thing that could collapse the whole
:
00:18:36
structure. And so incidents are really high
:
00:18:39
stakes, but
:
00:18:43
that drives us to look for certainty and look for clarity. And
:
00:18:46
so we look to a lot of these things that people have been talking
:
00:18:50
about for years around incident metrics. So you've
:
00:18:54
got your mean time metrics, what's your mean time to resolution
:
00:18:58
or your mean time between failure and it's? This attempt
:
00:19:01
to bring some kind of order
:
00:19:06
and sense to this very scary and chaotic world
:
00:19:10
of incidents. But so
:
00:19:13
many of those, what are now often being called shallow
:
00:19:17
incident metrics end up giving short
:
00:19:21
shrift to what we were just talking about, which is that
:
00:19:25
this is a very complex system.
:
00:19:29
The technology itself is very complex. The
:
00:19:33
sociotechnical system is complex.
:
00:19:36
We're trying to kind of get a handle on
:
00:19:40
how do you surface those complexities and make them
:
00:19:43
intelligible and make them sensible without
:
00:19:47
falling back to some of these shallow metrics. That
:
00:19:52
Niall Murphy, who was back to SRE, one of the authors of
:
00:19:55
the original SRE book, had a paper out recently where he kind
:
00:19:59
of unpacks the ways that these mean time
:
00:20:03
and other shallow metrics aren't
:
00:20:06
statistically meaningful and
:
00:20:09
aren't helping us make good decisions
:
00:20:13
in the wake of these incidents. And so much of what we're talking
:
00:20:17
about is SLAs are how do
:
00:20:20
you make decisions about what work you're going to do and how
:
00:20:24
much you invest in reliability versus new features
:
00:20:28
and incident follow up is so much about what
:
00:20:31
decisions do we make based on what we learned
:
00:20:35
in this event. Yeah, you add a whole new dimension
:
00:20:39
here to the metric discussion here, because
:
00:20:43
it's so easy to think about metrics along the line of
:
00:20:46
how we're performing and when we don't perform, it's a failure
:
00:20:50
oops, but there's a lot of data in the
:
00:20:53
Oops, and you're right. Things like meantime
:
00:20:57
to detect and meantime to resolution. And those are important,
:
00:21:01
but they're very superficial compared to the depth that you
:
00:21:05
can get. And I'm not talking about Joe's team caused five
:
00:21:08
incidents last week. That's a problem for Joe. I'm not talking about
:
00:21:12
that. I'm talking about the
:
00:21:15
undercovering,
:
00:21:18
the sophisticated connection between
:
00:21:22
things that can cause problems to occur.
:
00:21:28
Thank you for tuning in to Modern Digital Business. This
:
00:21:31
podcast exists because of the support of you, my listeners.
:
00:21:35
If you enjoy what you hear, will you please leave a review on Apple
:
00:21:38
podcasts or directly on our website at MDB
:
00:21:42
FM slash Reviews If you'd like to suggest a topic for an
:
00:21:46
episode or you are interested in becoming a guest, please contact
:
00:21:50
me directly by sending me a message at MDB FM
:
00:21:53
contact. And if you'd like to record a quick question or
:
00:21:57
comment, click the microphone icon in the lower right hand corner of our
:
00:22:00
website. Your recording might be featured on a future
:
00:22:04
episode. To make sure you get every new episode when they become
:
00:22:08
available, click subscribe in your favorite podcast player
:
00:22:11
or check out our website at MDB FM. If
:
00:22:15
you want to learn more from me, then check out one of my books, courses
:
00:22:18
or articles by going to Lee Atchison.com and
:
00:22:22
all of these links are included in the show. Notes thank you for
:
00:22:26
listening and welcome to the world of the modern digital business.