Artwork for podcast Modern Digital Business
ModernOps with Beth Long: Operational Ownership
Episode 3714th December 2023 • Modern Digital Business • Lee Atchison
00:00:00 00:23:16

Share Episode

Shownotes

Welcome to another episode of Modern Digital Business, the podcast that helps you navigate the ever-changing landscape of modernizing your applications and digital business. In this episode, we continue our exploration of modern operations with our special guest, Beth Long. Today's discussion is all about operational ownership and how it plays a crucial role in the success of modern organizations. We dive into the importance of service ownership, the measurement of SLAs, and the need for specific, measurable, attainable, relevant, and time-bound goals. Join us as we unravel the complexities of modern ops with Beth Long in this enlightening episode of Modern Digital Business. Let's dive in!



Today on Modern Digital Business

Thank you for tuning in to Modern Digital Business. We typically release new episodes on Thursdays. We also occasionally release short-topic episodes on Tuesdays, which we call Tech Tapas Tuesdays.

If you enjoy what you hear, will you please leave a review on Apple Podcasts, Podchaser, or directly on our website at mdb.fm/reviews?

If you'd like to suggest a topic for an episode or you are interested in being a guest, please contact me directly by sending me a message at mdb.fm/contact.

And if you’d like to record a quick question or comment, click the microphone icon in the lower right-hand corner of our website. Your recording might be featured on a future episode!

To ensure you get every new episode when they become available, please subscribe from your favorite podcast player. If you want to learn more from me, then check out one of my books, courses, or articles by going to leeatchison.com.

Thank you for listening, and welcome to the modern world of the modern digital business!

Useful Links

About Lee

Lee Atchison is a software architect, author, public speaker, and recognized thought leader on cloud computing and application modernization. His most recent book, Architecting for Scale (O’Reilly Media), is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee has been widely quoted in multiple technology publications, including InfoWorld, Diginomica, IT Brief, Programmable Web, CIO Review, and DZone, and has been a featured speaker at events across the globe.

Take a look at Lee's many books, courses, and articles by going to leeatchison.com.

Looking to modernize your application organization?

Check out Architecting for Scale. Currently in it's second edition, this book, written by Lee Atchison, and published by O'Reilly Media, will help you build high scale, highly available web applications, or modernize your existing applications. Check it out! Available in paperback or on Kindle from Amazon.com or other retailers.

Don't Miss Out!

Subscribe here to catch each new episode as it becomes available.

Want more from Lee? Click here to sign up for our newsletter. You'll receive information about new episodes, new articles, new books, and courses from Lee. Don't worry, we won't send you spam, and you can unsubscribe anytime.

Mentioned in this episode:

Architecting for Scale

What does it take to operate a modern organization running a modern digital application? Read more in my O’Reilly Media book Architecting for Scale, now in its second edition. Go to: leeatchison.com/books or mdb.fm/afs.

Architecting for Scale

Transcripts

Speaker:

Modern applications require modern operations, and modern

Speaker:

operations requires a new definition for ownership that

Speaker:

most classical organizations must provide.

Speaker:

Today I continue my discussion on modern ops with Beth Long.

Speaker:

Are you ready? Let's go.

Speaker:

This is the Modern Digital Business Podcast, the technical

Speaker:

leader's guide to modernizing your applications and digital business.

Speaker:

Whether you're a business technology leader or a small business

Speaker:

innovator, keeping up with the digital business revolution is a

Speaker:

must. Here to help make it easier with actionable insights and

Speaker:

recommendations, as well as thoughtful interviews with industry experts.

Speaker:

Lee Atchison in this episode of Modern Digital Business,

Speaker:

I continue my conversation on Modern Operations with my good

Speaker:

friend SRE engineer and operations manager Beth

Speaker:

Long. This conversation, which focuses on service

Speaker:

ownership and measurement, is a continuation of our

Speaker:

conversation on SLAs in Modern Applications.

Speaker:

In a previous episode, we talked about Stosa, and this fits very much into

Speaker:

that idea is the idea that how you organize

Speaker:

your teams so that each team has a certain

Speaker:

set of responsibilities. We won't go into all the details of Stosa, but bottom

Speaker:

line is ownership is critical to the

Speaker:

Stosa model. Ownership is critical towards all DevOps

Speaker:

models. If you own a service, you're responsible

Speaker:

for how that service performs, because other teams are depending on

Speaker:

you to perform what those performance,

Speaker:

what it means to perform. The definition of what it

Speaker:

means to perform is what an SLA is all about.

Speaker:

Yeah. So what does a good SLA look like?

Speaker:

Beth that's a great question. Let's get to the measurement.

Speaker:

It does get into measurement.

Speaker:

That is always a hard question to answer.

Speaker:

If you look at the textbook

Speaker:

discussions of Slis and SLOs and

Speaker:

SLAs in particular, you'll often see references

Speaker:

to a lot of the things that are measurable. So you'll

Speaker:

have your golden signals of error, rate,

Speaker:

latency, saturation. So you have

Speaker:

these things that allow you to say, okay,

Speaker:

we're going to tolerate this many errors,

Speaker:

or this many of this type of error, this much

Speaker:

latency. But all of that is kind of trying

Speaker:

to distill down the customer experience

Speaker:

into these things that can be measured and

Speaker:

put on a dashboard. The term smart goals comes

Speaker:

to mind, right. That I think, is a good

Speaker:

measure. I know the idea of smart goals really hasn't been tied to

Speaker:

SLAs too closely, but I think there's a lot of similarities here. So

Speaker:

smart goals are five specific criteria. They're specific

Speaker:

measurable, attainable,

Speaker:

relevant, and time bound. So

Speaker:

now I think all five of those actually apply here

Speaker:

as well. Too right. When you create your SLAs,

Speaker:

they have to be specific. You can't say, yeah, we'll meet your

Speaker:

needs. That's not a good experience. But

Speaker:

in my mind, a good measurement is something

Speaker:

like, we will maintain

Speaker:

five milliseconds latency on average

Speaker:

for 90% of all requests that come in.

Speaker:

And I also like to put in an assuming.

Speaker:

Assuming you meet these criteria, such

Speaker:

as amount of traffic, the traffic load is

Speaker:

less than X, number of requests permitted or whatever the

Speaker:

criteria is. So in my mind, it's a specific

Speaker:

measurement with bounds for what that

Speaker:

means. Under assumptions. And these are the

Speaker:

assumptions. So something like five

Speaker:

milliseconds average latency for 90% of requests,

Speaker:

assuming the request rate is less than

Speaker:

5000 requests per second,

Speaker:

and assuming both those things occur. And you could also have assuming the

Speaker:

request rate is at least 100 /second because

Speaker:

caching can warming caches can have an effect there too. And things

Speaker:

like that. So you can have both bounded numbers. There

Speaker:

something like that is a very specific it's specific. It's

Speaker:

measurable. All of those numbers I specified are all things you could

Speaker:

measure. They're something you could see. Specific

Speaker:

measurable. You want to make sure they're attainable within

Speaker:

the service. That's your responsibility as the owner of a

Speaker:

service. If another team says, I need

Speaker:

this level of performance, it is your responsibility as the owner. Before

Speaker:

you accept that is to say yes, I can do that. So they have to

Speaker:

be attainable to you. And this actually gets at something very

Speaker:

important in implementing these sorts of things, which is to make sure that

Speaker:

you are starting with goals that are near what you're currently

Speaker:

actually doing and step your way towards

Speaker:

improvement instead of setting impossible goals. And then

Speaker:

punishing teams when they don't achieve something that was so far outside of

Speaker:

their ability. Oh absolutely there's two things that make a

Speaker:

goal bad. One is when the goal is so easy that

Speaker:

it's irrelevant. The other one is when it's so difficult that it's never

Speaker:

set never hit. You should set

Speaker:

goals that are in the case of

Speaker:

SLAs, your goal needs to hit the

Speaker:

SLA 100% of the time, but it

Speaker:

can't be three times what you are ever

Speaker:

going to see. Because giving you plenty of room

Speaker:

to have all sorts of problems because then that doesn't make it relevant to

Speaker:

the consumer of the goal. They need something better than that. That's

Speaker:

where the attainable and that's where relevant comes in. And

Speaker:

relevant is so important because it's so tempting. This is where when

Speaker:

it's the engineers that set those goals those

Speaker:

objectives in isolation you tend to get things that are

Speaker:

measurable and specific and

Speaker:

attainable but not relevant, right? I will

Speaker:

guarantee my service will have a latency of less than

Speaker:

37 seconds for this simple request guaranteed I

Speaker:

can promise you that, right? And the consumer will say

Speaker:

well I'm sorry I need ten milliseconds 37 seconds doesn't

Speaker:

that sounds an absurd number but you and I have both

Speaker:

heard numbers like that right? Where they're so far out of bounds they're

Speaker:

totally irrelevant, they're not worth even discussing.

Speaker:

Yes and a sneakier example would be something

Speaker:

like setting an objective

Speaker:

around how your infrastructure is behaving in ways that

Speaker:

don't translate directly to

Speaker:

the benefit to the customer. If you own a web

Speaker:

service that is serving directly to end

Speaker:

users. And your primary measures of

Speaker:

system health are around

Speaker:

CPU and I

Speaker:

O. Well, those might tell you something about what's

Speaker:

happening, but they are not directly

Speaker:

relevant to the customer. You need to have those on your dashboards for when

Speaker:

you're troubleshooting, when there is a problem, but that's not indicating the health

Speaker:

of the system. Right. So specific measurable

Speaker:

attainable relevant. So relevant

Speaker:

means the consumer of your service has to find them

Speaker:

to be useful. Attainable means that you as provider

Speaker:

of the service, need to be able to meet them. Measurable

Speaker:

means need to be measurable specific.

Speaker:

They can't be general purpose and ambiguous. They have to

Speaker:

be very specific. So all those make sense. Does time bound really apply

Speaker:

here? I think it does, but in the sense

Speaker:

that when you're setting these agreements,

Speaker:

you tend to say, this is my commitment, and

Speaker:

you tend to measure over a span of time and

Speaker:

there is a sense of the clock getting reset.

Speaker:

That's true. We'll handle this much traffic

Speaker:

over this period of time. You're right. That's a form of time bound. I think

Speaker:

when you talk about smart goals, they're really talking about the time

Speaker:

when you'll accomplish the goal. And what we're saying

Speaker:

is the time you accomplish the goal is now. It's

Speaker:

not really a goal, it's an agreement as far

Speaker:

as it's a habit. Rather than a habit.

Speaker:

And that's actually a good point. These aren't goals.

Speaker:

I'm going to try to make this no, this is what you're

Speaker:

going to be performing to and you can change them and improve them over time.

Speaker:

You can have a goal that says I'm going to improve my

Speaker:

SLA over time and make

Speaker:

my SLA twice as good by the state.

Speaker:

That's a perfectly fine goal. But that's what a goal is

Speaker:

versus an SLA, which says your SLA is

Speaker:

something like five millisecond latency

Speaker:

with less than 10,000 requests. And you can say, that's

Speaker:

great, I have a goal to make. It a two millisecond latency

Speaker:

with 5000 requests, and by this time next

Speaker:

quarter, and at that point in time then your SLA is now two

Speaker:

milliseconds. But the SLA is what it is and

Speaker:

what you're agreeing to, committing to now, it's a

Speaker:

failure if you don't meet it right

Speaker:

now. As opposed to a goal, which is what you're striving towards.

Speaker:

Yeah, towards completing something. Right.

Speaker:

One anecdote. That a well known anecdote that I

Speaker:

think is interesting to talk about. Here is

Speaker:

the example that Google gave. This is in the SRE

Speaker:

book of actually

Speaker:

overshooting and having a service that

Speaker:

was too reliable. I can't remember which service it was

Speaker:

off the top of my head, but they actually had a service that they did

Speaker:

not want to guarantee 100% uptime, but they ended up

Speaker:

getting over delivering on quality for a while.

Speaker:

And when that service did fail,

Speaker:

users were incensed because there was sort of this

Speaker:

implicit SLA. Well, it's been performing so well.

Speaker:

And so what I love about that story is that they ended

Speaker:

up deliberately introducing failures into the system

Speaker:

so that users would not become accustomed to too high of

Speaker:

a performance level. And what this

Speaker:

underscores is how much this is about

Speaker:

ultimately the experience of whatever person it is

Speaker:

that needs to use your service. This is not a purely

Speaker:

technical problem. This is very much about understanding

Speaker:

how your system can be maximally healthy

Speaker:

and maximally serve

Speaker:

whoever it is that's using it. So I love that story. I

Speaker:

didn't know that story before, but it plays very well into

Speaker:

the Netflix Chaos Monkey approach to testing. And that is

Speaker:

the idea that the way you ensure

Speaker:

your system as a whole keeps performing is you keep causing it to fail on

Speaker:

a regular basis to make sure that you can handle those failures.

Speaker:

So what the Chaos Monkey does, and I'm sure at some point in time we're

Speaker:

going to do an episode on Chaos Monkey. Matter of fact, we should add it

Speaker:

to our list. What Chaos Monkey is all about is the idea

Speaker:

that you intentionally insert faults into your system

Speaker:

at irregular times so that you can

Speaker:

verify that the

Speaker:

response your application is supposed to have to self heal around the

Speaker:

problems that are occurring can be tested to make sure they

Speaker:

occur. Now, you don't do this in staging, you don't do this in

Speaker:

dev, you do it in production. But you do it in production

Speaker:

during times when people are around. So that if

Speaker:

it does cause a real problem, if you turn off the service

Speaker:

and that causes a real problem and customers are really affected,

Speaker:

everyone's on board and you can solve the problem right away as opposed

Speaker:

to the exact same thing happening by happen chance at

:

00 in the morning when everyone's drowsy and sleeping and

:

not knowing what's going on. You can address the problem right there

:

right then as opposed to later on. And the other

:

thing it helps with is this problem that you were addressing which

:

is getting too

:

used to things working. So if you deploy a new

:

change and let's say I own a service, and one of the

:

things I'm doing service A and I call Service B and I need to

:

expect a service B will fail occasionally, well, I'm going to write

:

code into Service A to do different things. If Service B

:

doesn't work well, what if I introduce an error in that

:

code that I'm not aware of and then I deploy my

:

code? Well it's going to function, it's going to work,

:

everything's going to be fine until Service B fails and Service A is also going

:

to fail. But if Service B is regularly

:

failing, you're going to notice that a

:

lot sooner, perhaps immediately after deployment,

:

and you're going to be able to fix that problem, roll it back if necessary,

:

or roll forward with a fix to it to

:

get the situation resolved. The more

:

chaotic you put code into, the more stable the

:

code is going to be. It's a weird thought

:

to think that way, but the more chaotic a system, the

:

more stable the code that's in that system behaves

:

over the long term. I'm so glad you bring this up. And what I

:

love about this is that we're really touching

:

on similar themes in different contexts

:

because both Chaos Engineering and the DevOps

:

approach are really about

:

understanding that we don't just have a technical system,

:

we have a sociotechnical system. We have this intertwined human and

:

technology system. And so with DevOps, one

:

of the advantages of DevOps is that it changes the behavior of the people

:

who are creating the system itself. Because

:

again, if you're going to deploy code

:

and you know that if something goes wrong, it's going to wake up that person

:

over there that you don't even know.

:

You just build your services differently.

:

You're not as rigorous as

:

when you know you're going to be the one woken up at 02:00 A.m.. And

:

similarly with chaos engineering, if you know that

:

service B is going to fail absolutely in the coming

:

week, you're just going to be like, well, I may as well deal with this

:

now. As opposed to like, well, I'm under deadline. Service b is usually

:

stable. I'm just going to run the risk and we'll deal with it later.

:

So it really drives behavior inserted into

:

systems. Right. And the other thing

:

I love about how you kind of unpacked chaos

:

Engineering is it does work

:

on this very counterintuitive idea that you should be

:

running towards incidents and problems

:

instead of running away from them, you should embrace them.

:

And that will actually help you, as you said,

:

make the system more stable because you

:

are proactively encountering those issues rather

:

than letting them come to you. Yeah, that's absolutely great.

:

That's great. Yeah, you're right.

:

We're not talking about coding. We're talking about social systems here. We're

:

talking about systems of people that happen to include

:

code as opposed to systems of code. And that the vast

:

majority of incidents that happen have a

:

socio component to. It not just a code

:

problem. It's someone who said this is good

:

enough or someone who didn't spend the time

:

to think about whether or not it would be good enough or not and

:

therefore missed something. Right. And these aren't bad

:

people doing bad things. These are good people that are making mistakes that

:

are caused by the environment of which they're

:

working. And that's why environment and

:

systems of people and how they're structured and how they're organized

:

is so important. I keep hearing people

:

say how you

:

organize your companies irrelevant. Right? It shouldn't matter.

:

Nothing could be further from the truth. It matters the

:

way you organize a company.

:

I hate saying it this way because I don't always work in this one, but

:

how clean your desk is is a good indication of how clean the system

:

is. And I don't mean that literally because I've had dirty

:

desks too, but it really is a good indication

:

here. It's how well you organize your

:

environment, how well you organize your team,

:

how well you organize your organization,

:

gives an indication for how well you're going to perform as a

:

company from the standpoint. Yes,

:

when we look at the realm of incidents which

:

are messy and frustrating and scary and expensive,

:

and every tech company knows that they

:

are probably one really

:

bad incident away from going out of

:

business, every company knows

:

that there's that really bad

:

thing that could collapse the whole

:

structure. And so incidents are really high

:

stakes, but

:

that drives us to look for certainty and look for clarity. And

:

so we look to a lot of these things that people have been talking

:

about for years around incident metrics. So you've

:

got your mean time metrics, what's your mean time to resolution

:

or your mean time between failure and it's? This attempt

:

to bring some kind of order

:

and sense to this very scary and chaotic world

:

of incidents. But so

:

many of those, what are now often being called shallow

:

incident metrics end up giving short

:

shrift to what we were just talking about, which is that

:

this is a very complex system.

:

The technology itself is very complex. The

:

sociotechnical system is complex.

:

We're trying to kind of get a handle on

:

how do you surface those complexities and make them

:

intelligible and make them sensible without

:

falling back to some of these shallow metrics. That

:

Niall Murphy, who was back to SRE, one of the authors of

:

the original SRE book, had a paper out recently where he kind

:

of unpacks the ways that these mean time

:

and other shallow metrics aren't

:

statistically meaningful and

:

aren't helping us make good decisions

:

in the wake of these incidents. And so much of what we're talking

:

about is SLAs are how do

:

you make decisions about what work you're going to do and how

:

much you invest in reliability versus new features

:

and incident follow up is so much about what

:

decisions do we make based on what we learned

:

in this event. Yeah, you add a whole new dimension

:

here to the metric discussion here, because

:

it's so easy to think about metrics along the line of

:

how we're performing and when we don't perform, it's a failure

:

oops, but there's a lot of data in the

:

Oops, and you're right. Things like meantime

:

to detect and meantime to resolution. And those are important,

:

but they're very superficial compared to the depth that you

:

can get. And I'm not talking about Joe's team caused five

:

incidents last week. That's a problem for Joe. I'm not talking about

:

that. I'm talking about the

:

undercovering,

:

the sophisticated connection between

:

things that can cause problems to occur.

:

Thank you for tuning in to Modern Digital Business. This

:

podcast exists because of the support of you, my listeners.

:

If you enjoy what you hear, will you please leave a review on Apple

:

podcasts or directly on our website at MDB

:

FM slash Reviews If you'd like to suggest a topic for an

:

episode or you are interested in becoming a guest, please contact

:

me directly by sending me a message at MDB FM

:

contact. And if you'd like to record a quick question or

:

comment, click the microphone icon in the lower right hand corner of our

:

website. Your recording might be featured on a future

:

episode. To make sure you get every new episode when they become

:

available, click subscribe in your favorite podcast player

:

or check out our website at MDB FM. If

:

you want to learn more from me, then check out one of my books, courses

:

or articles by going to Lee Atchison.com and

:

all of these links are included in the show. Notes thank you for

:

listening and welcome to the world of the modern digital business.

Chapters

More Episodes
37. ModernOps with Beth Long: Operational Ownership
00:23:16
36. Four Challenges Facing Cloud Architects
00:10:16
35. ModernOps with Beth Long: Talking SLAs
00:18:02
34. ICYMI: Special Halloween Edition: The 3 Scariest Mistakes Companies Make in the Cloud
00:10:27
33. Don't Give Developers Special Access - Balancing Access and Data Protection
00:06:54
32. Five Skills All Cloud Architects Require
00:09:38
31. ModernOps: Talking STOSA with Beth Long, Part 2
00:22:16
30. ModernOps: Talking STOSA with Beth Long, Part 1
00:30:06
29. Does low code make applications overly complex?
00:14:23
28. Reducing Complexity in your Configuration Management Strategy
00:14:12
27. Getting What You Intend from Your Organizational Design -- Business Breakthrough 3.0 with Ken Gavranovic
00:30:19
26. ICYMI: Do you need a Cloud Center of Excellence?
00:14:55
25. What You Need to Learn to Become a Cloud-Native Architect
00:10:38
24. Modern Pricing Plans with Dor Sasson, CEO Stigg
00:36:37
23. Don’t Depend on Maintenance Windows
00:15:25
22. Automating your Automation with Tyson Kunovsky, CEO AutoCloud
00:23:22
21. Independent Third Party Observability with Jeff Martens, CEO Metrist
00:48:57
19. MDB Weekly for Jan 16, 2023
00:05:28
18. Testing at Scale with Nate Lee, Co-Founder of SpeedScale
00:38:37
17. MDB Weekly for Jan 9, 2023
00:06:01
16. MDB Weekly for Jan 2, 2023
00:05:29
15. Simplifying Cloud Complexity with Tim Holm, Nitric
00:31:54
14. Securing Data at Rest and in Motion
00:13:01
13. ICYMI: DevOps with Mitch Ashley, CTO of Techstrong Group (DevOps.com)
00:41:44
12. Cloud-Native Observability with Bruno Kurtic, Sumo Logic
00:35:09
11. Tech Tapas Tuesday-"A little bit of tech": Principle of Least Privilege
00:05:45
10. Special Halloween Edition: The 3 Scariest Mistakes Companies Make in the Cloud
00:10:27
9. Tech Tapas Tuesday-"A little bit of tech": Flying Two Mistakes High
00:05:24
8. Do you need a Cloud Center of Excellence?
00:12:22
7. ModernOps with Beth Long: Transferring Operational Expertise to the Cloud
00:17:35
6. DevOps with Mitch Ashley, CTO of Techstrong Group (DevOps.com)
00:41:44
5. 8 Steps to higher quality DNS systems
00:16:05
4. Technical debt will sink you
00:10:50
3. Tech Tapas Tuesday: Cloud or no Cloud
00:03:18
2. ModernOps with Beth Long: AWS for Big and Little Companies
00:22:11
1. The Finances of a Cloud Migration
00:09:08
trailer Welcome to Modern Digital Business -- First episode on July 25th!
00:02:44