Ken Gavranovic was the Executive Vice President and GM for product at New Relic. In early 2019, Ken and I were in Boston together for an event, and we recorded an interview discussion about Risk Management in modern digital applications.
Both Ken and I have experience dealing with Risk Management issues in current and past assignments. I discuss Risk Management in my book, Architecting for Scale. Ken used a very similar risk management technique in his past corporate management gigs. In this interview, we compare notes and make recommendations on best practices for Risk Management that everyone can use.
Links and More Information
The following are links mentioned in this episode, and links to related information:
Ken: I know we both talk to a lot of customers. One of the questions is, where do I get started? What are some of the patterns we see in enterprises and our own experiences? We have an awesome opportunity to talk to a lot of companies doing digital transformation, but what is something that I can just go do tomorrow to get started?
Lee: One of the things I find it’s very easy to wrap your mind around is risk management. How do you build a risk matrix to track the issues and the risks you have within your system? I like to talk to companies about that because it gets people starting to think about what their system is doing, what problems they have, and how they deal with them. It gets them thinking beyond just the problem/resolution cycle, and more into a pro/con and risk assessment process. What is the benefit of fixing something versus the benefit of mitigating it versus the benefit of simply ignoring it? I like to talk about that because it gets conversations going within the company about the sorts of things that are important to them.
Creating a risk matrix is an important first step for anyone who is thinking about trying to improve their availability, trying to improve their scalability, or trying to modernize their application in many different ways. It helps get a grip on the issues that already exist in your system and what you are currently doing to manage those risks.
Ken: I 100% agree. I remember in a previous role, I had a couple hundred-million-dollar project, I had some challenges. We created a risk matrix which helped us solve those challenges. So I thought it might be helpful for people watching this video. Let’s double click and see what this might look like.
From my perspective, I think the key questions that need to be asked, those questions need to be asked in a bottoms-up way, not top down. Agreed?
Lee: Yes, definitely.
Ken: It’s not people at the top of the organization that are giving you the answers. It’s the team level that gives you the answers you need. Let me give you my shot and tell me where I miss.
First of all, the things that can go into the risk are the things that can go bump in the night.
Lee: Most people already have an idea of the things that keep them up at night. Things they think about, worry about. The things they think about on a regular basis, and that is a good place to start.
Ken: That makes sense. So, bottom up, by team, just create a list. Just list all the things that we think are some sort of risk to the project. These are things you know you should be resolving, but instead you have a habit of prioritizing feature development work over it instead.
Next, is to think about the likelihood that this risk will actually happen.
Lee: I tell people they need to think about two values for every risk item they come up with. Create a spreadsheet, and list all the risks as rows in the spreadsheet. Each individual risk, line by line. Then, for each risk, add two values in separate columns: likelihood and severity. That is, how likely is this risk to happen and if it does happen, how much negative impact will I have with it.
They should do this for every risk in the matrix, before they even begin to think about fixing or mitigation.
Ken: I think it’s important to share that this is what we’ve seen, not just from personal experience, but from a lot of companies that we work with.
Ken: What types of values should I use for likelihood and severity? Some people say I should score it from 1 to 10. I think that’s too granular. I like to keep it simple. Just use: Low, Medium, and High.
Lee: I agree with you. You do run into people that want to be highly analytic. They want to use numbers, say, from 1 to 100 and they end up arguing about whether a particular risk is a 35 or a 36. This is way too granular. Keep it simple.
Ken: Sometimes teams like to use their SPRINT approaches of throwing numbers, such as using cards.
Lee: Yeah, if you really want a more rigorous process, you can do something similar to the SPRINT throwing numbers approach, but just use three playing cards, say Ace, Five, and Ten. Then everyone can vote with a card and use that to determine High/Medium/Low.
But that sort of process is only for people that really want a truly analytic solution. It can be done much simpler than that. Often, items are clear to everyone that they are a high or a low or somewhere in between.
Ken: So, whether you use cards, or just use Low/Medium/High, or whatever. At the end of the day, the most important thing is to keep it simple. It’s not about a big debate.
Ken: At this stage, we are not trying to get into a great level of detail. Just a high-level description, likelihood, severity. Next thing for the matrix is, is this risk currently instrumented? Does it have observability? If this risk were to occur, would you know that it is occurring from a notification from an automated system, or would you find out from your customers telling you?
Lee: That’s a fantastic way to think about it. It’s one thing to know that if something goes wrong, what’s going to happen. It’s another thing to know that you’ll know when it happens.
Lee: And, certainly when we talk later about mitigation, you absolutely need to know that knowing when a risk is occurring is a critical aspect of risk management. This is especially true for your high severity risks, whether or not they are high likelihood or low likelihood.
Ken: Kicking starting a program like this in an enterprise is obviously hard. You need top-down leadership to support this process that we are going to do.
Ken: Risk matrix, containing lines with items, likelihood, severity, monitored or not monitored. Ok, what else, or is it just that simple?
Lee: Well, coming up with that list is going to get you 80% of the way to what you need. That’s because it gets you and your organization thinking about what’s going on. That’s the most important benefit of this process. You start thinking about risk and the impact risk has on your system. What’s going to happen during this risk discovery process is the engineers in the room, their minds are quickly going to go to the next thing, which is mitigation. They are going to start to think about how to handle the risk.
But, you are right. If you get nothing done but create that list of risks and put them in the matrix during the first meeting or two, that’s all you need and your world will be a whole lot better, just by simply having that matrix.
Ken: Right. Another point I want to throw out there and see if you agree, is around RCA and incident response processes. I think when you have an incident, during the RCA you should always check if this issue was already on the risk matrix. If it was not there, then it should be added, and some time should be spent on why it wasn’t added in the first place. Maybe a team wasn’t as aggressive, and they didn’t want to put everything in the matrix. Because, going back to no surprises, you want to understand why this incident was a surprise. One of my favorite phrases is, “surprises, not a fan of giving or receiving”. If you have a risk matrix and it’s done right, anything that goes bump in the night should have been known and on the risk matrix ahead of time.
Lee: Exactly, you know, every time you have an outage or an incident of any sort, you end up with some sort of post mortem whether it’s formalized or not. One of the key questions has to be, “did you know about this ahead of time?”, and that comes back to the risk matrix. Because, if you didn’t know about it, that’s a problem. It needs to be added to the risk matrix, so you understand that risk fully. But if you did know about it, you should also verify that the actual severity of the incident matches the severity you thought it should be on the risk matrix. Were you right or wrong in your estimates? You can gain a lot of knowledge when an incident occurs by answering questions like this with the help of a risk matrix.
Ken: So, let’s assume that as a leader, I’ve told my organization to build a risk matrix. They’ve done the process, I now have this risk matrix. From an execution point of view, I think there are two things that need to happen next.
First, you look at the high/highs – high likelihood, high severity. In some cases, removing these risks might involve rewriting. But the high/highs that you can fix, you should prioritize the work and get them fixed.
Second, you always have business partners. I’m a big believer that you should take that risk matrix and present it, at the executive level, to your business partners. You show the high/highs, the medium/mediums, or whatever they are. Now, as a company, think about one of two things. Should we focus on fixing these high/highs or should we all take a breath and say we are willing, for whatever reason, to take this risk on as a company. You go into that with open eyes, blameless culture, and state your willingness to take that risk together.
Lee: Yes, and that’s really critical too. Because no matter what, you are not going to remove all the risk from the system. You aren’t going to fix all the problems, nor is trying to do that necessarily the right investment for you. The right level of risk is whatever level your organization – your extended organization – is comfortable with. The business cost of the risk, the development cost of fixing it, all of these things have to fit together. But once you know what your risk is, you can evaluate whether you and the culture of your company, and your customers, and the business you provide, are comfortable with that level of risk.
Now, for the things that you are not comfortable with, you have to address these right away. You have to either mitigate these risks or remove them. But the other risks, the ones where you are comfortable with the level of risk, it’s not necessarily a good investment to work on resolving those things. Because there are going to be higher priority issues you want to work on.
Ken: Another important aspect is from the funding perspective. I look at the risk matrix as a living document. My thoughts are, you should run this exercise at least twice a year. Then, when you have incidents, you should update the risk matrix to match those incidents. The risk matrix should be accurate and maintained.
Ken: Now, most companies fund on an annual basis. My perspective is a lot of times people forget about risk when it comes to funding. In some companies, what is funded are the “bright shiny objects”. That’s where the money is invested. So, for companies that are technology leaders, you should bring the risk matrix to the budgeting discussions. That way you can make sure everybody is clear and all discussions are open on what we are investing in and why we are investing in it. The risk matrix is part of the budgeting process.
Lee: Yes, it’s definitely a feedback into your budgeting process. But it’s also at a much lower level a feedback back into your SPRINT planning process.
Ken: Totally agree.
Lee: You use it to determine what you can accomplish this SPRINT, and how much you want to spend on doing risk management activity during this specific SPRINT versus doing new features or dealing with other problems.
Ken: I know many enterprises, if they are really focused on the customer experience say High/Highs must be done first, unless it involves a full rewrite. If you go into an organization that has a lot of technical debt, that may not be the case, you do as many as you can each SPRINT.
Lee: Yes, absolutely. But the one important thing to consider is that fixing does not have to mean removing it. It might be creating a mitigation for it that reduces its severity or likelihood to an acceptable level.
Ken: It might move from a High likelihood to a medium or might take it from a High impact to a medium or low.
Lee: And just by doing that you’ve brought it to down to within the comfort level of your organization. And once it’s in the comfort level of the organization, that’s a very successful place to be.
Ken: And you and I have seen this at hundreds of global companies. So, from a best practices’ standpoint, it really makes a lot of sense. Have a risk matrix, update it semi-annually and when incidents happen. Review it during the RCA process. Rinse and repeat. Then, take what you have and use that in the budgeting process.
Anything else we should add?
Lee: The only additional thing is that individual teams need to own their own risk matrices – remember, they are built bottom up. Individual teams need to have responsibility for their own risk matrices and be held accountable for the content. Then, they all need to bubble up to a high-level list that is known at the highest levels of the organization.
Ken: I agree. And the initiative and guidance to do it needs to come top-down, because it’s important to the entire organization.
Ken: The actual work itself happens bottoms-up. Totally agree.
I’d like to thank Ken for his involvement in this interview and for the insights he provided to the very important topic of Risk Management. Additionally, thank you to Ken for providing the recording equipment he and I used for the interview.
Tech Tapas — History of the term Cloud
What is the history of the term “cloud” as it is used in Cloud Computing?
Well, that’s an interesting question and unfortunately there are probably as many answers as there are people who work in the cloud.
So, it’s very hard to answer.
But most people who have looked into this subject believe the term originally came in the 1980’s from the telephone companies of the day. Network engineers who drew diagrams of portions of their networks would often draw a big blob to indicate a portion of their network that they weren’t dealing with at the time. Rather than drawing these blobs as simple circles, squares, or rectangles, since those icons represented real entities in their diagrams, instead they drew the blob using rounded segments interconnected that made it look “nebulous” and nondescript in shape…it was, after all, suppose to represent a nebulous and nondescript part of the network.
In fact, this nebulous and nondescript shape looked like a cloud. So much so that these engineers started talking about the part of the network that they weren’t focusing on at the time — the part external to their area of concern — as being out in the cloud.
This usage expanded into software architects as they build their software diagrams and flow charts as well. They used the symbol for a similar purpose.
The term cloud computing, though, is much more recent. Some would argue that early server farms were really cloud computing, but the term really wasn’t popularly used back then. Some would say that Google Compute Platform (GCP) provided some of the earliest cloud computing technology to the industry. Others would say that Salesforce.com was a major early creator of cloud technology. But I think the real mainstream usage of the term cloud computing among technical professionals was popularized with the start of Amazon Web Services. AWS mainstreamed cloud computing — and hence mainstreamed the term cloud computing. This happened in the mid 2000’s.
But software running in the cloud was still something that was reserved for technical people to talk about. Mainstream non-techy people didn’t yet know what the term cloud was all about. This was certainly my personal experience. I was working at AWS in the early days, and I had a hard time telling my non-techy friends what I did for a living. They just didn’t know what I meant by “cloud computing”. They didn’t know what I meant when I said I worked “in the cloud”. They didn’t understand what the cloud was.
Like many of the changes in modern popular tech culture, that changed once Apple computer came into the picture. On October 12, 2011, Apple introduced iCloud to the Apple universe, and overnight, the word “cloud” was a part of mainstream culture. Those friends of mine who couldn’t understand what I was doing when I said I “worked in the cloud”, now understood, at least at some level, what the cloud was all about. Apple brought the term cloud to the mainstream.
I know I will likely get many people who disagree with my analysis on who invented the term cloud and who popularized its use. That’s because there really isn’t a single right answer. But I do believe that the biggest events in the history of the term “cloud computing”, were — first the network engineers, then AWS, then Apple computer. Each of those three groups played a role in bringing the world “cloud” into our everyday lives.