Rick Stewart, Chief Software Technologist at DLT Solutions joins Tech Transforms to give insight on Open Source, Platform One, and DORA initiatives. Listen in as Carolyn and Mark learn about the importance of focusing on the right metrics when managing security bottlenecks.
Carolyn: Today, we get to talk to Rick Stewart, a good friend. Rick Stewart is a Chief Software Technologist at DLT for more than 34 years. Do you really want me to tell people that Rick? That makes you sound super old?
Rick: No, it has some relation to the old way of doing things, traditional ways.
Carolyn: He knows the old stuff and the new stuff with 34 years of diverse experience in the IT industry. He’s progressing through technical and leadership roles in telecommunications, mobile entertainment, the federal government, and the manufacturing industries. Today, Rick is joining us to talk about DevOps research and assessments, or DORA, a term that is new to me. He’ll also talk about the four key metrics for increasing efficiency and delivering service. He will discuss how Platform One has advanced the cultural transformation to DevOps.
Mark: Welcome Rick. By the way, Rick started this when he was six.
Carolyn: That's right. I'm going, to be honest. I've been in the industry for a while, and I have never heard the term DORA. DevOps Research and Assessments make sense. I just haven't heard the acronym. They have four key metrics for increasing efficiency in delivering service. Those metrics are deployment frequency, lead time for changes, change failure rate, and time to restore to service. Will you unpack those for us?
Rick: It's interesting that you say that because I attend several different events and conferences where we have, especially in the public sector, astute people that have lots of experience.
Rick: They're on this journey of DevOps or in the public sector. It's more DevSecOps, bringing security up as a first-class citizen. They were talking about the things that they capture, the journey that they're on, and their improvements. On one of these occasions, DORA was brought up. I think it may be a Q&A panel. It was surprising that a lot of them didn't know what this organization does, especially being so well versed in the cultural transformation, not knowing some of the things to focus on. I thought it was really important to shine a light on.
Carolyn: Is it a federal organization?
Rick: No, it's more of a community-based organization, an industry-based organization. We've got people like Jez Humble and Gene Kim and others that are involved with this. What they do is, they go out and they do surveys of not just the public sector, but the private sector, all organizations globally. They basically give them surveys and they talk about their experience, where they're at in the spectrum of their journey, and what they have discovered through this analysis. It's a really deep, long analysis.
There's a book called Accelerate that was done by Nicole Ferguson. She has a PhD and took lots of painstaking analysis of these organizations and these teams and asked them a series of questions. What it boiled down to is there are a lot of traditional metrics that have been ingrained in the industry that are useful somewhat, becoming less useful over the years, like lines of code when we're talking about mainframe and the complexity and function points, etc. As the industry has changed into more service-oriented or even micro-service-oriented architectures, those types of metrics are less useful.
Rick: So, when you're talking about a cultural transformation of getting development teams and operations teams working in unison and collaborating together, these four metrics were decidedly important to focus on in order to strive more towards that collaborative effort. These indicate the ability to deliver software with high quality and the ability to rectify any changes or security vulnerabilities and rectify them quickly. I'll go through each one of them. Deployment frequency is how often an organization successfully releases a product to production. A product in this case could be a service, could be any kind of workload, or an application. There are differences to that.
There's an old saying that says, if something is difficult to do, do it more often and you'll get better at it and it will become less difficult. So this deployment frequency talks to that. You have to measure how many times you're deploying a particular change into production. That way, you can, A, determine your impact, the value you're having on your stakeholders, but also the ability to measure how frequently you can deliver that value.
I'll go back and forth between the private and public sectors. The public sector industry days are very interesting to me. It’s not only because that's the space I'm working in, but more importantly, it crystallized the importance of service delivery and frequency and speed. It was a Navy captain that was giving an industry because they wanted to develop a DevOps prototype. One thing that struck me was I can't wait two weeks while I'm in the middle of the Mediterranean, potentially in a firefight, to get a release, a change to an application that's not working properly.
Rick: That manifested for me the importance of focusing on the right things. You have to look at your frequency and where you're deploying these changes. It’s not just through enhancements and value, but to rectify issues, defects, and security vulnerabilities.
Carolyn: Are you seeing the government agencies embrace these four metrics?
Rick: I think they've embraced a hundred different metrics, but the industry is telling them, just like it's telling them to move towards DevOps or DevSecOps, to focus more on these. Get rid of the 300-page system security procedures, that's a waste of time because you're not getting value.
Carolyn: When you say the industry's telling them, who's industry?
Rick: Industry would be the developers that are in the private sector, that are in the Netflixes, the AWSs, the industry leaders, the Googles. Those that can deploy changes and take advantage of disruptive technology and innovative services quickly. They are recognized as thought leaders in terms of what should be a measurement in terms of measuring teams' productivity when they're on this journey to DevSecOps.
Mark: Are these standards something that the DORA organization came up with? Like you talk about the industry standards, do you know where they're getting the standards from?
Rick: The deployment frequency is standard. It's always been around. You mentioned the 34 years. I've known about deployments ever since I started doing software.
Carolyn: But the DORA organization sounds like it has boiled down to these four most important metrics. You're saying from industries like Netflix, like AWS, Amazon.
Rick: Google.
Carolyn: They've looked at best practices, the metrics that really matter, and DORA said, these are the four that matter most.
Rick: They can link back to the collaboration across multiple teams, which is the essence of DevOps or DevSecOps. Because these teams have different disciplines, they have different priorities, they have different measurements within their own teams, and if you can measure that you're getting better at deploying more frequently, it indicates that you're collaborating more with these teams. You're getting more rapid in terms of moving that thought from code to application to delivery quicker.
Mark: Are there metrics that they've come up with to determine what increasing efficiency means? Or are they kind of like work groups that look at thinking through what an organization might be dealing with?
Rick: Well they're looking really at the number, the sheer metric. And they divide it into four different categories of performance. You have your elite performances, I mentioned like the Netflixes, the Googles, etc. They're deploying multiple times a day, which Mark I'm sure you know in the public sector, multiple times a day, it's like a utopia for a public sector entity. They're usually talking once every six months, once every year.
They better make it successful or else they have to marshal all those resources again. You're talking about time, money, not being able to provide value, those types of things. When you're looking at the measurement of the metric itself, you're trying to categorize it to allow you to move up this hierarchy, if you're a low performer, you're maybe doing it once a week or once a month or once every six months. That's not optimum. How do you move up? You try to increase your ability to deploy faster. What does that mean?
Rick: Talk to more groups. Get them into a room. What are the bottlenecks, the areas that need improvement? How do you work together even when you're in a different company? In the public sector, you might have different contractors, and different companies doing various different pieces of this. So it's very important to foster that collaboration so that you can deploy more. That should be the goal. How do I deploy more and faster?
Mark: One of the things that have me thinking is how can organizations strive to get to the next tier of performance in each of these benchmarks?
Rick: Other metrics lead or feed into these four different metrics. For example, your lead time for changes, which is the next metric that they talked about. This is more developer speaking, more technical. When I commit my code saying this has passed all my testing, I've got it through my team. They've looked it over. It's passed all the tests and I've committed that branch or that version of my change onto the main version control. Previously, when you developed a release, a deployment to go to production, everybody, all your developers, would make their changes and be committed to that particular release branch.
That has subsequently changed with this movement towards agile and making things more frequent, smaller deployments where each developer would have their own little branch. Once they finished their little piece of the world and passed all the regression testing, they would commit their code to the branch. Using automation, they would move that change from building the application, through test environments and pre-production, to user's test, getting approval user test, and deploying into production.
Rick: Getting that time faster allows you to deploy more frequently. That one feeds into the other. In order to focus on moving up the chain, you need to apply, in my opinion, more automation. These are very repetitive tasks.
If you've ever developed code before or you've ever developed software, it’s the combination of artistry and engineering in a beautiful dance. Because you're trying to be an artist, you're trying to be creative. You're trying to figure out what's the most elegant way to put something together but there are certain engineering tasks that have to be done. If you don't do them, it will bite you in the rear end later on down the line.
That is, constantly test, constantly scan, and constantly do the mundane tasks that allow your code not only to be elegant but to be maintainable. It’s also correct in terms of requirements and hygienic in terms of not introducing vulnerabilities.
Carolyn: But that mundane consistency, you automate all that?
Rick: Yes. If DevOps, DevSecOps is the movement or the journey, automation is the key ingredient to allow you to move faster.
Carolyn: You feel like these four metrics are sufficient but listening to you talk, there are four big rocks. And then there's a whole bunch of metrics that fall underneath each of them.
Rick: Yes. But they should be feeding into increasing your frequency, decreasing your lead time for changes, and making that smaller. Your change fail rate, you want to make that as small as possible. There are ways that you can do this with automation. Then the time to restore service or the mean time to repair, I've heard mean time to restore, mean time to resolve, mean time to remediate.
Rick: So MTTR, the R is interchangeable, but it means the same thing. The change failure rate is when the DevOps, DevSecOps teams deploy into production. Was that a catastrophic failure such that you had to roll back or remove that change because you're making it worse than what it was before? Speaking of industry, I was in the telecommunications industry. We were doing a lot of white-labeled systems for the wireless industry, all the big ones, the Verizons, the AT&T, etc.
They have very strict procedures on when deployments occur within windows. It's usually between 2:00 AM and 4:00 AM on a Tuesday or a Wednesday, just enough to break up your week and make developers and operations miserable. Between those two times, if there was any failure deploying your new code, no matter how important it was, you back it out. You roll it back and you try again either the next day or the next week or the next window that they had. That gets grueling. What happens if you do have a major catastrophe or a major issue with your system or your new change or your fix? It could take weeks before you can get that out.
Meanwhile, you're not producing any value from enhancements to that application because they stay behind the failed deployment. So you need to reduce that change failure rate, hopefully, to zero and the elite performers do this. They do this with many different methods. One most popular is a blue, green deployment. What they do there is, let's say you have version one of an application and it's running in production. Everything's fine.
Rick: Now you have version two, and you want to enhance it or fix it. You deploy version two alongside your version one deployment. One blue and one green. You can test offline your new version two to ensure that it meets the requirements. It's working properly and it scales all the different operational functional capabilities that it needs to do. Then when you're happy about that, you can switch it over or you can produce a certain amount of traffic to get real traffic to it. So make sure it behaves properly. When it does, you just stop traffic to the old version and put all the traffic to the new version seamlessly with no downtime.
Carolyn: Do developers ever play games in a test environment where they blow it up on purpose so they can see how fast they can restore?
Rick: It should be part of the culture and the methodology that DevOps or DevSecOps teams have. When somebody asked me, I said, "I'm a pessimistic optimist." Meaning I want things to occur properly, but I know Murphy's involved with everything. So, let's test it before we go live because if we don't test it there, it will cause havoc.
Coming from that environment where you get one or two shots, once, twice a week, you want to make sure that you measure twice, cut once. That measure twice is testing in the test environment, and pre-production environment, so that when it gets to production, you're pretty confident that your change will work. It will be resilient enough to maintain production traffic.
Rick: One other point I think is a good one, I've always advocated that pre-production environments should mirror production environments. There's been a drift within the industry in terms of developers. Well, I can develop in this environment and I can push it to this environment. It looks slightly different but I'll maintain some changes here and I'll make it work. Then when it goes in production, it might be a third different environment. That's really a fool's errand, that's going to result in a bad experience. Luckily, there's some automation that makes that gap between the differences between production and pre-production a whole lot easier and a whole lot more narrow.
Mark: Speaking of automation, you've talked about this in blogs. You talked about Platform One and how it leverages new technologies and automation. Can you dig into this a little bit? First, tell our listeners what Platform One is.
Rick: Platform One is an innovative Air Force environment that is built on the Kubernetes orchestration and management framework. Now I'll explain that in a second. The second one is that it requires development teams to deliver their services, and even the tools that develop their services, in containers. Containers are, you can think of them as small virtual machines that only have application needs installed in them.
Mark: Like a modular approach.
Rick: Think of it as a widget. From an operational standpoint, they all look like several different widgets. Each one of those widgets could be a completely different language, dependency, structures, etc. inside. But from an operational capability, it is much more efficient because you can deploy these widgets as independent, generic items.
Rick: You can deploy them using scheduling techniques that make sure that an application's needs are deployed on a host within the Kubernetes environment. It has the appropriate resources to serve that application and enough resources that it can scale if it has too many requests coming to it. It can descale or become less in order to take advantage of resources, etc. But the application itself could be myriad languages or constructs from applications.
It’s really nice in terms of crystallizing or making concrete some of the notions that came out of the agile movement, which was each task that comes across a developer's desk shouldn't always be a Java application per se or pick a language because that's what the operational team can support.
The notion that the best technology should be used for the task at hand really makes a developer's life a lot easier. You can pick maybe a lighter-weight language or an application to create or solve the task. Then deploy it and not worry about the operational risk of not having dependencies or anything that the application needs once it goes further in product pre-production and down into production.
We're talking...