Durable Execution for Real‑World Failures with Temporal’s Cornelia Davis

Artwork for podcast Platform Engineering Podcast

Episode 49 • 27th May 2026 • Platform Engineering Podcast • Cory O'Daniel, CEO of Massdriver

I'm Cory O'Daniel. This week, Cornelia Davis from Temporal.IO joins me to talk about durable execution, failure handling, retries, entity workflows, and why platform engineering may have been a distributed systems problem from the start. Let's get into it.

My guest today has been working in the platform space since before we had a word for it. She spent seven years at Pivotal as the VP of Technology, where she helped shape Cloud Foundry, a developer platform that in two thousand thirteen was already doing things we consider table stakes - container based abstractions, separation between Platform and App Teams, and Golden Paths.

Now a principal technologist at Temporal, she's also the author of "Cloud Native Patterns" and she spent more than three decades helping developers build resilient distributed systems. Cornelia Davis, welcome to the platform engineering pod.

Cornelia: 00:00:43

Oh, it's so great to be here. Thank you so much, Cory.

Cory: 00:00:45

I was very excited for today's show. I loved Pivotal. I loved everything Pivotal did. So you have absolutely influenced a ton of my work. So very excited to have you on today.

You've been front row to so much of platform engineering from Cloud Foundry, GitOps, Kubernetes, and now you're working on durable execution at Temporal.

When I think about our space and DevOps and kind of the entire gamut of how our industry's changed over the past twenty six years or so, I feel like there's so many problems that still persist for different teams. They're just stuck at different places. But what are the things that you still consistently see teams getting stuck on?

Cornelia: 00:01:30

Well, I think that one of the things that has happened in the platform space that has been such a great thing is that we went from this system administration mindset with Click Ops into a developer mindset. So that's a boon. That is like, this is awesome, we are now programming our systems instead of clicking on UIs. We are treating it like a software engineering problem.

But one of the things that I still see is that there are software engineering patterns and solutions that we've applied over in the application space that we still haven't brought over into the platform space. So for example, you just mentioned durable execution. We're going to talk a lot about that. It's just this very real awareness and realization that what we do in the platform space is a distributed systems problem. Everything that we're orchestrating is a distributed systems problem.

We are orchestrating things up in AWS, we're orchestrating things in our data center, orchestrating things on Cloudflare, and all of those things together are forming a single unit, but it's highly distributed. I have not yet seen us take some of the knowledge that we have around distributed systems that we've applied in the application space... I haven't seen us apply that a whole lot in the platform space yet.

Cory: 00:02:57

Yeah, yeah, and that one's exciting to me too because I'm an Erlang developer. I love myself a good distributed system. But like, what would you say are like the core principles?

Because I feel like at the same time it can be hard because a lot of the people that you see in the platform space are coming from the OPS side, right. They have experience writing Terraform, they have experience writing Bash, but they may not be like a formal "application developer". I'm throwing air quotes on there for folks listening to the pod, right? And so they may not have worked in building distributed systems, they might have not worked on building the checkout API. The kind of things that application developers do day in, day out.

And many teams, when they're getting started in their platform journey, it's that OPS team, it's that DevOps team that has kind of taken those first steps. And so like, what are some of the things that you see that they're most often missing from distributed systems that would be the best boons for them?

Cornelia: 00:03:51

Yeah, so I think that one of the things is that a big part of... I mean it's said that sixty to eighty percent of the code that is built for applications and distributed systems is failure handling code.

And that is something that I don't see us doing on the platform side as much because what we do is we build these automations. We even build like... I love the transition. I like you, I'm not Erlang, but I'm a functional programmer at heart and I love declarative systems. That's why Cloud Foundry. And Cloud Foundry was before Kubernetes, so it was one of these... it was maybe around the same time that Terraform was coming on the scene... and so this is my love letter to Terraform.

Terraform did something amazing. It was this declarative system. It was like, "Okay, we're going to go out and take a look at what the current state is. We're going to compare the state and then we're going to update it." Love all of that, but things break.

And so this system where we're making calls to AWS and Cloudflare and some APIs in our own internal data centers, those things break. And so what we end up is... we end up with systems that are inconsistent, we end up with orphans, all of those types of things.

And then we try to apply, like, if you will, maybe brute force hammers to trying to fix those things. "Oh, I'll have a cron job that goes and looks for orphans every once in a while," that type of thing. But the reality is that there's patterns that could keep you from getting into a state where you've got orphans.

I think that we tend to be reactive. And so it's largely around that failure handling is that... yeah, we can put together orchestrations, we can do declarative configurations, but then realizing that to stitch it all together... and by the way, in the platform space, nothing runs in a half a second, in five hundred milliseconds, everything runs for not just minutes, but hours, days, weeks and months. And so the longer that something runs, the more likely something's going to fail. And so it's that failure handling, that implicit inherent failure handling that has to be there for everything.

And it's hard, by the way, because you don't want to spend sixty to eighty percent of your cycles working on making the system as durable as you can, because you've got a whole team of developers in your organization that are like, "When are you going to give me the next feature I'm waiting for?"

Cory: 00:06:23

I need a database. And you're like, "Well, that's a simple request."Right? But even like Redis... I don't know why, every time I start up a Redis on any cloud, it's like it's memory on a wire but it takes an hour and a half to boot up the Redis cluster. It's just like, "What?"

It's like things will go wrong, like subnets will disappear, EIPs will disappear. Like there's... but it is funny because one of these things, because again, like, I feel like for many developers that are used to having most of their infrastructure set up for them, they may not realize the nuance in getting all this stuff together. To them getting a database is, "Why isn't it as easy as Docker Pool Postgres? That's how it worked for me locally." Most of what we do in the cloud, you said it, it's not twenty seconds, it is an hour and a half. And it's like, "Oh, you're out of quota." Oh, shit, that's not even a real failure. I mean, it is a real failure, but you know what I'm saying? It's like, that's one we could easily get around.

So we mentioned durable execution. How would you define that for folks that aren't familiar with it?

Cornelia: 00:07:27

Yeah. So durable execution is a term. It's gaining a little bit of popularity. But the way that I would describe it, because most of your listeners probably aren't super familiar with it, is that it's... and I'm a nerd, so I'm going to explain it from a nerdy perspective... it's a programming model.

It's a programming model that allows you to write your code as if those failures didn't exist.

You see the potential huge value here in the platform space is that the platform space, because you say they come from a systems administration background often, and they have gone through tremendous pains to clean up after failures. They do program as if failures don't exist, and then they compensate for that.

Well, what this is, is it's a programming model that allows you to program as if process boundaries don't exist, as if failures don't exist. And the system, the durable execution platform will actually put all those compensations in place for you.

So for example, it will do automatic retries for you. And by the way, those retries... now I know you're probably thinking, "Retries? Well, everybody does retries." The thing about durable execution is that the retries themselves are durable. And what I mean by that is that even if you're in the middle of retrying something, you've got a something rate limited, or you've got some network issue and you're in the middle of retrying and then some other failure happens and your retry logic itself is in process and goes away. Oopsie.

But by durable retry, I mean that if the system that's doing the retrying, so you're at step three and you're doing the retrying, and now your orchestration itself goes down. When your orchestration comes back up, we remember exactly that you were on retry number three and we continue on with that.

So durable retries, for example, is one of the examples... or state management, like managing state... all of that stuff is what we call durability. So you get to program as if failures don't exist. But the runtime behaviors... and by the way, a lot of that is distributed systems, and we'll dig into that, I'm sure, in just a moment... the runtime behaviors make the system resilient to those failures. It's not that the failures don't exist. It's like a waterproof watch. Doesn't mean it won't get wet, it means that if it gets wet, it'll still be okay.

Cory: 00:10:09

That's funny. I like that example. Yeah, this watch is definitely going to get wet... it's waterproof.

So let me ask a question here. So as a developer, whether on a platform team or app developer working with durable execution, how much does my app still look like what I may consider my app today versus does it start to look a bit more like a series of Lambdas? So does this look like a bunch of series of smaller functions that are getting executed in a graph or a DAG or something like that? Or do I still typically have something that maybe feels like my Flask monolith or my Express monolith?

Cornelia: 00:10:46

Yep. So I'll answer that question in a couple of different ways.

From a theoretical perspective, durable execution does not imply a graph model or imply not a graph model. You can do durability with either of those programming models.

The one that we do... and I work, as you said in the intro, I work for Temporal, which by the way, is an open source, hundred percent open source - it's not open core, it's not like you can run some stuff on the open source, it's 100% open source. That's what I'm going to be talking about today is the open source, super cool technology... Our approach there is that we don't want to introduce a DAG, we don't want to introduce a DSL, that you have to program a different programming model. We believe that the most natural programming models are the ones that you're already familiar with. They are the languages that everybody uses. It's Python, it is Typescript, it is Java, it is .Net. We support seven different languages. There's even a Swift SDK.

Whatever programming models that you're familiar with, you can continue to use those. Getting back to your question, what the code looks like is, it looks like regular old code.

Now, there's two fundamental abstractions that we use. One is called an activity. And an activity is basically a unit of work, but it's a unit of work where there's a possibility of failure. Then the other abstraction is what we call a workflow, which stitches together all of those units of work in the flow that you want. The workflow, if you think of it, is the overall application. The activities are the units of work where there could potentially be failure.

You program that. You basically say, "Okay, here's my..." basically activities, or decorators on functions. If you've already structured your code in a really great modular way, you've already got functions that encapsulate the units of work. You put activity decorators on those which tells our SDK, "Oh, hang on, pay attention to this. When this function is called, we want you to do some magic for us.We want you to step in, we want you to put that durable retry in there. We want to do some state management around that." So the SDK basically acts as, if you will, a bit of proxy around that function.

It turns it into a distributed system, which maybe we should talk about next.

Cory: 00:13:24

I like that, I like that it's annotations. I feel like... coming from an Erlang background, there's this philosophy called Let It Crash. Have you heard the Let It Crash philosophy?

Cornelia: 00:13:34

Yep.

Cory: 00:13:35

So for folks that aren't familiar with it, the idea is you write code for a happy path and you let processes fail fast for adherence. Right? And there's this supervision system in Erlang that... you have to program it, you don't get it for free, you have to go do the error handling work. And it will restart the system to a known good state. It will bring things back.

It's one of those things that I feel like it's so novel and such a beautiful idea and it is absolutely so easy to screw up. It is hard to let things crash. It is actually very. I mean, it's very easy to let things crash, but the actual recovery of it does require... it requires a lot of adhering to. And understanding what is a good crash versus a bad crash. Like a user putting in extremely bizarre data - that is a good crash. You don't want the entire system to fall apart, you want to tell the user they did something wrong.

But it is hard to reason about. And the idea that you can just decorate code that you know how to think about in the good way and then have common cases.

So with those decorators, is it you're calling out what the failure modes are for each one that you know as the author of the code? Or is it like a kind of a magical decorator where it's like, "Oh, I will probe Python and Libcurl or whatever to figure out what's going on"?

Cornelia: 00:15:00

It's more the former. Basically, you put a decorator on the function and you say, "This is a function." This is what we call an activity. And then you can set your retry policies so you can do things like decide whether you want to do exponential backoffs. You can also, as a part of that, identify which failures are retryable and which ones are not.

So for example, if you are making a call out, I think you just had a very similar example, you're making a call out to AWS to provision something and you know what, the credentials that you're using to try to provision that are failing. You're not going to retry that because the credentials are still going to be failing, right? So that is what we call a non retryable error.

And you as a developer basically get to decide for this function which types of failures are application failures, i.e. we're just going to continue processing those as application failures, and then everything else we'll just assume is a retryable error.

And so you basically get to program those policies. And of course you can program timeouts as well, because in distributed systems timeouts are a major thing that you need to deal with.

Cory: 00:16:13

Oh yeah. Oh yeah. So for something like... if you're wrapping some sort of cloud IaC tool - like Pulumi or Terraform or Helm or Ansible, whatever - and going to the AWS case around the credentials, it's like I can see a handful of different error modes. The credentials are just wrong - that one's just completely not retryable. There's the quota failure. And then there's the IAM failure where like half the build or half the provision worked, but when it got over to making a subnet, the role that you have doesn't have the ability to make subnets. You can make this and you can make that, but not a subnet. So in scenarios like that where it's like... it all ties back to kind of authorization, like sort of, I guess, right?

So how do you decorate for that scenario? Where it's like it could be the inputs to it... like the same function... the inputs to it, this credential coming in could result in like one of three error modes. How do you signal to the, I guess the durable execution engine, which one of those error modes it fell into?

Because the quota one, it's not retryable but it also is? You know what I mean. Or like the IAM role - I forgot to put the IAM permission on there, I want the same execution, I want the same role to come through again, I just need to make sure I give this role the ability to create that resource type.

Cornelia: 00:17:39

Yeah. The first thing is that the durable execution layer doesn't actually know the details of your application logic. So you just described a really nuanced application specific thing.

The way that I've created my unit of work is it bundles up a number of different things... which by the way I'm going to go off on a little bit of a tangent here. One of the things that we've seen is obviously in the platform space, everybody uses Terraform. We've seen customers that are using Terraform and they're creating these Terraform configurations that are very composite. And it's like, "Do this, do this, do this and do this." And that's part of the reason why you end up in this very complex nuanced scenario that you did. Which is like, "Okay, this composite object, part of it worked, part of it didn't. How do we deal with all of that?"

One of the things that we're seeing is that people are starting to, because they have durable execution, they're able to break up their Terraform resources into smaller units, because the reason that they had them in a composite unit was so that they wouldn't have to deal with the potential failures between those different components. But when you have a different solution like durable execution, that's orchestrating those lower level units, now you can break your monolithic Terraform configurations into smaller pieces and use durable execution for that.

So it kind of simplifies the scenario that you talked about because part of the nuance of what you described was because you had a multitude of different things. It wasn't just one resource. It was like, "Okay, I got through part of it, but I didn't get through the rest of it because of this nuanced error."

So back to your original question - "How do you deal with that?" Well, you would have to still deal with that. That would still have to be part of the return codes of your activity. And then you would have to create, of course, a mapping... and so you would decide like, I'm going to return things out of here that indicate retryable versus non retryable. And that would dovetail with your policy.

I want to make one other comment though, because I love your quota example. Because what sometimes people would have naturally thought of that... and you're already picking up on this really interesting thing, which is that quotas, you might say, "Well, that's not retryable, because I'm just going to go back and ask and the quota is still going to be not satisfied," or "It's going to take me... I have to file a request to up my quota, so I need to stop retrying this until my quota goes... like, how do I deal with all that? I don't want to retry it every five seconds because now I've got a human in the loop process to increase my quota."

The interesting thing is when you start working with durable execution, you start thinking a little bit harder about retryable versus not. I would suggest that the quota error is a retryable error because you can actually side effect the system, where now this is going to continue when the quota is updated.

We'll get a little bit further into the durable execution... I'll come back to this example when we get a little further into the technology.

Host read ad: 00:21:00

Ops teams, you're probably used to doing all the heavy lifting when it comes to infrastructure as code wrangling root modules, CI/CD scripts and Terraform, just to keep things moving along. What if your developers could just diagram what they want and you still got all the control and visibility you need?

That's exactly what Massdriver does. Ops teams upload your trusted infrastructure as code modules to our registry.Your developers, they don't have to touch Terraform, build root modules, or even copy a single line of CI/CD scripts. They just diagram their cloud infrastructure. Massdriver pulls the modules and deploys exactly what's on their canvas. The result?

It's still managed as code, but with complete audit trails, rollbacks, preview environments and cost controls. You'll see exactly who's using what, where and what resources they're producing, all without the chaos. Stop doing twice the work.

Start making Infrastructure as Code simpler with Massdriver. Learn more at Massdriver.cloud.

Cory: 00:21:57

So in that scenario, can you pause execution and then resume execution?

So this sounds like also if you had a system where you're like... in an invented system and you've got something that's hitting a dead-letter queue all of a sudden because there's something just... there's an error that you were just not expecting and the code's just wrong. It's like that thing that hits the dead-letter queue, you have to have some other code that's going to process that dead-letter. This event happened, we still need to deal with it.

But in this scenario, I can pause execution, we fix the code, ship it, and then resume execution of this event that was failing previously. And it's just like, okay, now we just have the handling of this. Maybe a field was spelled wrong or something like that, using the same kind of model.

Cornelia: 00:22:44

Absolutely perfect, you have the essence of durable execution.

Cory: 00:22:49

I've broken shit. I've broken shit before, I've been around.

Cornelia: 00:22:54

Yep. So there's two things.

I'll get back to the pause in just a moment because you also used another magic word, which is eventing, and I hadn't described that yet. So one of the things that we do... I mentioned that you have these units of work, they're activities, and then you have a workflow that stitches them together, right? So we talked about the retry around the activity. But the other thing that I want to point out that's very, very important is that when I have this workflow that's orchestrating these activities, every single one of those calls from the workflow into an activity and the return happens via message queue. Happens via task queue.

No longer does this... remember I said the programming model says that you can program as if failures don't exist? Another way of putting that is you can program as if everything is running in the same process. Like a function call, you don't have to worry about that because it's running in the same process. Of course you can have out of memory errors and things like that, but the higher level programming languages have done a pretty good job not letting you shoot yourself in the foot by not having the right pointer type of a thing. So a function call is a pretty safe thing. You don't really have to wrap every single function call with a whole bunch of error handling code.

So what the SDKs do in the durable execution case is they intercept and they're handling some retry. But in fact that retry itself is being handled over a task queue. And so all of this is happening with an eventing system. So now in this quota example that you talked about, now I want to get to your dead-letter queue, because that is a perfect example, because this is the way that we have programmed these distributed systems in the past.

We have an eventing system and at some point when we hit some failure scenario, we don't let it sit in the queue anymore, we send it to a different queue, which is the dead-letter queue, which says, "Hang on, I'm stuck here, I can't do...." And then that's typically where application engineers... or if you're applying this in the platform space, platform engineers... have to occasionally go, they have to write automation that goes across that dead-letter queue, gives you some observability, and you've got to handle these things. And that's where a lot of orphans end up, right? Like orphaned infrastructure ends up somehow manifesting itself into the dead-letter queue.

With durable execution, another way of expressing it is that once you have started a process, so once you've started one of these workflows with durable execution, it will live until it either completes or you decide to terminate it. What that means is that if something's going wrong. Like I hit this quota problem. I don't have to actually send things to a dead-letter queue. I can basically say, "You know what, I'm stuck. I got an error. It's a quota error. And so now I am going to put this flow into a wait state."

Cory: 00:25:56

Very cool.

Cornelia: 00:25:57

You can basically have some logic in the workflow that says, "When I hit a quota error, I'm just going to go into a wait state and I'm going to wait for something." Typically you're waiting for some state in the application to change. You might have a flag in the running code, in the running workflow, it's a local state variable that says, "Hey, waiting for human input." You basically go into a wait state. And the magic is that in a durable execution platform, it basically says, "Alright, I'm going to offload, I'm not going to consume any resources with this."

By the way, there are some platforms out there that talk about durability, but if you read the fine print, it says, "While you're waiting, this is your cost." This because it's still consuming resources. In the Temporal durable execution platform, it literally consumes zero compute while it's in that wait state. And it can wait for a minute, a day, a week, a year. That process lives forever until it finishes. We need to talk about what we call entity workflows in just a minute.

What we do is we go into a wait state. It's not that we go into a dead-letter queue. We just say, "Hey, this workflow is paused." Now somebody goes and updates the quota. In updating the quota, you've got some code that updates the quota. It also flips that bit in the workflow. It says, "Hey, human responded." Now the workflow says, "Oh cool, I'll pick up where I left off." Which is an important part of durable execution. It knows exactly where you were, picks up where you left off and says, "Okay, human came back with some input. Let me continue this retry." And this time it retries and the quota has been updated and you can go on. No special logic, it's just the workflow continues. That's what durable execution is.

Cory: 00:28:04

That's very cool. In that Terraform or OpenTofu example... Sorry, I've been saying a lot of Terraform. Sorry folks, don't come at me... in that scenario, it would start to reapply, right? So it skipped through most of it because that's okay. So it's not like actually somehow pausing the Terraform binary and resuming that. Okay, very cool.

Cornelia: 00:28:23

That's Correct. Yeah.

We have a lot of people who are using Temporal to orchestrate their Terraform, to manage their state files, like I said, to break down their monolithic configurations into smaller pieces so that they can be a little bit more fine tuned with it.

And it's really very, very cool because you realize when you're doing things like retries, you need to have things like idempotence and most of the resources in OpenTofu are idempotent. So it's actually quite a nice match made in heaven between Terraform and durable execution.

Cory: 00:28:54

In the preshow you said that you guys weren't originally designed for infrastructure orchestration. It wasn't really designed for people on the opposite side of the house, but that's where you're starting to see a lot of people using the product today.

What do you think is the most attractive thing for Operations, DevOps, budding platform teams about this execution model for managing things like Ansible and Terraform and whatnot?

Cornelia: 00:29:21

Yeah. And so I'll say a little bit more about that just for a little context for your listeners.

So Temporal's been around for like six and a half years and when we had a conference a couple of years ago... I wasn't there yet, I've only been here for about a year and a half... but without doing any kind of enablement in the platform space, without doing any kind of go to market or anything like that, like literally over half of our user stories that came to that conference were platform engineering, were infrastructure orchestration... by infrastructure I also mean, you know, any kind of user onboarding. It's not necessarily compute storage and network, but it might be provisioning a user or provisioning them into some kind of a SaaS system. So very platform engineering use cases.

And I think that one of the main reasons that it took hold in that space was first of all, a lot of what we've been talking about was the need to have a tool set and also not have to learn about actor systems and event driven systems and event sourcing systems and all of that stuff to be able to get your job done. So the programming model is really great, but I think that one of the main reasons is because of the long running nature.

Durable execution - Yes, it helps and it's used in even money transfers. It's used a ton in that scenario because it's great to have durability when you're doing money transfer across different systems. But those transactions are relatively short in timeline and people have built the Rube Goldberg machines and spent the sixty to eighty percent of their dev cycles getting that resilience in place because they had to. Otherwise if you can't have a resilient financial system, it wouldn't be a system worth using at all.

But it's the long running nature, I think that was one of the biggest things. So that scenario that we just talked about where it could take somebody two days to go up that quota, that's okay, we'll hang out, we'll continue the process when you're ready to go.

The long running nature I think is one of the things that has really... programming model and long running nature of the workflows are the two things I think that has made this really a sweet technology for platform engineers.

Cory: 00:31:37

Yeah, I think one of the other things, just kind of hearing you talk through it, is the bite sized nature of the approach. One of the things that... I've been a big advocate for, for componentizing IaC for a long time.

There's so many teams with the terralith or the "Pterosaur" as Fred started calling it recently. It's hard to reason about, takes up a ton of time when it's all together, but also it's just fundamentally at odds with use, you know what I'm saying? It's like giving somebody a Terraform module and you're like, "This makes a VPC, some subnets, a database, a Kubernetes cluster and you can put your app in it." And it's just like, "That's like teo thousand parameters I have to think about." Right? Versus like, "Okay, I'm going to think about a network or maybe the network's already been thought about and it's provisioned, so that's great. I just have to think about a database."

It makes the self service nature of Terraform in any IaC tool really plausible to break it down. I won't say dumb it down, but abstract it in a way that it's easier to understand by somebody who's not necessarily a hardcore cloud engineer, like doing Ops all day long. And like that's the name of the game.

Like we want these people to be able to provision and manage things themselves. We can't expect them to learn our tools to do it. Why am I here if you're learning my entire stack to get the job done?

Cornelia: 00:33:01

Yep, yep, totally a hundred percent on that. There's another thing. So I mentioned long running and earlier I kind of hinted at like we should talk about entity workflows.

Cory: 00:33:11

Yes, I wanted that... I was trying to remember, I was digging in my brain. I'm like, "There's another workflow of things she'd mentioned I wanted to talk about." That was it. Yes.

Cornelia: 00:33:17

Yep. And this is a perfect spot to talk about it.

So we said that the things that the platform engineer is dealing with, these things live for a long time. And it's not only the provisioning process. We know, especially with the infrastructure as code and some of these declarative systems, there's definitely this notion of like, okay, it's not that we do all the orchestration and then we're done. We recognize that things change, so that environment will change. I've provisioned the environment, but now I need to provision more. I need to scale capacity or I need to add some other component into the architecture. Those types of things. I need to cycle credentials, all of those types of things. And this is where entity workflows come in.

I actually prefer a different term. It's a term, I think that's going to resonate with more of your audience, which is the notion of a digital twin.

So, I've got a thing, it is an application team has come along and they've said, "I need an environment. I need an environment provisioned. I actually need Dev, Staging, and Prod. And they all need to have this database, this message queue... the Temporal... It needs to have all of these different components that needs to be tied into the IAM system in the following way." Then we go ahead and the infrastructure is code, the orchestration logic, it provisions all of that.

The notion of a digital twin is that you always have a logical and digital analog to this very real thing that is out there. The very real thing, of course here in this case is abstract, it's infrastructure and all of those things. But the cool thing is now you have a digital twin. And that digital twin is basically sitting there saying, "okay, I have a representation of what that infrastructure is. And I'm also the place where you can interact instead of interacting directly with the physical thing. You're interacting with me as the digital twin." And so the entity workflow... remember I was describing how these workflows can live forever... you can basically design a workflow so that it provisions everything and then it goes into a wait state. Now you can send signals into that workflow and say, "Hey, I need you to scale capacity," or "Hey, I'm adding additional users to this project."

It basically allows you to not create a brand new orchestration to make changes to an existing thing. It says, "Alright, here's the orchestration that is going to make changes to the existing thing."

So it's a great place for audit. It is a great place for observability. What are the things that happened? That's what we mean by an entity workflow.

I think of it as a digital twin for the infrastructure that you're orchestrating and that's a super powerful thing in the platform space.

Cory: 00:36:26

Yes. That effectively gives you close proximity or parity effectively between your environments. Right? Like I have prod, that is my physical thing in this case. And then it's like, "Hey, I need a preview environment every time somebody opens a PR for this app, I want to clone Prod, stand it up, make sure that like QA can QA the entire thing, it works as intended and then get rid of that twin so I'm not spending money on it."

Cornelia: 00:36:51

Yep, yep, that's right.

Cory: 00:36:53

Very cool. And so that is... entity workflows in Temporal is how you model that today?

Cornelia: 00:36:59

Yeah, exactly. So they're these long running things that basically are just mirrors of the real infrastructure. So you've got the orchestration that lives as long as the infrastructure does.

It's not that the orchestration completes and then something else has to come in and affect that infrastructure, it's that the orchestration lives as long as the environment lives and you can continue to interact with it through that one thing. So that means it consolidates all of that orchestration, all that history.

Cory: 00:37:30

Yeah. And that's frustrating to model in a CI/CD pipeline. It's very frustrating. You can do it, but is there a better use of your time? There probably is.

Cornelia: 00:37:44

Yep. Rather do a little innovation.

Cory: 00:37:45

Yeah. So Temporal has been around for way beyond the current AI wave, but you're also seeing it used heavily in AI systems today.

There's the platform engineering shaped angle in there around sandbox orchestration and managing environments and agents are executing... executing agents is like... it is also very much like an infrastructure orchestration problem, right? Got tons of run.

What is it about durable execution that AI teams are reaching for? And how can it... or can it help some of the, I guess, non-determinism that we receive from these AI systems?

Cornelia: 00:38:32

Yep. One of the things I like to say is that your LLMs are non deterministic enough.

Everything that you wrap around the LLM, let's make that as deterministic as possible.

Cory: 00:38:42

Yes, please.

Cornelia: 00:38:45

And so there's a couple of things. So you're absolutely right. I mean it's been around for six and a half years, so well before the GenAI craze. But the uptake of Temporal in AI companies all the way to the biggest AI companies out there... OpenAI is very public about their use of Temporal and they use it in a ton of places... is several fold.

Number one, all of the GenAI-based applications, whether they're agentic or fixed flow, are distributed systems. Especially with agents, they're starting to run for longer and longer periods of time. And we talked about this earlier, very similar to the platform engineering space, the longer that something runs, the higher the chances of there being some kind of a problem that you need to compensate for. And so LLMs are an external call generally. I mean, even if you're running a local LLM, it's not running in process, so it's always an external call.

If you're running the ones out on the frontier models, you're going to get rate limited. So that whole thing that we talked about earlier with quotas, the analog over in the AI space is rate-limited LLMs. You're going to be orchestrating all sorts of things - interacting with databases, file systems, all of those things - all external calls where things could go wrong, where there's some element of non-determinism that's going on there.

And what we want to do is we want to at least have the orchestration of all those things be a little bit more predictable or quite a bit more predictable. And because these things run for a long time and say you're thirty minutes in on a thirty-five minute long running process and it dies, you don't want to go back and burn all those tokens again.

Cory: 00:40:35

No, you certainly do not.

Cornelia: 00:40:38

With durable execution, we've recorded this state and all... Durable execution is basically an event source system for any of your listeners who are familiar with event sourcing. Everybody's familiar. Nobody's built it because it's fricking hard. But that's essentially what it is. It's an event source system.

What we can do is we can say, "Yep, let's go through, where were we? Oh yeah, we already did that. We recorded that. Yep, yep, yep, yep. Got all those LLM outputs." So you're not re-executing the LLM, which has two problems. Number one, it burns a lot of tokens, lots of money, but also you're going to get back different results if you were to rerun that LLM call. Which is crazy because now how are you supposed to even identify whether your system's running properly?

There's that whole thing, it's a distributed systems problem, but then it has this additional thing of, "How do you actually do reasonable development in a system that is inherently non-deterministic? And durable execution absolutely helps you manage that thing.

There's another element that I want to talk about, and I want to go back to the programming model. We all know that more and more code is being written by AI agents... coding agents. They're getting really good. The LLMs together with the harnesses that people are building, they're all getting really good at writing business logic. It's pretty darn good. I usually have to go back and ask it to get rid of some of the fluff. Like, "Do you really need this?" And those types of things. But it's pretty darn good at the business logic. It's not so good at the whole event driven resilience and all of that stuff.

Cory: 00:42:25

No way.

Cornelia: 00:42:26

What if you... like humans aren't particularly good, spending sixty to eighty of their time and it's toil and all that stuff. What if we don't burden the coding agents with understanding distributed systems plumbing? That is just under the covers and that's where we're seeing... We had a customer, in fact they spoke about it at our conference last week, where they had scheduled six months for a migration from some of their legacy applications that they had on some legacy workflow system. They had scheduled six months for the migration. They did it in three weeks.

Cory: 00:43:03

Heck yeah, they did.

Cornelia: 00:43:07

Because they used coding agents that used a Temporal skill that gave it the knowledge of what these Temporal abstractions are. Workflows, activities, described the types of things. And so it was able to write the business logic. It didn't have to write any of the plumbing code that it used to have. They were able to... like developer productivity, time to market, huge.

Cory: 00:43:31

Yeah. Are those Temporal skills also open source and can be used with the open source Temporal?

Cornelia: 00:43:36

Absolutely. Yeah, they are in our repo.

Cory: 00:43:37

Awesome. We'll include some of those in the show notes.

Yeah, awesome. Well, I know we're coming up on time. This is a super fun conversation. I love learning about what you guys are doing over there. Where can people find out more about Temporal, the open source project, the company behind it and where can they find you online?

Cornelia: 00:43:53

Yeah, so you can find me on LinkedIn. That's the social media platform that I use these days.

Cory: 00:43:59

My favorite.

Cornelia: 00:44:00

Yep, same. Cornelia Davis, Temporal. You'll find me that way in terms of finding Temporal, you can certainly go to Temporal.io. So we are a business, we do have a SaaS offering of Temporal. So Temporal... I talked a lot about the SDKs, but there's a service element... there's a backing service to that. We have a SaaS offering of that and it runs globally, lots of different regions. We just last week announced that we've been achieving six nines. Yes, I said six nines of availability on our Temporal Cloud product.

Cory: 00:44:37

Nice.

Cornelia: 00:44:38

Yeah, it's insane. It's insane. Our SLA is either 3 or 4, but we've been achieving 6. It's just freaking awesome.

You can start at Temporal.io. But we are first and foremost an open source company so you can find your way to all of the open source stuff there as well. The GitHub organization is "temporalio". We also have a "temporal-community" GitHub organization where you can find a whole bunch of goodness. You'll find the skills out there. We'll put... like you said... put that in the show notes. If you're on a Mac, you can brew install Temporal a local dev server. So you don't... it's brain dead simple. So lots of stuff.

Cory: 00:45:17

I like it. I like it. And then "Cloud Native Patterns". You can find it on Amazon... anywhere? It's published by Manning, right?

Cornelia: 00:45:25

It is published by Manning. And I'll just take a moment to like celebrate with you a little bit.

I just recently got, you know, my quarterly royalty statement and I have ticked over 10,000 copies.

Cory: 00:45:38

Heck yeah, nice. Is that New York Times bestseller yet? Do they put tech books on bestsellers or...?

Cornelia: 00:45:45

We don't write technical books to get rich.

Cory: 00:45:51

Awesome. Well, it's so awesome to have you on the show. Thanks so much. And check out Temporal.

Share Episode

Shownotes

Transcripts

Follow

Chapters

Video

More from YouTube