Testing at Scale with Nate Lee, Co-Founder of SpeedScale

Modern businesses rely on applications, and they rely on continued innovation in those applications to drive their business.

This strive for innovation creates a need for improved techniques for validating that an application will work as expected. But constant innovation means a constant chance for problems, and testing applications at scale is not an easy task. This is where SpeedScale comes into play. SpeedScale assists in stress-testing applications by recreating real-world traffic loads in a test environment.

Today on Modern Digital Business.

Useful Links

About Lee

Lee Atchison is a software architect, author, public speaker, and recognized thought leader on cloud computing and application modernization. His most recent book, Architecting for Scale (O’Reilly Media), is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee has been widely quoted in multiple technology publications, including InfoWorld, Diginomica, IT Brief, Programmable Web, CIO Review, and DZone, and has been a featured speaker at events across the globe.

Take a look at Lee's many books, courses, and articles by going to leeatchison.com.

Looking to modernize your application organization?

Check out Architecting for Scale. Currently in it's second edition, this book, written by Lee Atchison, and published by O'Reilly Media, will help you build high scale, highly available web applications, or modernize your existing applications. Check it out! Available in paperback or on Kindle from Amazon.com or other retailers.

Don't Miss Out!

Subscribe here to catch each new episode as it becomes available.

Want more from Lee? Click here to sign up for our newsletter. You'll receive information about new episodes, new articles, new books, and courses from Lee. Don't worry, we won't send you spam, and you can unsubscribe anytime.

Mentioned in this episode:

Atchison Academy

What do 100,000 of your peers have in common? They've all boosted their skill set and career prospects by taking one of my online courses. https://mdb.fm/courses

Lee: 00:00:00

Modern businesses rely on applications and we rely on

Lee: 00:00:03

continued innovation in those applications to drive their business.

Lee: 00:00:08

But as these applications evolve and become more complicated, testing

Lee: 00:00:12

them also becomes more challenging.

Lee: 00:00:15

Testing applications at scale is not an easy task.

Lee: 00:00:20

Today, we're going to look at a company focused on easing the burden

Lee: 00:00:23

of testing applications at scale.

Lee: 00:00:26

Are you ready?

Lee: 00:00:27

Let's go.

Lee: 00:00:53

Modern businesses rely on applications and they rely on continued innovation in those

Lee: 00:00:59

applications to drive their business.

Lee: 00:01:01

This strive for innovation creates a need for improved techniques for validating

Lee: 00:01:06

that an application will work as expected.

Lee: 00:01:09

But constant innovation means constant chance for problems and testing

Lee: 00:01:13

applications at scale is not an easy task.

Lee: 00:01:17

This is where SpeedScale comes into play.

Lee: 00:01:20

SpeedScale assists in stress testing applications by recreating real world

Lee: 00:01:25

traffic loads in a test environment.

Lee: 00:01:28

Nate Lee is co-founder of SpeedScale and he is my guest today.

Lee: 00:01:32

Nate, welcome to Modern Digital Business.

Nate: 00:01:34

Thanks, Lee.

Nate: 00:01:35

Glad to be here.

Lee: 00:01:37

You know, I think we finally have this worked out, , after a

Lee: 00:01:40

couple of delays and an internet outage, I think we're finally going

Lee: 00:01:44

to do this podcast, don't you?

Lee: 00:01:45

What?

Lee: 00:01:45

What do you think?

Nate: 00:01:47

no, it's, it's, it's always exciting, uh, eventful, uh,

Nate: 00:01:50

leading up to something like this.

Nate: 00:01:51

But, uh, yeah, with the power outages and, um, we're, we're recording

Nate: 00:01:56

this, uh, between Thanksgiving and Christmas, uh, the holiday season,

Nate: 00:02:00

and everybody's kind of hectic.

Nate: 00:02:02

Um, a lot of our customers are retail, so they're going through

Nate: 00:02:06

code freezes and trying to make sure, you know, hold their breath.

Nate: 00:02:09

Tap their head, you know, rub their belly to make sure nothing goes

Nate: 00:02:12

down a critical time, but that, hey,

Lee: 00:02:15

I remember those days at a, in Amazon retail and, this time of the

Lee: 00:02:19

year was always a, a very busy time and yeah, a lot of holding your breath.

Lee: 00:02:23

You, you didn't do much change, but everyone was really busy.

Lee: 00:02:28

It was a very busy time.

Nate: 00:02:30

Yeah.

Nate: 00:02:31

Yeah.

Nate: 00:02:31

Actually, uh, I got, I got a funny story about Amazon and the holiday rush

Nate: 00:02:35

period we were talking to, um, I think it was heavy bit, um, one of the venture

Nate: 00:02:41

capital firms that kind of specialize on, on Kubernetes, um, dev tools.

Nate: 00:02:46

And, um, one of the gentlemen were telling us that, um, were, they were

Nate: 00:02:52

at Amazon working on SRE stuff and, and we're like, how are we gonna

Nate: 00:02:55

get ready for the holiday season?

Nate: 00:02:57

Like we, we have to run like a gigantic load test.

Nate: 00:02:59

And it, it kinda speaks to the genesis of SpeedScale, right?

Nate: 00:03:03

It's very difficult to run these sorts of, um, high traffic

Nate: 00:03:06

situations without a perfect carbon copy replica production, right?

Nate: 00:03:11

Because, you know, a lot of, lot of the load and whether can I handle

Nate: 00:03:14

it or not is, is critical on, on having production like, uh, hardware.

Nate: 00:03:19

They said, well, what if we run a gigantic sale?

Nate: 00:03:22

And, uh, we can basically just simulate what we're gonna be encountering in

Nate: 00:03:26

production and during the holiday season.

Nate: 00:03:28

And so they were like, yeah, that's a good idea.

Nate: 00:03:30

What are we gonna call it?

Nate: 00:03:31

And they decided to call it Prime Day.

Nate: 00:03:33

So when you have Amazon Prime Day, which is, it's pretty, pretty big deal, right?

Nate: 00:03:38

Um, that's really just a veiled dress rehearsal for, uh, black Friday

Nate: 00:03:43

season and Christmas holiday shopping.

Nate: 00:03:46

But you.

Nate: 00:03:47

Like, like a few of the ideas that that Amazon's put through, it actually ended

Nate: 00:03:52

up being a huge barn burner of an event.

Lee: 00:03:54

Yeah.

Lee: 00:03:54

Prime Day came after I left.

Lee: 00:03:57

I left Amazon in 2011, I think.

Nate: 00:04:01

Okay.

Lee: 00:04:02

Well, well, definitely one of the things we always used to do

Lee: 00:04:05

is we always, um, had, test days where it's like, what happens if we

Lee: 00:04:09

take this data center offline and.

Lee: 00:04:10

In what happens when we cut this cable?

Lee: 00:04:12

We do that sort of testing in production the time.

Lee: 00:04:15

The theory was everything should continue to work at scale.

Lee: 00:04:18

What's no issues whatsoever.

Lee: 00:04:20

But we had to it in production.

Lee: 00:04:22

You know, it's, it's the only way to, um, to get that volume of traffic

Lee: 00:04:27

until we have someone like SpeedScale.

Lee: 00:04:30

Why don't you tell us a little bit exactly what SpeedScale is and what it does.

Nate: 00:04:35

Yeah, so, so SpeedScale's a production, traffic replication

Nate: 00:04:38

service, and, uh, we help engineers simulate production

Nate: 00:04:43

conditions using actual traffic.

Nate: 00:04:46

Um, you know, it's, there's kind of been a long history of these sorts of tools.

Nate: 00:04:50

Um, you, I think you were referring to Chaos Monkey.

Nate: 00:04:53

That, you know, the army, I think it had come from the Netflix days where they were

Nate: 00:04:58

randomly executing these daemons to take down services and then seeing what fails.

Nate: 00:05:04

And then of course, Gremlin's got a productized version of, um,

Nate: 00:05:07

specifically focusing on chaos, right?

Nate: 00:05:10

Running these game days.

Nate: 00:05:12

Um, and experiments to take down, um, aspects of the servers.

Nate: 00:05:16

And I think they're tiptoeing around how do I, how.

Nate: 00:05:19

Run these experiments, but also not affect production.

Nate: 00:05:21

Right?

Nate: 00:05:22

Uh, but SpeedScale's approach is slightly different, and we actually

Nate: 00:05:26

capture the traffic and then allow you to run that traffic in

Nate: 00:05:30

a safe manner lower environments.

Nate: 00:05:34

another way to think about this is shifting left, uh, what you're,

Nate: 00:05:39

what you're gonna encounter in production, but do it in a safe way

Nate: 00:05:42

in these, in these lower environments.

Lee: 00:05:44

So you record production traffic and then replay it in

Lee: 00:05:49

a staging or a test environment

Lee: 00:05:51

. Nate: That's right.

Lee: 00:05:52

A lot of this is possible now because of the advent of cloud environments, right?

Lee: 00:05:57

You can spin up these ephemeral environments and was always a promise

Lee: 00:06:01

of cloud was . You know, just use what you need and uh, and, and, and spin up

Lee: 00:06:07

these environments at a moment's notice.

Lee: 00:06:08

And, I think the reality of it is, well, these environments are expensive.

Lee: 00:06:14

Uh, they, they actually can skyrocket and cost and they don't actually stay

Lee: 00:06:20

up ephemerally, we end up keeping 'em on for long periods of time.

Lee: 00:06:23

Right.

Lee: 00:06:24

Uh, and, and people, uh, are actually, especially given the

Lee: 00:06:28

current economic state, are looking for ways to reduce our costs.

Lee: 00:06:31

Your customers really are building modern or have modern applications,

Lee: 00:06:36

modern application development.

Lee: 00:06:37

I'm talking about things like.

Lee: 00:06:38

Cloud native applications, they're in undoubtedly cloud-based applications

Lee: 00:06:43

where they can do these replicated environments, um, a a lot easier.

Lee: 00:06:47

So in, in that sort of mindset, what challenges do you find exist for your

Lee: 00:06:53

customers in managing those applications?

Lee: 00:06:56

What are the, some of the problems they come to you with?

Nate: 00:06:59

Yeah.

Nate: 00:07:00

You know, um, I think that's kind of the key, um, qualifiers.

Nate: 00:07:06

What do our customers come with there?

Nate: 00:07:07

There are a variety of challenges in developing.

Nate: 00:07:10

And modern Cloud, you know, security is always of paramount concern and, um, know,

Nate: 00:07:15

making sure that, uh, scale is proper.

Nate: 00:07:18

But our customers typically are coming to us with the specific challenge

Nate: 00:07:23

of environments, and, and that's something that's, um, been, been kind

Nate: 00:07:28

of a common threat that we've noticed.

Nate: 00:07:30

Um, Environments themselves aren't the issue.

Nate: 00:07:35

Um, when I say environments more specifically, I mean the

Nate: 00:07:38

data and the, and the downstream constraints of those environments.

Nate: 00:07:42

So, they can always spin up just a carbon copy replica of production

Nate: 00:07:48

and, and a full end-to-end environment at a lower scale, right?

Nate: 00:07:51

Um, but even if you do.

Nate: 00:07:55

The problem is that, uh, a it's expensive cuz there's so many moving parts and,

Nate: 00:07:59

and databases and stuff like that.

Nate: 00:08:02

b, if it's not seated with the proper data that they need in order to exercise their

Nate: 00:08:07

applications, it's really quite useless.

Nate: 00:08:10

And and that's where the challenge exactly.

Nate: 00:08:13

So, so how are my clients hitting my app that I'm trying to test and uh, how

Nate: 00:08:19

does my app send these downstream calls?

Nate: 00:08:22

To, to the third party backends or the, or the, the systems of

Nate: 00:08:26

record or the other internal APIs.

Nate: 00:08:28

And, and what do those systems need to be seated with data-wise

Nate: 00:08:31

in order to respond accurately?

Nate: 00:08:33

So capturing state managing item potents, it becomes a huge headache actually.

Nate: 00:08:38

And, um, That, that's one of the reasons why we had developed SpeedScale, is

Nate: 00:08:43

we want the engineers to be able to come into a self-service portal and

Nate: 00:08:48

understand, okay, what does my app do?

Nate: 00:08:50

Like how does it behave currently?

Nate: 00:08:52

And then how do I recreate this situation?

Nate: 00:08:56

Um, in, in a cloud native environment without a lot of hassle.

Nate: 00:08:59

The current state of the art is usually.

Nate: 00:09:03

Using a conventional tool, like, uh, something that can

Nate: 00:09:06

actually the transactions.

Nate: 00:09:08

And, and on a very simple level, it could be something like Postman or Insomnia.

Nate: 00:09:13

Um, a more sophisticated level, maybe you're, you're replaying

Nate: 00:09:16

large, large, uh, reams of traffic using something like K six.

Nate: 00:09:21

Um, But again, what we hear typically is going on is you're doing those

Nate: 00:09:26

sorts of transactions and exercising your application in a full staging

Nate: 00:09:30

environment where everybody else is using it at the same time.

Nate: 00:09:33

Right?

Nate: 00:09:34

And so you don't know if somebody's pushed their alpha version of an

Nate: 00:09:37

application in and you're getting these.

Nate: 00:09:40

because somebody is, you know, doing some tests at the same time you are, or if

Nate: 00:09:45

you truly do have a bug and you should be paying attention to it and fixing it.

Nate: 00:09:49

Um, and, uh, yeah, so, so specifically backend environments, the right source

Nate: 00:09:55

of data, and then also simulating the inbound calls into your application.

Nate: 00:09:59

Those are the challenges we typically see, um, in, in modern cloud development.

Nate: 00:10:03

And, and it's really about having the.

Nate: 00:10:07

Um, if, if you're focusing on just one area, or one type of transaction, like,

Nate: 00:10:14

you know, gold medallion members, when you're really trying to test platinum

Nate: 00:10:17

medallion members right, you could be missing a lot of code coverage.

Lee: 00:10:20

So I imagine the typical QA development environment is kind

Lee: 00:10:27

of what you were describing, what kind of chaos is going on because

Lee: 00:10:29

everyone's doing everything

Lee: 00:10:30

in it

Lee: 00:10:31

but you know, in a, in like a full C I C D pipeline where you might

Lee: 00:10:36

have a, let's do a validation at scale test as part of pipeline.

Lee: 00:10:42

I imagine in that case, um, you.

Lee: 00:10:45

Could spin up, you could afford to spin up for a short period of time, a full fledged

Lee: 00:10:50

production environment, use something like SpeedScale to, to um, to execute,

Lee: 00:10:56

to test the environment at scale, to make sure nothing works as not anticipated.

Lee: 00:11:01

But I imagine the problem with that sort of scenario though, is as

Lee: 00:11:06

you're making deployments and making changes exactly what the script

Lee: 00:11:10

is from SpeedScale, the scripted.

Lee: 00:11:13

Traffic that you're getting in will change as time goes on.

Lee: 00:11:17

How do you keep that up to date?

Lee: 00:11:18

do you constantly take new scripted traffic and replay that?

Lee: 00:11:22

Is that how you do

Nate: 00:11:23

Yeah.

Nate: 00:11:23

Yeah, yeah, So it's really kind of shifting the paradigm.

Nate: 00:11:26

So the, the way SpeedScale was developed, um, we've all got backgrounds in companies

Nate: 00:11:32

like New Relic and observing and, It k o that really kind of founded the

Nate: 00:11:38

concept of service virtualization, which is a fancy way to say service mocking.

Nate: 00:11:42

But with that background, we inherently understood that it's really slow

Nate: 00:11:48

to, uh, develop these scripts.

Nate: 00:11:50

So we don't actually take a script based approach in running this traffic.

Nate: 00:11:53

What we actually do is, We run traffic snapshots.

Nate: 00:11:59

So what we're doing is since we are capturing all this traffic, we develop

Nate: 00:12:02

a snapshot, um, and generate things.

Nate: 00:12:07

One is the inbound traffic.

Nate: 00:12:09

We generate like a script, if you will.

Nate: 00:12:11

It's really just a JSON snapshot file, is what we call it.

Nate: 00:12:14

And there's no actual scripting involved.

Nate: 00:12:16

It's auto-generated from real traffic.

Nate: 00:12:19

Uh, a key point in this, uh, for the listeners is we are redacting PII we

Nate: 00:12:24

are capturing the traffic, cuz you don't wanna be, you know, spewing, uh, sensitive

Nate: 00:12:28

information, uh, when you're replaying it.

Nate: 00:12:30

So data loss prevention is actually a very big piece of this.

Nate: 00:12:34

Um, but anyways, so the snapshots are auto-generated as well as, From the

Nate: 00:12:41

traffic, we can kinda reverse engineer what backends you need in order to

Nate: 00:12:45

exercise a particular So, so not only do we, um, generate a traffic snapshot

Nate: 00:12:52

and you can replay of the inbound traffic, but we also generate a mock

Nate: 00:12:56

server in a pod, if you will, that mock server in a pod can be spun up.

Nate: 00:13:01

And what this really does is, is vastly uh, Narrow the scope of the

Nate: 00:13:07

environment that you need to spin up.

Nate: 00:13:09

So we're actually just spinning up n plus one and N minus one.

Nate: 00:13:13

We're spinning up your API and only its neighbors instead of the whole full,

Nate: 00:13:19

full-blown end-to-end environment.

Nate: 00:13:20

And so it's like a little microcosm of your API, but your API for all

Nate: 00:13:26

intensive purposes thinks it's in a fully integrated end-to-end environment.

Lee: 00:13:30

But you're essentially doing service by service level

Lee: 00:13:32

versus an application level.

Lee: 00:13:34

So you're not really, you're not.

Lee: 00:13:36

Scripting user traffic into the system, you're scripting traffic into a particular

Lee: 00:13:42

service in and out of the service and, and the data that goes with it.

Lee: 00:13:45

So you can, you only have to bring up the service and what's

Lee: 00:13:48

around it, and you don't have to bring up the entire application.

Nate: 00:13:52

Well, you really only need to bring up just the app

Nate: 00:13:55

and we're, and SpeedScale's taking care of the rest really.

Nate: 00:13:58

Yeah.

Nate: 00:13:59

Um, so we're scripting all the inbound traffic for you.

Nate: 00:14:03

There's no scripting involved.

Nate: 00:14:04

You basically, we, we have what's called traffic viewer and you use that to browse

Nate: 00:14:09

the type of traffic you want to invoke.

Nate: 00:14:12

Um, and once you select the traffic that you wanna invoke, we basically take a

Nate: 00:14:16

look at all the traffic around it and say, okay, well when you run this call

Nate: 00:14:19

inbound, As a result of that, your application calls, you know, a Dynamo

Nate: 00:14:25

database and then these other two APIs, and then you make a call to a third

Nate: 00:14:29

party, I don't know, let's say Stripe or Google Maps or something like that.

Nate: 00:14:32

And so we will automatically generate a mock server, uh, based off of reverse

Nate: 00:14:38

engineering, how your app works and make sure everything is there that you need.

Nate: 00:14:42

Um, so yeah, it, it, it's, you, you got it.

Nate: 00:14:45

The, the concept is we, Virtualizing your neighbors so that you can do consistent

Nate: 00:14:52

scientific, uh, dry runs of, of your API as part of ci and, and, but it's

Nate: 00:14:58

also a huge reduction in cloud costs cuz you're not spinning up a big end-to-end

Nate: 00:15:02

environment of literally everything that is included in your app every time.

Nate: 00:15:06

And, and, and to be honest, that's also sometimes not possible.

Nate: 00:15:10

Because of the connections that you do have to third parties, almost

Nate: 00:15:13

everybody integrates with like a payment or maybe like a background

Nate: 00:15:18

check organization or, a mapping or a messaging solution that's a third party.

Nate: 00:15:23

And so, so many times, uh, these wires that hang out at the cloud, uh, as I

Nate: 00:15:28

call them, uh, those are difficult.

Nate: 00:15:32

To simulate, uh, you have to call the vendor and ask for a sandbox.

Nate: 00:15:35

And if you wanna do a load test, forget it.

Nate: 00:15:38

You know, that's not gonna go to a hundred tps, right?

Nate: 00:15:41

They're just standing up the sandbox to give you, you know,

Nate: 00:15:43

ones Z two Z transactions, whereas,

Lee: 00:15:46

uh, performance testing or anything like

Nate: 00:15:48

exactly.

Nate: 00:15:49

Exactly.

Nate: 00:15:50

But if we're simulating that using, um, traffic to, to auto-generate a

Nate: 00:15:55

mock server and a pod for you, uh, the, the, the possibilities go up.

Lee: 00:16:00

Cool, cool.

Lee: 00:16:01

Now, I, I can see how this works for APIs.

Lee: 00:16:04

Um, and you can include database, AC activities such as to DynamoDB, uh, DB

Lee: 00:16:10

as an API call essentially is what it is.

Lee: 00:16:13

But what about native databases or native data that's stored in the service?

Lee: 00:16:17

You know, like MySQL database might be part of the service or, or, um,

Lee: 00:16:21

cash or Redis cash or something like.

Lee: 00:16:23

Do you simulate those as well or how do you, uh, what do you do in those cases?

Nate: 00:16:29

Yeah, so, um, for tus we can actually, uh, see the traffic going

Nate: 00:16:35

through it, but we can't simulate it, um, for other data sources.

Nate: 00:16:39

Um, we do have ongoing support developing for like things like Postgres and MongoDB.

Nate: 00:16:45

Um, we've got the full list of supported technologies on our documentations

Nate: 00:16:49

page, which is docs SpeedScale.com.

Nate: 00:16:52

But.

Nate: 00:16:53

Really the beauty with, um, being able to provision these, um,

Nate: 00:16:58

backends, if they're API based, right?

Nate: 00:17:00

Usually it's all fairly standard.

Nate: 00:17:02

If you communicate to a system of record via API, we can also handle that something

Nate: 00:17:06

like elastic search, for example.

Nate: 00:17:09

Uh, but if it is a local data source or, or something like MySQL, which, uh, sorry,

Nate: 00:17:14

uh, MS that's got like a proprietary, um, non-open standard, you would

Nate: 00:17:20

probably wanna provision that locally.

Nate: 00:17:22

Uh, by yourself as part of that, uh, kind of simulated microcosm.

Nate: 00:17:27

So, um, with most of these cloud native environments, you can specify

Nate: 00:17:31

either like, you know, the environment script or, or, or, or the, the yams

Nate: 00:17:36

to properly stand those things up.

Nate: 00:17:38

In addition to the SpeedScale simulations.

Lee: 00:17:40

Makes sense.

Lee: 00:17:41

Let's talk about resiliency a little bit.

Lee: 00:17:43

You know, resiliency is, it's an, it's an interesting aspect when it comes

Lee: 00:17:47

to cloud-based applications, you know, because, building the DNA of the cloud

Lee: 00:17:51

is the cloud is designed to break, right?

Lee: 00:17:54

I mean, the whole fundamental aspect of the cloud is if a

Lee: 00:17:57

service server isn't working right, just terminate and restart it.

Lee: 00:18:01

And, and that mindset extends throughout the entire cloud ecosystem

Lee: 00:18:06

where everything is designed to.

Lee: 00:18:09

You know, with, with retry, with, with a redundancy built in so that you

Lee: 00:18:13

can lose components, components can go away, come back, and your entire

Lee: 00:18:17

system as a whole continues to work.

Lee: 00:18:21

What does SpeedScale do to help with that sort of resiliency testing?

Lee: 00:18:26

Are there ways you can simulate those sorts of, of environment?

Nate: 00:18:31

Yeah.

Nate: 00:18:32

Yeah, to an extent.

Nate: 00:18:33

I mean, uh, well, first of all, before I ju jump into that, I, I think, um, a lot

Nate: 00:18:40

of people have kind of a false level of comfort with the, the resiliency that's,

Nate: 00:18:44

that's inherently built into the cloud.

Nate: 00:18:46

I think what people realize is, oh, look, the, the, the, you know, The

Nate: 00:18:51

startup times of the Lambda serverless instances are actually quite long.

Nate: 00:18:55

And how do we get past that right?

Nate: 00:18:57

Or hey, horizontal pod auto-scaling rules actually take quite a little

Nate: 00:19:01

while to understand that, hey, a pod is down and then spin up another pod.

Nate: 00:19:06

Like it, it waits and it retries a couple times, and meanwhile, you

Nate: 00:19:09

know, you're bleeding thousands of dollars, , uh, because you know, your,

Nate: 00:19:13

your mobile ordering app is down.

Nate: 00:19:14

So, I think it's a little bit of false sense of comfort, or, or protection.

Nate: 00:19:21

And um, that's what we can really help simulate.

Nate: 00:19:26

And, and what we do with that is, again, it's, capturing traffic, um, in order to

Nate: 00:19:31

understand how users run your application.

Nate: 00:19:33

But once we do have that traffic, engineers can multiply that.

Nate: 00:19:39

Um, and, and empowering the engineers to run these what if scenarios?

Nate: 00:19:42

Like, what if I had a hundred x traffic, or what if I had, you know, a thousand

Nate: 00:19:46

x traffic for 30 seconds, um, and run more of a soak or soak or sanity test.

Nate: 00:19:52

Um, this is all things that are available with the few mouse clicks

Nate: 00:19:56

once we have that baseline of traffic.

Nate: 00:19:59

Um, and the traffic captures kind of how your, your application

Nate: 00:20:03

is exercised as well as.

Nate: 00:20:06

We've got the necessary backends ready to be spun up in a mock server.

Nate: 00:20:09

So it's kinda like a turnkey simulation that you can run.

Nate: 00:20:14

and so when people do have DR rules or HPA rules, um, they can actually

Nate: 00:20:22

verify that things are going to, to fail over as expected or scale as expected.

Nate: 00:20:27

Another aspect within resiliency that that.

Nate: 00:20:32

simulation can help catch is, your resource consumption.

Nate: 00:20:36

So if you're making logic changes to your, your services and, um, or, or

Nate: 00:20:42

you make this calculation change and for some reason, let's say it causes

Nate: 00:20:45

CPU to skyrocket or it causes, you've got a memory leak in your code and

Nate: 00:20:49

it, it begins to r rise over time.

Nate: 00:20:53

The state of the art in catching issues like that really is to, to

Nate: 00:20:57

just go ahead and release and then pay really close attention to Datadog

Nate: 00:21:01

or New Relic or AppDynamics, right?

Nate: 00:21:03

And, and rely on those observability tools to give you an early warning.

Nate: 00:21:07

And then it's kind of all hands on deck reacting or, or trying to shut

Nate: 00:21:11

down that pod over and over again, whatever it starts creeping up.

Nate: 00:21:14

Those sorts of changes can be actually proactively caught by, you know,

Nate: 00:21:18

running these traffic simulations.

Nate: 00:21:20

So by simulating the inbound traffic in the mock server

Nate: 00:21:23

pods, those are your controls.

Nate: 00:21:24

And really the only thing that changes is your application as you make changes.

Nate: 00:21:28

And that's another kind of reason to use these crowded, um, chaotic

Nate: 00:21:34

staging environments is because there's so much noise in the system

Nate: 00:21:37

and other people are doing things and.

Nate: 00:21:40

Staging can break quite frequently, and I know you've written actually about this,

Lee: 00:21:43

Yep.

Nate: 00:21:44

and . And so that's another kinda argument to using these, um,

Nate: 00:21:50

production simulations in a very kinda sterilized lab environment, if you will.

Nate: 00:21:55

And you know, the only thing that's changing is your code.

Nate: 00:21:57

So it's, it's a way to consistently iterate and experiment, make changes.

Nate: 00:22:01

And so that's another way you could improve your resiliency.

Nate: 00:22:06

Um, you can make sure that you're optimizing all the resources at hand and

Nate: 00:22:10

you're not, you know, irresponsibly, allocating memory, and then just hoping

Nate: 00:22:14

that horizontal auto scaling rules or the cloud scalability will cover for you.

Nate: 00:22:18

Right.

Nate: 00:22:19

You might not be economical with your code.

Lee: 00:22:22

Right, right.

Lee: 00:22:23

That makes sense.

Lee: 00:22:24

You, you can also do controlled failures too, right?

Lee: 00:22:27

You can do game day testing, if you will, during these simulation runs,

Lee: 00:22:30

so you can see what You know, your, your, your normal traffic works fine,

Lee: 00:22:36

but what happens if three servers go down while that's going on?

Lee: 00:22:40

The DR rules you're talking about certainly cover that, but, but

Lee: 00:22:43

this is kind of a way of interject what if scenarios and to get even

Lee: 00:22:48

useful information that you can feed back the development org about,

Lee: 00:22:51

Hey, it didn't quite work the way we expected to in this scenario.

Lee: 00:22:54

What if we changed the rules a little bit and adjusting so it's

Lee: 00:22:59

higher likelihood of success.

Nate: 00:23:02

That's right.

Nate: 00:23:02

Yeah.

Nate: 00:23:03

Um, we can, we can, you know, generate the inbound traffic

Nate: 00:23:07

into, you know, just an API.

Nate: 00:23:10

. But you can also just use that in isolation.

Nate: 00:23:12

You can use our traffic generation capabilities to hit you at the front

Nate: 00:23:16

door like an ingress or an API gateway and test your entire application so

Nate: 00:23:21

you can actually piecemeal out the solutions, um, which is like, you

Nate: 00:23:24

know, we've got the traffic generation piece and the mock server piece.

Nate: 00:23:28

some people spin up our mocking pod just leave it on full-time

Nate: 00:23:33

because they need, uh, to, to simulate the third party components.

Nate: 00:23:38

That's the cool part about having the traffic patterns as a snapshot is once we

Nate: 00:23:42

do have the traffic, we can play with the traffic we can start to slow things down.

Nate: 00:23:47

So we can say, Hey, we're mocking, stripe.

Nate: 00:23:52

what if Stripe goes down?

Nate: 00:23:53

Then we can just tell that traffic replay to be a black hole not respond.

Nate: 00:23:58

We can also tell it to respond with 22nd latency.

Nate: 00:24:02

Um, and then you can start check.

Nate: 00:24:05

my application time out gracefully?

Nate: 00:24:08

Uh, does it wait the whole time?

Nate: 00:24:10

can also speed up the traffic.

Nate: 00:24:11

I've actually heard of cases of applications failing because the

Nate: 00:24:14

back ends get improved and they start responding faster, and then

Nate: 00:24:19

your application then becomes the bottleneck and starts crashing.

Lee: 00:24:22

So even as development tool, right?

Lee: 00:24:26

This is, um, you know, when you're, when you're trying to build your application

Lee: 00:24:30

and build the resiliency in, or you're trying to build, what, if scenarios

Lee: 00:24:34

in, you can take the scripted language in your development environment.

Lee: 00:24:38

And, and fool around with it and do different things there.

Lee: 00:24:40

I, I'm assuming these are all rational use cases for speed that correct?

Lee: 00:24:45

. Nate: that, yeah, exactly.

Lee: 00:24:46

They're out of the box kind of.

Lee: 00:24:48

And then, and again, just to, just to reemphasize, uh, while under the hood we

Lee: 00:24:53

are developing, you know, JSON and Scripps and stuff, uh, no scripting involved.

Lee: 00:24:58

It's literally just a UX dashboard where you peruse all the API level calls that

Lee: 00:25:03

we've been picking up and desensitized.

Lee: 00:25:06

And you can see basically the ins and outs of all the traffic of a

Lee: 00:25:09

particular API you're trying to test.

Lee: 00:25:12

You tell us, Hey, I wanna generate a snapshot, and I wanted this snapshot

Lee: 00:25:16

to have this set of inbound traffic that you're gonna rerun, and also this

Lee: 00:25:20

set of mocked traffic that I wanna run, and you get this kind of turn.

Lee: 00:25:26

Ephemeral environment, lab environment that you can run over and over again.

Lee: 00:25:31

If production happens to update, then you can just go out and grab

Lee: 00:25:35

another side of traffic, right?

Lee: 00:25:37

The paradigm's completely changed.

Lee: 00:25:38

Now.

Lee: 00:25:39

There's no scripting involved.

Lee: 00:25:40

There's no maintenance of the script and updating the script, like normal kind of

Lee: 00:25:46

testing organization to take a look at it.

Lee: 00:25:48

It's literally.

Lee: 00:25:49

out and grab a new snapshot.

Lee: 00:25:50

Wait two minutes for it to be auto generation, and

Lee: 00:25:53

then run that new snapshot.

Lee: 00:25:54

Or it can be automated via GitHub or API call.

Lee: 00:25:57

You can say, Hey, grab the last 15 minutes of traffic, run it again.

Lee: 00:26:02

Uh, and, and it can all be done as part of the CI pipeline as well.

Lee: 00:26:05

Yeah, so one of your use cases is, like you said, C I C D pipelines,

Lee: 00:26:09

another use cases development.

Lee: 00:26:11

Another use case I'm assuming is uh, QA departments who just want

Lee: 00:26:15

to see what happens if scenarios and they just poke around and.

Lee: 00:26:20

Make changes dynamically just to try to see what's going on, whether

Lee: 00:26:24

that's a QA department, I, as I said, or if it's the development

Lee: 00:26:27

organization going through a QA process.

Lee: 00:26:29

It doesn't matter, but a, it's a step to validate.

Lee: 00:26:33

So I imagine, so those are like three distinct use cases, an automated

Lee: 00:26:37

pipeline, a QA doing random testing and development, using it to harden

Lee: 00:26:44

the application, or even in part of the development process itself, are there.

Lee: 00:26:49

Use cases that are not represented by those three that

Lee: 00:26:52

this, this, this is useful for.

Nate: 00:26:55

Yeah.

Nate: 00:26:56

Yeah.

Nate: 00:26:56

There's, um, within those three use cases, I mean, I guess you could,

Nate: 00:27:01

uh, break it up into specific.

Nate: 00:27:04

Phases of testing.

Nate: 00:27:05

I mean, it could, the traffic replays can really be curated in a

Nate: 00:27:09

way, uh, where you're checking for functionality or contract changes, right?

Nate: 00:27:14

Um, you can look at it more as like an integration test.

Nate: 00:27:17

You can also multiply the traffic and look at it more as a load test.

Nate: 00:27:21

So that's where the concept gets interesting is load testing at a

Nate: 00:27:27

regular interval as part of ci.

Nate: 00:27:30

Um, so I've heard people call it performance assurance.

Nate: 00:27:34

Uh, I've heard people call it continuous performance testing.

Nate: 00:27:38

Um, once you integrate, and, and really the linchpin to all of that is the mocks

Nate: 00:27:42

because when you're doing load testing, typically everybody has to be finished

Nate: 00:27:47

with their application code, like their, their particular piece of it, right?

Nate: 00:27:51

And then they have to curate it at a performance environment that's,

Nate: 00:27:55

you know, one 10th the size of staging so they can extrapolate

Nate: 00:27:58

the results and multiply it by 10.

Nate: 00:28:01

Now if we're mocking the back ends and they're performing and they

Nate: 00:28:04

can do a thousand t ps, um, then really that constraint goes away.

Nate: 00:28:10

and, and now you can understand, well, this one piece, this payment API, or

Nate: 00:28:14

this fulfillment API I'm working on needs to go up to 800 transaction per second.

Nate: 00:28:18

You can do that, without having to wait for the full end, end environment

Nate: 00:28:22

without having to tell the dba, Hey, I'm gonna be hammering the database.

Nate: 00:28:26

You know, please don't get mad at me, kind of thing.

Nate: 00:28:30

And so that can all be done and, in, a self-service way.

Nate: 00:28:34

Now you've written about like all these different microservice teams

Nate: 00:28:37

that are disparate and, and siloed, and they, but they all have to

Nate: 00:28:41

be communicating tightly, right?

Nate: 00:28:43

And you've written about the ability for them to have some sort

Nate: 00:28:45

of self-service way to understand how they interconnect with every.

Nate: 00:28:50

And also understand the integrations and then spin up these environments

Nate: 00:28:53

and, and so SpeedScale literally does that, is allows somebody to jump into

Nate: 00:28:58

this API or that API view the traffic and we'll show them service map.

Nate: 00:29:04

Then they say, well I run this, I exercise my application.

Nate: 00:29:08

They can actually just grab the traffic that's relevant to them.

Nate: 00:29:12

Um, and so in that way we've actually beyond just like the CI and the

Nate: 00:29:15

development enablement and then the, the QA testing kinda what if testing that they

Nate: 00:29:20

can do, can also take that traffic and point it at different endpoints, right?

Nate: 00:29:25

So they can actually do performance benchmarking.

Nate: 00:29:28

One of the stories that we've had from a customer is like, you know,

Nate: 00:29:30

they were graviton, Google came out with a new graviton process.

Nate: 00:29:34

And they were like, well, is that really gonna be any faster

Nate: 00:29:37

than what we're currently on?

Nate: 00:29:39

And so they were able to benchmark like, well, this is business as usual traffic.

Nate: 00:29:43

Let's test on the Google Graviton processors.

Nate: 00:29:46

And they did find out that there was like a X percent faster throughput.

Nate: 00:29:51

Um, so they ended, yeah.

Nate: 00:29:53

So you can use it to, to benchmark in a conventional load testing sense.

Nate: 00:29:57

Um, there's also the, the use case that I call parody testing.

Nate: 00:30:02

To check for parody and, uh, when you're doing migrations like from e c

Nate: 00:30:07

two to Kubernetes, if your application fundamentally is gonna remain the

Nate: 00:30:12

same, but you're just re-platforming, you could capture business as usual

Nate: 00:30:17

traffic coming into your e c two app.

Nate: 00:30:19

And then once you're done platforming, like moving to Kubernetes,

Nate: 00:30:24

you can do a sanity check.

Nate: 00:30:26

And before you.

Nate: 00:30:28

Fork all the traffic over and kind of do the grand opening.

Nate: 00:30:31

You can run the old traffic that you would normally get to e C two, run it

Nate: 00:30:35

against Kubernetes platform and say, Hey, am I getting to the same response times?

Nate: 00:30:39

Are things scaling properly?

Nate: 00:30:40

Do the functionality get preserved as we moved over?

Nate: 00:30:43

the last, just the last piece actually is, um, In particular, when you're doing

Nate: 00:30:48

like docker environment development.

Nate: 00:30:50

That you can run like Docker locally on your laptop or you docker

Nate: 00:30:53

compose, Minicube, micro Kates kind.

Nate: 00:30:57

so all of these concepts, all these mock server pods, traffic

Nate: 00:31:00

generator pods can actually be spun up locally on your laptop.

Nate: 00:31:04

So now, an argument for like, Hey, I don't need the full

Nate: 00:31:08

blown end to end environment.

Nate: 00:31:09

I can just simulate my neighbors, get SpeedScale to generate those pods and

Nate: 00:31:14

then run them locally on my laptop.

Lee: 00:31:16

It's one of.

Lee: 00:31:17

Biggest complaints I hear about microservice architectures is the

Lee: 00:31:22

development, the laptop development environment is so difficult to set up

Lee: 00:31:26

and manage and this, this is a tool that'll help make that a lot easier.

Nate: 00:31:31

Yep.

Nate: 00:31:31

Yep.

Nate: 00:31:32

We

Lee: 00:31:32

done offline as well, or is it still only an online tool?

Nate: 00:31:36

So we're about to launch a command line version of this that is, uh, is it,

Nate: 00:31:42

it doesn't require an internet connection.

Nate: 00:31:44

So it'll be free and you can generate those pods and then run them locally.

Nate: 00:31:49

In like a mini cube environment, something like that.

Lee: 00:31:52

So you talked a little bit about what motivated you to start

Lee: 00:31:57

SpeedScale but why you, why,

Nate: 00:31:58

why did you start SpeedScale?

Nate: 00:31:59

You were writing device drivers.

Nate: 00:32:02

And, uh, Matt had actually developed a, uh, it was basically

Nate: 00:32:08

like a, what would we call it?

Nate: 00:32:10

Um, like a visual kind of driver development kit, that allowed us to.

Nate: 00:32:16

these drivers more quickly.

Nate: 00:32:17

And then Ken developed a simulator that would, it was like kind of like

Nate: 00:32:20

a stub code harness that you could drop the driver in and it would test

Nate: 00:32:24

the out, the input and outputs to it.

Nate: 00:32:26

So all three of us have kind of been in this mindset of like,

Nate: 00:32:29

you know, better testing, faster development, and those two got into the

Nate: 00:32:34

observability space first with Wiley.

Nate: 00:32:37

Uh, and then New Relic and Observe Inc.

Nate: 00:32:41

meanwhile, I kind of took a different path.

Nate: 00:32:43

I, I had actually, um, been with itt, k o, actually Ken worked at itt.

Nate: 00:32:48

K he's the one who pulled pn, but it, k o had developed this

Nate: 00:32:52

concept of service virtualization, and that was back in the so days.

Nate: 00:32:57

there's just a huge mix of like legacy queuing technologies

Nate: 00:33:00

like MQ and TIPCO and amqp.

Nate: 00:33:03

And then there was just, you know, SOAP services were just becoming a thing.

Nate: 00:33:07

So developing these service mocks was a hugely complex affair.

Nate: 00:33:12

You had to, you know, redirect WA servers and bounce them and,

Nate: 00:33:17

and do a lot of networking to get these mocks up and running.

Nate: 00:33:21

Um, and we had always been, uh, Enamored with the concept, but really

Nate: 00:33:26

dissatisfied with the process of developing these service marks cuz

Nate: 00:33:31

done properly, they're a huge enabler.

Nate: 00:33:34

they're a huge value add cuz they can accelerate, um, the dev process.

Nate: 00:33:38

You can develop in parallel, you can simulate all these

Nate: 00:33:40

conditions and so on and so forth.

Nate: 00:33:42

But service blocks kind of.

Nate: 00:33:45

A bad reputation because you usually have to hand script the responses one by one.

Nate: 00:33:49

If you want a backend to simulate whatever, you have to

Nate: 00:33:53

seat it with the right data.

Nate: 00:33:53

It has to be onesie, twosy, programmed to respond.

Nate: 00:33:58

So, uh, this is a long-winded way of saying, um, when the cloud

Nate: 00:34:01

came about Kubernetes and cloud data warehouse storage, realized,

Nate: 00:34:06

oh, we can do this very quickly.

Nate: 00:34:07

There's proxies, uh, there's always, there's already network taps that we

Nate: 00:34:11

can take advantage of, and then we can use the traffic to train the models.

Nate: 00:34:15

once the, the mock pods and, and the inbound traffic, uh, can, can

Nate: 00:34:20

be simulated, the rest of it is just an orchestration problem.

Nate: 00:34:23

And, you know, with Terraform Scripts, helm charts and yaml, all that stuff

Nate: 00:34:27

is pretty, pretty well known as well.

Nate: 00:34:29

So, It was, it was a matter of desire and background.

Nate: 00:34:33

And then the, the cloud data technology has actually just been a huge enabler, so,

Lee: 00:34:41

Yeah, and I've known Ken for many years, but, , I'm so glad I, I met

Lee: 00:34:46

you guys and, , and I'm really excited to see what you guys, are going to

Lee: 00:34:50

accomplish as, uh, as you go along.

Lee: 00:34:51

So the natural question that always comes up at this time of.

Lee: 00:34:55

Year is what's next year like, and so what, what are your plans for next year?

Lee: 00:35:00

What are you gonna do in 2023?

Lee: 00:35:01

And for what, what does SpeedScale look like in 2023?

Nate: 00:35:05

Yeah, you know, we've been working with some great partners in 2022 and

Nate: 00:35:11

really refining the ergonomics of the product and, , It's been a huge kind

Nate: 00:35:15

of developer productivity accelerator.

Nate: 00:35:18

2023, we're gonna release a, for a free version of SpeedScale.

Nate: 00:35:22

We know kind of what aspects that people love uh, we just wanted to be

Nate: 00:35:26

careful about understanding where the, the, the real, you know, exceptionally

Nate: 00:35:32

useful features are, and then what, what, what those features could be

Nate: 00:35:35

command line driven, which of those actually need like a full-blown ui?

Nate: 00:35:39

Um, so the freemium tool is gonna be mostly command ba, command line based.

Nate: 00:35:42

Um, but once you start needing re uh, you know, enterprise level

Nate: 00:35:46

things like single signon and more sophisticated, uh, redaction.

Nate: 00:35:51

and visual reports, that's when you would, you know, kind of have a paid tier.

Nate: 00:35:55

So we expect the free tier to be, uh, a great value add for engineers that, that

Nate: 00:36:00

need mocking and traffic generation.

Nate: 00:36:02

And then, there's also gonna be, A lot of, , momentum around kind of publicizing

Nate: 00:36:07

SpeedScale from a marketing perspective.

Nate: 00:36:09

We help to, uh, really kind of listen to the engineering community and

Nate: 00:36:13

understand, uh, where we can provide the most lift and, uh, iterate quickly

Nate: 00:36:17

to, to develop those features in.

Nate: 00:36:19

But already we're, we're getting stories of, you know, taking two week load

Nate: 00:36:23

testing sprints down to three hours and improving API performance by 30 x.

Nate: 00:36:28

And we just wanna continue that.

Lee: 00:36:29

So, so if any listener is interested in learning more about

Lee: 00:36:32

SpeedScale, where should they go?

Nate: 00:36:35

Yeah, they could just go to SpeedScale.com.

Nate: 00:36:37

Uh, spelled exactly like it sounds, uh, one word.

Nate: 00:36:41

Uh, we also have a community on slack SpeedScale.com, where they

Nate: 00:36:44

can talk directly to the founders or the engineers, ask questions.

Nate: 00:36:49

Um, and then if you go to SpeedScale.com/free-trial, they're

Nate: 00:36:53

able to, to download the product and try it, um, locally, um,

Lee: 00:36:59

And I'll make sure those links are in the show notes as well

Lee: 00:37:02

too, so people can see 'em there.

Lee: 00:37:04

So, great.

Lee: 00:37:05

So, um, anything else you wanna add before we, we, uh, we wrap it up here and, uh, we

Lee: 00:37:10

managed to make it all the way through the episode without losing the internet again.

Lee: 00:37:13

That's, that's fantastic.

Nate: 00:37:15

Yeah, I mean, wouldn't No, no.

Nate: 00:37:18

That was it.

Nate: 00:37:18

Always a pleasure to, to talk and, uh, you know, um, kinda commiserate

Nate: 00:37:23

over the technical problems of modern Cloud with you, Lee, it's always great.

Lee: 00:37:26

Definitely.

Lee: 00:37:27

I love to love talking with you, Nate.

Lee: 00:37:29

Thank you.

Lee: 00:37:29

my guest today has been, uh, uh, Nate Lee, who is the co-founder of SpeedScale.

Lee: 00:37:34

Nate, thank you very much for joining me on Modern Digital Business.

Share Episode

Shownotes

About Lee

Looking to modernize your application organization?

Don't Miss Out!

Transcripts

Follow

Links

Chapters

Video

More from YouTube