How to Load Test with Goose - Part 2: Running a Gaggle - Tag1 Team Talks

Speaker: 00:00:00

Hello, and welcome to Tag1 Team Talks, the podcast and blog of Tag1 Consulting.

Speaker: 00:00:04

Today, we're going to be doing distributed load testing how to: a deep dive into

Speaker: 00:00:09

running a Gaggle with Tag1's, open source Goose load testing framework.

Speaker: 00:00:14

Our goal is to prove to you that Goose is both the most scalable load

Speaker: 00:00:18

testing framework currently available.

Speaker: 00:00:20

And the easiest to scale.

Speaker: 00:00:22

We're going to show you how to run a distributed load, test yourself.

Speaker: 00:00:25

And we're going to provide you with lots of code and examples to make it easy and

Speaker: 00:00:29

possible for you to do this on your own.

Speaker: 00:00:32

I'm Michael Meyers, the managing director at Tag1, and joining

Speaker: 00:00:35

me today is a star-studded cast.

Speaker: 00:00:38

We have Jeremy Andrews, the founder and CEO of Tag1.

Speaker: 00:00:42

Who's also the original creator of Goose, Fabian Franz, our VP of Technology.

Speaker: 00:00:48

Who's made major contributions to Goose, especially around

Speaker: 00:00:51

performance and scalability.

Speaker: 00:00:53

And Narayan Newton, our CTO who has set up and put together all the

Speaker: 00:00:58

infrastructure that we're going to be using to run these load tests.

Speaker: 00:01:03

Jeremy, why don't you take it away?

Speaker: 00:01:04

Give us an overview of what we're going to be covering and let's jump into it.

Speaker: 00:01:10

Yeah.

Speaker: 00:01:11

So last time we were exploring with setting up a load test from a single

Speaker: 00:01:16

server and confirmed that Goose makes great use of that server.

Speaker: 00:01:20

it leverages all the CPU's and ultimately tends to get as far as it

Speaker: 00:01:25

can until the uplink slows it down.

Speaker: 00:01:27

So today what we're going to do is.

Speaker: 00:01:30

Use a feature of Goose called a Gaggle which is a distributed load test.

Speaker: 00:01:35

If you're familiar with Locust, it is like a swarm.

Speaker: 00:01:37

and the way it, the way that this works with Goose you have a manager

Speaker: 00:01:41

process that you kick off and you say, I want to simulate 20,000 users and I'm

Speaker: 00:01:47

expecting 10 workers to do this load.

Speaker: 00:01:50

the manager process prepares things and, and all the workers then connect

Speaker: 00:01:54

in through a TCP port and it sends each of them a batch of users to run.

Speaker: 00:02:00

And then each of them the manager coordinates a start, each of the

Speaker: 00:02:03

workers start at the same time.

Speaker: 00:02:05

And And then they send their statistics back to the managers so that you can

Speaker: 00:02:10

actually see what happened in the end.

Speaker: 00:02:11

what this nicely solves is if you're uplink can only do so much traffic,

Speaker: 00:02:15

or if you want traffic coming from multiple regions around the world you

Speaker: 00:02:19

could let Goose manage that for you in all of these different servers.

Speaker: 00:02:23

So today Narayan has set up a pretty cool test where we're going to

Speaker: 00:02:26

be spinning up a lot of workers.

Speaker: 00:02:28

and he could talk about how many each one is not going to be working too hard.

Speaker: 00:02:32

They'll run maybe a thousand users per server.

Speaker: 00:02:34

which means it'll be at least 50% idle.

Speaker: 00:02:36

It won't be maxing out the uplink on any given server.

Speaker: 00:02:38

but in spite of that, we're going to show that working together in a Gaggle

Speaker: 00:02:43

we can generate a huge amount of load.

Speaker: 00:02:45

so now Narayan, if you can talk about what you've set up here.

Speaker: 00:02:48

Sure.

Speaker: 00:02:49

so what I built today is basically a simplistic Terraform tree.

Speaker: 00:02:53

what is interesting about this is that we wanted to distribute the load between

Speaker: 00:02:58

different regions and for those people that have used Terraform in the past

Speaker: 00:03:03

that can be slightly odd in that you can only set one region for each AWS provider

Speaker: 00:03:09

that Terraform uses to spin things up.

Speaker: 00:03:12

So how we've done, this is defined multiple providers, one

Speaker: 00:03:16

for each region and a module that spins up our region workers.

Speaker: 00:03:20

And we basically initialize multiple versions of the module

Speaker: 00:03:23

passing each a different region.

Speaker: 00:03:26

So in the default test, we spin up 10 worker nodes in various regions.

Speaker: 00:03:31

the Western part of the United States, the Eastern part of the

Speaker: 00:03:34

United States, Ireland Frankfurt.

Speaker: 00:03:38

India and Japan with how the test currently works.

Speaker: 00:03:43

It's the load testing truss, which is what we decided to call it.

Speaker: 00:03:47

it's a little limited because once you start it, you can't really

Speaker: 00:03:52

interact with the workers themselves.

Speaker: 00:03:55

They start up, they pull down Goose and they run the test.

Speaker: 00:03:58

then next revision of this would be something that has a clustering agent

Speaker: 00:04:03

between the workers to, so that you can actually interact with the workers

Speaker: 00:04:06

after they start it gets very annoying to have to run Terraform, to stand up

Speaker: 00:04:11

these VMs all over the world, and then you want to make a change to them.

Speaker: 00:04:15

You have to destroy all of them and then relaunch them which isn't terrible.

Speaker: 00:04:19

But as a testing, sequence.

Speaker: 00:04:22

It adds a lot of time, just because it takes time to

Speaker: 00:04:25

destroy and recreate these VMs.

Speaker: 00:04:27

so the next revision of this would be something other than Goose,

Speaker: 00:04:31

creating a cluster of these VMs.

Speaker: 00:04:33

how it currently works is that we're using Fedora CoreOS so that we have

Speaker: 00:04:38

a consistent base at each location.

Speaker: 00:04:41

And so I could only send it a single file for initialization.

Speaker: 00:04:46

And then Fedora CoreOS pulls down a container that has the Goose load test

Speaker: 00:04:51

and a container that has a logging agent so that we can monitor the workers

Speaker: 00:04:58

and send all the logs from the Goose agents back to a central location

Speaker: 00:05:02

I had a quick question, so Narayan , um, the basic setup is that we have EC2

Speaker: 00:05:08

instances, like on AWS, and then we run containers like normal Kubernetes

Speaker: 00:05:14

like on them, or how is it working?

Speaker: 00:05:17

It's using Docker.

Speaker: 00:05:18

So that is the big thing that I want to improve.

Speaker: 00:05:22

And I almost got there before today.

Speaker: 00:05:24

what would be nicer is if we could use one of the IOT distributions or Kubernetes

Speaker: 00:05:31

at the edge distributions to run a very slim version of Kubernetes on each

Speaker: 00:05:35

worker node so that we get a few things.

Speaker: 00:05:37

One is cluster access, so we can actually interact with the clusters spread

Speaker: 00:05:42

load, run, multiple instances of Goose.

Speaker: 00:05:44

it would be interesting to pack multiple instances of Goose on things

Speaker: 00:05:48

like the higher end and also be able to actually edit the cluster after

Speaker: 00:05:53

it's up and not have to destroy it and recreate it each time.

Speaker: 00:05:56

The other thing is to get containerd and not Docker.

Speaker: 00:06:00

just because there are some issues that you can hit with that.

Speaker: 00:06:03

as it stands right now, CoreOS ships with Docker running, and that's how

Speaker: 00:06:07

you interact with it for the most part is a systemctl Docker, but you could

Speaker: 00:06:11

also use Podman but I ran into issues with that for redirecting the logs.

Speaker: 00:06:17

So we are actually using Docker itself and Docker is just running

Speaker: 00:06:20

the container as you would in a local development environment.

Speaker: 00:06:24

So what we are missing from standard kubernetes deployment thing that

Speaker: 00:06:29

we would normally have is the ability to deploy a new container.

Speaker: 00:06:33

You were saying that if I want to deploy a new container versus

Speaker: 00:06:36

simplistic infrastructure right now, I need to shut down the EC2

Speaker: 00:06:39

instance and then start them up again.

Speaker: 00:06:42

Okay.

Speaker: 00:06:42

So that's, that's the, so like what I did when before this test, Jeremy released

Speaker: 00:06:49

a new branch with some changes to make this load test faster as on startup.

Speaker: 00:06:53

what I did to deploy that is run Terraform destroy, wait for it, to kill

Speaker: 00:06:57

all their VMs across the world, and then Terraform apply and wait for it to

Speaker: 00:07:01

recreate all those VMs across the world.

Speaker: 00:07:03

And like that is.

Speaker: 00:07:05

management style, honestly, but in this specific case, because we're

Speaker: 00:07:09

doing sometimes micro iterations, it can get really annoying.

Speaker: 00:07:13

Yeah, for sure.

Speaker: 00:07:14

No, no, that makes perfect sense.

Speaker: 00:07:16

I just want to understand, because I was like, in this container world, you

Speaker: 00:07:20

can just deploy a new container, but obviously you need a manager for that.

Speaker: 00:07:23

Yes.

Speaker: 00:07:24

Yes.

Speaker: 00:07:24

I could totally deploy a new container.

Speaker: 00:07:26

So what I could do is have Terraform output the list of IPs, and then I can SSH

Speaker: 00:07:31

to each of them and pull a new container.

Speaker: 00:07:34

But at that point,

Speaker: 00:07:40

But seriously, there's another Git repository that I have started.

Speaker: 00:07:44

The version of this that uses a distribution of Kubernetes is called

Speaker: 00:07:47

K3s that is designed for CI systems and IOT and deployments to the edge.

Speaker: 00:07:53

And it's a It's a single binary version of Kubernetes where everything is

Speaker: 00:07:57

wrapped into a single binary and starts on edge nodes and then can connect

Speaker: 00:08:02

them all together and so we could have a multi-region global cluster

Speaker: 00:08:06

of this little Kubernetes agents.

Speaker: 00:08:08

And then we could spin up Gooses on that.

Speaker: 00:08:10

And that I think will actually work.

Speaker: 00:08:12

You totally blew my mind.

Speaker: 00:08:13

So now you've just signed up for follow up to show that

Speaker: 00:08:20

because that's, I mean, that's, that's what you want actually, but now I'm

Speaker: 00:08:23

really curious, how does this Terraform configuration actually look, can,

Speaker: 00:08:27

can you share a little bit about it?

Speaker: 00:08:29

So this is the current tree.

Speaker: 00:08:31

If everyone can see that it's pretty simplistic.

Speaker: 00:08:35

So this is the main file that gets loaded.

Speaker: 00:08:39

And then for everyone, there's a module that is named after its region.

Speaker: 00:08:44

They're all hitting that same actual module is just different

Speaker: 00:08:48

revisions of this module.

Speaker: 00:08:49

And then they'll take a worker count and their region and their provider

Speaker: 00:08:55

and the provider is what is actually separating them into regions.

Speaker: 00:09:02

And then if you look at the region worker, which is where most of these

Speaker: 00:09:07

things are happening, There's a variables thing, which is interesting

Speaker: 00:09:12

because I have to define an AMI map because every region has a different

Speaker: 00:09:16

AMI because the regions are disparate.

Speaker: 00:09:22

Like there's no, there's no consensus building between these regions for images.

Speaker: 00:09:27

So one of the reasons I picked CoreOS is because it exists in each of these regions

Speaker: 00:09:33

and can handle a single start-up file.

Speaker: 00:09:35

when when we do the K3s version of this K3s kind of run on Ubuntu

Speaker: 00:09:38

of, and Ubuntu of obviously exists in all these regions as well.

Speaker: 00:09:41

but I'll still have to do something like this, or there's another way I can do it,

Speaker: 00:09:45

but this was the way to do it for CoreOS.

Speaker: 00:09:49

And then we set, instance type, this is this a default.

Speaker: 00:09:54

And then the main version of this is very simple.

Speaker: 00:09:58

we initialized our key pair, cause I want to be able to SSH into these instances at

Speaker: 00:10:03

some point and upload it to each region.

Speaker: 00:10:06

We initialize a simple security group that allows me to SSH in to each region.

Speaker: 00:10:12

And then a simple instance that doesn't really have much because it's, it

Speaker: 00:10:18

doesn't even have a large root device cause we're not using it at all.

Speaker: 00:10:21

Basically we're just spinning up a single container and then pushing the logs to

Speaker: 00:10:27

Datadog, which is our central log agent.

Speaker: 00:10:30

So even the logs aren't being written locally on that we

Speaker: 00:10:33

associated a public IP address.

Speaker: 00:10:35

We spin up the AMI, we look up which AMI we should use based on our region.

Speaker: 00:10:39

and then we output the worker address.

Speaker: 00:10:41

So the other part of this is the manager.

Speaker: 00:10:46

The only real difference in this, we basically spent out the exact same

Speaker: 00:10:49

way is we also allow the Goose ports, which is 5115, and we spin up a DNS

Speaker: 00:10:57

record that points to our manager because that DNS record is what all the

Speaker: 00:11:02

region workers are going to point at.

Speaker: 00:11:03

Um, and we make use of the fact that they're all using route 53.

Speaker: 00:11:07

So this update propagates really quickly.

Speaker: 00:11:14

And that's basically that it's pretty simple.

Speaker: 00:11:17

each VM is running.

Speaker: 00:11:22

Sorry, go ahead.

Speaker: 00:11:22

where do you actually put in the Goose part?

Speaker: 00:11:25

Because I've seen the VM.

Speaker: 00:11:28

Yep.

Speaker: 00:11:28

So each CoreOS VM it can take a ignition file.

Speaker: 00:11:35

The idea behind CoreOS is it was a project to simplify infrastructure

Speaker: 00:11:39

that was based on containers.

Speaker: 00:11:41

It became an underlying part of a lot of Kubernetes deployments

Speaker: 00:11:44

because it's basically read only in essence on a configuration level.

Speaker: 00:11:49

It even can auto update itself.

Speaker: 00:11:51

it's a very interesting way of dealing with an operating system.

Speaker: 00:11:54

It its entire concept is you don't really interact with it outside of containers.

Speaker: 00:11:58

It's just a stable base for containers that remain secure can auto update is

Speaker: 00:12:03

basically read only in its essence and it takes these ignition files that define

Speaker: 00:12:08

how it should set itself up on first boot.

Speaker: 00:12:11

So if we look at one of these ignition files,

Speaker: 00:12:18

okay.

Speaker: 00:12:18

we can see that it's basically YAML.

Speaker: 00:12:21

And we defined the SSH key.

Speaker: 00:12:23

We want to get pushed.

Speaker: 00:12:24

We define an /etc/hosts file to push.

Speaker: 00:12:27

We then define some systemd units, which include turning off SELinux

Speaker: 00:12:33

because we don't want to deal with that on short-lived workers.

Speaker: 00:12:37

And then we define the Goose service, which pulls down the image.

Speaker: 00:12:41

And right here actually starts Goose.

Speaker: 00:12:44

this is mostly defining the log driver, which ships logs back to

Speaker: 00:12:50

Datadog, the log driver, the actual logging agent is started here.

Speaker: 00:12:55

but then like, this is one of the workers.

Speaker: 00:12:57

So we pulled the temp, umami branch of Goose.

Speaker: 00:13:01

We start it up, set it to worker.

Speaker: 00:13:04

Point it to the manager host set it to be somewhat verbose set the log

Speaker: 00:13:08

driver to be Datadog startup data dogs that we get metrics in the logs.

Speaker: 00:13:12

And then that's just how it runs.

Speaker: 00:13:14

And this will restart over and over and over again.

Speaker: 00:13:17

So you can actually run multiple tests with the same infrastructure.

Speaker: 00:13:20

You just have to restart Goose on the manager and then the workers will

Speaker: 00:13:23

kill themselves and then restart.

Speaker: 00:13:26

And so you get this plan, where it shows you where all the

Speaker: 00:13:30

instances it's going to spin up.

Speaker: 00:13:31

It's actually fairly long just because there are a lot of params

Speaker: 00:13:34

for each EC2 instance that we're spinning up 11 of them, 10 plus

Speaker: 00:13:37

the manager, you say that's fine.

Speaker: 00:13:41

And it goes,

Speaker: 00:13:45

and I will stop sharing my screen now is this is going to take a bit.

Speaker: 00:13:50

So is this already doing something now.

Speaker: 00:13:53

Yes.

Speaker: 00:13:53

And this is, you're probably going to see one of the quirks.

Speaker: 00:13:58

and this is another thing I dislike about this.

Speaker: 00:14:00

Because we're using CoreOS . These are all coming up on an outdated AMI and

Speaker: 00:14:06

they're all going to reboot right there

Speaker: 00:14:12

because they come up, they start pulling the Goose container and

Speaker: 00:14:15

then they start the update process and they're not doing anything.

Speaker: 00:14:21

So at that point, they think it's safe to update.

Speaker: 00:14:24

And so they update and reboot.

Speaker: 00:14:25

it's somewhat cool that that has no impact on anything like the entire infrastructure

Speaker: 00:14:29

comes up, updates itself, reboots.

Speaker: 00:14:31

Then it continues on with what it's doing, but it's another

Speaker: 00:14:35

little annoyance that I just don't.

Speaker: 00:14:36

You spin up this infrastructure and you don't really have

Speaker: 00:14:39

a ton of control over it.

Speaker: 00:14:42

And so this is the logs of the manager process of Goose, and it's just

Speaker: 00:14:48

waiting for its workers to connect.

Speaker: 00:14:50

They're all, they've all updated, rebooted and Goose is starting on them.

Speaker: 00:14:56

As you can see, eight of them had completed that process.

Speaker: 00:15:00

is the, you know, all of this, the stuff that you put together here is

Speaker: 00:15:03

this going to be available open source for folks to download and leverage?

Speaker: 00:15:07

Yep.

Speaker: 00:15:08

Awesome.

Speaker: 00:15:10

It's all online.

Speaker: 00:15:12

On our Tag1 Consulting GitHub organization and the K3s version will be as well.

Speaker: 00:15:17

And that's the one I'd recommend you use.

Speaker: 00:15:19

This

Speaker: 00:15:19

one's real annoying.

Speaker: 00:15:20

I know I keep going on about it, but like, this is how it skunkworks projects work.

Speaker: 00:15:24

You make the first revision and you hate it and then you

Speaker: 00:15:28

decide to never do that again.

Speaker: 00:15:30

And then you make the second revision.

Speaker: 00:15:33

Okay.

Speaker: 00:15:34

This is starting.

Speaker: 00:15:35

Now I'm going to switch my screen over to the web browser

Speaker: 00:15:38

so we can see what it's doing.

Speaker: 00:15:40

Sure.

Speaker: 00:15:41

Great.

Speaker: 00:15:41

the logs that we're seeing there are they coming from the Datadog

Speaker: 00:15:45

or just the manager director.

Speaker: 00:15:48

And so

Speaker: 00:15:49

that was a direct connection to the manager.

Speaker: 00:15:50

If we go over to Datadog here, there, these are going to be the logs.

Speaker: 00:15:58

As you can see, the host is just like what an EC2 host name looks

Speaker: 00:16:03

like, and they're all changing, but we're getting logs from every agent.

Speaker: 00:16:08

As well as the worker.

Speaker: 00:16:09

You can see they're launching.

Speaker: 00:16:11

If we go back to Fastly, we can see that they're getting global traffic.

Speaker: 00:16:18

So we're getting traffic on the West coast, the East

Speaker: 00:16:21

coast, Ireland, Frankfurt, and

Speaker: 00:16:27

Mumbai,

Speaker: 00:16:30

and the bandwidth will just keep ramping up from here

Speaker: 00:16:34

For the Datadog is that way to also filter by the manager, like,

Speaker: 00:16:42

sure.

Speaker: 00:16:44

this is the live tail.

Speaker: 00:16:46

We'll go to the past 15 minutes and then you can go service Goose.

Speaker: 00:16:53

And then we have worker and manager, so I can do all my worker.

Speaker: 00:16:57

And that's sorry, only manager.

Speaker: 00:17:01

The manager is pretty quiet.

Speaker: 00:17:04

The workers are not.

Speaker: 00:17:07

You must've disabled.

Speaker: 00:17:09

displaying metrics regularly.

Speaker: 00:17:10

Cause I would have expected on the server to see that.

Speaker: 00:17:12

if I did, I did not intend to, but I probably did.

Speaker: 00:17:17

Can we, is it easy to quickly see what command you passed in or not to go back

Speaker: 00:17:22

there from where you're at right now?

Speaker: 00:17:24

It's in Terraform, I think.

Speaker: 00:17:26

It is all set here.

Speaker: 00:17:31

So it

Speaker: 00:17:32

must be interesting.

Speaker: 00:17:33

I have to figure out why you're not getting statistics on the

Speaker: 00:17:35

manager because you should be getting statistics on the manager.

Speaker: 00:17:39

Is this the log you're tailing or is this what's verbosly put out to the screen?

Speaker: 00:17:44

This is

Speaker: 00:17:44

what is put out to the screen.

Speaker: 00:17:46

Yeah.

Speaker: 00:17:47

Interesting.

Speaker: 00:17:47

Okay.

Speaker: 00:17:48

I would have expected statistics every 30

Speaker: 00:17:50

seconds.

Speaker: 00:17:53

So what's kind of interesting is you can expand this in Fastly and see we're

Speaker: 00:17:58

doing significantly less traffic in Asia Pacific, but that makes sense.

Speaker: 00:18:02

Considering we're only hitting one of the PoPs and then Europe and North

Speaker: 00:18:05

America tends to be about the same, but you can even drill down further,

Speaker: 00:18:11

one quick question.

Speaker: 00:18:12

I saw you hard-code the IP address of the end point in the Terraform.

Speaker: 00:18:17

how does Fastly is still know essentially to which PoP to route

Speaker: 00:18:21

and they're doing it through magic.

Speaker: 00:18:22

Um, you mean I'd put the IP the same IP address everywhere in /etc/hosts?.

Speaker: 00:18:28

Yep.

Speaker: 00:18:28

Yeah.

Speaker: 00:18:29

It's because of how they're doing traffic.

Speaker: 00:18:31

So it is the same IP address everywhere, but they the IP

Speaker: 00:18:35

address points to different things.

Speaker: 00:18:36

Basically.

Speaker: 00:18:39

It's cool.

Speaker: 00:18:40

A lot of CDNs do it that way.

Speaker: 00:18:41

so instead of different IP addresses, it's basically routing tricks.

Speaker: 00:18:47

We seem

Speaker: 00:18:47

to have maxed out.

Speaker: 00:18:48

Can you look at the

Speaker: 00:18:49

Yeah, this should be about it.

Speaker: 00:18:50

It should be all started at this point.

Speaker: 00:18:53

Yeah.

Speaker: 00:18:53

So we've launched a thousand users, we've entered Goose attack.

Speaker: 00:18:56

So we have evened out at 14.5 gigabits per second, which is, I think what we got

Speaker: 00:19:02

on one server with 10,000 users as well.

Speaker: 00:19:05

this is more, this is more than a single server single server.

Speaker: 00:19:08

I think we max out at nine gigabit.

Speaker: 00:19:10

Awesome.

Speaker: 00:19:11

Thank you guys.

Speaker: 00:19:11

All for joining us.

Speaker: 00:19:13

It was really cool to see that in action.

Speaker: 00:19:15

all the links we mentioned are going to be posted in the video summary and the

Speaker: 00:19:19

blog posts to that correlates with this.

Speaker: 00:19:22

Be sure to check out Tag1.com/goose that's tag the number one.com.

Speaker: 00:19:28

that's where we have all of our talks, documentation links to GitHub.

Speaker: 00:19:33

There's some really great blog posts there that will show you

Speaker: 00:19:35

step-by-step with the code, how to do everything that we covered today.

Speaker: 00:19:40

So be sure to check that out.

Speaker: 00:19:42

If you have any questions about Goose, please post them to the Goose issue queues

Speaker: 00:19:46

that we can share them with the community.

Speaker: 00:19:48

Of course, if you like this talk, please remember to upvote

Speaker: 00:19:51

subscribe and share it out.

Speaker: 00:19:53

You can check out our past Tag1 TeamTalks on a wide variety of topics from open

Speaker: 00:19:59

source and funding, getting funding on your open source projects to things like

Speaker: 00:20:05

decoupled systems and architectures for web applications at tag1.com/tag1teamtalks

Speaker: 00:20:13

as always we'd love your feedback and input on both this episode, as

Speaker: 00:20:17

well as ideas for future topics.

Speaker: 00:20:20

You can email us at ttt@tag1.com Again, a huge thank you, Jeremy, Fabian and

Speaker: 00:20:30

Narayan for walking us through this and to everyone who tuned in today.

Speaker: 00:20:35

Really appreciate you joining us.

Share Episode

Shownotes

Transcripts

Follow

Links

Chapters

Video

More from YouTube