Artwork for podcast Tag1 Team Talks | The Tag1 Consulting Podcast
How to load test with Goose- Part 3: Bigger instances - Tag1 Team Talk
Episode 7519th July 2021 • Tag1 Team Talks | The Tag1 Consulting Podcast • Tag1 Consulting, Inc.
00:00:00 00:20:49

Share Episode

Shownotes

Goose, the open source load testing framework created by Tag1 CEO Jeremy Andrews, continues to show its performance and scalability capabilities. In this Tag1 Team Talk, Managing Director Michael Meyers joins Vp of Software Engineering Fabian Franz for a demonstration of Goose’s rapid ramp-up and scaling by COO Narayan Newton.

In this final talk in our series of live demonstrations of Goose, Narayan and Fabian break down how some of the methods used in part  2 weren’t ideal, and some ways to make spinning up load tests more efficient and fast.


For more Goose content, see Goose Podcasts, Blogs, Presentations, & more!


Transcripts

Speaker:

Hello, and welcome to Tag1 Team Talks the podcast and blog of Tag1 Consulting.

Speaker:

Today.

Speaker:

We're going to be talking about distributed load testing and doing

Speaker:

a, how to a deep dive into running Gaggles with Tag1's, open source

Speaker:

Goose load testing framework.

Speaker:

Our goal today is to prove to you that Goose is both the most scalable load

Speaker:

testing framework available currently.

Speaker:

And the easiest to actually scale.

Speaker:

This is a follow-up.

Speaker:

Talk to one we did very recently on a similar topic.

Speaker:

We're going to stress the servers even more in this one.

Speaker:

I'm Michael Meyers, the managing director at Tag1 Consulting.

Speaker:

And I'm joined today by Narayan Newton Tag1's CTO and Fabian

Speaker:

Franz, our VP of Technology.

Speaker:

Fabian, can you give us just a quick background on how this is

Speaker:

a follow-up to our last talk?

Speaker:

Sure so in , our last talk, what we did is so essentially we spun up EC2

Speaker:

instances all over the world, but if you need to change something, you

Speaker:

essentially have to destroy the cluster.

Speaker:

Re- deploy the cluster and while recording it.

Speaker:

We actually ran into the problem that we had to change something and it wasn't

Speaker:

easy and not easily possible to do that.

Speaker:

And we want to change the Ulimit because with Goose, if you Put a lot of users

Speaker:

and you usually need to increase the ulimit that Linux comes with and then

Speaker:

you need to do in the VM obviously.

Speaker:

And we have no real control about that because we only had to start script.

Speaker:

So why the solution that you've presented was very straightforward, very simple

Speaker:

and easy to use, essentially, if you quickly want to iterate on something.

Speaker:

Yeah.

Speaker:

It can take quite a while because you have to wait for all the clusters to shut down

Speaker:

and really don't want any EC2 machines like hanging around for 10 years and then.

Speaker:

Thousands of dollars that you are paying for nothing because

Speaker:

you were in a load test once.

Speaker:

It's so important to cleanly shut down and then start up.

Speaker:

But that costs a lot of time and development.

Speaker:

Time is also costly.

Speaker:

So today we are having a completely new solution and am I'm totally

Speaker:

fascinated and excited by it.

Speaker:

And Narayan, please tell us more.

Speaker:

Yes, it was.

Speaker:

if you watch the last talk I spent.

Speaker:

unfortunate amount of it talking about why I disliked the thing I built to

Speaker:

spin up EC2 instances from them because I couldn't control the end points.

Speaker:

and then all the things that I was complaining about happened,

Speaker:

we had to like stop recording.

Speaker:

So I got annoyed and What we have today is similar to what we ran

Speaker:

last time, where it is still like, kind of the same Terraform tree.

Speaker:

And we're spinning up CoreOS nodes in various different regions, but instead of

Speaker:

pushing just a Goose container to each of them to run the load test it is installing

Speaker:

K3s which is a Kubernetes distribution.

Speaker:

That is.

Speaker:

Designed for IOT and CI and running at the edge.

Speaker:

It's very small.

Speaker:

it's not it actually, it is a full Kubernetes distribution, but it's not a

Speaker:

full Kubernetes distribution in that is running everything on a standard one.

Speaker:

they have made some changes to make it lighter and spin up faster.

Speaker:

so you can now instead of running Terraform apply and

Speaker:

it's spinning up the load test.

Speaker:

It spins up a multi-region Kubernetes cluster which was interesting to do

Speaker:

there were some oddities to doing that because each node, when you're spinning

Speaker:

up a node, EC2, it has an internal IP address and external IP address.

Speaker:

and if you're spinning up a Kubernetes distribution in a single

Speaker:

region, that doesn't really matter because everyone's talking to each

Speaker:

other via the internal IP address.

Speaker:

But.

Speaker:

When you're doing multi-region, everyone's talking to each other via

Speaker:

the external IP address and that IP address does not appear on the VM at all.

Speaker:

So that was interesting, I will share my screen.

Speaker:

before we started, I, where we were last time is basically spinning up 10 nodes.

Speaker:

two nodes in each region, five regions.

Speaker:

I did that again, but with the new truss..

Speaker:

Okay.

Speaker:

So now we are on the manager node.

Speaker:

Cube control gets set up automatically on the manager node.

Speaker:

So.

Speaker:

There is our cluster as it stands currently.

Speaker:

So I just ran a get nodes wide so I can see extended fields.

Speaker:

And you can see that we have the control plane, which is what we are on currently.

Speaker:

And then all of our worker nodes, you can see the internal IP addresses

Speaker:

and the external IP addresses.

Speaker:

If you look at one of these, like, let's look at probably this one.

Speaker:

You can see it's even tagged by the region its in.

Speaker:

So we're fetching the region that spin up and tagging

Speaker:

whatever region this node is in.

Speaker:

Yeah, these are just boring regions, but there are more

Speaker:

interesting regions as well.

Speaker:

So to run the load test,

Speaker:

I have a little YAML directory here, and this is what we're

Speaker:

doing instead of pushing the Docker image to each CoreOS node.

Speaker:

Okay.

Speaker:

Should.

Speaker:

And just letting it run without any control, is that now you spring up the

Speaker:

cluster and you can submit these jobs.

Speaker:

So this is the manager job it's going to ________ workers, but we want 10,

Speaker:

but real quick.

Speaker:

I got lost a little bit in the translation of things, so we have.

Speaker:

EC2 instances and they are now having this huge Kubernetes network.

Speaker:

And now what are we using to now deploy Goose or is Goose already there?

Speaker:

Goose is not there.

Speaker:

So this, when I'm copying up to the manager, node is our deployment of Goose.

Speaker:

So this is a deployment telling Kubernetes.

Speaker:

To spin up one replica of the Goose manager and so this is going to be the

Speaker:

manager that all the workers connect to.

Speaker:

Excellent.

Speaker:

Perfect.

Speaker:

So we are going to create that

Speaker:

and you can see it creating here.

Speaker:

And if we look back at that deployment, you'll see that we're, we have a node

Speaker:

selector to tell Kubernetes that I want this to run on the manager node.

Speaker:

I don't want it to run on the worker nodes because this is the

Speaker:

management instance of Goose.

Speaker:

Excellent.

Speaker:

So how can I now?

Speaker:

Um, can I now look.

Speaker:

At the as soon as the Kubernetes, one is open, can I now look

Speaker:

that , uh, is waiting for nodes?

Speaker:

Can I like see the output of that as well?

Speaker:

You can, once it's done creating the container.

Speaker:

So what it's doing right now is it's pulling the container down

Speaker:

from our container registry.

Speaker:

How long does container deployment usually take with Kubernetes?

Speaker:

It scales by the size of the container.

Speaker:

And our container is actually quite large because I have not put the

Speaker:

effort and making it smaller yet.

Speaker:

Okay.

Speaker:

So just downloading a few gigabytes of data just for the distribution and

Speaker:

all the rest dependencies, et cetera.

Speaker:

Exactly.

Speaker:

Okay.

Speaker:

And now this is our Goose worker same container, but different

Speaker:

arguments to the container, obviously.

Speaker:

we have some pod anti affinity rules here, which are kind of interesting.

Speaker:

Basically.

Speaker:

This is me saying to Kubernetes.

Speaker:

I don't want you to schedule this on any node that already has a worker running,

Speaker:

and I don't want you to schedule it on any node that has the manager running.

Speaker:

So it will distribute it to every single node so that

Speaker:

there won't be an inactive one.

Speaker:

Yeah.

Speaker:

You wanted to see this.

Speaker:

So, there.

Speaker:

Before I start the workers, the manager is not running and we can do a group

Speaker:

control logs and there are logs.

Speaker:

Nice.

Speaker:

and it's waiting for 10 workers.

Speaker:

So let's create the workers.

Speaker:

Oh,

Speaker:

I fixed this on the other one.

Speaker:

So what it's complaining about is these pod affinity rules.

Speaker:

It needs a topology key for each one.

Speaker:

And so that one's responsible.

Speaker:

So then this typology essentially meant that it's it's per host

Speaker:

name, not per something else.

Speaker:

Okay.

Speaker:

Per

Speaker:

host name instead of per zone, per region, per rack.

Speaker:

So I could also decide to.

Speaker:

Deploy, just one Kubernetes to each region, essentially.

Speaker:

And now we have our workers spinning up.

Speaker:

And so this is what was happening last time as well.

Speaker:

It's just, this was happening when you ran Terraform apply.

Speaker:

So it would bring up all the nodes and then each node would be doing this,

Speaker:

but without our direct interaction,

Speaker:

I think I like the manager way more.

Speaker:

It's nice to have a little bit of control and to easily take a look at logs.

Speaker:

Yeah.

Speaker:

We have already started to send users.

Speaker:

That's great.

Speaker:

Yep.

Speaker:

as part of this, the ulimit issues we were hitting cause in the old

Speaker:

one, we weren't changing the ulimit.

Speaker:

So we had a maximum open file descriptors 1,024.

Speaker:

we are actually inheriting a ulimit fix that K3s pushes in when

Speaker:

it installs, which is helpful.

Speaker:

That's indeed helpful if they have already solved the problem for us,

Speaker:

so that's starting up you can see.

Speaker:

These are workers transitioning to the running state.

Speaker:

we can look at the logs of the workers as well.

Speaker:

Which is kind of cool.

Speaker:

If you think that these workers are in like various regions, I

Speaker:

can just run something to pull logs from like the central Europe

Speaker:

Japan is cooler than central Europe.

Speaker:

Yeah,

Speaker:

that's true.

Speaker:

Okay.

Speaker:

And our load tests should be starting.

Speaker:

I think it was around 15 seconds right I'm just coming in.

Speaker:

Yeah, there we go.

Speaker:

Oh my God.

Speaker:

how fast do we have to ramp up right now?

Speaker:

I had a hundred and you start going up and it, it crushes our site.

Speaker:

So the ramp up is slower, but it kind of needs to be.

Speaker:

For sure.

Speaker:

And actually, why don't we look at one for workers so we can see the ramp up.

Speaker:

That's

Speaker:

freakingly cool.

Speaker:

This is all.

Speaker:

This will all be open source on the, on the Tag1 server

Speaker:

it's already there.

Speaker:

It has one thing that you should know about it currently is that

Speaker:

there's when you spin up a Kubernetes cluster, I'll just talk about this

Speaker:

while its ramping up there's network.

Speaker:

There's like a backend network that all the pods communicate on.

Speaker:

That is not a real network.

Speaker:

It's a network.

Speaker:

That's.

Speaker:

Kubernetes specific and this particular speech and uses Flannel for that.

Speaker:

That's even a, it's a pluggable thing.

Speaker:

So you can decide what you want, your network control plan to be.

Speaker:

Flannel is detecting the wrong IP address is detecting the internal IP

Speaker:

address, not the public IP address.

Speaker:

I haven't personally fixed that because there's a bug open about

Speaker:

it with traction, but I'm going to push on the bug to fix that.

Speaker:

So as it stands, if you hit, if you want to just do something like this,

Speaker:

where you're going to be using the host network, which is the network of

Speaker:

the VM, not the network of the pod.

Speaker:

That's not an issue.

Speaker:

but if you want to do something like Interpod communication,

Speaker:

that will be an issue.

Speaker:

So if you just pull it down, you would have to fix that.

Speaker:

I'll probably dump the URI of the bug in question in the Read me

Speaker:

just so that people can track that.

Speaker:

So if I wanted to do something that I would need to apply

Speaker:

a patch or what do we need?

Speaker:

Okay.

Speaker:

Yeah.

Speaker:

In that issue.

Speaker:

There's a little deployment you can deploy that will go in and you've

Speaker:

changed the, it changes the annotation.

Speaker:

On the nodes to point to the correct IP and then Flannel will update.

Speaker:

So it's a bonus.

Speaker:

So you had to get the external IPs manually and put them into configuration

Speaker:

or

Speaker:

no, I, well, so we're at 20 gigabytes,

Speaker:

but while that's happening

Speaker:

where's a good place for that.

Speaker:

Sure.

Speaker:

So I am running curls against the ECT metadata service to grab the public IP

Speaker:

and the region before doing the install and then passing it to the installer.

Speaker:

Which is actually a huge security vulnerability.

Speaker:

If the wrong people get access to that.

Speaker:

But it is very useful for setting things up.

Speaker:

Like that's how, that's how an EC2 VM knows stuff about itself.

Speaker:

Oh we got our first error.

Speaker:

I've noticed that at about 26 gigabits per second, you start

Speaker:

pulling errors every once in a while.

Speaker:

Is there a corollary chart to our Fastly bill here?

Speaker:

Yeah.

Speaker:

Yeah, they're really is.

Speaker:

We've been testing this awhile and I don't know.

Speaker:

It just never occurred to me that the Fastly bill would be high, but it is.

Speaker:

Man that's amazing

Speaker:

I think we should be at around 25 at this.

Speaker:

When you see where we're at.

Speaker:

Yes, we've launched all the users.

Speaker:

So we should just sustain at around here.

Speaker:

And this people is how you are testing a CDN, as you can see.

Speaker:

And you can see, we have, we have fewer locations near the Asia Pacific

Speaker:

Pop's, but we're holding five giga bits per second, Asia Pacific, and then 10

Speaker:

and 10 in Europe and North America.

Speaker:

Yep.

Speaker:

And just, just again, to.

Speaker:

To reiterate that a little bit.

Speaker:

Our Goose users are not real users.

Speaker:

Like they are way faster usually.

Speaker:

that's why they also create discrete benches, but every user is like

Speaker:

downloading all the, I mean, we have little breaks in there.

Speaker:

but all of the users also downloading all the assets.

Speaker:

So When a page is loaded from Umami, like a nice recipe, which we are talking

Speaker:

about then all those images are also downloaded, like, it's real browsers, like

Speaker:

browsing the site or the JavaScript is not execute because you're not, doing that.

Speaker:

But it's really parsing a lot.

Speaker:

It's ensuring everything is correct in that.

Speaker:

And you're doing this with 2000 workers on 10 nodes, 20,000 workers in total.

Speaker:

Right.

Speaker:

Yeah.

Speaker:

users.

Speaker:

Yeah.

Speaker:

So 2000 users per node over 10 minutes and keep two nodes in each region.

Speaker:

Yeah.

Speaker:

So those 20,000 users are now hitting this site real hard.

Speaker:

Probably the user doesn't click as fast as we are clicking here.

Speaker:

So it's probably more like 200,000 or so.

Speaker:

Generating this kind of traffic Amazing.

Speaker:

Now really cool.

Speaker:

Oh, then we have an error.

Speaker:

And this was the error we got, but like on our old setup, I

Speaker:

would have no real way to do that.

Speaker:

Like we were pushing logs centrally, and by the way our bill for the

Speaker:

central logging was also great

Speaker:

way to like separate and look at individual workers or anything like that.

Speaker:

So this is a big improvement as far as manageability.

Speaker:

Yeah, Datadog probably was not as amused with that many logs.

Speaker:

No, they emailed me.

Speaker:

Yeah.

Speaker:

Can we schedule a call to discuss your new usage.

Speaker:

Sorry.

Speaker:

It was just a one off.

Speaker:

We need to show the world how to test Fastly.

Speaker:

what's nice about Goose and what you can see here is that every

Speaker:

error is very nicely reported.

Speaker:

Goose just got a new patch in that also allows us to get an overview of

Speaker:

all the errors that ever happened like for all the workers and everything.

Speaker:

So this will be a very nice new feature.

Speaker:

Um, that's launched in next release so that you don't have to.

Speaker:

Look through the log of what errors you have, but you'll get an aggregated

Speaker:

per error type thing report in the end.

Speaker:

And I think we can essentially stop.

Speaker:

we can see Fastly is handling 25 gigabytes per second easily.

Speaker:

There's no, there's no real reason to make our bill higher.

Speaker:

So what we can do is just

Speaker:

delete the deployments.

Speaker:

And just terminate them all..

Speaker:

Not the nicest way to stop a load test, but yeah,

Speaker:

no, no, but it is very forceful.

Speaker:

Yeah.

Speaker:

in theory we could have also given the manager, essentially, a stop signal.

Speaker:

And then it would have given us a nice end of load test report

Speaker:

which can also be in HTML.

Speaker:

And this, we can show that again some other time, but yeah.

Speaker:

Oh, and with this, you can, one thing you can do not right now, cause they're

Speaker:

terminating, but if you want to, you can actually exec into these containers.

Speaker:

So you can pull a bash prompt from any of these containers, even

Speaker:

the ones in different regions.

Speaker:

Not something I did.

Speaker:

That's just Kubernetes.

Speaker:

I mean, for sure.

Speaker:

No, I, every love this new K3s is it like an acronym for something K3s?.

Speaker:

Okay.

Speaker:

No, it's a, so Kubernetes is the acronym for Kubernetes is K8s.

Speaker:

so K3s would be their joke as it's a lighter version.

Speaker:

It was really cool.

Speaker:

It's literally a single binary and the install, like the install deals with

Speaker:

all the prerequisites and then places the binary and spins up a systemd

Speaker:

unit file that sets everything up.

Speaker:

Kubernetes by default has a, like a HA multi-instance data store and

Speaker:

then replace that with SQL-lite.

Speaker:

Like that sort of thing.

Speaker:

This feels really, really cool.

Speaker:

And I think that's, that's really nice too, for multi region service.

Speaker:

running so easily.

Speaker:

Great.

Speaker:

Yep.

Speaker:

I think it's a step up from what we said before.

Speaker:

Absolutely.

Speaker:

Not only would it be shared before, but also what you had to do before.

Speaker:

I remember SSH-ing into four different machines to start a load,

Speaker:

to start a Locust test, manually Starting eight workers on each.

Speaker:

And so, yeah, it's, it's really nice to have everything automated that way.

Speaker:

And now I'm just bringing it down.

Speaker:

Awesome.

Speaker:

Thank you guys so much.

Speaker:

That was really cool.

Speaker:

look forward to our Fastly and Datadog bill.

Speaker:

I really appreciate you guys coming back to show us how that works.

Speaker:

we will do another Goose webinar in our series here in a couple of weeks.

Speaker:

So please stay tuned.

Speaker:

We're going to make this a regular series where we show you how

Speaker:

to use different features and functionality as we release them.

Speaker:

And also show you how to use the tool and profile websites and

Speaker:

effectively performance tune, not just get it up and running and, and

Speaker:

slamming your site with traffic.

Speaker:

the links we mentioned, we'll throw into the show notes and the description.

Speaker:

you can check out these other Goose talks at Tag1.com/goose

Speaker:

GOOSE that's where we have links to documentation and code and all of

Speaker:

the talks and videos that we've done.

Speaker:

if you have any questions, please head over to the GitHub issue

Speaker:

queues and ask them over there.

Speaker:

If you have questions about Goose the product and how to use it,

Speaker:

but if you want to get engaged and contribute, we'd love it.

Speaker:

If you have input and feedback on the talk itself please contact us at.

Speaker:

Tag1teamtalks@tag1.com.

Speaker:

please remember to upvote share, subscribe with all your friends.

Speaker:

You can check out our past Tag1 Team Talks at tag1.com/ttt for Tag Team Talks.

Speaker:

Again, huge thank you, Fabian, Narayan for joining us and thank you to everybody

Links