In this second part of our team talk series on live load testing with Goose, we focus on demonstrating load testing using a Gaggle. A Gaggle is a distributed load test using running Goose from one or more servers. Here, we’re testing with 20,000 users using ten Workers and a Manager process on services spun up using Terraform.
CEO Jeremy Andrews, the creator of Goose; Fabian Franz, VP of Software Engineering; CTO Narayan Newton, and Managing Director Michael Meyers demonstrate running a Goose Gaggle and discuss how variations on these load test change testing results, and what you can expect from a Gaggle. Our goal is to prove to you that Goose is both the most scalable load testing framework currently available, and the easiest to scale.
Narayan is a key member of the Drupal.org infrastructure team, responsible for ensuring the site stays up under load or attack. Load testing is a critical part of ensuring Drupal.org, or any website’s continued success.
For more Goose content, see Goose Podcasts, Blogs, Presentations, & more!
Hello, and welcome to Tag1 Team Talks, the podcast and blog of Tag1 Consulting.
Speaker:Today, we're going to be doing distributed load testing how to: a deep dive into
Speaker:running a Gaggle with Tag1's, open source Goose load testing framework.
Speaker:Our goal is to prove to you that Goose is both the most scalable load
Speaker:testing framework currently available.
Speaker:And the easiest to scale.
Speaker:We're going to show you how to run a distributed load, test yourself.
Speaker:And we're going to provide you with lots of code and examples to make it easy and
Speaker:possible for you to do this on your own.
Speaker:I'm Michael Meyers, the managing director at Tag1, and joining
Speaker:me today is a star-studded cast.
Speaker:We have Jeremy Andrews, the founder and CEO of Tag1.
Speaker:Who's also the original creator of Goose, Fabian Franz, our VP of Technology.
Speaker:Who's made major contributions to Goose, especially around
Speaker:performance and scalability.
Speaker:And Narayan Newton, our CTO who has set up and put together all the
Speaker:infrastructure that we're going to be using to run these load tests.
Speaker:Jeremy, why don't you take it away?
Speaker:Give us an overview of what we're going to be covering and let's jump into it.
Speaker:Yeah.
Speaker:So last time we were exploring with setting up a load test from a single
Speaker:server and confirmed that Goose makes great use of that server.
Speaker:it leverages all the CPU's and ultimately tends to get as far as it
Speaker:can until the uplink slows it down.
Speaker:So today what we're going to do is.
Speaker:Use a feature of Goose called a Gaggle which is a distributed load test.
Speaker:If you're familiar with Locust, it is like a swarm.
Speaker:and the way it, the way that this works with Goose you have a manager
Speaker:process that you kick off and you say, I want to simulate 20,000 users and I'm
Speaker:expecting 10 workers to do this load.
Speaker:the manager process prepares things and, and all the workers then connect
Speaker:in through a TCP port and it sends each of them a batch of users to run.
Speaker:And then each of them the manager coordinates a start, each of the
Speaker:workers start at the same time.
Speaker:And And then they send their statistics back to the managers so that you can
Speaker:actually see what happened in the end.
Speaker:what this nicely solves is if you're uplink can only do so much traffic,
Speaker:or if you want traffic coming from multiple regions around the world you
Speaker:could let Goose manage that for you in all of these different servers.
Speaker:So today Narayan has set up a pretty cool test where we're going to
Speaker:be spinning up a lot of workers.
Speaker:and he could talk about how many each one is not going to be working too hard.
Speaker:They'll run maybe a thousand users per server.
Speaker:which means it'll be at least 50% idle.
Speaker:It won't be maxing out the uplink on any given server.
Speaker:but in spite of that, we're going to show that working together in a Gaggle
Speaker:we can generate a huge amount of load.
Speaker:so now Narayan, if you can talk about what you've set up here.
Speaker:Sure.
Speaker:so what I built today is basically a simplistic Terraform tree.
Speaker:what is interesting about this is that we wanted to distribute the load between
Speaker:different regions and for those people that have used Terraform in the past
Speaker:that can be slightly odd in that you can only set one region for each AWS provider
Speaker:that Terraform uses to spin things up.
Speaker:So how we've done, this is defined multiple providers, one
Speaker:for each region and a module that spins up our region workers.
Speaker:And we basically initialize multiple versions of the module
Speaker:passing each a different region.
Speaker:So in the default test, we spin up 10 worker nodes in various regions.
Speaker:the Western part of the United States, the Eastern part of the
Speaker:United States, Ireland Frankfurt.
Speaker:India and Japan with how the test currently works.
Speaker:It's the load testing truss, which is what we decided to call it.
Speaker:it's a little limited because once you start it, you can't really
Speaker:interact with the workers themselves.
Speaker:They start up, they pull down Goose and they run the test.
Speaker:then next revision of this would be something that has a clustering agent
Speaker:between the workers to, so that you can actually interact with the workers
Speaker:after they start it gets very annoying to have to run Terraform, to stand up
Speaker:these VMs all over the world, and then you want to make a change to them.
Speaker:You have to destroy all of them and then relaunch them which isn't terrible.
Speaker:But as a testing, sequence.
Speaker:It adds a lot of time, just because it takes time to
Speaker:destroy and recreate these VMs.
Speaker:so the next revision of this would be something other than Goose,
Speaker:creating a cluster of these VMs.
Speaker:how it currently works is that we're using Fedora CoreOS so that we have
Speaker:a consistent base at each location.
Speaker:And so I could only send it a single file for initialization.
Speaker:And then Fedora CoreOS pulls down a container that has the Goose load test
Speaker:and a container that has a logging agent so that we can monitor the workers
Speaker:and send all the logs from the Goose agents back to a central location
Speaker:I had a quick question, so Narayan , um, the basic setup is that we have EC2
Speaker:instances, like on AWS, and then we run containers like normal Kubernetes
Speaker:like on them, or how is it working?
Speaker:It's using Docker.
Speaker:So that is the big thing that I want to improve.
Speaker:And I almost got there before today.
Speaker:what would be nicer is if we could use one of the IOT distributions or Kubernetes
Speaker:at the edge distributions to run a very slim version of Kubernetes on each
Speaker:worker node so that we get a few things.
Speaker:One is cluster access, so we can actually interact with the clusters spread
Speaker:load, run, multiple instances of Goose.
Speaker:it would be interesting to pack multiple instances of Goose on things
Speaker:like the higher end and also be able to actually edit the cluster after
Speaker:it's up and not have to destroy it and recreate it each time.
Speaker:The other thing is to get containerd and not Docker.
Speaker:just because there are some issues that you can hit with that.
Speaker:as it stands right now, CoreOS ships with Docker running, and that's how
Speaker:you interact with it for the most part is a systemctl Docker, but you could
Speaker:also use Podman but I ran into issues with that for redirecting the logs.
Speaker:So we are actually using Docker itself and Docker is just running
Speaker:the container as you would in a local development environment.
Speaker:So what we are missing from standard kubernetes deployment thing that
Speaker:we would normally have is the ability to deploy a new container.
Speaker:You were saying that if I want to deploy a new container versus
Speaker:simplistic infrastructure right now, I need to shut down the EC2
Speaker:instance and then start them up again.
Speaker:Okay.
Speaker:So that's, that's the, so like what I did when before this test, Jeremy released
Speaker:a new branch with some changes to make this load test faster as on startup.
Speaker:what I did to deploy that is run Terraform destroy, wait for it, to kill
Speaker:all their VMs across the world, and then Terraform apply and wait for it to
Speaker:recreate all those VMs across the world.
Speaker:And like that is.
Speaker:management style, honestly, but in this specific case, because we're
Speaker:doing sometimes micro iterations, it can get really annoying.
Speaker:Yeah, for sure.
Speaker:No, no, that makes perfect sense.
Speaker:I just want to understand, because I was like, in this container world, you
Speaker:can just deploy a new container, but obviously you need a manager for that.
Speaker:Yes.
Speaker:Yes.
Speaker:I could totally deploy a new container.
Speaker:So what I could do is have Terraform output the list of IPs, and then I can SSH
Speaker:to each of them and pull a new container.
Speaker:But at that point,
Speaker:But seriously, there's another Git repository that I have started.
Speaker:The version of this that uses a distribution of Kubernetes is called
Speaker:K3s that is designed for CI systems and IOT and deployments to the edge.
Speaker:And it's a It's a single binary version of Kubernetes where everything is
Speaker:wrapped into a single binary and starts on edge nodes and then can connect
Speaker:them all together and so we could have a multi-region global cluster
Speaker:of this little Kubernetes agents.
Speaker:And then we could spin up Gooses on that.
Speaker:And that I think will actually work.
Speaker:You totally blew my mind.
Speaker:So now you've just signed up for follow up to show that
Speaker:because that's, I mean, that's, that's what you want actually, but now I'm
Speaker:really curious, how does this Terraform configuration actually look, can,
Speaker:can you share a little bit about it?
Speaker:So this is the current tree.
Speaker:If everyone can see that it's pretty simplistic.
Speaker:So this is the main file that gets loaded.
Speaker:And then for everyone, there's a module that is named after its region.
Speaker:They're all hitting that same actual module is just different
Speaker:revisions of this module.
Speaker:And then they'll take a worker count and their region and their provider
Speaker:and the provider is what is actually separating them into regions.
Speaker:And then if you look at the region worker, which is where most of these
Speaker:things are happening, There's a variables thing, which is interesting
Speaker:because I have to define an AMI map because every region has a different
Speaker:AMI because the regions are disparate.
Speaker:Like there's no, there's no consensus building between these regions for images.
Speaker:So one of the reasons I picked CoreOS is because it exists in each of these regions
Speaker:and can handle a single start-up file.
Speaker:when when we do the K3s version of this K3s kind of run on Ubuntu
Speaker:of, and Ubuntu of obviously exists in all these regions as well.
Speaker:but I'll still have to do something like this, or there's another way I can do it,
Speaker:but this was the way to do it for CoreOS.
Speaker:And then we set, instance type, this is this a default.
Speaker:And then the main version of this is very simple.
Speaker:we initialized our key pair, cause I want to be able to SSH into these instances at
Speaker:some point and upload it to each region.
Speaker:We initialize a simple security group that allows me to SSH in to each region.
Speaker:And then a simple instance that doesn't really have much because it's, it
Speaker:doesn't even have a large root device cause we're not using it at all.
Speaker:Basically we're just spinning up a single container and then pushing the logs to
Speaker:Datadog, which is our central log agent.
Speaker:So even the logs aren't being written locally on that we
Speaker:associated a public IP address.
Speaker:We spin up the AMI, we look up which AMI we should use based on our region.
Speaker:and then we output the worker address.
Speaker:So the other part of this is the manager.
Speaker:The only real difference in this, we basically spent out the exact same
Speaker:way is we also allow the Goose ports, which is 5115, and we spin up a DNS
Speaker:record that points to our manager because that DNS record is what all the
Speaker:region workers are going to point at.
Speaker:Um, and we make use of the fact that they're all using route 53.
Speaker:So this update propagates really quickly.
Speaker:And that's basically that it's pretty simple.
Speaker:each VM is running.
Speaker:Sorry, go ahead.
Speaker:where do you actually put in the Goose part?
Speaker:Because I've seen the VM.
Speaker:Yep.
Speaker:So each CoreOS VM it can take a ignition file.
Speaker:The idea behind CoreOS is it was a project to simplify infrastructure
Speaker:that was based on containers.
Speaker:It became an underlying part of a lot of Kubernetes deployments
Speaker:because it's basically read only in essence on a configuration level.
Speaker:It even can auto update itself.
Speaker:it's a very interesting way of dealing with an operating system.
Speaker:It its entire concept is you don't really interact with it outside of containers.
Speaker:It's just a stable base for containers that remain secure can auto update is
Speaker:basically read only in its essence and it takes these ignition files that define
Speaker:how it should set itself up on first boot.
Speaker:So if we look at one of these ignition files,
Speaker:okay.
Speaker:we can see that it's basically YAML.
Speaker:And we defined the SSH key.
Speaker:We want to get pushed.
Speaker:We define an /etc/hosts file to push.
Speaker:We then define some systemd units, which include turning off SELinux
Speaker:because we don't want to deal with that on short-lived workers.
Speaker:And then we define the Goose service, which pulls down the image.
Speaker:And right here actually starts Goose.
Speaker:this is mostly defining the log driver, which ships logs back to
Speaker:Datadog, the log driver, the actual logging agent is started here.
Speaker:but then like, this is one of the workers.
Speaker:So we pulled the temp, umami branch of Goose.
Speaker:We start it up, set it to worker.
Speaker:Point it to the manager host set it to be somewhat verbose set the log
Speaker:driver to be Datadog startup data dogs that we get metrics in the logs.
Speaker:And then that's just how it runs.
Speaker:And this will restart over and over and over again.
Speaker:So you can actually run multiple tests with the same infrastructure.
Speaker:You just have to restart Goose on the manager and then the workers will
Speaker:kill themselves and then restart.
Speaker:And so you get this plan, where it shows you where all the
Speaker:instances it's going to spin up.
Speaker:It's actually fairly long just because there are a lot of params
Speaker:for each EC2 instance that we're spinning up 11 of them, 10 plus
Speaker:the manager, you say that's fine.
Speaker:And it goes,
Speaker:and I will stop sharing my screen now is this is going to take a bit.
Speaker:So is this already doing something now.
Speaker:Yes.
Speaker:And this is, you're probably going to see one of the quirks.
Speaker:and this is another thing I dislike about this.
Speaker:Because we're using CoreOS . These are all coming up on an outdated AMI and
Speaker:they're all going to reboot right there
Speaker:because they come up, they start pulling the Goose container and
Speaker:then they start the update process and they're not doing anything.
Speaker:So at that point, they think it's safe to update.
Speaker:And so they update and reboot.
Speaker:it's somewhat cool that that has no impact on anything like the entire infrastructure
Speaker:comes up, updates itself, reboots.
Speaker:Then it continues on with what it's doing, but it's another
Speaker:little annoyance that I just don't.
Speaker:You spin up this infrastructure and you don't really have
Speaker:a ton of control over it.
Speaker:And so this is the logs of the manager process of Goose, and it's just
Speaker:waiting for its workers to connect.
Speaker:They're all, they've all updated, rebooted and Goose is starting on them.
Speaker:As you can see, eight of them had completed that process.
Speaker:is the, you know, all of this, the stuff that you put together here is
Speaker:this going to be available open source for folks to download and leverage?
Speaker:Yep.
Speaker:Awesome.
Speaker:It's all online.
Speaker:On our Tag1 Consulting GitHub organization and the K3s version will be as well.
Speaker:And that's the one I'd recommend you use.
Speaker:This
Speaker:one's real annoying.
Speaker:I know I keep going on about it, but like, this is how it skunkworks projects work.
Speaker:You make the first revision and you hate it and then you
Speaker:decide to never do that again.
Speaker:And then you make the second revision.
Speaker:Okay.
Speaker:This is starting.
Speaker:Now I'm going to switch my screen over to the web browser
Speaker:so we can see what it's doing.
Speaker:Sure.
Speaker:Great.
Speaker:the logs that we're seeing there are they coming from the Datadog
Speaker:or just the manager director.
Speaker:And so
Speaker:that was a direct connection to the manager.
Speaker:If we go over to Datadog here, there, these are going to be the logs.
Speaker:As you can see, the host is just like what an EC2 host name looks
Speaker:like, and they're all changing, but we're getting logs from every agent.
Speaker:As well as the worker.
Speaker:You can see they're launching.
Speaker:If we go back to Fastly, we can see that they're getting global traffic.
Speaker:So we're getting traffic on the West coast, the East
Speaker:coast, Ireland, Frankfurt, and
Speaker:Mumbai,
Speaker:and the bandwidth will just keep ramping up from here
Speaker:For the Datadog is that way to also filter by the manager, like,
Speaker:sure.
Speaker:this is the live tail.
Speaker:We'll go to the past 15 minutes and then you can go service Goose.
Speaker:And then we have worker and manager, so I can do all my worker.
Speaker:And that's sorry, only manager.
Speaker:The manager is pretty quiet.
Speaker:The workers are not.
Speaker:You must've disabled.
Speaker:displaying metrics regularly.
Speaker:Cause I would have expected on the server to see that.
Speaker:if I did, I did not intend to, but I probably did.
Speaker:Can we, is it easy to quickly see what command you passed in or not to go back
Speaker:there from where you're at right now?
Speaker:It's in Terraform, I think.
Speaker:It is all set here.
Speaker:So it
Speaker:must be interesting.
Speaker:I have to figure out why you're not getting statistics on the
Speaker:manager because you should be getting statistics on the manager.
Speaker:Is this the log you're tailing or is this what's verbosly put out to the screen?
Speaker:This is
Speaker:what is put out to the screen.
Speaker:Yeah.
Speaker:Interesting.
Speaker:Okay.
Speaker:I would have expected statistics every 30
Speaker:seconds.
Speaker:So what's kind of interesting is you can expand this in Fastly and see we're
Speaker:doing significantly less traffic in Asia Pacific, but that makes sense.
Speaker:Considering we're only hitting one of the PoPs and then Europe and North
Speaker:America tends to be about the same, but you can even drill down further,
Speaker:one quick question.
Speaker:I saw you hard-code the IP address of the end point in the Terraform.
Speaker:how does Fastly is still know essentially to which PoP to route
Speaker:and they're doing it through magic.
Speaker:Um, you mean I'd put the IP the same IP address everywhere in /etc/hosts?.
Speaker:Yep.
Speaker:Yeah.
Speaker:It's because of how they're doing traffic.
Speaker:So it is the same IP address everywhere, but they the IP
Speaker:address points to different things.
Speaker:Basically.
Speaker:It's cool.
Speaker:A lot of CDNs do it that way.
Speaker:so instead of different IP addresses, it's basically routing tricks.
Speaker:We seem
Speaker:to have maxed out.
Speaker:Can you look at the
Speaker:Yeah, this should be about it.
Speaker:It should be all started at this point.
Speaker:Yeah.
Speaker:So we've launched a thousand users, we've entered Goose attack.
Speaker:So we have evened out at 14.5 gigabits per second, which is, I think what we got
Speaker:on one server with 10,000 users as well.
Speaker:this is more, this is more than a single server single server.
Speaker:I think we max out at nine gigabit.
Speaker:Awesome.
Speaker:Thank you guys.
Speaker:All for joining us.
Speaker:It was really cool to see that in action.
Speaker:all the links we mentioned are going to be posted in the video summary and the
Speaker:blog posts to that correlates with this.
Speaker:Be sure to check out Tag1.com/goose that's tag the number one.com.
Speaker:that's where we have all of our talks, documentation links to GitHub.
Speaker:There's some really great blog posts there that will show you
Speaker:step-by-step with the code, how to do everything that we covered today.
Speaker:So be sure to check that out.
Speaker:If you have any questions about Goose, please post them to the Goose issue queues
Speaker:that we can share them with the community.
Speaker:Of course, if you like this talk, please remember to upvote
Speaker:subscribe and share it out.
Speaker:You can check out our past Tag1 TeamTalks on a wide variety of topics from open
Speaker:source and funding, getting funding on your open source projects to things like
Speaker:decoupled systems and architectures for web applications at tag1.com/tag1teamtalks
Speaker:as always we'd love your feedback and input on both this episode, as
Speaker:well as ideas for future topics.
Speaker:You can email us at ttt@tag1.com Again, a huge thank you, Jeremy, Fabian and
Speaker:Narayan for walking us through this and to everyone who tuned in today.
Speaker:Really appreciate you joining us.