Who We Are:
Simplebet is a B2B product development company using machine learning and real-time technology to make every moment of every sporting event a betting opportunity. We leverage real-time data from in-game events to create new and exciting betting opportunities for everyone’s favorite teams and players.
At Simplebet, we needed to answer the age-old question that every start-up must eventually tackle:
“What do we do with all of this data?“
We LOVE data at Simplebet, we pull it in from everywhere and analyze everything. We analyze all kinds of data from a wide array of sources that we find meaningful, some more directly than others, but all useful nonetheless. This data could be how Paul Goldschmidt is doing against sliders, or how that Thunderstorm in northern Texas is developing, or even Tom Brady’s stats from games where there’s snow.
Now, we have different uses for our data here at Simplebet, but we needed something flexible that also would help us keep costs low. We wanted something that we could use for Data ETL jobs to move production data to lower environments, but also something that could train our machine-learning algorithms. After some searching, trial and error, and a few POCs, we settled on Prefect.
Who is Prefect?
Prefect is a data automation workflow platform company. They have a few products and are in one of the earlier stages of the start-up lifecycle as far as companies go. They were founded in 2018 and are “The easiest way to automate your data.” They provide Prefect Platform to manage all your workflows and scheduling needs through their intuitive UI. While Prefect does manage and organize your data workflows and flowruns, it doesn’t provide a platform in which to execute those flows on, so we turn to AWS for that.
Why the Cluster Autoscaler?
At Simplebet, we use Kubernetes(K8s) pretty heavily on AWS. We use K8s and many related tools to containerize all of our services and provide them with the right hooks, resources, and access that they need. We decided that rather than have a dedicated EC2 instance to run all of our nightly ETL jobs, and then have to size completely different machines to train our Machine Learning Models on, that we wanted to offer the right-sized machine just-in-time for the job. We also decided to go with SPOT instances because of the short nature of some of the jobs and the reduced price of the SPOT instances. AWS SPOT instances are a type of EC2 instance that come from a public pool of servers and AWS can pull them back from you with a short notice at any moment, but the trade-off is that they are about 10% of the cost of an ON_DEMAND server.
How it Happens Today:
In addition to the above Cluster Autoscaler, we also have configured and deployed a lightweight Prefect Agent to run in our Kubernetes cluster where the compute is located that we will be using to run the Prefect Flows as executed in Kubernetes pods. Working with our outstanding Machine Learning and Data Engineering teams, who primarily leverage this product day-to-day, to help tune this agent to include anything it may need to both communicate back with the Prefect Platform and orchestrate the necessary pod workloads in the Kubernetes cluster.
Now, we knew we had all the right technologies to come up with a working solution, but we needed an orchestrator to manage these SPOT instances for us; Enter the Cluster Autoscaler. The Kubernetes Cluster Autoscaler monitors the active resources of a set of servers it is assigned to watch; In our case, it is watching a set metrics and then adjusting the desired size of the AWS Auto-Scaling Groups we’ve allocated for the Prefect Flow Runs, thus giving us that desired amount of EC2 instances. The Cluster Autoscaler also watches to see that the nodes are actually being used, and if they aren’t, it will reduce the size of the AWS Auto-Scaling Groups back down to their desired size.
With everything in place, we have a workflow that resembles the following:
- A Prefect Flow is scheduled on the Prefect Platform
- When scheduled, Prefect Platform uses the template K8s Job we have prepared and merges it with the job-specific configuration and communicates that with the Prefect Agent running in our K8s cluster
- The Prefect Agent uses the K8s API to schedule a Job with the local k8s scheduler.
- The Job is scheduled on a Namespace in K8s that can only exist on AWS SPOT instances that are managed by the Cluster Autoscaler
- The Cluster Autoscaler communicates with the AWS API to spin up a new EC2 instance for our K8s cluster
- When health-checks are passed from the new server, it is made available to the K8s cluster for scheduling, and the Prefect Job is scheduled there.
- The Prefect Flow is executed inside the K8s Job, manipulating data, storing artifacts, and resulting.
- When completed, the K8s job is then removed from the nodes and cleaned up.
- The Cluster Autoscaler notices the metrics of this server have reduced to next-to nothing and marks it as a candidate for removal.
- After a period of time, the Cluster Autoscaler will remove the unused nodes, powering them down and terminating them.
We end up with metrics that clearly show when we spin up more and more servers to handle the workloads, and then drop them off as we move out of the maintenance period.
What Did We Learn Doing This?
Ok, so you’re not reading this right now because you want to make the same mistakes we did learning how to put this solution in, so I’m here to share some secrets we learned along the way.
- DO NOT tell BOTH Prefect and Kubernetes to restart your Jobs
This may not seem intuitive at first, but both Prefect Flows and the Kubernetes Jobs have built-in features that allow them to be restarted if they don’t complete successfully. If a SPOT instance is removed midway through a job, you’ll probably want to restart that job so that it finishes all the way through. Now if both K8s and Prefect do it, there are some miscommunication issues that arise and both of them will fail. I recommend just setting the Prefect configuration to try a retry on a failure, and set your Jobs to use “restartPolicy: Never”
- Safe-to-evict: False
Currently, the Cluster Autoscaler can evict pods/jobs to move them to another node where they may be better suited. While this is a robust and powerful feature, it doesn’t work for us in this context, so we’ve added the annotation to all of our jobs:
- One size DOESN’T fit all
We have jobs that range from processing a hundred or so Megabytes of data, to hundreds of Terabytes, so we needed a way to allow our developers to pick the right box for their needs. We created an instance map for the Prefect Flows to reference that would correspond to an Auto-Scaling Group in AWS with the same instance size(s). This allowed our Data Engineering and Machine Learning teams to pick the correct size resource for the job, and it would then be delivered just-in-time for processing
- Nodegroups can be set to 0
We have the Prefect Agent deployed in our K8s cluster and it is running on an ON_DEMAND instance there so that it can always communicate with the Prefect Platform. The Nodegroups/Auto-Scaling Groups that the Cluster Autoscaler manages however, have their desired capacities set to 0. This means that when we don’t have ANY Prefect jobs up and running, there are no servers that we’re paying for. And when we do need those servers, the Cluster Autoscaler will see the pending scheduled K8s Jobs from the Prefect Agent and it will request a new SPOT instance for it to run.
With more and more data coming in, we want to be mindful of our future growth and priorities. To alleviate some hardships with scheduling in the beginning, we marked many of our jobs to use their own instances to better get a resource footprint of the individual jobs. With this new data, we plan to do some more sophisticated scheduling with our jobs to use even fewer nodes to accomplish our data manipulation needs.
We also plan on leveraging the new Prefect Orion product to tap into some of the new orchestration features they’re providing there, such as the ability to Loops and Conditional Logic in our Prefect Flows. We also are looking to take advantage of the new Asynchronous tasks in Orion for some of our ETL tasks to shorten them significantly.