Horizontal scaling is a common scaling strategy for resource-constrained, stateless workloads. It works by changing the number of individual processes serving the workload, as opposed to the sizeof the resources allocated to those processes. The latter is typically known as vertical scaling. HTTP servers, stream processors, and batch processors are all examples of workloads which can benefit from horizontal scaling.
At Twilio Segment, we use horizontal scaling to optimize our compute resources. Our systems experience fluctuations in workload volume on many different timescales, from minute-to-minute changes to seasonal variation. Horizontal scaling allows us to address this without changing anything about the individual deployment unit, decoupling this operational concern from the day-to-day development workflow.
We deploy most services with Kubernetes, which means our most common tool for scaling them is the Horizontal Pod Autoscaler, or HPA. While appearing simple on the surface, the HPA can be tricky to configure properly.
Today we’ll look at basic and advanced HPA configuration, as well as a case where proper tuning led to a doubling in service efficiency and over $100,000 of monthly cloud compute value saved.
The Horizontal Pod Autoscaler
The HPA is included with Kubernetes out of the box. It is a controller, which means it works by continuously watching and mutating Kubernetes API resources. In this particular case, it reads HorizontalPodAutoscaler resources for configuration values, and calculates how many pods to run for associated Deployment objects. The calculation solves for the number of pods to run by attempting to project how many more or fewer are necessary to bring an observed metric value close to a target metric value.
To use the HPA, we first must choose a suitable metric to target. To find the right metric, ask: for each additional unit of work, which additional resource does the service consistently need the most of? In the vast majority of cases this will be CPU utilization. The HPA controller supports memory and CPU as target metrics out of the box, and can be configured to target external metric sources as well. Throughout this guide, we will assume that CPU utilization is our target metric.
The primary parameters that the HPA exposes are minReplicas, maxReplicas, and target utilization. Minimum and maximum replicas are simple enough; they define the limits that the HPA will work within choosing a new number of pods. The target utilization is the resource utilization level that you want your service to use at steady-state.
To illustrate how the target utilization works, consider a basic HTTP server configured to use 1 CPU core per pod. At a given request rate, say 1k rps, the service might require eight pods to maintain a normalized average CPU utilization rate of 50%.
Now say that traffic increases to 1.5k rps, a 50% increase. At this level of traffic, our service exhibits a 75% CPU utilization.
If we wanted to bring the utilization back down to 50%, how would we do it? Well, with eight pods and 1k rps, our utilization was 50%. If we assume a linear relationship between the number of pods and CPU utilization, we can expect that the new number will simply be the current number, multiplied by the ratio of new to old utilization, i.e., 8 * (0.75/0.5) = 8 * 1.5 = 12. Thus the HPA will set the new desired number of pods to 12.
Tuning your target value
The target utilization should be set sufficiently low such that any rapid changes in usage don’t saturate the service before additional pods can come online, but high enough that compute resources aren’t wasted. In the best case, additional pods can be brought online within several seconds; in the worst it may take several minutes. The time required depends on a variety of factors, such as image size, image cache hit rate, process startup time, scheduler availability, and node startup time.
The final value will depend on the workload and its SLOs. We find that for well-behaved services, a CPU utilization target of 50% is a good place to start.
In order to better understand your service’s scaling behavior, we recommend exporting certain HPA metrics, which are available via kube-state-metrics. The most important is kube_horizontalpodautoscaler_status_desired_replicas, which allows you to see how many pods the HPA controller wants to run for your service at any given time. This lets you see how your scaling configuration is working, especially when combined with utilization metrics.
Kubernetes 1.23 brought the HPA v2 API to GA; previously it was available as the v2beta2 API. It exposes additional options for modifying the HPA’s scaling behavior. You can use these options to craft a more efficient autoscaling policy.
Understand your utilization patterns
To take advantage of the v2beta2 options, you have to understand how your service’s workload varies in its typical operation. You can think of the overall variation as a combination of the following types of variations:
Consider four different HTTP traffic patterns:
Adding pods is often too slow of an operation to count on to handle rapid increases in workload size.
It follows that any rapid increase in workload size must be handled by the existing pods and associated infrastructure. For example, a load balancer may queue more requests under a rapid load increase. If the increase persists, the HPA will create more pods to bring the target metric back to baseline, and the queue depth will return to normal.
For each traffic pattern, here's how you might configure the HPA:
With this model in mind, we can now look at the v2beta2 configuration.
v2beta2 configuration options
There are essentially two concepts introduced by the v2beta2 API: scaling policies, and stabilization windows.
Scaling policies, described in detail in the Kubernetes documentation, allow the user to limit the rate of scaling, both up and down.
Scaling policies help you tweak how quickly the HPA can affect changes to the pod count. Say you have a service that experiences intermittent low workload volume, but which quickly returns to average after a few minutes. You wouldn’t want the HPA to scale the service down prematurely, so you can set a scaling policy to limit the velocity of the downscaling.
Stabilization windows essentially allow the user to configure a trailing utilization minimum when applied to upwards scaling, and a trailing utilization maximum when applied to downwards scaling. They are described in more detail here: Stabilization Window.
Imagine a service that operates at a steady state, aside from brief rapid and temporary utilization spikes. We don’t want the HPA to change the number of pods in this case, so as long as our stabilization window is greater than the duration of the deviation, the HPA will essentially “smooth over” the deviation.
Here's pod count (purple) and request count (blue) for one of Segment's HTTP services over a 45 minute period. Look how each brief spike of the blue line leads to an upscaling, even though the spike is practically over by the time the new pods come online. This occurs primarily because there is insufficient stabilization window set on scaleup. There is practically no limit on the rate at which the service can scale up.
By setting a stabilization window and a scaling policy, we can tame these unnecessary scaling actions and maintain a more efficient service. The below graph is the same quantities charted after applying the new configuration. Notice that the pod count does not change following two brief request spikes.
The service in this example is quite large, and currently uses around 120 c6i.8xl nodes on average. By carefully tuning the scaling policy and stabilization window, we were able to double the number of requests served per core, effectively halving the number of instances needed.
Below you can see the change in requests served per core over the tuning period. The series is smoothed to better visualize the trend.
At the on-demand sticker price, that’s nearly $120k of monthly savings!
The HPA is a powerful tool for managing variable workloads on Kubernetes, but it’s not without its quirks. By understanding the more advanced configuration options, operators can gain additional control over the behavior of the autoscaler, making for more efficient services and potentially big compute savings.
How to manage consent enforcement with Twilio Segment
Announcing the availability of Consent Enforcement in Connections for all Business Tier customers at Twilio Segment, empowering businesses to integrate with any Consent Management Platform and enforce end-users' consent preferences seamlessly.
Unlocking the Power of Facebook's Conversions API with Segment: A Guide to First-Party Data Retargeting
Explore the shifting terrain of online privacy regulations, including Facebook's Conversions API, and learn how Segment streamlines the integration and transmission of first-party data, enabling efficient retargeting strategies.
Empowering teams, inspiring solutions: Inside Twilio Segment's build-a-thon
We share the innovations from Twilio's internal build-a-thon, which showcases the transformative potential of integrating Segment, that deliver solutions that address real-world challenges and redefine customer engagement.