Why and how concurrency based autoscaling may be useful

Every time facing a task of configuring how the system should scale out and scale in, we decide what metrics and policies will be more effective. I have already described CPU based policies and what challenges they may have so that ideally would be to check the configuration and fix the possible root cause.

But what if this is not an option, i.e. the service is represented by legacy application, written years ago, not well documented, etc. and we still want to run it and autoscale?

I'd like to share possible approaches to handle such a situation.

There are no good or bad policies. There are effective and ineffective for performing some functionality, and the level of applicability may differ based on the load the application must handle, nature of the load(stable, linearly-changing, periodically-changing, random), dynamics of changing this load, reliability, and availability required to provide, ability to auto-recover and others.

The usual approach is using CPU-based metrics. For the high availability applications, where there are a lot of concurrent client requests, this metric might not effectively reflect the real load, as the relation between CPU utilization and the number of requests/users may be non-linear.

Still, it is ok to use CPU based autoscaling policies, concurrency based policies may make you resource usage more effective. The worst scenario for CPU based metric is if for some reason under the load instance won't be able to respond to health check still having CPU usage on a normal level.

In this case, as far as the instance is not responding to health checks, the autoscaling group will remove the instance from the group and terminate it. This will dramatically affect the system which is already under the huge load. To handle such situations we can use concurrency based metrics.

Here you can read more about how concurrency based metrics may be more effective: How to Lose Money with the New AWS ELB Network Load Balancer

Talking about concurrency based metrics, the ActiveConnectionCount metric is often mentioned. This metric still has a problem which makes it not very applicable for Autoscaling. ActiveConnectionCount represents the number of connections to load balancer, but it does not take into account the number of target instances. For example, you have a Target policy tracking Active connections count to 50 connections and 5 instances running. After the load increases and ActiveConnectionCount equals 100 connections Autoscaling group will add some more instances.

Let's suppose after scaling-out there are 10 instances and still 100 connections. If the load happened is a short spike this policy will help to sort out the requests and will scale down. But what if the increased load is not a short term event? And supposing 1 instance is capable to process 10 connections, having 10 instances per 100 connections is fine. But as far as your Policy will try to keep 50 Active connections, it will scale out again. So at some point, you will end up having maximum allowed by your policy number of instances running.

ActiveConnectionCount is useful in case if you have a steady load with some spikes which don't last long. In case if you want to handle a situation when the number of connections may steadily grow to 4x of the normal amount, then decrease to normal level after several hours - ActiveConnectionCount is not the best choice.

So what we want is ActiveConnectionCount per target, and RequestPerTarget is a metric that is pretty similar to it. In most cases, it allows us to scale in and out pretty relevant.

Now getting back to the scenario when one instance is not responding to health check due to huge load, it happens it is still considered as an existing target for some time, so in case of spike 4x of normal you may have a couple of instances not responding, but no scaling out started to handle the situation proactively.

So RequestPerTarget is a good metric to address steadily increasing and decreasing load. As for spikes, it may be a little bit slow to react, and as a result, may cause unnecessary downtimes and errors. You can see that comparing RequestPerTarget and HealthyHost metrics.

So it would be helpful to have not just requests per target, where some targets might be overloaded, but requests or connections per healthy target, so you can proactively indicate situation when some instance is overloaded and very likely might be removed by autoscaling group. In this case, you can create a policy to add several instances and rapidly decrease the load and fulfill the loss of unhealthy instances if it happens.

Sources & More details:

AWS documentation

How to Lose Money with the New AWS ELB Network Load Balancer