What is scalability? Scalability is the property of a system to handle a growing amount of work by adding resources to them. Such an architecture can support growth in users, traffic or data without drop in performance.

Two ways to scale an architecture:

  • Vertical Scaling: Scaling vertically means adding or removing resources from an individual resource, typically involving the addition of CPUs, memory or storage to a single computer.

  • Horizontal Scaling: Scaling horizontally means adding more resources (nodes) to a system, such as adding more servers to a distributed software application.

Not all architectures are designed to distribute workload to their resources. So, let’s look into a few scenarios:

  • Stateless Applications: A stateless application does not store session information which means an application given the same input provide the same response to the end user. Such applications can scale horizontally because any of the compute resources(such as EC2 or lamda functions) can service any request.

  • Distribute Load to Multiple nodes: To distribute load to multiple nodes we can use either push model or pull model.

    • Push model: Distribute incoming requests to multiple servers. We can use either Network Load Balancer or Application Load Balancer depending (for container based requests) upon the use case.

    • Pull model: Pull model uses asynchronous message driven architecture where messages are stored in a queue and multiple resources can pull and consume those in a distribute fashion.

  • Stateless Components: Consider only storing a unique session identifier in an HTTP cookie and more detailed user information on the server side. Instead of storing information on a local file system, store it in a database. If you’re using AWS, DynamoDB would be a good choice. This way the application can scale horizontally.

  • Implement Session Affinity: For HTTP/HTTPS traffic, we can use sticky sessions feature of an Application Load Balancer to bind a user’s session to a specific instance. With this feature, an ALB will try to use the same server for that user for the duration of the session.

  • Implement Distributed Processing: Use cases that involve processing large amounts of data or anything that can’t be handled by a single compute resource. Distribute processing approach divides a task and its data into many small fragments of work, execute them in parallel across a set of comput resources.

Reference: AWS Cloud Best Practices