The ability to elastically scale workloads is a key concept of any cloud computing platform. Elasticity is used in cases where changes in workload can happen rapidly, with dramatic increases or decreases over a very short period of time, with little to no advanced warning. This type of change requires a much more flexible and automated system that can react to changes without manual intervention. Automated provisioning mechanisms can help applications to scale up and down systems in the way that performance and economical sustainability are balanced. So, what does it mean to scale?
Fig. 4.2/1
Fig. 4.2/1: Two dimensions of scaling
Claster loads - Scale up
Claster loads - Scale out
Basically, scalability can be defined “as the ability of a particular system to fit a problem as the scope of that problem increases (number of elements or objects, growing volumes of work and/or being susceptible to enlargement)”. For example, increasing system’s throughput by adding more software or hardware resources to cope with an increased workload. The ability to scale up a system may depend on its design, the types of data structures, algorithms or communication mechanisms used to implement the system components. By determining the load scalability, we mean when a system has the ability to make good use of available resources at different workload levels (i.e. avoiding excessive delay, unproductive consumption or contention). Factors that affect load scalability may be bad use of parallelism, inappropriate shared resources scheduling or excessive over heads. For example, a web server maintains a good level of load when the number of threads that executes HTTP requests is increased in a workload peak or if the workload is a video streaming service, then volume is typically measured by the number of users watching a video at any given time. In this case, increasing the network bandwidth between the streaming servers and the user devices would address increases in the workload. Similarly, increasing CPU power would handle an increase in an insurance company’s underwriting workload, which is measured by the size and complexity of the policy being processed. Be aware that elasticity and scalability not only require that a system react to an increase in workload, but also that it equally reacts to a decrease in workload (something that is often overlooked, and sometimes a harder problem to solve). Autoscaling depends on the monitoring agent to collect metric data at the operating system level. The actions to scale may be classified in: Vertical or Horizontal scaling.

Dimensions of scaling

Scaling Vertically
vertical scaling improves the performance of a single service of your stack. In simple terms, to scale vertically is increase the resources of the server.
Scaling Horizontally
or increasing the number of nodes in the cluster, reduces the responsibilities of each member node. Reasons to scale horizontally include increasing I/O concurrency, reducing the load on existing nodes, and increasing disk capacity. The goal here is to handle tasks in parallel, such as concurrently handling more requests by launching a new web server or storing your database in different servers at the same time.

 Horizontal scaling

One of the distinguishing characteristics of the cloud model is the ability for the service users to horizontally scale computing resources to match customer demand. Because the cloud model is offered in a pay-as-you-go scheme, it is in the service user's best interest to maximize utilization while still providing a high quality of service to the customer. Horizontal scaling is also called “scale out” and “scaling in”, respectively. This type of scaling usually refers to tying multiple independent computers together to provide more processing power through adding additional nodes or virtual machines to a system, or removing them as necessary. That is, the capacity of each individual node does not change, but its load is decreased. Cloud offers extensive support for scaling virtual application instances, by injecting components into each deployed virtual machine that work together to achieve the desired behavior. Horizontal scaling typically implies multiple instances of operating systems, residing on separate hosts.
Fig. 4.2/2
Fig. 4.2/2: Horizontal scaling
Horizontal scaling

 Vertical scaling

This type of scalability, or in other words “improving the capabilities” of a node/server, gives greater capacity to the node but does not decrease the overall load on existing members of the cluster. Vertical scaling, also described as scale up, when adding resources, and scale down when removing them. Typically refers to include increasing IOPS, adding more processors/RAM and storage to an Symmetric Multiple Processing to extend processing capability or reducing that capacity accordingly. Generally, this form of scaling employs only one instance of the operating system.
Fig. 4.2/3
Fig. 4.2/3: Vertical scaling


Scaling criteria
Whether new VMs are added to or removed from a virtual application (horizontal scaling), or whether memory or CPUs are added to or removed from an existing VM (vertical scaling) can based on defined “trigger events”. Each trigger event consists of the actual metric (for example, the measured utilization of CPU in a VM), and the thresholds for scaling in and out, respectively. You can also define the time duration that the threshold must have exceeded before taking action. Finally, you can define the minimum and maximum number of instances (VMs) that can exist within the deployed virtual application instance.