The 5 Pillars of Good Solution Architecture: Performance and Scalability
By Sergio Barbosa (CIO – Global Kinetic)
In this second part of our 5-part series on the pillars of good solution architecture, we look at Performance and Scalability. Designing for scale and performance requires a clear plan of when to scale up and when to scale out. This plan needs to be accompanied by a process for optimizing network and storage performance and identifying bottlenecks.
Scaling up vs. Scaling Out
Scaling up involves increasing the CPU power of your servers or adding more RAM or bigger disk drives to them, whereas scaling out involves adding more servers in parallel to distribute the compute, memory, or storage requirements across servers. There is a limit to how much you can scale up, versus scaling out, which is hypothetically limitless through modern cloud provider offerings.
From an architecture perspective, special consideration needs to be given if you want your solution to be able to scale out. For one, you need to have some Load Balancing technology in place to determine which instance to send traffic to or delegate compute power to. Additionally, if you need to scale out your data storage, you need to implement “sharding” or design your solution so that, logically, your data does not need to be split across instances and can be grouped together.

Once you have a handle on how you can go about scaling up and/or scaling out, you can start putting a plan together of when to do so for the solution that you are designing. Most infrastructure providers offer interfaces to programmatically scale up your virtual machines or compute instances, or to scale out and add more in parallel. It is important though to clearly understand the workload of the various components of your system over time so that you can put this plan together and make use of these programmatic interfaces.
For example, certain parts of your system may have an increased load at certain times of day, days in a month or a specific time of year. The ordering function in a food delivery system will be much busier over the lunch hour, and the payment authorization function of a card management system will be a lot busier at month-end or over the festive season. Combining this kind of understanding with the non-functional requirements of your solution will enable you to separate out specific workloads and scale them independently according to a plan.
There is an added complexity in that you may not know in advance when your system will require additional power to deal with certain workloads. A celebrity, for example, may decide to mention your SaaS product in a social media post and suddenly your system must deal with thousands of new sign-ups per second. In these kinds of scenarios, it is critical to firstly have Application Performance Monitoring (APM) in place so that your system knows when workloads are exceeding the compute, memory, or storage thresholds in the plan, and then secondly to be able to automatically scale up and/or out accordingly.
Auto-scaling in this way is an incredibly powerful feature to have in your solution, but it is a difficult feature to implement because it is not something you can implement once and forget about it. You need to revisit your design with updated information from your APM regularly to ensure that you have split out workloads according to how your system is being utilized by its users and other systems that it integrates with.
Optimizing Network and Storage Performance
The latency between cloud resources and separate data centres across which your solution is deployed can have a massive impact on the performance of your solution. There is a big difference in performance between Site-to-site VPNs over the Internet versus dedicated VPNs. Although there is a cost implication to these, there are some great offerings by most cloud providers in this regard. Additionally, the latency between end user applications on the edge and the API Gateways and/or other network resources that these applications consume can further impact the overall performance of your solution negatively.
Determine where your users are and deploy the APIs and the services that they require access to as close to them as possible. If the services and cloud resources that make up your solution must be distributed over geographical locations for whatever reason, ensure that the connections between these locations is optimum. Try to design your solution so that data that does not change regularly, but that is used regularly, is brought closer to the end user through replication or caching. This will limit the reliance on the network for your solution to perform optimally. Eliminate the need for “chatty applications”, where polling is required to implement certain functionality. The fewer network hops a specific feature needs to go through, the better.

An area that is often overlooked in optimizing network performance, is the reliance on DNS servers in a solution architecture and the benefits that can be derived from them. DNS Load Balancing is a low effort, high impact spanner in your toolbox. You can easily route traffic to different data centres based on the priority, weighting, performance, and geographic locations of the requests coming through. CDNs can also be leveraged to cache static content as close to the user as possible.
In terms of optimizing storage performance, the key here is to determine the trade off being made between the latency in accessing data and the volatility of the data itself. You might have very low latency on retrieving data that is stored in a cache, but the data may be stale, or may not even be there at all. Polyglot persistence is the term used for storing the data for a solution across many different mechanisms, for example Caches, SQL databases, NoSQL databases and Message stores like Kafka. Polyglot persistence is usually found in systems that have implemented some form of CQRS pattern in their solution design. The whole view of a “business object” for lack of a better term, like a Food Order or a Bank Payment, is the collection of related data across these different mechanisms that collectively make up all the data for that “business object”. Using different mechanisms to store different data related to a “business object” is a useful strategy to adopt if you want to improve the performance of your solution. As mentioned above, data that is frequently accessed, but changed rarely, can be cached, and not require retrieval from a database that would typically involve accessing physical storage like a hard drive. Disk I/O is super expensive, so separating the logic in your applications that does the reading of data and the writing of data is a great technique to optimizing storage performance.

Identify Performance Bottlenecks
Whenever I am required to look at a performance issue with a particular solution that I am not familiar with, the first two things that I look at are a) how the solution uses the network it is deployed across and b) how the solution stores and retrieves data. Network and storage inefficiencies are usually the biggest culprits when it comes to performance bottlenecks. To effectively isolate network inefficiencies, it is vital to implement “health check” endpoints on your services. Doing so enables you to check that services are available and what their response times are. It is important to clearly define up front what your non-functional requirements for your solution are, for example:
- Transaction speed (in milliseconds)
- Number of transactions (per second)
- Simultaneous connections (before returning errors)
- Maximum downtime (in seconds)
You can monitor the results you get from these “health check” endpoints against these non-functional requirements initially, and then over time as you add functionality to your solution. You can then set up alerting mechanisms to let you know when you are approaching the limits defined in your non-functional requirements and act if necessary.
Additionally, it is important to implement and make use of Identifiers within a transaction so that you can have additional information to help identify bottlenecks. Typical examples of Identifiers are:
- Correlation IDs: A unique identifier generated for each transaction, and required as mandatory additional information for each transaction at the edge or highest entry point into the stack
- Service IDs: A unique identifier provisioned for each type of consumer/application/user that uses the solution or its services, and required as mandatory additional information for each transaction at the edge or highest entry point into the stack
- Application IDs: A unique identifier generated for each instance of a consumer/application/user that uses the solution or its services, and required as mandatory additional information for each transaction at the edge or highest entry point into the stack
The words generated and provisioned are underlined for a reason – generated means that the identifier is automatically generated by your solution at the edge, whereas provisioned means that the identifier is defined or produced by your solution as far down (not at the edge) in your stack as possible.
If the services in your solution implement identifiers like the ones above and then log with date/time stamps every time a transaction boundary is crossed during execution, then the detailed execution path of any transaction can be mapped end-to-end. This is particularly useful when your solution requires complex integrations with multiple third-party systems that are outside of your control. Without this kind of visibility, it becomes impossible to improve the performance of your solution over time and manage it effectively.
In the next post we unpack the third pillar of Availability and Recoverability, and how big the difference is between 99.9% and 99.99% uptime…
Pingback:The 5 Pillars of Good Solution Architecture: Security - Global Kinetic
Pingback:The 5 Pillars of Good Solution Architecture: Availability and Recoverability - Global Kinetic