The Final Pillar: Designing for Cost


For the OCD readers of this post, defining cost as a pillar of good architecture seems like a poor choice of words as the picture below illustrates.

Cost is the most important aspect of any good architecture because it is cross-cutting across the other 4 pillars.  If you have enough money, you can build anything, but this is not always a good thing.  You can end up not making conscious decisions about where you spend your money and can end up over-engineering a solution to its detriment.  Most projects (and hence architectures) will almost certainly be constrained by budgets.  However, cost not only acts as a constraint on the architecture, conversely it can also help reinforce any of the other 4 pillars.

So how does one design for cost?

The most logical approach would be to apportion cost to each of the other 4 pillars equally and start from there.  But that may put you in a situation where you have overinvested in a certain pillar that is not a priority for the target solution.  For example, if you were building a solution that allows people to share cool cat photos you may not want to invest as heavily in the Security pillar as you would say in the Availability and Recoverability pillar.  On the other hand, if you were building a financial system you would want to invest heavily in the Security Pillar.

The key to determining this balance is by looking at the solution’s Non-functional Requirements.  At Global Kinetic, we group our NFRs into the following categories.

Some of the NFR categories above span multiple architectural pillars, like for example Operational Excellence spans the Performance and Scalability Pillar, the Availability and Recoverability Pillar and the Efficiency of Operations Pillar.  On the other hand, Security NFRs will be focused on the Security Pillar.

Another factor to consider is the stage at which your solution or product is in terms of the market or user base that the solution is targeting.  For example, if you are an early-stage startup, you may want to invest more heavily in Efficiency of Operations which will allow you to pivot and be more nimble in order to respond to changing customer/user demands while you get your product market ready, and invest more heavily later on in Performance and Scalability once your product or solution starts getting traction and the feature set starts to mature.

The most important thing to ensure is that you apportion at least some budget into each of the four pillars at the start so that you are making a conscious decision of which pillar has the highest priority and which has the lowest.  This budget and its apportionment to the 4 pillars must either meet the solution’s NFRs or come with a roadmap or plan of how meet the solution’s NFRs for the system over time.

And there you have it, the Five Pillars of good solution Architecture.  A final note is that a good architecture is one that can meet the demands of the product and market throughout the life cycle of the product without requiring a major redesign.  This means that a good architecture comes with a roadmap of when you will build each part of the architecture out to its full capability, because you can’t build everything up front and on day one.  I hope the blog series has been helpful, and if you need some help working through an architectural design for a new digital product that you are looking to take to market, you know where we are.

The 5 Pillars of Good Solution Architecture: Designing for Efficiency

Designing for Efficiency

Sergio Barbosa (CIO, Global Kinetic)


The wave of cloud computing that hit the tech industry during the first decade of the century brought about the promise of reduced infrastructure costs with on-demand infrastructure utilization.  In layman’s terms you only paid for the infrastructure that you used for the time that you used it.  No longer did you have to purchase a powerful expensive server upfront that was able to handle your system’s peak workloads, and then have it sit idle for most of the time until it was needed.  With cloud computing the promise was that you could run your maximum workloads on powerful servers for the one or two hours that you needed it, and then scale that down to a small server for the rest of the time, drastically reducing your infrastructure costs.

That was easier said than done.  We quickly discovered that for this to be achieved you would need to have system diagnostics to know when you needed the big server and when you needed the small one, and for how long.  That means you needed to build this monitoring into your system from the onset so that the system can give you the diagnostics you need to make infrastructure decisions.  But not all systems are that predictable.  There are four basic models, and a single system can have a combination of these models if it is a more modern and modular or microservices-based system.

The microservices that power the finance department of a company for example might have very specific predictable demand at month end when payments are made and reconciliation processes are run, whereas the microservices that power the onboarding of new customers may have an unpredictable demand as some external forces could drive demand for new customer sign ups that weren’t previously anticipated.

Some systems may have a requirement for an on-premise component for whatever reason, and hybrid infrastructure architectures are very common.  It is important to ensure that your on-premise infrastructure does not become a bottleneck for your elastic cloud infrastructure in hybrid scenarios.

A good way to approach cost efficiencies for a system is to organize the infrastructure being utilized.  In most cloud environments you can make use of subscriptions, resource groups and tags to assign resources to different cost centres within a large enterprise.  Organizing system resources like this will help you optimize the spend.  Optimizations can be done at an IaaS (Infrastructure as a Service) level with compute and storage provisioning, or at a PaaS (Platform as a Service) level with database, blob and orchestration services like Kubernetes provided on demand by most cloud providers.

As mentioned before, key to understanding where you can optimize a system and make it more efficient from a cost and/or utilization perspective (we all want to save the planet right?), is through monitoring.  The formula is simple; Monitoring + Analytics = Insights.  Core system monitoring involves four specific things:

Now that you have your monitoring in place, you can start working on automation.  Automation can add incredible efficiencies to operations.  There are three main areas of automation that you can focus your energies on:

Designing for efficiency up front can add immense costs savings to your solution in the long run.  Building the metrics, diagnostics, health checks, automated tests and IaC to an existing code base is a near impossible task and the costs will undoubtedly outweigh the benefits.  Build these in upfront and reap the rewards.  Continue monitoring your system over time as system usage evolves and changes.  This way you will always can improve the efficiency of operations in the systems that you build.

If you missed any earlier parts from our series on the 5 Pillars of Good Solution Architecture, click here to read more.

The 5 Pillars of Good Solution Architecture: Availability and Recoverability

The 5 Pillars of Good Solution Architecture: Availability and Recoverability

By Sergio Barbosa (CIO - Global Kinetic)


To date, we have looked at the security and the performance and scalability aspects of good solution architecture.  If you did not catch those two posts, be sure to read up on them before checking out this next architectural pillar, which focuses on Availability and Recoverability.

 Availability vs. Recoverability

Firstly, what is the difference between these two?  They seem similar?

Availability refers to a measure of how available a system is for its end-users and/or external integrated systems.  Recoverability, however, refers to how quickly a system can recover from a catastrophic failure.  Availability is usually measured as a percentage of time that a system is available, whereas recoverability is usually measured by the number of hours of data loss or downtime.  The two are inter-related because the requirements for the one imply the requirements for the other, but they are not the same thing.  From an architectural perspective, it is necessary to implement different measures and designs to address the availability and recoverability requirements of a system separately.  These requirements are usually dictated by a Service Level Agreement (SLA) promised to end-users and/or external integrated systems and is the departure point in designing for availability and recoverability.

Let us take availability first by way of example.  The table below shows the allowable downtime by availability.  You can see the huge difference between an SLA requiring a system to be available 99% of the time vs. one that is required to be available 99.999% of the time.  On the one hand, the system can be unavailable for days, whereas on the other hand if the system is down for more than 5 minutes, you’re in big trouble.

SLA downtime vs availability
SLA downtime vs availability

A system requiring 99.99% availability or higher will likely need to be self-diagnosing and self-healing to some degree.  Services should be able to restart themselves automatically and scale out and/or up automatically if load increases to prevent the system from becoming unavailable.  This requires specific design to allow the system to work with load balancers for example or to have the ability to restart automatically in the event of a failure.

There are some patterns here that can be utilized as well, such as the circuit breaker pattern.  With this pattern, a service will automatically return a failed status to consumers after a certain number of failures (threshold) has been reached (much like an electrical circuit breaker).  The service will do so over a pre-defined or calculated time period after which the service will let a certain number of test calls through.  On success of all test calls the service will then return to normal operation.  This pattern can be particularly useful if the solution you are designing is heavily dependent on external third-party integrations where the availability of those dependent systems or applications are outside of your control.  The whole management of dependent third-party applications and integrations is a whole topic on its own, and probably warrants a separate blog post… hmm.

A 99.99% availability SLA will most likely also require the system to be deployed across multiple availability zones, and these will most likely need to be deployed across different geographies to mitigate the risk of a data centre becoming unavailable.  To support this, the solution must be able to work behind a DNS Traffic Manager and data that the system relies upon should be replicated across both Availability Zones, so that normal system operation can resume in the event of a failover from one data centre to another.

Additionally, to normal operation, one needs to consider the deployment of new versions of the software into production and the impact that these can have to the availability of the system.  For example, if you need to incur some downtime to the system for a deployment of a new version, you are eating into the allowable downtime your SLA for the system allows.  You should be banking this time allowance for when you really need it, i.e., in the event of a disaster.  To support zero downtime deployments, you will need to be able to route only some requests/traffic to the newly deployed version of the system or service and then, on successful validation of the new version, route the rest of the requests/traffic to the new version and then deprecate the old version.  You need to design version control into your system otherwise you will not be able to accomplish this.

 Building a highly available architecture

The diagram below is an example of a high availability system architecture.  The solution is deployed to multiple regions behind a Traffic Manager.  In each availability zone there the services are deployed behind a gateway that can direct requests to different services according to certain rules.  Load Balancers have been implemented to manage additional load to the system.  What does this picture tell you about the system itself?

High Availability Systems architecture
High Availability Systems architecture 

Well firstly, it looks as though the system is data-heavy, and that there is potentially no real separation in the design between data reads and writes.  Additionally, it looks as though the system serves up a lot of content that is independent of processing.  Also, the processing that the system does is probably confined to a small number of features and is potentially single purpose, because there does not appear to be much separation of concerns between application instances.  What are the single points of failure?  Well, they are clearly the Gateway, Load Balancers, and the Traffic Manager.  So, to support high availability, the application should be able to scale out with more VM instances in the event of heavy load and have a robust Disaster Recovery (DR) plan in place for the 3 single points of failure listed.

Disaster Recovery strategies

Much like availability, recoverability is driven by the SLAs that are promised by the system to end-users and/or external integrated systems.  There are two recovery objectives to take into consideration here:

The diagram below shows the difference between these two.  The RPO and RTO needs to be defined for each service and workload in the system.

RPO vs RTO

Once the RPO and RTO for the system is determined, a DR plan needs to be put into place with detailed recovery steps.  I am not going to go through the details here about best practices for DR strategies and planning, but it goes without saying that the only useful DR plan is one that has been tested and gets tested end to end at least twice a year.  From an architecture perspective though, there are a couple of key areas to design for in terms of recoverability.

Firstly, and most importantly, data.  To lose as little data as possible during a disaster, you need to have regular backup processes in place, have less reliance on cached data as an interim persistence mechanism and/or be able to rebuild your data from event sources.  Each one of these tactics has its advantages and disadvantages, and an ideal scenario is to take a bit from each into your architecture design.  It is not always practical to do regular backups, so replicated data stores at different locations can offer some relief an alternative method to protecting your data.

Secondly, is downtime, which is typically dictated by the system’s single points of failure and the DR plan’s ability to restore any one of these to full operation.  Obviously the more single points of failure you have, the higher the likelihood of a disaster.  A good solution architecture will keep single points of failure to a minimum.  Again, there is no silver bullet, but it is key to consider that the solution’s recoverability must be driven by the SLA requirements and the RPO and RTO of the system.  There is no point in over-engineering a system when there is no requirement to do so, but where there is a requirement it is imperative that it is met.  This plays into the cost pillar which we cover in the last post in the series.

In the next and penultimate post of this series, we look at designing for the efficiency of operations.  A system architected without considering the operational maintenance and enhancement of that system can be crippling and often one of the most overlooked aspects of good solution architecture.

The 5 Pillars of Good Solution Architecture: Performance and Scalability

The 5 Pillars of Good Solution Architecture: Performance and Scalability

By Sergio Barbosa (CIO - Global Kinetic)


In this second part of our 5-part series on the pillars of good solution architecture, we look at Performance and Scalability.  Designing for scale and performance requires a clear plan of when to scale up and when to scale out.  This plan needs to be accompanied by a process for optimizing network and storage performance and identifying bottlenecks.

Scaling up vs. Scaling Out

Scaling up involves increasing the CPU power of your servers or adding more RAM or bigger disk drives to them, whereas scaling out involves adding more servers in parallel to distribute the compute, memory, or storage requirements across servers.  There is a limit to how much you can scale up, versus scaling out, which is hypothetically limitless through modern cloud provider offerings.

From an architecture perspective, special consideration needs to be given if you want your solution to be able to scale out.  For one, you need to have some Load Balancing technology in place to determine which instance to send traffic to or delegate compute power to.  Additionally, if you need to scale out your data storage, you need to implement “sharding” or design your solution so that, logically, your data does not need to be split across instances and can be grouped together.

Figure 1: Scaling up vs. Scaling out

Once you have a handle on how you can go about scaling up and/or scaling out, you can start putting a plan together of when to do so for the solution that you are designing.  Most infrastructure providers offer interfaces to programmatically scale up your virtual machines or compute instances, or to scale out and add more in parallel.  It is important though to clearly understand the workload of the various components of your system over time so that you can put this plan together and make use of these programmatic interfaces.

For example, certain parts of your system may have an increased load at certain times of day, days in a month or a specific time of year.  The ordering function in a food delivery system will be much busier over the lunch hour, and the payment authorization function of a card management system will be a lot busier at month-end or over the festive season.  Combining this kind of understanding with the non-functional requirements of your solution will enable you to separate out specific workloads and scale them independently according to a plan.

There is an added complexity in that you may not know in advance when your system will require additional power to deal with certain workloads.  A celebrity, for example, may decide to mention your SaaS product in a social media post and suddenly your system must deal with thousands of new sign-ups per second.  In these kinds of scenarios, it is critical to firstly have Application Performance Monitoring (APM) in place so that your system knows when workloads are exceeding the compute, memory, or storage thresholds in the plan, and then secondly to be able to automatically scale up and/or out accordingly.

Auto-scaling in this way is an incredibly powerful feature to have in your solution, but it is a difficult feature to implement because it is not something you can implement once and forget about it.  You need to revisit your design with updated information from your APM regularly to ensure that you have split out workloads according to how your system is being utilized by its users and other systems that it integrates with.

Optimizing Network and Storage Performance

The latency between cloud resources and separate data centres across which your solution is deployed can have a massive impact on the performance of your solution.  There is a big difference in performance between Site-to-site VPNs over the Internet versus dedicated VPNs.  Although there is a cost implication to these, there are some great offerings by most cloud providers in this regard.  Additionally, the latency between end user applications on the edge and the API Gateways and/or other network resources that these applications consume can further impact the overall performance of your solution negatively.

Determine where your users are and deploy the APIs and the services that they require access to as close to them as possible.  If the services and cloud resources that make up your solution must be distributed over geographical locations for whatever reason, ensure that the connections between these locations is optimum.  Try to design your solution so that data that does not change regularly, but that is used regularly, is brought closer to the end user through replication or caching.  This will limit the reliance on the network for your solution to perform optimally.  Eliminate the need for “chatty applications”, where polling is required to implement certain functionality.  The fewer network hops a specific feature needs to go through, the better.

Figure 2: Bringing resources closer to users
Figure 2: Bringing resources closer to users

An area that is often overlooked in optimizing network performance, is the reliance on DNS servers in a solution architecture and the benefits that can be derived from them.  DNS Load Balancing is a low effort, high impact spanner in your toolbox.  You can easily route traffic to different data centres based on the priority, weighting, performance, and geographic locations of the requests coming through.  CDNs can also be leveraged to cache static content as close to the user as possible.

In terms of optimizing storage performance, the key here is to determine the trade off being made between the latency in accessing data and the volatility of the data itself.  You might have very low latency on retrieving data that is stored in a cache, but the data may be stale, or may not even be there at all.  Polyglot persistence is the term used for storing the data for a solution across many different mechanisms, for example Caches, SQL databases, NoSQL databases and Message stores like Kafka.  Polyglot persistence is usually found in systems that have implemented some form of CQRS pattern in their solution design.  The whole view of a “business object” for lack of a better term, like a Food Order or a Bank Payment, is the collection of related data across these different mechanisms that collectively make up all the data for that “business object”.  Using different mechanisms to store different data related to a “business object” is a useful strategy to adopt if you want to improve the performance of your solution.  As mentioned above, data that is frequently accessed, but changed rarely, can be cached, and not require retrieval from a database that would typically involve accessing physical storage like a hard drive.  Disk I/O is super expensive, so separating the logic in your applications that does the reading of data and the writing of data is a great technique to optimizing storage performance.

Figure 3: Simple Polyglot persistence
Figure 3: Simple Polyglot persistence

Identify Performance Bottlenecks

Whenever I am required to look at a performance issue with a particular solution that I am not familiar with, the first two things that I look at are a) how the solution uses the network it is deployed across and b) how the solution stores and retrieves data.  Network and storage inefficiencies are usually the biggest culprits when it comes to performance bottlenecks.  To effectively isolate network inefficiencies, it is vital to implement “health check” endpoints on your services.  Doing so enables you to check that services are available and what their response times are.  It is important to clearly define up front what your non-functional requirements for your solution are, for example:

You can monitor the results you get from these “health check” endpoints against these non-functional requirements initially, and then over time as you add functionality to your solution.  You can then set up alerting mechanisms to let you know when you are approaching the limits defined in your non-functional requirements and act if necessary.

Additionally, it is important to implement and make use of Identifiers within a transaction so that you can have additional information to help identify bottlenecks.  Typical examples of Identifiers are:

The words generated and provisioned are underlined for a reason – generated means that the identifier is automatically generated by your solution at the edge, whereas provisioned means that the identifier is defined or produced by your solution as far down (not at the edge) in your stack as possible.

If the services in your solution implement identifiers like the ones above and then log with date/time stamps every time a transaction boundary is crossed during execution, then the detailed execution path of any transaction can be mapped end-to-end.  This is particularly useful when your solution requires complex integrations with multiple third-party systems that are outside of your control.  Without this kind of visibility, it becomes impossible to improve the performance of your solution over time and manage it effectively.

In the next post we unpack the third pillar of Availability and Recoverability, and how big the difference is between 99.9% and 99.99% uptime…

The 5 Pillars of Good Solution Architecture: Security

The 5 Pillars of Good Solution Architecture: Security

By Sergio Barbosa (CIO - Global Kinetic)


A lot of fanfare has been made about the Twelve-Factor App methodology and how it is becoming the best way to approach building a SaaS-based application that makes use of microservices.  I am one of those fans.  When designing a new solution, or upgrading an existing one, having a simple set of guiding principles can be invaluable.  And of course, non-functional requirements.  But if I look at the Twelve-Factor App methodology, it speaks a lot to the “how”, but not to the “what”.  I may very well build a solution that adheres to all Twelve Factors but fail in meeting the non-functional requirements of the desired solution.  By definition, I would have delivered a bad Solution Architecture.

Every good Solution Architecture should have a plan for the following 5 things, within which non-functional requirements can be grouped and addressed:

  1. Security
  2. Performance and Scalability
  3. Availability and Recoverability
  4. Efficiency of Operations
  5. Cost

Let's take a look at each of these areas one by one:

 

Security

Designing for Security requires a “Defence in Depth” approach.  This means that every solution should be continually validating trust as it is executing code and accessing system resources.  Commonly referred to as a Zero Trust model, the solution should not make any assumptions about the privileges that the user or system account executing code and accessing system resources has.  Trust should be validated at each layer in the solution stack, from the physical layer, through the perimeter and network, all the way down to the compute, application, and data layers.

Be explicit about the requirements at each layer, i.e. what are the Authentication rules (which user accounts and how do they authenticate themselves) and Authorization rules (what do the user accounts have access to) at each layer. There are many tools that can be leveraged to implement and manage these rules so that you do not have to write code to do this from scratch.  Most of these tools implement widely accepted standards and best practices, so make use of those.  Identity Management systems like KeyCloak implement OpenID Connect standards, provide single sign-on capabilities, and can be extended to support multi-factor authentication very easily.

 
Defense in Depth
Figure 1: Defense in Depth
 

Security is ultimately about data, and it needs to be clear at each layer what aspect of your data you are securing.  There are three options here, Confidentiality, Integrity and Availability, commonly referred to as the CIA principles.  At the data layer for example, you would have a requirement to encrypt the data at rest to preserve the Integrity of the data.  At the perimeter layer for example, you would have a requirement to prevent DDoS attacks to preserve the Availability of the data.  And at the physical layer for example, you would have a requirement to implement biometrics as an additional authentication factor to preserve the Confidentiality of the data.

At any point in time, the data generated and managed by your solution is either at rest, or in transit on some piece of hardware infrastructure.  That means that you need to protect the infrastructure your solution is deployed to, apply the best network security you can, and implement the most robust encryption algorithms and techniques.  In terms of infrastructure, make sure you have adequate Identity Access Management and role-based security that can access the underlying infrastructure, and that you have adequate failover (more on this later) in place.  For Network Security, implement DDoS protection, Firewalls, Gateways and Load Balancers and constantly monitor traffic, limiting resource exposure/access via IP address, port, and protocol restrictions.  Be especially careful when deploying microservices to orchestration systems like Kubernetes, and ensure you are not making assumptions about the execution privileges inside a cluster.  Encrypt data in transit and at rest, and be explicit about the encryption algorithms that you are using, and how you are using them.  Encryption is a massive topic so to do it justice in a small paragraph is impossible, but pay particular attention to Symmetric vs Asymmetric techniques, one-way vs. two-way encryption, and the difference between encryption and hashing of data.  Classify data as Public, Private or Restricted, and take action to make sure that Private and Restricted data is always encrypted at rest and in transit and that Restricted data can only be accessed by the owner of the data (like in the case of regulatory requirements like POPIA, PCI, GDPR)

 

Figure 2:CIA of Data

.

As the last point on Security, none of the above can be effective if you do not have a security mindset when developing the solution.  Some refer to this as a 'culture of security' within dev teams and organizations.  This means that at every stage in the development of your solution, you are validating the solution against your security requirement.  Initially with core security training when onboarding developers into your teams, then while developing each feature of your solution evaluating the impact of the security requirements of the solution against that feature.  Design for these security requirements by using thread modelling and attack surface analysis, implement the code according to them, verify that the implementation meets the success criteria, release the feature after final security review with an incident and response plan and then implement feedback loops from your monitoring data in your production environment.  This the Security Development Lifecyle, and the most important plan you need in place to meet the non-functional requirements grouped in the Security pillar.


Figure 3: Security Development Life Cycle

In summary, it is useful with each of these pillars to have a baseline or standard that you work from and then evolve and improve. In the next post we take a look at the second pillar, Performance and Scalability, so stay tuned...