At VGS, reliability isn't just a "nice-to-have" - it's a fundamental requirement, especially as we serve and partner with enterprise customers who require SLAs for critical payment data flows in excess of four nines of availability.
For a long time VGS services have utilized common tools such as rate limiting and autoscaling to help meet the needs of our customers and elevate our infrastructure's resilience to meet the increasing demands of our customers. This has meant tackling a crucial challenge: implementing a robust workload partitioning architecture that both improves developer velocity and provides improved resiliency. In the following blog post, we detail our journey, the challenges we faced, and the solutions we implemented to improve our enterprise-grade reliability.
Challenges with Previous Attempts
Our prior scaling efforts had been largely successful, but not without a price. We were suffering because of a 'human powered' effort to scale our systems.
- Inconsistent Deployments: Without proper tooling, manually deploying multiple instances of Helm charts led to inconsistencies. Each instance could drift out of sync with others, meaning configurations and deployments weren't uniform. This made it difficult to maintain a predictable environment.
- Increased Toil: 'Toil' refers to manual, repetitive, and automatable tasks. Manually managing Helm releases and values files is a classic example of toil. It required significant manual effort to configure and update each instance, which was time-consuming and error-prone.
- Proliferation of Configuration: Our cells contained over 1,000 lines of highly duplicated configuration per cell. Manually managing this volume of configuration across many cells and environments was complex and unsustainable. It made it difficult to track changes and ensure consistency.
- Management Overhead: For an engineer performing a deployment, manually managing releases across multiple environments and regions was burdensome. It involved creating manual pull requests for each cell in each environment.
Automation was identified as the right solution to encourage uniform deployments and ensure consistency and repeatability. To address these challenges, we moved towards programmatic generation of our GitOps code and embracing configuration as code to help standardize and validate each instance.
Designing a Partitioned Workload Architecture
When workloads are partitioned, a failure in one environment is less likely to cascade and affect other customers. This enhances the overall stability and uptime of the system. Isolating workloads in a multi-tenant system is crucial for maintaining performance, reliability, security, and agility, which are all vital for providing a high-quality service to multiple customers.
We referenced AWS's Cell Architecture as a design guide for building our solution. Our original brief was to be able to isolate workloads to minimize noisy neighbor issues. Adjacent goals include increasing developer velocity and improving quality by enabling ephemeral environments for testing and improving the local testing tooling available to increase developer velocity.
We decided to model our solution using a pattern called Pooled Isolation. This allows us to create multiple pools upon which customers can be located with the ability to scale both horizontally (the number of cells) and vertically (the size of each cell) to better accommodate our customers.
We value our engineer's time so we chose to leverage common DevOps patterns and cloud native tooling to minimize the amount of custom solutioning needed to build our cells, our mantra was “cattle, not pets” - treating cells as interchangeable resources (cattle) rather than unique entities (pets). This means we chose to spend our time building and leveraging software to perform common tasks related to testing, deployment, and provisioning of our cells as opposed to the human centric model that had organically formed during our initial foray into workload partitioning. From a technology perspective we decided to leverage Helm as our technology choice for modeling each cell. As guidance for the teams, we made an assumption that we would provision one hundred cells with this solution. We didn't anticipate needing one hundred cells initially but wanted to set a design requirement that ensured we built something that would scale without O(n) labor from humans.
Tackling Helm Chart Inefficiencies
As part of this migration we decided to focus on convention over configuration, (using sensible defaults wherever possible) and started by creating a common, reusable helm chart which could become a foundational building block for defining each service contained within a cell. The reusable chart can then be composed to create a collection of services that define a cell.
We also spent some time at this point to ensure we had network isolation and redundancy. Currently we've decided to run cells across availability-zones, each within its own namespace, however we are still determining internally if it's worth trading off the resiliency for increasing performance by running cells within a single availability zone. Additionally we're evaluating if we'd be better served by moving to independent Kubernetes clusters per cell versus a single, shared cluster.
Building on top of a common helm chart also unlocked a slew of additional benefits which included the ability to integrate common rollout patterns which introduced graduated canary rollouts (a deployment strategy where a new version is rolled out to a small subset of users for testing) and standardized metrics and observability. Canaries are becoming a major tool to help us reduce the complexities of deployments by relying on software to manage rollouts rather than humans.
Building Release Tooling and Patterns to Manage Complexity
One of the major challenges was that we were staring at a proliferation of inconsistent deployments if we didn't do this correctly. We had previously solved the isolation issue manually by deploying multiple instances of our Helm charts without any tooling. This was causing each instance of our application to drift out of sync with the others and increased toil (manual, repetitive, and automatable tasks) for our engineers. We knew that the right solution had to encourage uniform deployments and ensure consistency and repeatability for each instance. We began to model this as a configuration-as-code problem, and in building our cells, we ensured we had a logical pattern for structuring values files (files which hold configuration parameters for each instance) for each instance of a cell to follow a specific pattern and treated them as code which allowed us to leverage many of the patterns we typically use with code.
This is concretely implemented by including a values file for each environment, region, and instance of a cell.
helm install a-chart path/to/chart \
-f path/to/chart/values.<environment>.yaml \
-f path/to/chart/values.<environment>-<region>.yaml \
-f path/to/chart/values.<environment>-<region>-<instance>.yaml
Additionally, we had to revisit how we approach managing GitOps. The largest change here is that our Helm Releases will proliferate under this model. A helm release per cell may be unique in terms of baseline capacity (e.g. autoscaling upper and lower limits) but should otherwise share a uniform configuration information with other cells.
def generate_helm_releases():
for environment, region, instance in get_releases():
yield f ” ” ”
... # boiler plate template omitted…
valuesFiles:
- {chart_path}/values.yaml
- {chart_path}/values.{environment}.yaml
- {chart_path}/values.{environment}-{region}.yaml
- {chart_path}/values.{environment}-{region}-{instance}.yaml
“ ” ”
Code snippet showing generation of HelmRelease files
It quickly becomes impractical to manually manage cells by hand. In order to achieve this we had stopped the practice of hand crafting helm releases and moved towards generating them programmatically for each environment and committing the output of the generator into our GitOps repo.
In concrete terms, what we are doing here is structuring our values files and helm releases so that they are logical and then using templating with loops to create multiple instances consistently according to a well-defined pattern. This is a common DevOps approach where we build software to perform deployments rather than hand-crafting deployments, but is critical to ensuring that our releases are consistent and repeatable.
5c5
< vgs.io/environment: dev.vault.sandbox
---
> vgs.io/environment: prod.vault.live
23,25c23,25
< - path/to/chart/values.dev.vault.sandbox-us-east-2.yaml
< - path/to/chart/values.dev.vault.sandbox-us-east-2-instance-01.yaml
---
> - path/to/chart/values.prod.vault.live-us-east-2.yaml
> - path/to/chart/values.prod.vault.live-us-east-2-instance-01.yaml
31,40d30
Visualization comparing differences between a development and a live environment release.
In doing this we reduced the amount of duplicated configuration from over 1,000 lines of configuration per cell down to 2-5 lines (🎉) referencing source controlled value files within a single artifact. Better yet, for an engineer performing a deployment, generating releases across multiple environments and regions, we've shrunk the problem down from making manual pull-requests for each cell in each environment and region to running a single request to generate our releases and then a pull-request per environment to perform the release.
There's no one-size-fits-all solution here, different teams might adopt different conventions depending on their needs and tools, one challenge we've discovered with this approach is the number of permutations of values file can grow large as you grow the number of environments and regions and instances of cells, if a human has to manage these files by hand it quickly becomes unsustainable.
The values file architecture is not a panacea, it does introduce some friction around directly modifying a helm release for a single environment without first building an artifact. We continue to experiment here as we look to find the right balance between empowering engineers to modify services directly within an environment versus providing consistency across environments.
Unlocking Ephemeral Testing within CI Pipelines
Once this work was performed, it unlocked the ability for us to also introduce improved CI pipelines which takes our exact Helm chart artifact and instantiates it with the same configuration that we'll use in our higher environments once the software is running and exposed to customers.
helm install chart-test-instance path/to/chart --wait \
-f path/to/chart/values.{environment}-{region}.yaml
Code snippet showing use of parameterized values file
This pattern allows us to gain confidence by testing and instantiating each cell in any environment and duplicating cells that look identical each time. We plan to leverage this functionality later to help stand up parallel instances of cells to perform pre-migration verification of changes further minimizing the chance of a rogue change causing issues.
Per-cell testing has proven to have more benefits beyond CI, we regularly provision cells to reproduce production-like configurations and perform isolated performance testing against them, through this method we've unlocked the ability to prove the performance of our platform to more than 2x our previously known limits.