Engineering

Scaling ArgoCD to 50+ testing environments

At Cabify, we revamped our deployment system by leveraging Kubernetes Testing Environments (KTEs) and adopting ArgoCD for a centralized, GitOps-driven model. Through multiple iterations, we transitioned from in-cluster ArgoCD instances to a centralized deployment approach, addressing challenges like GitLab overload, reconciliation inefficiencies, and scaling complexities. Key optimizations included webhooks, reduced reconciliation activities, and resource exclusions, resulting in faster, more reliable deployments. These improvements boosted developer satisfaction and reduced deployment times, paving the way for further enhancements like environment consolidation and a more efficient continuous deployment loop.

Kubernetes as our testing environments

At Cabify, we run our services inside Kubernetes. Instead of having a single testing environment, shared across all engineering teams, we decided to have isolated, reproducible, and ephemeral Kubernetes clusters, that we internally call Kubernetes Testing Environments, or KTEs for short. This approach gives development teams greater flexibility and speed to develop new features or fix bugs without clashing with other teams. While this approach benefits developers, it poses challenges for our infrastructure and platform teams: they have to maintain and manage multiple Kubernetes clusters, along with managing deployments for each team’s services.

Regarding deployments, in the past we used to have an internal tool that worked in a pseudo-GitOps model to manage every deployment of the rest of the teams:

  1. Teams would publish Kubernetes manifests to a shared repository.
  2. The tool was configured to run in the CI pipeline to compare the changes made and apply them in the specific KTEs.

While this tool solved most of our problems, it was giving us many headaches in the long run:

  • CI executions were not mutually exclusive, leading to concurrency issues: If one pipeline ran after another that finished earlier, it could revert the changes made by the previous one. This often led to conflicting or outdated deployments.

  • The source of truth was supposed to be the git repository but was instead stored within each cluster: This inconsistency meant that a shared deployment could work in some clusters but fail in others, resulting in frequent synchronization issues.

  • Deployments were tied to pipeline executions: This dependency meant that hotfixes could not be deployed if a pipeline failed. Pipeline stability thus became a critical factor for even minor updates or urgent patches.

ArgoCD to the rescue

We were eager to address the issues mentioned above and, after exploring the current landscape of available tools, narrowed our options to two: FluxCD and ArgoCD. After a thorough analysis, we chose ArgoCD for our deployment systems because it aligned better with our requirements.

Key factors that led us to choose ArgoCD over FluxCD included:

  • ArgoCD offers a user-friendly UI, which FluxCD lacks by default.
  • ArgoCD has more extensive documentation and a larger, more active community than FluxCD.

With ArgoCD we could solve our previously mentioned issues almost right out of the box:

  • We leveraged deployments to ArgoCD and its GitOps approach, ditching pipeline inconsistencies.
  • The source of truth became the actual git repository instead of the Kubernetes cluster.
  • Deployments were directly linked to commits instead of pipeline executions, bringing teams the freedom to deploy hotfixes right away if they needed it (because ArgoCD would sync the changes against the git repository).

However, our initial approach with ArgoCD was far from being optimal, since it introduced a lot of problems. In the next sections, we’ll go through all the iterations we implemented featuring their pros and cons until we reach our current, optimized setup.

First approach: One ArgoCD to rule them all

We installed a single ArgoCD that managed all our deployments, with an ApplicationSet declaring a matrix generator that would deploy our N services over M Kubernetes clusters. Our single ArgoCD instance soon became overwhelmed trying to reconcile more than 25K applications (50 KTEs × 500 services). It didn’t matter that we tried to scale horizontally as much as possible our ArgoCD instance, we continued facing performance issues.

flowchart TB
  A1 --> B1
  A1 --> B2
  A1 --> C1
  A1 --> C2
  A1 --> D1
  A1 --> D2

  subgraph A["internal tools k8s cluster"]
    A1("ArgoCD")
  end
  subgraph D["kte-n"]
    D1("service-1")
    D2("service-2")
  end
  subgraph C["kte-2"]
    C1("service-1")
    C2("service-2")
  end
  subgraph B["kte-1"]
    B1("service-1")
    B2("service-2")
  end
  style A fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:4,ry:4,texstroke-width:12px
  style D fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:8,ry:8
  style C fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:8,ry:8
  style B fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:8,ry:8

Second approach: An ArgoCD inside every KTE

We opted for splitting the load of ArgoCD into standalone instances that would manage the deployments for each KTE. This way, we shifted from an ‘N environments × M services’ model to ‘N ArgoCD instances managing M services’. We gained a lot of efficiency when the deployments were made in-cluster, but this time we caused a lot of trouble in a critical part of our platform: we were overloading our GitLab instance with every pull made by each ArgoCD instance.

We had ArgoCD configured to pull from the manifests repository every 3 minutes, leading to approximately 8.3K requests to GitLab per minute. Our poor instance was struggling to work every time our ArgoCD instances synchronized their git changes.

Addendum to the second strategy: push-based approach

To solve the overload that GitLab was suffering from our ArgoCD instances, we disabled every automatic pull, and instead we developed a solution using webhooks, allowing ArgoCD to sync changes only when triggered by GitLab.

flowchart TB
  Z["deploy-tool"] -- commit --> A1
  A1 -- webhook --> B0
  A1 -- webhook --> C0
  A1 -- webhook --> D0
  B0 -- git fetch --> A1
  C0 -- git fetch --> A1
  D0 -- git fetch --> A1

  subgraph A[" "]
    A1["manifests-repo"]
  end
  subgraph D["kte-n"]
    D0["ArgoCD"]
  end
  subgraph C["kte-2"]
    C0["ArgoCD"]
  end
  subgraph B["kte-1"]
    B0["ArgoCD"]
  end

  style Z rx:8,ry:8 
  style A fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:4,ry:4,stroke-width:24px
  style A1 rx:8,ry:8 
  style D fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:4,ry:4,stroke-width:12px
  style D0 rx:8,ry:8 
  style C fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:4,ry:4,stroke-width:12px
  style C0 rx:8,ry:8 
  style B fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:4,ry:4,stroke-width:12px
  style B0 rx:8,ry:8 

This way, we decrease the number of git fetch requests to one per KTE for each deployment (around 50 requests at once). Additionally, we also applied the argocd.argoproj.io/manifest-generate-paths annotation that helped reduce the cache invalidation when triggering those git fetch operations.

This improved configuration didn’t work out eventually, since we were still experiencing some performance issues in our ArgoCD instances:

A refresh loop in the ApplicationSet controller

The ArgoCD ApplicationSet controller had a bug which made it trigger git fetch requests every second. This, multiplied by our ~50 clusters, was making our GitLab instance have a hard time. The underlying cause of this issue was that the change in status of a child Application was triggering the reconciliation of the parent ApplicationSet. This would then trigger the reconciliation of the children Applications, causing the aforementioned refresh loop.

flowchart TB
  B["ApplicationSet"] -- refresh --> A[application-1]
  A -- refresh --> B
  style B rx:8,ry:8
  style A rx:8,ry:8

This took us some time to discover because the ApplicationSet controller was not exporting the same kind of metrics as the rest of the ArgoCD components, so these git requests were practically invisible for us.

We solved this problem in three steps:

  1. Disabled the ApplicationSet controller. As a preemptive measure, we disabled the ApplicationSet controller in every testing cluster, except for newly created ones. This strategy, however, had the side effect of not allowing us to deploy new services, although deploying the current ones was still possible, and it also allowed us to keep creating new KTEs.

  2. Reduced redundant reconcile operations. A pull request in ArgoCD addressed the issue of redundant reconciles and was under active discussion. However, it still did not prevent sync loops, as the ApplicationSet would still be synchronized each time one of its children applications were synced. Furthermore, a single misconfigured application could destabilize the entire system.

    To mitigate this issue, we forked the ArgoCD project and deployed a patched applicationset-controller in our KTEs that did not monitor changes to applications. This approach had the downside that manual changes to an app would not be overridden by the controller, and it also meant that the new alpha feature for progressive syncs would not be available.

  3. Implemented a caching mechanism. Our last solution to prevent GitLab from being overloaded by the ApplicationSet controller was to integrate it with the repo-server. This component is used by all other ArgoCD components to interact with git and includes caching and metrics functionality. This approach was suggested in this comment and implemented in this PR.

Third iteration: centralized ArgoCD instances managing KTE deployments

Running a deployment system within the same runtime environment is managing can pose significant risks. If the runtime experiences a failure, the deployment system itself would go down, making recovery difficult or even impossible.

Additionally, deploying new clusters with an in-cluster ArgoCD setup introduced operational challenges. When ArgoCD began deploying numerous services simultaneously, the scheduler would trigger the creation of many new nodes. This often led to pods (including ArgoCD’s own) being restarted, which required manual intervention to re-sync the Argo applications and stabilize the cluster.

To address these issues, we shifted to a centralized model by consolidating ArgoCD into a central cluster that manages deployments across environments. To avoid scalability issues, each testing cluster has its own dedicated ArgoCD instance, isolated within its own namespace in the central cluster.

Here’s a diagram showing the layout of the centralized ArgoCD instances:

flowchart TD
    subgraph B["internal tools k8s cluster"]
        B1["parent ArgoCD"] --> B2 & B3 & B4
        B2["kte-1 ArgoCD"]
        B3["kte-2 ArgoCD"]
        B4["kte-n ArgoCD"]
    end
    style B fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:16,ry:16,stroke-width:24px, stroke-top-width:48px
    style B1 rx:8,ry:8
    style B2 rx:8,ry:8
    style B3 rx:8,ry:8
    style B4 rx:8,ry:8

This approach gives each environment an isolated ArgoCD instance for improved fault tolerance and scalability, while simplifying cluster creation and reducing manual intervention.

While this third iteration of our deployment system solved many of the issues encountered in earlier approaches, it also introduced its own set of challenges. Managing deployments from a central cluster brought operational efficiencies, but scaling and maintaining such a setup across multiple testing environments came with its own complexities. Below, we outline the key challenges we faced and the solutions we implemented to address them.

Challenges

Managing git hammering: Introducing a webhooks proxy

One of the major challenges we faced was the strain caused by frequent git fetch requests, particularly during reconciliations across multiple ArgoCD instances. Despite earlier mitigations, such as reducing synchronization intervals and implementing webhooks, the sheer volume of requests to our GitLab instance remained significant. To alleviate this, we introduced a webhooks proxy, acting as an intermediary between GitLab and ArgoCD. This proxy optimized the distribution of webhook events, ensuring synchronization occurred efficiently and reducing unnecessary load on our GitLab instance.

Controlling reconciliation activity

Another critical challenge was the high reconciliation activity of ArgoCD’s application-controller, which directly impacted performance and stability. To address this, we implemented several optimizations to reduce unnecessary reconciliations:

  • Disabling custom health checks: While useful for application-specific monitoring, custom health checks added extra processing overhead. We disabled them to focus on essential reconciliations.
  • Disabling monitoring of orphan resources: Orphan resource monitoring was turned off to avoid excessive reconciliations related to unmanaged Kubernetes resources.
  • Ignoring resource updates: We configured ArgoCD to ignore changes to specific resource types, such as status updates, which do not impact the desired state.
  • Resource exclusions: By explicitly excluding certain resources from reconciliation, we significantly reduced the workload on ArgoCD’s controllers.

These changes collectively improved the efficiency and scalability of our deployment pipelines, ensuring smoother operations across our testing environments.

Customer satisfaction

Enhanced deployment experience for development teams

The development teams at Cabify have expressed higher satisfaction with the new deployment system. The switch to a centralized, GitOps-driven model with ArgoCD brought several tangible benefits:

  • Increased confidence in deployments: The clear and consistent synchronization of Git-based manifests to testing environments reduced deployment errors and provided more predictable results. Feedback surveys indicate that developer satisfaction with deployment processes increased.

  • Reduced deployment times: The streamlined deployment workflows significantly decreased average deployment times. Previously, deployments could take up to 30 minutes due to pipeline delays and manual interventions. With the new system, this has dropped to seconds, allowing teams to iterate more quickly and deliver fixes or features faster.

Road ahead

Reducing the number of testing environments

As part of a broader company-wide initiative, we aim to optimize our testing environments by reducing their overall number. While the current setup of ephemeral and isolated Kubernetes Testing Environments (KTEs) offers flexibility, maintaining a large number of clusters incurs higher operational costs and complexity. Consolidating environments without sacrificing team independence will be a key focus in the coming months.

Improving the continuous deployment loop

Another area of focus is enhancing the overall continuous deployment loop. While our current system has greatly improved reliability and scalability, there is room for further automation and efficiency. Initiatives include:

  • Progressive rollouts: Implementing progressive deployment strategies such as canary releases to minimize the risk of failures in production environments.
  • Advanced metrics integration: Leveraging better observability tools to monitor deployment health and performance.
  • Seamless rollback mechanisms: Introducing automated rollback triggers based on predefined health metrics to further improve deployment safety.

By addressing these goals, we aim to provide an even more efficient and developer-friendly deployment experience while maintaining high operational standards.

Wrapping up

The journey of evolving our deployment system at Cabify has been both challenging and rewarding. By leveraging KTEs and adopting ArgoCD, we achieved greater flexibility, reliability, and scalability in managing deployments across teams. Each iteration brought valuable lessons, from reducing reconciliation inefficiencies to optimizing GitLab usage, and these improvements have directly translated into faster deployments and enhanced developer satisfaction.

However, as with any evolving system, there is always room for improvement. Consolidating testing environments and refining the continuous deployment loop are just the beginning of what lies ahead. While we’ve made significant strides in optimizing ArgoCD for our use case, there are deeper performance aspects yet to be explored. In upcoming articles, we’ll dive into the intricacies of ArgoCD performance tuning and share insights to help others tackle similar challenges.

For now, the foundation we’ve built ensures that our teams can innovate faster and deliver with confidence—an achievement that reflects the collective effort of engineering at Cabify.

Choose which cookies
you allow us to use

Cookies are small text files stored in your browser. They help us provide a better experience for you.

For example, they help us understand how you navigate our site and interact with it. But disabling essential cookies might affect how it works.

In each section below, we explain what each type of cookie does so you can decide what stays and what goes. Click through to learn more and adjust your preferences.

When you click “Save preferences”, your cookie selection will be stored. If you don’t choose anything, clicking this button will count as rejecting all cookies except the essential ones. Click here for more info.

Aceptar configuración