Making ArgoCD 100x Faster 🚀

In September 2022, Cabify marked the end of an era by announcing a major shift: transitioning from deploying our services with an in-house CI/CD tool to embracing GitOps with ArgoCD*.

In essence, this move meant:

How it works

Setup Overview:

A manifest monorepo that provides an SSOT, improved team collaboration with a global visibility and easier migrations from the previous system.
Each Kubernetes cluster has its own ArgoCD instance.
Push-based model.
Use of App of Apps pattern.
Each application has its own code repository and its own manifest definition.
Application definitions inside the Monorepo are grouped by owner, service, and environment they belong to.

And as it always is, in the beginning, ArgoCD ran flawlessly: it was fast, efficient, required minimal resources, and developers were highly satisfied but after two years and exponential growth in applications and clusters, teams began raising issues and questions about ArgoCD’s performance in our Kubernetes Testing Environments (KTEs)**.

Here’s the story of how we enhanced our ArgoCD performance by 100x and shared those improvements with the community

ℹ️
If you want to learn more about the journey we’ve taken to implement ArgoCD at Cabify, and KTEs check out the post “Scaling ArgoCD to 50+ Testing Environments.”

But…Wait! Who announced it?

Let me introduce you to who we are 😎!

We are the Developer Experience (DevX) Team at Cabify, a part of our Engineering organization dedicated to making life easier for developers. We’re here to streamline workflows, create tools, and optimize environments so our developers can focus on building amazing products for our users. In essence,

“We build products for those who build the product”

and yes! deployment is a fundamental part of our responsibilities.

The problem

At the beginning of 2024, it became increasingly common to encounter issues related to the performance of ArgoCD, particularly concerning application changes not being reflected in the target cluster.

This type of problem was twofold:

As the development loop expanded, teams took longer to validate their changes.
Most of the time in DevX was spent resolving these types of incidents rather than progressing on other initiatives.

To understand better the problem, let me describe briefly the ArgoCD architecture and focus on the components related:

ArgoCD Architecture

Argo UI & CLI: The ArgoCD UI provides a visual dashboard for managing applications, while the CLI allows programmatic interaction for automation.
API Server: Acts as the main interface for communication between ArgoCD components and external systems.
Repo Server: Fetches application manifests from Git repositories and serves them to the application controller building the kubernetes manifest based on source files.
CMP Plugin: When the cmp plugin is configured, the Repo Server delegates the task of building manifests to the plugin. We are using it because we need to interpolate variables.
Application Controller: Manages the lifecycle of applications, ensuring their state in the cluster matches the desired state defined in Git.
ApplicationSet Controller: Enables the management of multiple applications through a single specification, allowing dynamic generation based on parameters.

Ok so… the new deployment is not reflected in the cluster? What’s happening?

Indeed, the logs from the Repo Server confirmed that certain operations, such as GenerateManifest, GetRevisionMetadata, and GetGitDirectories, experienced spikes at times, reaching durations of 15 to 50 minutes:

Repo Server logs

The issue described impacts both the Repo Server and the CMP plugin, which is why they were highlighted above 😏

Digging a little bit using pprof:

pprof result

We noticed that most of the time, the Repo Server is just waiting ⌛!

Upon reviewing the CMP plugin code, we gained a clearer understanding of its functionality and identified areas that may be influencing the current issue:

The entire repository of manifests (excluding globs) is sent with every gRPC request from the Repo Server to the CMP server.
For each call to the CMP, the Repo Server creates a tar.gz file of the entire repository and sends it; then
The CMP server has to decompress it, recreate all the files on disk, run the generate command for manifest generation, and delete the temporary files.

Most of the time is spent waiting for the creation and deletion of directories and files (blockUntilWaitable).

We were able to confirm the enormous spikes in disk write operations and this situation worsens if multiple Repo Servers are deployed on the same node:

Disk usage metrics

But…Wait! We talked about variable interpolation. What do you mean by this?

When we talk about variable interpolation, we refer to the need to define an application in a single manifest that can be deployed across multiple ephemeral KTE clusters. These variables may include environment variables, ingress hosts, and more. Eg:

Ingress manifest example

It is common for such an application to require environment-specific variables. That’s why we use the CMP plugin, which allows us to perform this variable interpolation during the generate command.

Given our need for variable interpolation, we initially considered offloading this responsibility from ArgoCD by using admission mutating webhooks. However, this approach has significant drawbacks, particularly when working with server-side apply:

The features of ArgoCD, such as IgnoreDifferences and ManagedFields, do not function as intended in this context and cause an eternal reconciliation loop.
Manually manipulating ManagedFields is discouraged 💀
Additionally, ManagedFields has limitations; it cannot designate an element within a list as the manager, resulting in entire lists being marked, leading to the loss of synchronization for significant blocks of manifests (such as host in ingresses).
ArgoCD IgnoreDifferences also fails to work for elements within a key, such as in ConfigMap / CRDs

ArgoCD didn’t offer much beyond exclusion features, and we were not the only ones experiencing performance issues with similar configurations:

At this point, we start asking, how far-fetched is it to propose this change to the community and actually implement it? Let’s do it 💪!!

The solution

By examining the code of the Repo Server and cmp-plugin in detail, we identified two potential quick wins that could significantly improve the performance of ArgoCD:

Enhance CMP Plugin Evaluation for Applications with name based configuration

We noticed that to check if Discovery was enabled, the Repo Server, per application, would internally compress the entire repository and send it to perform the matchRepositoryCMP. In monorepo scenarios with hundreds of applications, or even worse, in situations where a single node could host multiple Repo Servers, this could place unnecessary stress on the disk.

Our proposal was to extend the API provided by the cmp-plugin with a new operation, CheckPluginConfiguration, which allows for a quick and efficient way to determine if the plugin has Discovery configured.

Take a look at the contribution for more info.

Transfer only applications manifests to cmp-server for plugin-based applications

ArgoCD has already defined the manifest-generate-paths annotation, which uses the paths specified in the annotation to compare the last cached revision with the latest commit and trigger reconciliation.

The proposal aimed to simplify the number of resources transmitted from the Repo Server to the CMP plugin by taking this annotation into account. This would allow, in monorepo scenarios, for a limited set of resources to be compressed and sent to the plugin instead of sending the entire repository, which in our case could consist of thousands of resources.

Take a look at the contribution for more info.

The outcome

After applying the changes in our environments, we observed a significant reduction in the times for the GenerateManifest operations, returning to a matter of seconds as we have in production environments:

Result

..and that’s all folks 🚀🚀!

Do you like what you read? It’s easy, join us and be part of the journey 👋!

Provider	Purpose
Google Ads	A. Data processing based on consent: - Store or access information on a device - Build a personalised ad profile - Select personalised ads B. Based on legitimate interest: - Personalise content - Improve products - Measure ad performance - Select basic ads - Select personalised content - Use market research to generate audience insights Extra processing: - Match and combine offline data - Ensure security, prevent fraud and debug - Technically deliver ads or content - Link devices Google Advertising Products follow the IAB Transparency & Consent Framework. More in their Privacy Policy.
Facebook	Based on consent: store or access information on a device. Learn more in their Privacy Policy.
LinkedIn	Based on consent: - Store/access info - Build a personalised ad profile - Select personalised ads Based on legitimate interest: - Improve products - Measure ad performance - Select basic ads Extra processing: - Ensure security, prevent fraud and debug - Technically deliver LinkedIn content LinkedIn follows the IAB Framework. See their Privacy Policy.
X	Based on consent: store or access information on a device. More info in their Privacy Policy.
Taboola	Based on consent: store or access information on a device. Based on legitimate interest: - Personalise content and ads - Measure performance - Link devices - Improve products - Use offline data Taboola follows the IAB Framework. Read their Privacy Policy.
TikTok	Based on consent: - Store or access information - Build a personalised ad profile - Select personalised ads Based on legitimate interest: - Measure ad performance - Select basic ads - Improve products Extra processing: - Prevent fraud - Ensure security - Deliver ads TikTok follows the IAB Framework. See their Privacy Policy.
Microsoft Advertising	Uses the UET tag to track site usage and optimise campaigns. Helps personalise ads and measure effectiveness. Follows the IAB Framework. More in their Privacy Policy.
StackAdapt	Based on consent: uses cookies to uniquely identify users for retargeting, conversion tracking, and lookalike profiles. Tracks campaign performance and engagement. Follows the IAB Framework. Read their Privacy Policy.

Provider	Purpose
Google Analytics	Google’s tool for measuring site use. It uses cookies (like “_ga”) to track visits, without identifying individuals. Data may be used with advertising cookies to personalise and measure ads across Google and the web. More info.
Hotjar	Behavioural analytics tool that tracks user interactions like clicks and scrolling. It helps us improve usability and design. More info.
Amplitude	Tracks how users navigate our site, what features they use, and what actions they take — all to help us improve. More info.

Making ArgoCD 100x Faster 🚀

But…Wait! Who announced it?

The problem

Ok so… the new deployment is not reflected in the cluster? What’s happening?

But…Wait! We talked about variable interpolation. What do you mean by this?

The solution

Transfer only applications manifests to cmp-server for plugin-based applications

The outcome

Javier Solana

Bringing our culture to life through stories—discover it in our blog.

Every journey begins with a gesture

Scaling ArgoCD to 50+ testing environments

Fiat lux! Accessibility as the Vehicle for Our New Color Palette

Making ArgoCD 100x Faster 🚀

Let's talk about feedback!

Kenia 2022: Think outside the office

Our experience at LeadDev Berlin 2022

How We Speed Up Backend Development

Blog

About us

Legal

There are cookies… and then there are cookies

Choose which cookies
you allow us to use

Making ArgoCD 100x Faster 🚀

But…Wait! Who announced it?

The problem

Ok so… the new deployment is not reflected in the cluster? What’s happening?

But…Wait! We talked about variable interpolation. What do you mean by this?

The solution

Transfer only applications manifests to cmp-server for plugin-based applications

The outcome

Javier Solana

Bringing our culture to life through stories—discover it in our blog.

Every journey begins with a gesture

Scaling ArgoCD to 50+ testing environments

Fiat lux! Accessibility as the Vehicle for Our New Color Palette

Making ArgoCD 100x Faster 🚀

Let's talk about feedback!

Kenia 2022: Think outside the office

Our experience at LeadDev Berlin 2022

How We Speed Up Backend Development

There are cookies… and then there are cookies

Choose which cookiesyou allow us to use

Choose which cookies
you allow us to use