These cookies make our website usable and secure. They ensure fast loading, basic functions, and general reliability. Without them, the site simply wouldn’t work.
Engineering
Jul 08, 2025
At Cabify, we revamped our deployment system by leveraging Kubernetes Testing Environments (KTEs) and adopting ArgoCD for a centralized, GitOps-driven model. Through multiple iterations, we transitioned from in-cluster ArgoCD instances to a centralized deployment approach, addressing challenges like GitLab overload, reconciliation inefficiencies, and scaling complexities. Key optimizations included webhooks, reduced reconciliation activities, and resource exclusions, resulting in faster, more reliable deployments. These improvements boosted developer satisfaction and reduced deployment times, paving the way for further enhancements like environment consolidation and a more efficient continuous deployment loop.
At Cabify, we run our services inside Kubernetes. Instead of having a single testing environment, shared across all engineering teams, we decided to have isolated, reproducible, and ephemeral Kubernetes clusters, that we internally call Kubernetes Testing Environments, or KTEs for short. This approach gives development teams greater flexibility and speed to develop new features or fix bugs without clashing with other teams. While this approach benefits developers, it poses challenges for our infrastructure and platform teams: they have to maintain and manage multiple Kubernetes clusters, along with managing deployments for each team’s services.
Regarding deployments, in the past we used to have an internal tool that worked in a pseudo-GitOps model to manage every deployment of the rest of the teams:
While this tool solved most of our problems, it was giving us many headaches in the long run:
CI executions were not mutually exclusive, leading to concurrency issues: If one pipeline ran after another that finished earlier, it could revert the changes made by the previous one. This often led to conflicting or outdated deployments.
The source of truth was supposed to be the git repository but was instead stored within each cluster: This inconsistency meant that a shared deployment could work in some clusters but fail in others, resulting in frequent synchronization issues.
Deployments were tied to pipeline executions: This dependency meant that hotfixes could not be deployed if a pipeline failed. Pipeline stability thus became a critical factor for even minor updates or urgent patches.
We were eager to address the issues mentioned above and, after exploring the current landscape of available tools, narrowed our options to two: FluxCD and ArgoCD. After a thorough analysis, we chose ArgoCD for our deployment systems because it aligned better with our requirements.
Key factors that led us to choose ArgoCD over FluxCD included:
With ArgoCD we could solve our previously mentioned issues almost right out of the box:
However, our initial approach with ArgoCD was far from being optimal, since it introduced a lot of problems. In the next sections, we’ll go through all the iterations we implemented featuring their pros and cons until we reach our current, optimized setup.
We installed a single ArgoCD that managed all our deployments, with an ApplicationSet
declaring a matrix
generator that would deploy our N services over M Kubernetes clusters. Our single ArgoCD instance soon became overwhelmed trying to reconcile more than 25K applications (50 KTEs × 500 services). It didn’t matter that we tried to scale horizontally as much as possible our ArgoCD instance, we continued facing performance issues.
flowchart TB
A1 --> B1
A1 --> B2
A1 --> C1
A1 --> C2
A1 --> D1
A1 --> D2
subgraph A["internal tools k8s cluster"]
A1("ArgoCD")
end
subgraph D["kte-n"]
D1("service-1")
D2("service-2")
end
subgraph C["kte-2"]
C1("service-1")
C2("service-2")
end
subgraph B["kte-1"]
B1("service-1")
B2("service-2")
end
style A fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:4,ry:4,texstroke-width:12px
style D fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:8,ry:8
style C fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:8,ry:8
style B fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:8,ry:8
We opted for splitting the load of ArgoCD into standalone instances that would manage the deployments for each KTE. This way, we shifted from an ‘N environments × M services’ model to ‘N ArgoCD instances managing M services’. We gained a lot of efficiency when the deployments were made in-cluster
, but this time we caused a lot of trouble in a critical part of our platform: we were overloading our GitLab instance with every pull made by each ArgoCD instance.
We had ArgoCD configured to pull from the manifests repository every 3 minutes, leading to approximately 8.3K requests to GitLab per minute. Our poor instance was struggling to work every time our ArgoCD instances synchronized their git changes.
To solve the overload that GitLab was suffering from our ArgoCD instances, we disabled every automatic pull, and instead we developed a solution using webhooks, allowing ArgoCD to sync changes only when triggered by GitLab.
flowchart TB
Z["deploy-tool"] -- commit --> A1
A1 -- webhook --> B0
A1 -- webhook --> C0
A1 -- webhook --> D0
B0 -- git fetch --> A1
C0 -- git fetch --> A1
D0 -- git fetch --> A1
subgraph A[" "]
A1["manifests-repo"]
end
subgraph D["kte-n"]
D0["ArgoCD"]
end
subgraph C["kte-2"]
C0["ArgoCD"]
end
subgraph B["kte-1"]
B0["ArgoCD"]
end
style Z rx:8,ry:8
style A fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:4,ry:4,stroke-width:24px
style A1 rx:8,ry:8
style D fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:4,ry:4,stroke-width:12px
style D0 rx:8,ry:8
style C fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:4,ry:4,stroke-width:12px
style C0 rx:8,ry:8
style B fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:4,ry:4,stroke-width:12px
style B0 rx:8,ry:8
This way, we decrease the number of git fetch
requests to one per KTE for each deployment (around 50 requests at once). Additionally, we also applied the argocd.argoproj.io/manifest-generate-paths
annotation that helped reduce the cache invalidation when triggering those git fetch
operations.
This improved configuration didn’t work out eventually, since we were still experiencing some performance issues in our ArgoCD instances:
The ArgoCD ApplicationSet controller had a bug which made it trigger git fetch
requests every second. This, multiplied by our ~50 clusters, was making our GitLab instance have a hard time. The underlying cause of this issue was that the change in status of a child Application was triggering the reconciliation of the parent ApplicationSet. This would then trigger the reconciliation of the children Applications, causing the aforementioned refresh loop.
flowchart TB
B["ApplicationSet"] -- refresh --> A[application-1]
A -- refresh --> B
style B rx:8,ry:8
style A rx:8,ry:8
This took us some time to discover because the ApplicationSet controller was not exporting the same kind of metrics as the rest of the ArgoCD components, so these git requests were practically invisible for us.
We solved this problem in three steps:
Disabled the ApplicationSet controller. As a preemptive measure, we disabled the ApplicationSet controller in every testing cluster, except for newly created ones. This strategy, however, had the side effect of not allowing us to deploy new services, although deploying the current ones was still possible, and it also allowed us to keep creating new KTEs.
Reduced redundant reconcile operations. A pull request in ArgoCD addressed the issue of redundant reconciles and was under active discussion. However, it still did not prevent sync loops, as the ApplicationSet would still be synchronized each time one of its children applications were synced. Furthermore, a single misconfigured application could destabilize the entire system.
To mitigate this issue, we forked the ArgoCD project and deployed a patched applicationset-controller
in our KTEs that did not monitor changes to applications
. This approach had the downside that manual changes to an app would not be overridden by the controller, and it also meant that the new alpha feature for progressive syncs would not be available.
Implemented a caching mechanism. Our last solution to prevent GitLab from being overloaded by the ApplicationSet controller was to integrate it with the repo-server
. This component is used by all other ArgoCD components to interact with git and includes caching and metrics functionality. This approach was suggested in this comment and implemented in this PR.
Running a deployment system within the same runtime environment is managing can pose significant risks. If the runtime experiences a failure, the deployment system itself would go down, making recovery difficult or even impossible.
Additionally, deploying new clusters with an in-cluster ArgoCD setup introduced operational challenges. When ArgoCD began deploying numerous services simultaneously, the scheduler would trigger the creation of many new nodes. This often led to pods (including ArgoCD’s own) being restarted, which required manual intervention to re-sync the Argo applications and stabilize the cluster.
To address these issues, we shifted to a centralized model by consolidating ArgoCD into a central cluster that manages deployments across environments. To avoid scalability issues, each testing cluster has its own dedicated ArgoCD instance, isolated within its own namespace in the central cluster.
Here’s a diagram showing the layout of the centralized ArgoCD instances:
flowchart TD
subgraph B["internal tools k8s cluster"]
B1["parent ArgoCD"] --> B2 & B3 & B4
B2["kte-1 ArgoCD"]
B3["kte-2 ArgoCD"]
B4["kte-n ArgoCD"]
end
style B fill:#f7f7fc,stroke:#f7f7fc,color:#adadd6,rx:16,ry:16,stroke-width:24px, stroke-top-width:48px
style B1 rx:8,ry:8
style B2 rx:8,ry:8
style B3 rx:8,ry:8
style B4 rx:8,ry:8
This approach gives each environment an isolated ArgoCD instance for improved fault tolerance and scalability, while simplifying cluster creation and reducing manual intervention.
While this third iteration of our deployment system solved many of the issues encountered in earlier approaches, it also introduced its own set of challenges. Managing deployments from a central cluster brought operational efficiencies, but scaling and maintaining such a setup across multiple testing environments came with its own complexities. Below, we outline the key challenges we faced and the solutions we implemented to address them.
One of the major challenges we faced was the strain caused by frequent git fetch
requests, particularly during reconciliations across multiple ArgoCD instances. Despite earlier mitigations, such as reducing synchronization intervals and implementing webhooks, the sheer volume of requests to our GitLab instance remained significant. To alleviate this, we introduced a webhooks proxy, acting as an intermediary between GitLab and ArgoCD. This proxy optimized the distribution of webhook events, ensuring synchronization occurred efficiently and reducing unnecessary load on our GitLab instance.
Another critical challenge was the high reconciliation activity of ArgoCD’s application-controller, which directly impacted performance and stability. To address this, we implemented several optimizations to reduce unnecessary reconciliations:
These changes collectively improved the efficiency and scalability of our deployment pipelines, ensuring smoother operations across our testing environments.
The development teams at Cabify have expressed higher satisfaction with the new deployment system. The switch to a centralized, GitOps-driven model with ArgoCD brought several tangible benefits:
Increased confidence in deployments: The clear and consistent synchronization of Git-based manifests to testing environments reduced deployment errors and provided more predictable results. Feedback surveys indicate that developer satisfaction with deployment processes increased.
Reduced deployment times: The streamlined deployment workflows significantly decreased average deployment times. Previously, deployments could take up to 30 minutes due to pipeline delays and manual interventions. With the new system, this has dropped to seconds, allowing teams to iterate more quickly and deliver fixes or features faster.
As part of a broader company-wide initiative, we aim to optimize our testing environments by reducing their overall number. While the current setup of ephemeral and isolated Kubernetes Testing Environments (KTEs) offers flexibility, maintaining a large number of clusters incurs higher operational costs and complexity. Consolidating environments without sacrificing team independence will be a key focus in the coming months.
Another area of focus is enhancing the overall continuous deployment loop. While our current system has greatly improved reliability and scalability, there is room for further automation and efficiency. Initiatives include:
By addressing these goals, we aim to provide an even more efficient and developer-friendly deployment experience while maintaining high operational standards.
The journey of evolving our deployment system at Cabify has been both challenging and rewarding. By leveraging KTEs and adopting ArgoCD, we achieved greater flexibility, reliability, and scalability in managing deployments across teams. Each iteration brought valuable lessons, from reducing reconciliation inefficiencies to optimizing GitLab usage, and these improvements have directly translated into faster deployments and enhanced developer satisfaction.
However, as with any evolving system, there is always room for improvement. Consolidating testing environments and refining the continuous deployment loop are just the beginning of what lies ahead. While we’ve made significant strides in optimizing ArgoCD for our use case, there are deeper performance aspects yet to be explored. In upcoming articles, we’ll dive into the intricacies of ArgoCD performance tuning and share insights to help others tackle similar challenges.
For now, the foundation we’ve built ensures that our teams can innovate faster and deliver with confidence—an achievement that reflects the collective effort of engineering at Cabify.
Design
Jul 22, 2025
Engineering
Jul 08, 2025
Design
Jun 23, 2025
Engineering
May 22, 2025
Culture
Jul 20, 2023
Culture
Nov 11, 2022
Culture
Nov 09, 2022
Engineering
Jun 01, 2022
Cookies are small text files stored in your browser. They help us provide a better experience for you.
For example, they help us understand how you navigate our site and interact with it. But disabling essential cookies might affect how it works.
In each section below, we explain what each type of cookie does so you can decide what stays and what goes. Click through to learn more and adjust your preferences.
When you click “Save preferences”, your cookie selection will be stored. If you don’t choose anything, clicking this button will count as rejecting all cookies except the essential ones. Click here for more info.