Back to Homepage
  • Home
  • Blog
  • How GCP defaults spiked a Datadog bill by 20x
Share Article
  • Link copied!Copy
  • Email
  • X
  • LinkedIn

How GCP defaults spiked a Datadog bill by 20x

Niall OHiggins
Niall OHiggins

Imagine opening your Datadog bill and seeing a 20x increase. You dig deeper and discover thousands of “custom metrics” flooding into Datadog—metrics you didn’t configure, expect, or even recognize. That’s exactly what happened to a GCP user running workloads on GKE Autopilot. 

I’ve said this before, and I’ll say this again: cost and complexity are the two biggest pain points for Datadog users. This case is a perfect example of how quickly costs can spiral when defaults aren’t working in your favour. 

Managing observability costs is hard, especially when defaults quietly interact in ways that aren’t immediately visible. At StepChange, we had the opportunity to examine support emails from a customer facing these challenges. For two weeks, the customer and Datadog’s support team wrestled with the problem, cycling through theories and partial fixes. Was it a misconfiguration? A quirk in Datadog’s auto-discovery logic? Or something more nuanced in GKE Autopilot’s setup?

The customer’s first guess was pretty close: perhaps Cilium, the default networking layer in GKE Autopilot, was being misconfigured by Datadog’s agent. So Datadog’s support team suggested a quick fix to disable Cilium’s auto-configuration in the agent, which kind of worked. The custom metrics dropped by a third but there were still 20,000+ metrics being ingested by Datadog.

In the end, the root cause wasn’t immediately obvious because it was hidden in the interplay of two defaults – Prometheus scraping on GKE Autopilot and Datadog’s default Prometheus integration.

Everything was working exactly as intended—defaults doing what defaults do—but together, they created an avalanche of unwanted custom metrics that drove costs through the roof.

If you’re a Datadog user reading this, you already know the lesson here: defaults always carry hidden risks. Defaults are often buried beneath layers of well-intentioned, but ultimately opaque, configurations. In systems where configurations overlap, assumptions can be costly. Misconfigurations aren’t always a result of human error—they’re sometimes just defaults you didn’t know were active.

We wanted to share this case as a reminder to data teams everywhere: you can’t leave observability on autopilot. 

How the problem started

The customer was pretty quick to guess that something was going wrong with GKE Autopilot and Cilium, a networking layer installed by default on GKE Autopilot clusters. Nothing about their setup explicitly called for Cilium metrics to be ingested, let alone as custom metrics. 

Their first suspicion was that Datadog’s agent, responsible for auto-discovering and reporting metrics, had been misconfigured. Was Datadog’s autodiscovery logic–which scans workloads and enables integrations automatically–inadvertently pulling in Cilium metrics as custom metrics? It raised two questions:

  • Was Datadog’s agent treating Cilium metrics incorrectly due to default settings?
  • Were there Prometheus scrape annotations or endpoints that should have been ignored?

Both questions pointed to plausible theories but neither accounted for the full scope of the problem. At this point, they turned to Datadog support for help. 

Why Datadog’s team missed the root cause 

Datadog’s support team zeroed in on Cilium’s auto-configuration as the most obvious suspect. This made sense: Cilium is enabled by default on GKE Autopilot, and Datadog’s agent auto-discovers integrations based on predefined rules. If Cilium workloads were being misconfigured, it could easily explain why thousands of metrics were being ingested. 

So Datadog support recommended disabling Cilium auto-configuration by modifying the values.yml file to exclude the integration. 

As you’ve already read, this cut the number of custom metrics quite a bit but not entirely. The persistence of 20,000 metrics raised new questions about what else could be causing the surge. 

This is where things got trickier. Datadog’s support team then turned their focus to Prometheus scraping. By default, GKE Autopilot annotates pods with with prometheus.io/scrape: true for Prometheus monitoring. 

Datadog’s Prometheus integration, enabled out of the box, was dutifully scraping these endpoints–even unmanaged GKE-managed pods. 

As a managed service, GKE Autopilot tightly controls Kubernetes settings, and many default behaviors can’t be modified (e.g. you can’t simply remove the problematic annotations or disable GKE-managed pods). 

Meanwhile, Datadog’s default behavior—meant to be flexible—was quietly scraping everything annotated for Prometheus, and in ways that clashed with GKE’s rigid defaults. 

Neither the customer nor Datadog had full visibility into how these systems were interacting. Was the issue with GKE’s managed Prometheus settings? Datadog’s auto-discovery logic? Or the overlapping defaults that made it impossible to detangle the problem?

It took two weeks of testing partial fixes, studying detailed logs, and looking up relevant GitHub issues that hinted at similar behavior in GKE Autopilot environments before the underlying issue became clear. 

When defaults work as designed

The problem wasn’t a bug or misconfiguration. The problem was that everything worked exactly as intended.

GKE Autopilot annotated pods for Prometheus scraping, and Datadog’s Prometheus integration—enabled out of the box—did what it was designed to do: scrape annotated endpoints.

Individually, these defaults made sense. Together, they created tens of thousands of unnecessary custom metrics being ingested, and a massive spike in costs. 

The fix turned out to be pretty simple: filtering out the noise. Instead of disabling Prometheus scraping entirely, which would risk losing valuable metrics, the customer implemented specific exclusions:

  • Excluding GKE-managed namespaces (e.g., kube-system, gke-managed-*
  • Filtering by container name regex to skip unwanted pods

Here’s what the final configuration looked like:

datadog:
  prometheusScrape:
    enabled: true
    additionalConfigs:
      - configurations:
          # Exclude unnecessary GKE-managed scraping
          kubernetes_annotations:
            exclude:
              prometheus.io/scrape: "true"
          kubernetes_container_names:
            - ".*gke-managed.*"
            - ".*kube-system.*"

After implementing the new configuration, they saw an immediate drop in custom metrics to just a few hundred and costs stabilized.

Stay ahead of the defaults

When defaults quietly work against you, visibility alone isn’t enough. In managed environments like GKE Autopilot, where you can’t control every setting, understanding how the systems interact is critical. Each component here was functioning as designed, but together, they created an unwanted (and costly) feedback loop.

Defaults aren’t optimized for cost efficiency. Proactively audit your observability configurations, fine-tuning scraping settings, and filter out unnecessary noise before it becomes a problem. Don’t leave Datadog on autopilot.

Reviewing your setup

Defaults are designed for general use, not for cost efficiency. If you’re running Datadog in complex environments like GKE Autopilot, always audit and adjust your configurations to fit your specific needs.

This isn’t a one-off issue. Every day, I talk to Datadog customers struggling with cost overruns and hidden complexity. It’s not just about Datadog’s capabilities—it’s about how quietly defaults can work against you, inflating costs before you even notice.

Is your company facing high Datadog costs? We’re developing a solution to help teams significantly reduce them. Get in touch with us.

You may also like…

Get In Touch

Contact StepChange to see how we can help you build better apps & databases today.

StepChange

© 2024 StepChange Labs Inc. All rights reserved.