Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCGM exporter integration collecting the namespace of the exporter in addition to the reported namespace #19570

Open
nwilliams-bdai opened this issue Feb 6, 2025 · 1 comment

Comments

@nwilliams-bdai
Copy link

We have systems with NVidia GPUs running Kubernetes and the nvidia-dcgm-exporter pod. We're collecting these metrics into our Datadog instance via agent v7.62.0 with the documented annotation:

  annotations:                                                                                                                                                                                                                                            
    ad.datadoghq.com/nvidia-dcgm-exporter.checks: |-                                                                                                                                                                                                      
      {                                                                                                                                                                                                                                                   
        "dcgm": {                                                                                                                                                                                                                                         
          "instances": [                                                                                                                                                                                                                                  
            {                                                                                                                                                                                                                                             
              "openmetrics_endpoint": "http://%%host%%:9400/metrics"                                                                                                                                                                                      
            }                                                                                                                                                                                                                                             
          ]                                                                                                                                                                                                                                               
        }                                                                                                                                                                                                                                                 
      }                                                                                                                                                                                                                                                   

And this is collecting metrics, but it's including the namespace of the nvidia-dcgm-exporter pod in the kube_namespace tag.

So when I query DCGM metrics and group by kube_namespace, I get the metrics grouped as:
kube_system, project-foo
kube_system, project-bar
gpu-operator, project-foo
gpu-operator, project-bar

(the exporter is running in the kube_system namespace in one type of cluster and in the gpu-operator namespace in another type of cluster)

I found this PR: #18654 which seems to be trying to address this via the IGNORED_TAGS setting. But somehow it doesn't seem to be working as intended?

@nwilliams-bdai
Copy link
Author

I also noticed that PR #18654 has a test for adding the inner tag from namespace, but doesn't have a test for removing the outer tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant