Skip to main content

Workflow: Troubleshooting Failed Targets

Step-by-step guide for diagnosing why a scrape target is down — using resources and tools to triage the issue systematically.


When to Use

Use this workflow when:

  • A scrape target is showing as down in Prometheus
  • You see up{job="..."} == 0 for a job
  • The AI needs to diagnose a monitoring gap

Step-by-Step

StepActionTool / ResourceKey Parameters
1Check failed targetsResource: prom://topology/failed_targetsAggregated view of all failed targets
2Check up statusprom_query_instant(query="up{job='api-server'}")Shows target health
3Check scrape durationprom_query_instant(query="scrape_duration_seconds{job='api-server'}")Detect slow endpoints
4Validate endpoint directlyprom_test_endpoint(endpoint_url="http://api-server.default:8080/metrics")Bypasses Prometheus — direct HTTP check
5Check cardinalityResource: prom://tsdb/cardinalityHigh cardinality can cause performance issues

Guided Prompt: Use prom-troubleshoot-guided for the full step-by-step flow.


Common Scenarios

ScenarioCauseFix
Connection RefusedPod/VM not running or wrong portVerify Deployment is healthy, check port in ServiceMonitor
Context Deadline ExceededScrape timeout exceededIncrease scrape_timeout or optimize the metrics endpoint
401 UnauthorizedEndpoint requires authenticationConfigure bearer token or basic auth in ServiceMonitor
High CardinalityToo many label dimensionsUse prom_plan_relabel to drop labels
No metrics foundEndpoint doesn't expose Prometheus formatUse prom_test_endpoint to validate format

Resources for Troubleshooting

ResourcePurpose
prom://topology/failed_targetsQuick triage of all down targets
prom://topology/servicesService catalog with health status
prom://system/backendsBackend connectivity check
prom://config/runtimeVerify scrape intervals and retention

Next Steps