Skip to main content

Workflow: Troubleshooting Failed Targets

Step-by-step guide for diagnosing why a scrape target is down — using resources and tools to triage the issue systematically.

When to Use

Use this workflow when:

A scrape target is showing as down in Prometheus
You see up{job="..."} == 0 for a job
The AI needs to diagnose a monitoring gap

Step-by-Step

Step	Action	Tool / Resource	Key Parameters
1	Check failed targets	Resource: `prom://topology/failed_targets`	Aggregated view of all failed targets
2	Check up status	`prom_query_instant(query="up{job='api-server'}")`	Shows target health
3	Check scrape duration	`prom_query_instant(query="scrape_duration_seconds{job='api-server'}")`	Detect slow endpoints
4	Validate endpoint directly	`prom_test_endpoint(endpoint_url="http://api-server.default:8080/metrics")`	Bypasses Prometheus — direct HTTP check
5	Check cardinality	Resource: `prom://tsdb/cardinality`	High cardinality can cause performance issues

Guided Prompt: Use prom-troubleshoot-guided for the full step-by-step flow.

Common Scenarios

Scenario	Cause	Fix
Connection Refused	Pod/VM not running or wrong port	Verify Deployment is healthy, check port in ServiceMonitor
Context Deadline Exceeded	Scrape timeout exceeded	Increase `scrape_timeout` or optimize the metrics endpoint
401 Unauthorized	Endpoint requires authentication	Configure bearer token or basic auth in ServiceMonitor
High Cardinality	Too many label dimensions	Use `prom_plan_relabel` to drop labels
No metrics found	Endpoint doesn't expose Prometheus format	Use `prom_test_endpoint` to validate format

Resources for Troubleshooting

Resource	Purpose
`prom://topology/failed_targets`	Quick triage of all down targets
`prom://topology/services`	Service catalog with health status
`prom://system/backends`	Backend connectivity check
`prom://config/runtime`	Verify scrape intervals and retention

Next Steps

App Onboarding — Instrumenting applications
PromQL Querying — Safe query workflows
TSDB FinOps — Cardinality analysis and optimization

When to Use
Step-by-Step
Common Scenarios
Resources for Troubleshooting
Next Steps