Repository avatar
Monitoring
v1.0.9
active

datadog-mcp

io.github.TANTIOPE/datadog-mcp

Full Datadog API access: monitors, logs, metrics, traces, dashboards, and observability tools

Documentation

Datadog MCP Server

Quality gate CI/Release npm License Coverage

DISCLAIMER: This is a community-maintained project and is not officially affiliated with, endorsed by, or supported by Datadog, Inc. This MCP server utilizes the Datadog API but is developed independently.

MCP server providing AI assistants with full Datadog observability access. Features grep-like log search, APM trace filtering with duration/status/error queries, smart sampling modes for token efficiency, and cross-correlation between logs, traces, and metrics.

Configuration

Required Environment Variables

DD_API_KEY=your-api-key
DD_APP_KEY=your-app-key

Optional Environment Variables

DD_SITE=datadoghq.com  # Default. Use datadoghq.eu for EU, etc.

# Limit defaults (fallbacks when AI doesn't specify)
MCP_DEFAULT_LIMIT=50              # General tools default limit
MCP_DEFAULT_LOG_LINES=200         # Logs tool default limit
MCP_DEFAULT_METRIC_POINTS=1000    # Metrics timeseries data points
MCP_DEFAULT_TIME_RANGE=24         # Default time range in hours

Optional Flags

--site=datadoghq.com     # Datadog site (overrides DD_SITE)
--transport=stdio|http   # Transport mode (default: stdio)
--port=3000              # HTTP port when using http transport
--host=0.0.0.0           # HTTP host when using http transport
--read-only              # Block all write operations
--disable-tools=synthetics,rum,security    # Comma-separated list of tools to disable

Usage

Claude Desktop / VS Code / Cursor

{
  "mcpServers": {
    "datadog": {
      "command": "npx",
      "args": ["-y", "datadog-mcp"],
      "env": {
        "DD_API_KEY": "your-api-key",
        "DD_APP_KEY": "your-app-key",
        "DD_SITE": "datadoghq.com"
      }
    }
  }
}

Docker

{
  "mcpServers": {
    "datadog": {
      "command": "docker",
      "args": [
        "run", "-i", "--rm",
        "-e", "DD_API_KEY",
        "-e", "DD_APP_KEY",
        "-e", "DD_SITE",
        "ghcr.io/tantiope/datadog-mcp"
      ],
      "env": {
        "DD_API_KEY": "your-api-key",
        "DD_APP_KEY": "your-app-key",
        "DD_SITE": "datadoghq.com"
      }
    }
  }
}

Kubernetes

Use environment variables instead of container args:

env:
  - name: DD_API_KEY
    value: "your-api-key"
  - name: DD_APP_KEY
    value: "your-app-key"
  - name: MCP_TRANSPORT
    value: "http"
  - name: MCP_PORT
    value: "3000"
  - name: MCP_HOST
    value: "0.0.0.0"

Note: Kubernetes args: replaces the entire Dockerfile CMD, causing Node.js to receive the flags instead of your application. Environment variables avoid this issue.

HTTP Transport

When running with --transport=http:

  • POST /mcp — MCP protocol endpoint
  • GET /mcp — SSE stream for responses
  • DELETE /mcp — Close session
  • GET /health — Health check

Tools

ToolActionCategoryDescriptionRequired Scopes
monitorslistAlertingList monitors with optional filtersmonitors_read
monitorsgetAlertingGet monitor by IDmonitors_read
monitorssearchAlertingSearch monitors by querymonitors_read
monitorscreateAlertingCreate a new monitormonitors_write
monitorsupdateAlertingUpdate an existing monitormonitors_write
monitorsdeleteAlertingDelete a monitormonitors_write
monitorsmuteAlertingMute a monitormonitors_write
monitorsunmuteAlertingUnmute a monitormonitors_write
monitorstopAlertingTop N monitors by alert frequency with real monitor names and context breakdown. Groups without context tags are included as "no_context"monitors_read
dashboardslistVisualizationList all dashboardsdashboards_read
dashboardsgetVisualizationGet dashboard by IDdashboards_read
dashboardscreateVisualizationCreate a new dashboarddashboards_write
dashboardsupdateVisualizationUpdate a dashboarddashboards_write
dashboardsdeleteVisualizationDelete a dashboarddashboards_write
logssearchLogsSearch logs with query syntax and filterslogs_read_data, logs_read_index_data
logsaggregateLogsAggregate log data with groupBylogs_read_data
metricsqueryMetricsQuery timeseries datametrics_read, timeseries_query
metricssearchMetricsSearch for metrics by namemetrics_read
metricslistMetricsList active metricsmetrics_read
metricsmetadataMetricsGet metric metadatametrics_read
tracessearchAPMSearch spans with filtersapm_read
tracesaggregateAPMAggregate trace dataapm_read
tracesservicesAPMList APM servicesapm_service_catalog_read
eventslistEventsList eventsevents_read
eventsgetEventsGet event by IDevents_read
eventscreateEventsCreate an eventevents_read
eventssearchEventsSearch events with v2 API and cursor paginationevents_read
eventsaggregateEventsClient-side aggregation by monitor_name, source, etc.events_read
eventstopEventsTop N event groups by count with generic groupBy support (deployments, configs, alerts, etc.). Groups without context tags are included as "no_context"events_read
eventstimeseriesEventsTime-bucketed alert trends (hourly/daily counts)events_read
eventsincidentsEventsDeduplicate alerts into incidents with Trigger/Recover pairingevents_read
incidentslistIncidentsList incidentsincident_read
incidentsgetIncidentsGet incident by IDincident_read
incidentssearchIncidentsSearch incidentsincident_read
incidentscreateIncidentsCreate an incidentincident_write
incidentsupdateIncidentsUpdate an incidentincident_write
incidentsdeleteIncidentsDelete an incidentincident_write
sloslistSLOsList SLOsslos_read
slosgetSLOsGet SLO by IDslos_read
sloscreateSLOsCreate an SLOslos_write
slosupdateSLOsUpdate an SLOslos_write
slosdeleteSLOsDelete an SLOslos_write
sloshistorySLOsGet SLO historyslos_read
syntheticslistSyntheticsList synthetic testssynthetics_read
syntheticsgetSyntheticsGet test by public IDsynthetics_read
syntheticscreateSyntheticsCreate a testsynthetics_write
syntheticsupdateSyntheticsUpdate a testsynthetics_write
syntheticsdeleteSyntheticsDelete a testsynthetics_write
syntheticstriggerSyntheticsTrigger a test runsynthetics_write
syntheticsresultsSyntheticsGet test resultssynthetics_read
downtimeslistDowntimesList downtimesmonitors_downtime
downtimesgetDowntimesGet downtime by IDmonitors_downtime
downtimescreateDowntimesCreate a downtimemonitors_downtime
downtimesupdateDowntimesUpdate a downtimemonitors_downtime
downtimescancelDowntimesCancel a downtimemonitors_downtime
downtimeslistByMonitorDowntimesList downtimes for a monitormonitors_downtime
hostslistInfrastructureList hostshosts_read
hoststotalsInfrastructureGet host totalshosts_read
hostsmuteInfrastructureMute a hosthosts_read
hostsunmuteInfrastructureUnmute a hosthosts_read
rumapplicationsRUMList RUM applicationsrum_read
rumeventsRUMSearch RUM eventsrum_read
rumaggregateRUMAggregate RUM datarum_read
rumperformanceRUMGet Core Web Vitals (LCP, FCP, CLS, FID, INP)rum_read
rumwaterfallRUMGet session timeline with resources/actions/errorsrum_read
securityrulesSecurityList security rulessecurity_monitoring_rules_read
securitysignalsSecuritySearch security signalssecurity_monitoring_signals_read
securityfindingsSecurityList security findingssecurity_monitoring_findings_read
notebookslistNotebooksList notebooksnotebooks_read
notebooksgetNotebooksGet notebook by IDnotebooks_read
notebookscreateNotebooksCreate a notebooknotebooks_write
notebooksupdateNotebooksUpdate a notebooknotebooks_write
notebooksdeleteNotebooksDelete a notebooknotebooks_write
userslistAdminList usersuser_access_read
usersgetAdminGet user by IDuser_access_read
teamslistAdminList teamsteams_read
teamsgetAdminGet team by IDteams_read
teamsmembersAdminList team membersteams_read
tagslistInfrastructureList all tagshosts_read
tagsgetInfrastructureGet tags for a hosthosts_read
tagsaddInfrastructureAdd tags to a hosthosts_read
tagsupdateInfrastructureUpdate host tagshosts_read
tagsdeleteInfrastructureDelete host tagshosts_read
usagesummaryBillingUsage summaryusage_read
usagehostsBillingHost usageusage_read
usagelogsBillingLog usageusage_read
usagecustom_metricsBillingCustom metrics usageusage_read
usageindexed_spansBillingIndexed spans usageusage_read
usageingested_spansBillingIngested spans usageusage_read
authvalidateAuthTest API and App key validity

Token Efficiency

Limit Control

AI assistants have full control over query limits. The environment variables set what value is used when the AI doesn't specify a limit. They do NOT cap what the AI can request.

ToolDefaultParameterDescription
Logs200limitLog lines to return
Metrics (timeseries)1000pointLimitData points per series (controls resolution)
General tools50limitResults to return

Defaults can be configured via MCP_DEFAULT_* environment variables:

{
  "mcpServers": {
    "datadog": {
      "command": "npx",
      "args": ["-y", "datadog-mcp"],
      "env": {
        "DD_API_KEY": "your-api-key",
        "DD_APP_KEY": "your-app-key",
        "MCP_DEFAULT_LIMIT": "50",              // General fallback for most tools
        "MCP_DEFAULT_LOG_LINES": "200",         // Logs search only
        "MCP_DEFAULT_METRIC_POINTS": "1000",    // Metrics query timeseries only
        "MCP_DEFAULT_TIME_RANGE": "24"          // Default time range in hours
      }
    }
  }
}

Compact Mode (Logs)

Use compact: true when searching logs to reduce token usage. Strips custom attributes and keeps only essential fields:

logs({ action: "search", status: "error", compact: true })

Returns: id, timestamp, service, status, message (truncated), traceId, spanId, error

Sampling Modes (Logs)

Control how logs are sampled with the sample parameter:

ModeDescriptionUse Case
firstChronological order (default)Timeline analysis, specific events
spreadEvenly distributed across time rangeSee patterns over time
diverseDeduplicated by message patternError investigation (distinct error types)

Example - find distinct error patterns:

logs({ action: "search", status: "error", sample: "diverse", limit: 25 })

The diverse mode normalizes messages (strips UUIDs, timestamps, IPs, numbers) to identify unique error patterns instead of returning duplicates.

Events Aggregation

Top Monitors Report (Monitor-Specific)

Use monitors tool for monitor alerts with real monitor names:

monitors({ action: "top", from: "7d", limit: 10 })

Returns monitors with real names (including {{template.vars}}) from monitors API:

{
  "top": [
    {
      "rank": 1,
      "monitor_id": 67860480,
      "name": "High number of ready messages on {{queue.name}}",
      "message": "Queue {{queue.name}} has {{value}} ready messages",
      "total_count": 50,
      "by_context": [
        {"context": "queue:email-notifications", "count": 30},
        {"context": "queue:payment-processing", "count": 20}
      ]
    },
    {
      "rank": 2,
      "monitor_id": 134611486,
      "name": "Nginx some requests on errors (HTTP 5XX) on {{ingress.name}}",
      "message": "Nginx request on ingress {{ingress.name}} contains some errors (HTTP 5XX)",
      "total_count": 42,
      "by_context": [
        {"context": "ingress:api-gateway", "count": 29},
        {"context": "ingress:admin-panel", "count": 13}
      ]
    }
  ]
}

Top Events Report (Generic)

Use events tool for any event type (deployments, configs, custom events):

events({ action: "top", from: "7d", limit: 10, groupBy: ["service"] })

Returns event groups by custom fields:

{
  "top": [
    {
      "rank": 1,
      "service": "api-server",
      "message": "Deployment completed",
      "total_count": 30,
      "by_context": [
        {"context": "env:prod", "count": 20},
        {"context": "env:staging", "count": 10}
      ]
    }
  ]
}

Key Differences:

  • monitors top: Fetches real monitor names from monitors API (slower, monitor-specific)
  • events top: Fast generic grouping, returns event message text (any event type)

Context tags are auto-extracted: queue:, service:, ingress:, pod_name:, kube_namespace:, kube_container_name:

Tag Discovery

Discover available tag prefixes in your alert data:

events({ action: "discover", from: "7d", tags: ["source:alert"] })

Returns: {tagPrefixes: ["queue", "service", "ingress", "pod_name", "monitor", "priority"], sampleSize: 150}

Custom Aggregation

For custom grouping patterns, use aggregate:

events({
  action: "aggregate",
  from: "7d",
  tags: ["source:alert"],
  groupBy: ["monitor_name", "priority"]
})

Supported groupBy fields: monitor_name, priority, alert_type, source, status, host, or any tag prefix

The aggregation uses v2 API with cursor pagination to stream through events efficiently (up to 10k events).

Alert Trends (Timeseries)

Visualize alert patterns over time with time-bucketed aggregation:

events({ action: "timeseries", from: "7d", interval: "1d" })

Returns hourly/daily alert counts grouped by monitor:

{
  "timeseries": [
    { "timestamp": "2024-01-15T00:00:00Z", "counts": { "High CPU": 5, "Low Disk": 2 }, "total": 7 },
    { "timestamp": "2024-01-16T00:00:00Z", "counts": { "High CPU": 3 }, "total": 3 }
  ]
}
IntervalUse Case
1hRecent incident analysis (default)
4hDaily patterns
1dWeekly trends

Combine with groupBy to see trends per monitor, source, or priority.

Incident Deduplication

Consolidate noisy alert floods into logical incidents:

events({ action: "incidents", from: "24h", dedupeWindow: "5m" })

Groups repeated triggers within the dedupe window and pairs with recovery events:

{
  "incidents": [
    {
      "monitorName": "High CPU Usage",
      "firstTrigger": "2024-01-15T10:00:00Z",
      "lastTrigger": "2024-01-15T10:15:00Z",
      "triggerCount": 4,
      "recovered": true,
      "recoveredAt": "2024-01-15T10:30:00Z",
      "duration": "30m"
    }
  ],
  "meta": { "totalIncidents": 15, "recoveredCount": 12, "activeCount": 3 }
}
Dedupe WindowUse Case
5mFlapping detection (default)
15mAlert storm consolidation
1hIncident grouping

Monitor Enrichment

Add monitor metadata to search results for deeper context:

events({ action: "search", tags: ["source:alert"], from: "1h", enrich: true })

Returns events with monitor details (type, thresholds, tags):

{
  "events": [{
    "id": "...",
    "title": "[Triggered on {host:prod-1}] High CPU Usage",
    "monitorMetadata": {
      "id": 12345,
      "type": "metric alert",
      "message": "CPU is above threshold",
      "tags": ["team:platform", "env:prod"],
      "options": { "thresholds": { "critical": 90 } }
    }
  }]
}

Note: Enrichment adds latency (fetches monitor list). Use for detailed investigation, not bulk analysis.

Cross-Correlation

Logs → Traces → Metrics

  1. Find errors in logs: logs({ action: "search", status: "error", sample: "diverse" })
  2. Extract trace_id from log attributes (dd.trace_id)
  3. Get full trace: traces({ action: "search", query: "trace_id:<id>" })
  4. Query APM metrics: metrics({ action: "query", query: "avg:trace.<service>.request.duration{*}" })

Deep Links

All query responses include a datadog_url field that links directly to the Datadog UI, allowing AI assistants to provide evidence links back to the source data.

Example Response

{
  "logs": [...],
  "meta": {
    "count": 25,
    "query": "service:api status:error",
    "from": "2024-01-15T10:00:00Z",
    "to": "2024-01-15T11:00:00Z",
    "datadog_url": "https://app.datadoghq.com/logs?query=service%3Aapi%20status%3Aerror&from_ts=1705312800000&to_ts=1705316400000"
  }
}

Supported Tools

ToolURL Type
logsLogs Explorer with query and time range
metricsMetrics Explorer with query and time range
tracesAPM Traces with query and time range
eventsEvent Explorer with query and time range
monitorsMonitor detail page (get) or Manage Monitors (list/search)
rumRUM Explorer or Session Replay

Multi-Region Support

URLs are automatically generated for your configured Datadog site:

SiteApp URL
datadoghq.com (default)https://app.datadoghq.com
datadoghq.euhttps://app.datadoghq.eu
us3.datadoghq.comhttps://us3.datadoghq.com
us5.datadoghq.comhttps://us5.datadoghq.com
ap1.datadoghq.comhttps://ap1.datadoghq.com
ddog-gov.comhttps://app.ddog-gov.com

Configure your site via the DD_SITE environment variable or --site flag.

Contributing

Contributions are welcome! Feel free to open an issue or a pull request if you have any suggestions, bug reports, or improvements to propose.

License

This project is licensed under the Apache License, Version 2.0.