ReferenceObservability

Observability Reference

ai-core-kit gives you two honest, complementary views of a Claude Code build:

  1. AI usage — the USD and token cost of the run, derived offline from the transcript. There is no live cost API for Claude Code spend (issue #11008), so this is accurate after-the-fact accounting, never a live meter.
  2. DORA — the four delivery “keys” (deployment frequency, lead time, change failure rate, time to restore), computed from your local git history (and gh when present). This is exact, not a transcript estimate.

This page is the complete, no-omissions list of those primitives: the aggregator, the pricing map, budgets, the DORA module, the report, the local monitor, the Prometheus exporter, and the docker-compose stack with its three dashboards. The cost attribution model is detailed in Offline Cost Telemetry.

How you consume it — three tiers

Observability here is offline-first and tiered. You do not need Grafana — or any infra at all — to get the full cost, token, and DORA picture. The tiers are additive (each reads the same two engines, re-implementing no math):

TierInfraWhat you get
Tier 0 — CLI + report (default)noneaggregate.py / dora.py on the CLI, plus a self-contained HTML/Markdown report (report.py) you can open or attach to a PR
Tier 1 — scheduled monitornone newDORA via a GitHub Action (git history is in the runner); cost/budget via a local monitor.sh
Tier 2 — Grafana stackDockerlive-ish dashboards over the same gauges — opt-in, for teams already running Grafana

Start at Tier 0. Reach for Tier 2 only if a dashboard earns its keep.

Two engines, one stack. aggregate.py × pricing.json prices transcript token-usage (OFFLINE — near-real-time at best, never a live meter). dora.py reads local git (+ gh) for the four keys (EXACT). Both surface in the same Prometheus + Grafana stack via three folder-provisioned dashboards.

The primitives at a glance

PrimitiveKindLayerWhat it does
aggregate.pytelemetryMETATier 0. Offline cost + token aggregator — reads transcripts, multiplies token counts by the versioned pricing map, attributes by model/feature/agent/session/day, and compares totals to advisory budgets.
pricing.jsontelemetryMETAVersioned model → USD/MTok map, unknown_model_policy=error.
dora.pytelemetryMETATier 0. The DORA four keys from local git history (+ optional gh), with a self-test. Text / JSON / Prometheus output.
report.pytelemetryMETATier 0. Self-contained HTML/Markdown report — imports the two engines into one standalone, no-network artifact.
dashboard.pytelemetryMETATier 0. Self-contained interactive HTML cost dashboard — open the .html, or --serve for a local live view. The Grafana-free way to get charts.
monitor.shtelemetryMETATier 1. Local cost/budget monitor — runs aggregate.py against local transcripts and ALERTs on a manifest-budget overage.
ack-cost-exportertelemetryMETATier 2. Thin Prometheus wrapper that imports aggregate.py (no re-implementation) and exposes cost/token gauges on /metrics.
observability-stacktelemetryMETATier 2 (opt-in). Prometheus + Grafana + exporter docker-compose stack with three dashboards.

Paths:

PrimitivePath
aggregate.pytelemetry/aggregate.py
pricing.jsontelemetry/pricing.json
dora.pytelemetry/dora.py
report.pytelemetry/report.py
dashboard.pytelemetry/dashboard.py
monitor.shtelemetry/monitor.sh
ack-cost-exportertelemetry/observability/exporter/ack_cost_exporter.py
observability-stacktelemetry/observability/docker-compose.yml
dashboardstelemetry/observability/grafana/dashboards/{ack-cost,ack-ai-usage,ack-dora}.json

Every one of these lives in the META telemetry/ and is mirrored to the CHILD payload under templates/telemetry/, wired by /ack-init when telemetry.enabled: true.


AI usage — aggregate.py (cost and tokens)

A stdlib-only post-run tool. For each assistant line it reads message.usage (present on every assistant turn, tool or not, so it captures 100% of spend) and prices it against pricing.json. Every bucket carries token counts — input / output / cache_read / cache_write_5m / cache_write_1h — alongside its USD cost, so this is true token-usage accounting, not just a dollar figure. It is fail-loud: an unknown model, a missing/invalid pricing.json, or a bucket-sum that does not reconcile to the grand total exits non-zero. A single malformed JSONL line is skipped (not fatal).

# whole machine, all axes, the AI-usage table (cost + tokens) + JSON:
python3 telemetry/aggregate.py
 
# per-session usage, this build only:
python3 telemetry/aggregate.py --by session --since 2026-06-01
## by session                turns   cost USD    in+out tok    cache tok
e3b61498-3313-49..            3872    441.4824     3,546,105   446,558,196
a29e493f-f2aa-4d..            5496    340.2362     3,670,522   442,059,374

Attribution axes — now including day

--by selects one or more of model,feature,agent,session,day:

  • model / session — keyed on the exact message.model / sessionId. Always exact.
  • agentisSidechain splits main from subagent:<requestId> spend.
  • feature — supplied by --branch-prefix (feature = branch after the prefix) or a --sidecar-map (timestamp → bucket); anything unmatched lands in the --default-bucket (never silently dropped).
  • day — each turn buckets to its UTC calendar day (YYYY-MM-DD); timestamp-less turns land in an explicit undated bucket. This powers the per-day token + cost time series the ack-ai-usage dashboard charts.

Every axis reconciles: the per-bucket sum is proven equal to the grand total, or the run exits non-zero.

Budgets are advisory

pricing.json produces actuals. Budgets (advisory USD ceilings) flag overage — they never enforce or block anything live. Two ways to set them:

  • CHILD manifesttelemetry.budgets[] (scope project|feature|contract|agent), read by aggregate.py --manifest and by the exporter (ACK_MANIFEST).
  • Ad-hoc on the CLI--budget USD for the grand total, or --budget-axis AXIS + repeated --bucket-budget NAME=USD for per-bucket caps. Overage is reported; --budget-strict makes overage exit non-zero (reconciliation failure always exits non-zero, independent of budgets).

pricing.json — the versioned price map

A model → USD/MTok map with schema_version, an as_of date, and unknown_model_policy: error. Per-model keys: input, output, cache_write_5m, cache_write_1h, cache_read. An aliases block maps bare/aliased ids to a priced id (dated -YYYYMMDD suffixes are stripped automatically); skip_models lists non-billable pseudo-models. A message.model absent from the map is a hard error naming the offending id — cost is never silently under-counted. The fix: add a row (copy a same-tier row, set the USD/MTok values, bump as_of).


Tier 0 — report.py (self-contained report)

The default, zero-infra view. report.py imports aggregate.py and dora.py and renders a single standalone artifact — no external CSS/JS, no network — combining the cost+token breakdown and the DORA four keys into one document you can open in a browser or paste into a PR. It is a view, not a second source of truth: the numbers are still the reconciled, fail-loud output of the two engines.

python3 telemetry/report.py --format html --out report.html   # open / attach to a PR
python3 telemetry/report.py --format md   --out report.md     # comment / commit body

Tier 0 — dashboard.py (interactive HTML cost dashboard)

telemetry/dashboard.py is a self-contained interactive cost dashboard — the Grafana-free way to get charts. Where report.py emits a static document, this emits a single HTML file with interactive charts (filter by feature / model / agent, drill into sessions, toggle token kinds) and all CSS/JS inlined — no external assets, no network. Open the .html, or run --serve for a local live view that re-aggregates on an interval:

python3 telemetry/dashboard.py --out cost-dashboard.html   # one self-contained file
python3 telemetry/dashboard.py --serve --watch 5           # local live view, recompute 5s

Like report.py, it imports aggregate.py and dora.py — it is a view, not a second source of truth, and even under --serve it is an OFFLINE recompute (near-real-time as transcripts grow, never a live token meter — #11008).


DORA — dora.py (the four keys, exact from git)

A stdlib-only sibling of aggregate.py that reads local git history — no servers, no pip — and computes the four DORA keys over a window (--since 30d|12w|6m|1y|YYYY-MM-DD, default 30d). Unlike cost, this is exact, not a transcript estimate.

python3 telemetry/dora.py                       # tag mode (release tags = deploys)
python3 telemetry/dora.py --deploy-mode merge   # trunk/CD repos (first-parent = deploy)
python3 telemetry/dora.py --selftest            # pin the math on a synthetic fixture
python3 telemetry/dora.py --prom                # Prometheus exposition text
KeyDefinition (in this tool)Rating bands
Deployment frequencydeploys in the window ÷ days.elite ≥1/day · high ≥weekly · medium ≥monthly · low
Lead time for changesmedian(commit authored → first deploy that contains it).elite <1d · high <1w · medium <1m · low
Change failure ratefailed_deploys ÷ deploys.elite/high ≤15% · medium ≤30% · low
Mean time to restoremedian(failure marker → next deploy that resolves it).elite <1h · high <1d · medium <1w · low

Deploys and failures are PROXIES — dora.py is honest about it. A git repo has no real deployment stream, so a deploy is either a release tag (--deploy-tag-glob, default v*; the default mode) or a first-parent commit on the default branch (--deploy-mode merge, for trunk/CD repos). A failure is a deploy that contains a revert (Revert … / This reverts commit …) or hotfix commit (--hotfix-glob, default *hotfix*; also fix!: / [hotfix]), or — only with --use-gh — a deploy whose commit SHA has a failed CI run. Squash/rebase/force-push histories and tag-less flows will mis-estimate; pick the --deploy-mode that matches how you ship and read the heuristic note the report prints.

gh enrichment is best-effort: missing, unauthenticated, or offline gh silently skips CI-based failure detection (revert/hotfix detection still runs); the report states which path it took. The --selftest asserts the exact four-key math on a synthetic, git-free fixture (and the edge cases: no deploys, windowing, CI-only failure, the window grammar) — it is part of the test gate.

--prom emits these gauges (so the exporter can surface DORA without re-implementing the math): ack_dora_deploys_total, ack_dora_deploy_frequency_per_day, ack_dora_deploy_frequency_per_week, ack_dora_lead_time_seconds, ack_dora_change_failure_rate, ack_dora_failed_deploys_total, ack_dora_mttr_seconds, ack_dora_window_span_days. A metric with no data is emitted as NaN (Prometheus records “no sample” rather than a misleading 0).


Tier 1 — scheduled monitor (zero new infra)

The same engine also drives a live terminal sessiontelemetry/watch.py redraws tokens + cost per feature in place, like top:

Same engines, now running on a schedule so a regression or an overage finds you. The split follows the data:

  • DORA → GitHub Action. Git history is already checked out in the runner, so a scheduled workflow runs dora.py (--json / --prom), writes the four keys to the job summary, and opens an issue on a regression (a key dropping a rating band). Nothing leaves CI; no transcripts are needed.
  • Cost/budget → local monitor.sh. It runs aggregate.py against your local transcripts with the manifest’s advisory budgets and flags an overage as an ALERT. This stays on the developer’s machine on purpose: token transcripts are machine-local (#11008
    • the locality note) and are not present in CI, so a CI job could not price them. Run it from cron, a SessionStart/Stop hook, or by hand: telemetry/monitor.sh.

Why the split: DORA travels with the repo (CI can see it); AI cost is reconstructed from machine-local transcripts (CI cannot). Tier 1 puts each metric where its data already lives — no new infra, no shipping transcripts off the box.


Tier 2 (opt-in) — ack-cost-exporter — Prometheus gauges

Tier 2 is optional — worth it only for teams already running Grafana. It adds visualization over the same offline numbers, not accuracy, and not a live meter.

A thin Prometheus wrapper that imports load_pricing, discover_jsonl, and aggregate from the sibling aggregate.py — it does not re-implement pricing or attribution. On each scrape (subject to an ACK_SCRAPE_TTL cache, default 30s) it re-parses the transcript JSONL and re-aggregates, so freshness is “as of the last recompute”, never a live token meter. It is fail-soft at scrape time: on any error it keeps the last good gauges and sets ack_scrape_error=1 (an empty/missing transcript dir is not an error — it emits clean zeros).

MetricMeaning
ack_total_cost_usdGrand total across all assistant turns.
ack_assistant_turns_totalNumber of assistant turns priced.
ack_files_scannedTranscript files discovered.
ack_cost_usd{model,feature,agent}Cost per axis bucket (1-D; inactive axes pinned to *).
ack_tokens_total{kind,feature,agent}Tokens per kind per axis bucket.
ack_budget_usd{feature}Advisory budget ceilings from the manifest (scope=project__project__).
ack_reconciled1 if all axes reconcile, else 0.
ack_pricing_as_of{as_of,reconciled}Pricing-doc metadata (Info).
ack_scrape_duration_secondsWall time of the last recompute.
ack_scrape_error1 if the last scrape errored (stale, do not trust).
ack_last_scrape_unixtimeUnix ts of the last recompute.

Config (env): ACK_PROJECT_DIR, ACK_PRICING, ACK_MANIFEST (optional; supplies budgets), ACK_BRANCH_PREFIX (default feat/), ACK_DEFAULT_BUCKET, ACK_SINCE, ACK_PORT (default 9418), ACK_SCRAPE_TTL (default 30).

Tier 2 (opt-in) — observability-stack — Prometheus + Grafana

telemetry/observability/docker-compose.yml stands up three services:

ServiceImagePortRole
exporterack-cost-exporter:local (built)9418 (internal)Re-parses transcripts each scrape (TTL-cached) and exposes the gauges above. Mounts transcripts, aggregate.py, and pricing.json read-only.
prometheusprom/prometheus:v3.5.39090Scrapes the exporter every 30s; stores series for 30 days.
grafanagrafana/grafana:12.4.33001 (→ 3000)Dashboards; anonymous Viewer by default. Host port 3001 because 3000 is the docs site.
docker compose up -d        # start (from telemetry/observability/)
open http://localhost:3001  # Grafana (anonymous Viewer)
docker compose down         # stop (add -v to wipe stored series)

Three dashboards (folder-provisioned)

Grafana auto-loads every *.json under grafana/dashboards/ via the ack.yml provider — a new dashboard needs no provisioning edit.

DashboardUIDWhat it shows
Cost Observability (ack-cost.json)ack-cost-observabilityTotal cost, turns, exporter health; cost by feature/model/agent; cost-share pie; top sessions; budget gauges.
AI Usage (ack-ai-usage.json)ack-ai-usageToken counts (not just USD) — tokens-over-time by kind, cache-read share, a feature×kind token ledger, tokens by feature/agent, per-session spend, budget gauges, and an OFFLINE recompute-age “Data Freshness” stat.
DORA (ack-dora.json)ack-dora-metricsThe four keys as stat panels + trend timeseries (deploy frequency, lead time/MTTR, change-failure rate, failed-vs-total deploys), sourced from the ack_dora_* gauges.

Copy .env.example to .env to override ports, project dir, and Grafana creds. The cost/token numbers are near-real-time (refreshed every scrape, bounded by ACK_SCRAPE_TTL), not a live per-token meter.


The hard constraint, and what is exact

There is no live cost meter for Claude Code spend, and there never can be — hooks carry no token or cost fields (#11008). State this before promising any real-time number.

So AI cost is offline (transcript × pricing, near-real-time at best). DORA is exact (derived from git, not the transcript). None of this needs Grafana: Tier 0 (CLI + report) is the default, Tier 1 (the scheduled monitor) adds no new infra, and Tier 2 (Grafana) is purely opt-in. Two CHILD payload skills sit on top of the cost engine (both MIT):

  • cost-telemetry — runs aggregate.py and interprets its output; the single source of truth for the numbers. Rendered when features.cost_telemetry == true.
  • cost-audit — evidence-first investigation of why spend spiked; it delegates the numbers to cost-telemetry and never re-derives pricing math.

See also: Offline Cost Telemetry (attribution model, reconciliation, budgets), Skills Reference.