Span tools log what happened. blamr traces handoffs, attributes blame, and explains which agent caused the failure — and why. Not observability. Causal intelligence.
Every agent handoff emits a CausalEdge — confidence, intent drift, I/O previews. Workers build a causal graph and rank who actually caused the failure.
SDK, MCP proxy, or framework adapter emits edges at runtime — not reconstructed from logs later.
Ingest stores edges; workers compute semantic drift, Shapley blame scores, and confidence gates.
Blame report + intent trace show which agent broke the mission and the fix path — in seconds.
Certainty per hop — catch inflation when hedges become facts
Goal drift — see where the mission dissolved across hops
Downstream impact weight for blame propagation
Existing tools — LangSmith, Langfuse, AgentOps — excel at recording spans, tokens, and latency. When a multi-agent workflow fails silently, they show you every step. They cannot tell you which step caused it.
What did each agent do?
Flat spans and traces. Manual log inspection. The agent that outputs the wrong answer gets investigated first — even when the root cause was six hops upstream.
Which agent caused this outcome?
Causal edges at every handoff. Backward blame propagation with Shapley scoring. Confidence and intent tracked across hops — so silent failures surface before they ship.
These are not missing features waiting for the next Langfuse release. Causality requires runtime instrumentation at the handoff layer — not post-hoc reconstruction from linear traces.
Traces show every step. None rank agents by causal contribution. With 8+ agents, you guess.
Agent 2 hedges; Agent 5 states it as fact. Manufactured certainty is invisible in span logs.
The original goal erodes hop by hop. By step 8 you are confidently answering the wrong question.
The agent that outputs the wrong answer gets blamed — not the bad decision six hops earlier.
Real scenarios from customer support, research, sales intelligence, finance, and autonomous analysis — from single-hop misclassification to full semantic drift across eight agents.
You tweak response_writer for two hours. You add a manual override. The root cause at intent_classifier is never fixed.
blamr ranks intent_classifier at 89% blame — classified leave as payroll with 0.91 confidence. Fix: few-shot leave examples on the classifier.
The report looks authoritative. Standard observability shows 200 OK on every hop. Nobody catches it until reputational damage.
Confidence trace shows summarizer inflated 0.43 → 0.71 by dropping the hedge. Root cause: hop 2 uncertainty stripping.
Both agents succeeded. You debug the agents — but neither was wrong. The orchestrator policy was.
Conflict report: orchestrator weighted recency over firmographic ICP match. Fix is domain signal weighting — not agent prompts.
Finance finds it in the bank statement. Manual audit of every hop. No exception was ever thrown.
Causal graph flags entity_extractor at 94% blame — comma stripping misread Indian vs Western notation at hop 2.
You re-run the workflow. You tweak report_writer. Drift started at hop 4 — you never find it in logs.
Intent map: content_aggregator at 61% — SAP content weighted by volume. Counterfactual: relevance filter restores 89% intent.
What existing tools record vs what blamr attributes — at a glance.
| Capability | LangSmith | Langfuse | AgentOps | blamr |
|---|---|---|---|---|
| Span-level tracing | Yes | Yes | Yes | Yes |
| Blame propagation | No | No | No | Yes |
| Root cause ranking | No | No | No | Yes |
| Confidence decay tracking | No | No | No | Yes |
| Intent preservation tracking | No | No | No | Yes |
| MCP-native instrumentation | No | No | No | Yes |
| Self-hostable OSS | No | Yes | No | Yes |
Four independent forcing functions aligned in 2026 — production pain, protocol standardization, regulation, and validated research.
5–15+ coordinating agents are a 2026 phenomenon — MCP, LangGraph, CrewAI. The debugging pain did not exist at this scale a year ago.
Linux Foundation standard across Anthropic, Microsoft, Google, AWS. blamr instruments at the protocol layer — framework-agnostic by design.
High-risk AI systems need tamper-evident traceability. Causal audit export is compliance infrastructure — not optional for enterprise HR and finance.
AgentTrace, AAAI causal inference, A2P scaffolding — validated approaches with no production open-source implementation yet.
Run the full stack on your infrastructure. Docker Compose, Helm, Ollama-only LLM enrichment — no cloud LLM required.
API, ingest, workers, dashboard, ClickHouse, Redpanda, Postgres — one command to stand up the stack.
TypeScript SDK, Python SDK, or zero-code MCP middleware — emit causal edges from any agent runtime.
Production chart with ingress, init jobs, and local Ollama for semantic drift and blame reasons.