Open source · Self-hosted

Causal intelligence for multi-agent AI

Span tools log what happened. blamr traces handoffs, attributes blame, and explains which agent caused the failure — and why. Not observability. Causal intelligence.

Install & quick start How it works See failure examples Star on GitHub

Forward handoff (CausalEdge) Backward blame propagation

88%

of AI agents fail in production

17%

accuracy of existing root-cause tools

70%

of MAS maintenance is debugging

production OSS tools for causal attribution

How blamr works

Every agent handoff emits a CausalEdge — confidence, intent drift, I/O previews. Workers build a causal graph and rank who actually caused the failure.

STEP 01

Instrument handoffs

SDK, MCP proxy, or framework adapter emits edges at runtime — not reconstructed from logs later.

STEP 02

Build causal graph

Ingest stores edges; workers compute semantic drift, Shapley blame scores, and confidence gates.

STEP 03

Explain root cause

Blame report + intent trace show which agent broke the mission and the fix path — in seconds.

confidence_out

Certainty per hop — catch inflation when hedges become facts

intent_delta

Goal drift — see where the mission dissolved across hops

influence_score

Downstream impact weight for blame propagation

Flight recorder vs crash investigator

Existing tools — LangSmith, Langfuse, AgentOps — excel at recording spans, tokens, and latency. When a multi-agent workflow fails silently, they show you every step. They cannot tell you which step caused it.

Observability — flat span timeline

blamr — causal graph + blame rank

Observability tools

What did each agent do?

Flat spans and traces. Manual log inspection. The agent that outputs the wrong answer gets investigated first — even when the root cause was six hops upstream.

blamr

Which agent caused this outcome?

Causal edges at every handoff. Backward blame propagation with Shapley scoring. Confidence and intent tracked across hops — so silent failures surface before they ship.

The structural gap

These are not missing features waiting for the next Langfuse release. Causality requires runtime instrumentation at the handoff layer — not post-hoc reconstruction from linear traces.

Which agent caused this wrong output?

Traces show every step. None rank agents by causal contribution. With 8+ agents, you guess.

Confidence inflation across handoffs

Agent 2 hedges; Agent 5 states it as fact. Manufactured certainty is invisible in span logs.

Intent decay on long chains

The original goal erodes hop by hop. By step 8 you are confidently answering the wrong question.

Symptom vs root cause

The agent that outputs the wrong answer gets blamed — not the bad decision six hops earlier.

Five production failure patterns

Real scenarios from customer support, research, sales intelligence, finance, and autonomous analysis — from single-hop misclassification to full semantic drift across eight agents.

01 · Simple

Wrong agent blamed — misclassification at hop 1

Customer support: intent_classifier → policy_lookup → response_writer. User asks about leave balance; response talks about payroll. All agents log success.

Without blamr

You tweak response_writer for two hours. You add a manual override. The root cause at intent_classifier is never fixed.

With blamr

blamr ranks intent_classifier at 89% blame — classified leave as payroll with 0.91 confidence. Fix: few-shot leave examples on the classifier.

Fix path: Add few-shot leave examples to intent_classifier — not the writer.

02 · Moderate

Silent confidence inflation

Research workflow: web_searcher hedges at 0.43; four hops later the report states "40% — confirmed" at 0.95. No errors, all green spans.

Without blamr

The report looks authoritative. Standard observability shows 200 OK on every hop. Nobody catches it until reputational damage.

With blamr

Confidence trace shows summarizer inflated 0.43 → 0.71 by dropping the hedge. Root cause: hop 2 uncertainty stripping.

Fix path: Preserve uncertainty language in summarizer prompt and gate on confidence decay.

03 · Moderate+

Parallel agent conflict

SDR pipeline: firmographic_agent says HIGH; intent_signal_agent says LOW. Orchestrator picks LOW. Three weeks later it was a $200K deal.

Without blamr

Both agents succeeded. You debug the agents — but neither was wrong. The orchestrator policy was.

With blamr

Conflict report: orchestrator weighted recency over firmographic ICP match. Fix is domain signal weighting — not agent prompts.

Fix path: Enterprise ICP leads should overweight firmographic signals in orchestrator policy.

04 · Complex

Silent data mutation

Invoice pipeline: OCR extracts "1,40,000" (Indian notation). entity_extractor parses ₹1,400,000. Six agents log success; payment goes out 10× wrong.

Without blamr

Finance finds it in the bank statement. Manual audit of every hop. No exception was ever thrown.

With blamr

Causal graph flags entity_extractor at 94% blame — comma stripping misread Indian vs Western notation at hop 2.

Fix path: Detect Indian number notation and emit low-confidence when comma placement is ambiguous.

05 · Advanced

Semantic drift across 8 agents

Competitive analysis goal: "Workday Q2 APAC HCM." Final briefing covers SAP SuccessFactors globally. Polished output, wrong mission, zero errors.

Without blamr

You re-run the workflow. You tweak report_writer. Drift started at hop 4 — you never find it in logs.

With blamr

Intent map: content_aggregator at 61% — SAP content weighted by volume. Counterfactual: relevance filter restores 89% intent.

Fix path: Add intent relevance scoring to content_aggregator; filter web_searcher_2 to Workday-only sources.

Capability comparison

What existing tools record vs what blamr attributes — at a glance.

Capability	LangSmith	Langfuse	AgentOps	blamr
Span-level tracing	Yes	Yes	Yes	Yes
Blame propagation	No	No	No	Yes
Root cause ranking	No	No	No	Yes
Confidence decay tracking	No	No	No	Yes
Intent preservation tracking	No	No	No	Yes
MCP-native instrumentation	No	No	No	Yes
Self-hostable OSS	No	Yes	No	Yes

Why now

Four independent forcing functions aligned in 2026 — production pain, protocol standardization, regulation, and validated research.

Multi-agent systems hit production scale

5–15+ coordinating agents are a 2026 phenomenon — MCP, LangGraph, CrewAI. The debugging pain did not exist at this scale a year ago.

MCP is the universal agent protocol

Linux Foundation standard across Anthropic, Microsoft, Google, AWS. blamr instruments at the protocol layer — framework-agnostic by design.

EU AI Act audit trails (Aug 2026)

High-risk AI systems need tamper-evident traceability. Causal audit export is compliance infrastructure — not optional for enterprise HR and finance.

Research proved it — nobody shipped OSS

AgentTrace, AAAI causal inference, A2P scaffolding — validated approaches with no production open-source implementation yet.

Self-hosted by default

Run the full stack on your infrastructure. Docker Compose, Helm, Ollama-only LLM enrichment — no cloud LLM required.

Docker Compose

API, ingest, workers, dashboard, ClickHouse, Redpanda, Postgres — one command to stand up the stack.

SDK + MCP proxy

TypeScript SDK, Python SDK, or zero-code MCP middleware — emit causal edges from any agent runtime.

Helm on Kubernetes

Production chart with ingress, init jobs, and local Ollama for semantic drift and blame reasons.