Skip to content

Multi-Agent Debate & Critique Patterns

Key Points

  • Debate = multiple LLM agents argue with or critique each other to converge on a better answer than any single agent could give alone.
  • Foundational research: Du et al. 2023 ("Improving Factuality and Reasoning in Language Models through Multiagent Debate") and Liang et al. 2023 (encouraging diverse positions) — both showed accuracy gains over chain-of-thought on reasoning + factuality benchmarks.
  • Variants: (1) Independent + judge — N agents answer in parallel, a judge picks. (2) Self-refine — one agent + a critic loop. (3) Round-robin debate — N agents over K rounds, summarizer at end. (4) Tournament — pairwise debates, winners advance.
  • In Microsoft Agent Framework: build via GroupChat (round-robin), Sequential (proposer → critic), or Magentic (planner-critic). Each agent has a distinct system prompt + role.
  • Cost scales linearly with agents × rounds. Frame as compute spent vs accuracy gained — only worth it when single-agent accuracy is unacceptable.
  • Beats CoT on: open-ended reasoning, code review, factual claims with traps, plans with multiple stakeholders. Doesn't help on: deterministic lookups, simple arithmetic, well-defined transformations.
  • Failure modes: sycophantic convergence (agents agree to be agreeable), persistent disagreement (need a tie-breaker), context blowup over rounds.

Concepts (deep dive)

Why debate at all

A single LLM call samples one chain of reasoning from a probability distribution. It can be confidently wrong. Two independent agents are more likely to spot each other's errors than one agent is to catch its own — the same way human code review catches bugs the author missed. Empirically, Du et al. measured ~10–20% accuracy gains on grade-school math + factual QA when using 3 agents × 3 rounds vs single-agent CoT.

This is the "wisdom of crowds for LLMs" hypothesis. It works because errors are partially independent — different sampled chains make different mistakes — and disagreement surfaces those mistakes.

Variant 1 — Independent generation + judge

[User Q]
   ├──► [Agent A] ──┐
   ├──► [Agent B] ──┤
   ├──► [Agent C] ──┤
   │                ▼
   │         [Judge agent]
   │                │
   ▼                ▼
            [Best answer]

Cheap. Embarrassingly parallel. The judge sees three answers + reasoning and picks the best. Best when answers are easy to compare ("which response is more accurate?") and you can run agents concurrently.

Variant 2 — Self-refine (proposer + critic)

[Proposer] ──► answer v1
    ▲              │
    │              ▼
    └─── [Critic] ─┘
         (until critic accepts or N rounds)

One pair, K iterations. Each round the critic finds flaws; proposer fixes. Best for code, prose, plans — anything where iteration improves quality and a clear "good enough" criterion exists.

Variant 3 — Round-robin debate

Round 1: [A says X] [B says Y] [C says Z]
Round 2: [A reads B,C, refines] [B reads A,C, refines] [C reads A,B, refines]
Round 3: ...
Final:   [Summarizer reads all rounds, produces consensus]

The Du et al. setup. Each agent in each round sees the other agents' previous outputs. Convergence usually happens within 2–3 rounds.

Variant 4 — Tournament

[A vs B] ──► winner_AB
[C vs D] ──► winner_CD
[winner_AB vs winner_CD] ──► champion

Pairwise comparisons. Useful when you have many candidate solutions and want a ranked list, not just a winner.

Code: 3-agent GroupChat in Microsoft Agent Framework

var proposer = new ChatClientAgent(chat)
{
    Name = "Proposer",
    Instructions = """
        You propose solutions. Be specific and concrete.
        After others critique, refine your proposal incorporating valid points.
        """
};

var securityCritic = new ChatClientAgent(chat)
{
    Name = "SecurityCritic",
    Instructions = """
        You are a security reviewer. Your job is to find vulnerabilities,
        unsafe defaults, and missing authn/authz checks. Be specific.
        If the proposal is secure, say so explicitly.
        """
};

var perfCritic = new ChatClientAgent(chat)
{
    Name = "PerfCritic",
    Instructions = """
        You are a performance reviewer. Find hot loops, N+1 queries,
        unbounded allocations, missing async, lock contention.
        If the proposal is sound, say so explicitly.
        """
};

var orchestration = new GroupChatOrchestration(
    agents: [proposer, securityCritic, perfCritic],
    maxRounds: 3,
    selectionStrategy: SelectionStrategy.RoundRobin,
    terminationStrategy: TerminationStrategy.WhenAllAgree);

var result = await orchestration.InvokeAsync(
    "Design an endpoint that uploads files and returns a download URL.");

What happens: - Round 1: proposer drafts; security critic finds "no virus scan, no auth, public URL"; perf critic finds "loads whole file in memory". - Round 2: proposer revises; critics review again; one or both say "looks good". - Termination: when both critics agree → emit final.

Adversarial pair — yes/no debate

var pro = new ChatClientAgent(chat) { Name = "Pro", Instructions = "Argue YES." };
var con = new ChatClientAgent(chat) { Name = "Con", Instructions = "Argue NO." };
var judge = new ChatClientAgent(chat) { Name = "Judge", Instructions =
    "You are neutral. Read both sides. Pick the one with stronger evidence." };

Forced positions surface arguments a single agent (which might just pick the easier side) would skip.

When debate beats single-agent CoT

Task Debate helps? Why
Multi-step reasoning with traps Critics catch chain-of-thought errors
Code review Distinct lenses (sec, perf, style)
Factual claims Cross-checking
Plan critique Forces consideration of edge cases
Open-ended writing Iteration improves drafts
Simple arithmetic One agent gets it; debate just costs more
Deterministic lookup Either you have the data or you don't
Well-specified transform No ambiguity to debate

Cost framing

Debate costs N × K × (avg tokens per turn). For a 3-agent × 3-round debate on a 4k-token problem, expect ~36k–60k tokens — roughly 5–10× a single CoT call. The senior calculation:

worth_it = (accuracy_gain_pp × cost_of_being_wrong) > (extra_tokens × token_price)

If being wrong costs $1000 (a bad recommendation to a customer) and tokens cost $0.01 per call, debate at $0.05 buys you 10pp accuracy → easy yes. If being wrong costs $0.001 (cosmetic copy), no.

Magentic's planner-critic loop

Magentic Orchestration (Microsoft Agent Framework) bakes a critic into the planner: the planner generates a plan; the critic agent reviews; the planner refines. This is debate at the planning layer, not the answering layer. See orchestration-magentic.md.

Failure modes

Sycophantic convergence

Every agent agrees with whoever spoke last. Mitigations: - System prompts that explicitly authorize disagreement: "It is your job to disagree if you find issues. Do not soften critique to be agreeable." - Forced adversarial roles (Pro / Con). - Different models for different agents (GPT-4 vs Claude vs Gemini) — model diversity reduces shared bias.

Persistent disagreement

Critics never accept. Cap rounds at K (3 is plenty). At K, invoke a tie-breaker judge or summarizer.

Context window blowup

Each round each agent sees the prior rounds. By round 5 you may be feeding 50k tokens. Mitigations: - Cap rounds. - Summarize older rounds before injecting. - Only show the most recent round to each agent (loses some context but stays bounded).

Cost runaway

K=10, N=5 debates can cost dollars per query. Always set hard caps; emit token-count metrics; alert on anomalies.


How it works under the hood

Round-robin debate execution

┌──────────────────────────────────────────────────────┐
│ Orchestrator                                         │
│  ┌─ messages: [user_q]                               │
│  │                                                   │
│  │  for round in 1..K:                               │
│  │    for agent in [A, B, C]:                        │
│  │      resp = agent.invoke(messages + [whose turn]) │
│  │      messages.append(resp)                        │
│  │      if termination(messages): break              │
│  │                                                   │
│  └─ if no termination: summarizer.invoke(messages)   │
└──────────────────────────────────────────────────────┘

Termination strategies

TerminationStrategy.WhenAllAgree         // critics emit "approved" tokens
TerminationStrategy.MaxRounds(3)         // hard cap
TerminationStrategy.NoChangeFor(2)       // proposer didn't refine
TerminationStrategy.JudgeApproves        // dedicated judge agent

Agent selection strategies

SelectionStrategy.RoundRobin             // A, B, C, A, B, C
SelectionStrategy.Auto                   // LLM picks who speaks next
SelectionStrategy.Adaptive               // based on last speaker's content

Auto is more natural but adds an extra LLM call per turn just for selection.


Code: correct vs wrong

✅ Correct — bounded debate with explicit critique mandate

var critic = new ChatClientAgent(chat)
{
    Name = "Critic",
    Instructions = """
        Your job is to find flaws. If you cannot find any, say "No issues found."
        Do not soften your critique to be agreeable. Be specific: cite line numbers,
        function names, and concrete examples from the proposal.
        """
};

var orchestration = new GroupChatOrchestration(
    agents: [proposer, critic],
    maxRounds: 3,
    terminationStrategy: TerminationStrategy.WhenLastMessageContains("No issues found"));

❌ Wrong — vague roles

var critic = new ChatClientAgent(chat)
{
    Name = "Helper",
    Instructions = "Help improve the answer."   // sycophant generator
};

❌ Wrong — unbounded rounds

var orchestration = new GroupChatOrchestration(
    agents: [a, b, c],
    maxRounds: int.MaxValue);   // cost explodes; never terminates

❌ Wrong — full history each round, no summary

// round 5 each agent sees ~50k tokens of prior debate; cost cubic in rounds

✅ Correct — independent + judge (cheap parallel variant)

var responses = await Task.WhenAll(
    agent1.InvokeAsync(question, ct),
    agent2.InvokeAsync(question, ct),
    agent3.InvokeAsync(question, ct));

var judgePrompt = $"""
    Question: {question}
    Candidate A: {responses[0].Text}
    Candidate B: {responses[1].Text}
    Candidate C: {responses[2].Text}
    Pick the most accurate answer. Reply with only "A", "B", or "C" + brief reasoning.
    """;
var winner = await judge.InvokeAsync(judgePrompt, ct);

3 agents in parallel + 1 judge call = 4× single-agent cost (not 9×).

✅ Correct — different models for diversity

var gpt = new ChatClientAgent(openAiClient) { Name = "GPT", Instructions = "..." };
var claude = new ChatClientAgent(anthropicClient) { Name = "Claude", Instructions = "..." };
var gemini = new ChatClientAgent(googleClient) { Name = "Gemini", Instructions = "..." };

Reduces shared bias — three models trained on different data won't all share the same blind spots.


Design patterns for this topic

Pattern 1 — "Proposer + N critics"

  • Intent: code/plan review with distinct lenses.
  • Mechanism: one author, multiple specialist critics, round-robin until critics approve.

Pattern 2 — "Independent + judge"

  • Intent: cheap accuracy boost via parallelism.
  • Mechanism: N agents answer independently; judge picks.

Pattern 3 — "Self-refine loop"

  • Intent: iteratively improve a single output.
  • Mechanism: one proposer, one critic, K rounds.

Pattern 4 — "Adversarial Pro/Con"

  • Intent: surface arguments on both sides of a yes/no.
  • Mechanism: forced-position agents + judge.

Pattern 5 — "Heterogeneous models"

  • Intent: reduce shared bias via model diversity.
  • Mechanism: mix providers/models across agents.

Pattern 6 — "Tournament ranking"

  • Intent: rank a set of candidates.
  • Mechanism: pairwise debates, winners advance.

Pros & cons / trade-offs

Aspect Pros Cons
Round-robin debate Strong on reasoning Cost scales N×K
Independent + judge Embarrassingly parallel; cheap No interaction; misses cross-pollination
Self-refine Cheap pair One viewpoint blind spots
Tournament Produces ranking O(N) rounds for N candidates
Heterogeneous models Diversity Multi-provider ops complexity
Forced adversarial Avoids sycophancy Can produce contrived disagreement

When to use / when to avoid

  • Use for high-stakes reasoning where wrong answers are expensive (legal, security, recommendations).
  • Use for code review with specialist lenses (security + perf + style).
  • Use for plan critique before execution.
  • Use independent + judge when you want cheap accuracy gains and tasks parallelize.
  • Avoid for deterministic lookups, simple transforms, well-defined tasks — single agent is fine.
  • Avoid when latency matters (debate is sequential; even parallel adds judge round trip).
  • Avoid without a hard round cap and token budget.

Interview Q&A

Q1. Why does multi-agent debate beat single-agent CoT? Errors in sampled chains are partially independent. Multiple agents catch each other's mistakes — same idea as code review.

Q2. Cite the foundational paper. Du et al. 2023, Improving Factuality and Reasoning in Language Models through Multiagent Debate. Liang et al. 2023 added the "encourage diversity" angle.

Q3. Variants? Independent + judge; self-refine; round-robin debate; tournament. Plus adversarial Pro/Con pairs.

Q4. How do you avoid sycophantic convergence? Explicit "your job is to disagree" instructions; forced adversarial roles; heterogeneous models; require concrete citations.

Q5. How do you bound cost? Hard cap on rounds (3 is plenty); summarize older rounds; cap tokens per turn; emit token metrics; alert on anomalies.

Q6. When is debate wasteful? Deterministic lookups, simple math, well-specified transforms — single agent gets it, debate just multiplies cost.

Q7. What's the cheapest debate variant? Independent + judge — N parallel + 1 judge = (N+1)× single-agent cost, not N×K×.

Q8. Microsoft Agent Framework primitive for debate? GroupChat orchestration with round-robin selection + termination strategy. Or build it via Sequential (proposer → critic). Magentic bakes planner-critic into planning.

Q9. Termination strategies? WhenAllAgree, MaxRounds, NoChangeFor(N), JudgeApproves.

Q10. Why use heterogeneous models? Different training data → different blind spots → less shared bias.

Q11. Connection to Magentic? Magentic's planner generates a plan; a critic agent reviews and the planner refines — that's debate at the planning layer.

Q12. Context window blowup mitigation? Cap rounds; summarize older rounds before injection; show only last round to each agent.

Q13. How is debate different from ensembling? Ensembling = N independent samples + aggregation. Debate = agents see each other's outputs and revise. Interaction is the key difference.


Gotchas / common mistakes

  • ⚠️ Vague critic instructions ("be helpful") produce sycophants. Demand concrete flaws.
  • ⚠️ No round cap → cost explosion.
  • ⚠️ No termination criterion → never converges.
  • ⚠️ Same model + same prompt = correlated errors. Diversity matters.
  • ⚠️ Showing full debate history every round → quadratic context.
  • ⚠️ Using debate for things a single agent solves → just burns money.
  • ⚠️ No telemetry on rounds-to-converge, agent-disagreement-rate.
  • ⚠️ Judge agent picks based on confidence not correctness — confident wrong answers win.
  • ⚠️ Auto selection strategy (LLM picks next speaker) doubles call count silently.
  • ⚠️ Thinking debate fixes hallucination — it reduces it but doesn't eliminate it. Still ground in tools/RAG.

Further reading