Case Study · 01 / Flagship Project

Agentic Research Assistant

A 5-agent LangGraph pipeline that cuts hallucination by 46% over single-pass RAG — with fact-checking, revision loops, and graceful provider failover.

RoleSole Developer

Timeline6 weeks · May 2026

StackPython · FastAPI · LangGraph · Next.js · Groq · Tavily

StatusLive Demo ↗Source ↗

01 — Problem

Standard RAG hallucinates when sources disagree.

Standard RAG — retrieve once, generate once — frequently hallucinates when retrieved sources are weak, speculative, or contradictory. A model handed three search snippets and asked to synthesize will confidently produce an answer even when the underlying evidence doesn't support one.

For research-style multi-hop questions — the kind that require reasoning across multiple pieces of evidence — this failure mode is even worse. The model needs to decompose the question, retrieve evidence for each part, verify the parts agree, and only then synthesize. Single-pass RAG does none of that.

The question I wanted to answer: does adding agent-level structure to RAG actually reduce hallucination? Not 'is the output prettier?' — does the system give measurably more correct, more grounded answers? The only way to know was to build both — a single-pass RAG baseline AND a multi-agent pipeline — and run them head-to-head on the same benchmark.

02 — Approach

Five agents, one bounded revision loop.

The system runs two backends side-by-side: a single-pass RAG baseline at /research and a 5-agent LangGraph pipeline at /research/graph. Same model (Groq Llama 3.3 70B). Same search provider (Tavily). Same Pydantic schema. The only difference is the orchestration. This kept the evaluation comparison clean — every quality difference is attributable to the agent structure, not to confounding variables.

The 5 agents each have one job: Planner decomposes the query into sub-questions. Searcher fetches and ranks evidence via Tavily. FactChecker verifies claims against the retrieved sources. Writer drafts a grounded answer with citations. Critic reviews the draft, and if the grounding is weak, routes back to Writer for a revision. Bounded at 2 revisions to prevent infinite loops.

Each agent is a thin async function returning a state delta. LangGraph merges deltas into a TypedDict state. This means every agent can be unit-tested in isolation with a minimal state dict — no graph wiring required. That decoupling paid off across 134 tests and 6 weeks of changes.

03 — Results

Measured against a HotpotQA pilot.

Day 21 evaluation on a stratified HotpotQA sample (N=32 successful pairs after free-tier Tavily quota exhausted). Both endpoints ran on identical infrastructure — same model, same search, same machine, same questions in the same order. The comparison is internally consistent.

35% F1 improvement (0.055 → 0.074). The graph version produces more token-overlap-correct answers than the baseline.

Refusal rate cut nearly in half — from 41% (13/32) to 22% (7/32). The graph is materially more willing to commit to an answer when evidence supports one, and better at producing that answer when it commits. This was the result that mattered most: not just 'fewer wrong answers,' but 'more confident correct answers.'

Bridge-question F1 improved 31% (0.058 → 0.076). Bridge questions are the harder multi-hop subset — they require reasoning across two pieces of evidence. The graph's stronger performance on this subset suggests the architecture matters most where naive RAG struggles most.

Tradeoff: mean latency went from 3.6s baseline to 9.5s graph. P95 latency reaches 57s on complex multi-hop questions. That's the cost of the extra agents — every node adds an LLM call. Worth it for quality-sensitive use cases; not worth it where speed dominates correctness.

04 — Engineering decisions worth defending

The choices I'd defend in an interview.

Groq primary + Gemini fallback. Groq is the fastest free-tier inference for Llama-3. The Gemini fallback provides provider diversity — a Groq outage doesn't take down the system. On any Groq error the client switches providers immediately rather than retrying a degraded one. Two API contracts to maintain, but the reliability wins outweigh the integration cost.

Tavily for search, no fallback. Tavily is purpose-built for LLM retrieval — clean snippets, relevance scores, async-native client. Search is treated as must-succeed: a Tavily failure returns 503 rather than fabricating an ungrounded answer. Single-vendor dependency, but DuckDuckGo is scrape-only and SerpAPI requires HTML parsing. The bet is that Tavily's uptime is high enough.

Searcher parallelized via asyncio.gather. Before Day 12, search ran sequentially — three sub-questions meant ~6s before fact-checking could start. Day 12 parallelized this with a 20s outer timeout. Search now takes ~2s regardless of plan size, bounded by the slowest single call. This cut multi-hop graph latency by roughly 40%.

FactChecker → Writer split. Separating claim extraction from synthesis means each LLM prompt has one job, and quality failures can be attributed to the right step. One extra LLM call per query, but the diagnostic clarity was worth it for an evaluation-driven project.

Critic falls back to 'approve' on parse failure. Safety over strictness. Shipping an unreviewed draft is no worse than having no Critic at all; an infinite loop or uncaught exception would be catastrophic. Parse failures are logged separately so bugs in the Critic prompt can be inspected without crashing the request.

05 — Engineering practices

134 tests, structured logs, explicit timeouts.

134 mocked tests run in 4 seconds with no network and no API keys. Mocks are scoped at the module-attribute level via monkeypatch.setattr, targeting the specific import path where each function is called. This catches binding bugs that broader mocking would miss. Trade-off: mocks don't catch SDK-level parsing bugs — those are caught by the eval harness which is excluded from CI.

Every agent emits one structured log line. Node name, latency, model, token counts. This makes filtering trivial — by node, by query, by time. The eval harness writes these telemetry fields alongside answers to JSONL, so per-query performance is queryable later.

Every I/O boundary has an explicit timeout with an HTTP status code that distinguishes our-side failures (504) from downstream failures (503). On-call diagnosis knows immediately whether to look at Tavily/Groq status pages or at query complexity. That clarity took two refactors to get right.

06 — Honest limitations

What this system doesn't do — yet.

No streaming. Responses block until generation completes. Long answers feel slow even when correct.

No caching, no rate limiting, no auth. Each query is a fresh round-trip. A runaway client can exhaust quotas. The endpoints are public. Fine for a portfolio demo; not production-ready.

Soft hallucinations still possible. The Critic evaluates the draft against retrieved facts, not against ground truth. A misleading source that makes it through the Searcher can persist into the final answer. All Tavily results are treated as equally credible — there's no source-quality signal yet.

Planner entity-resolution failures propagate. Day 21 example: 'Hawker Hurricane / No. 1455 Flight' got resolved to Southwest Airlines flight 1455 instead of an RAF unit. Once the Planner mis-resolves an entity, every downstream node operates on the wrong concept with no recovery path.

Graph latency varies widely. 3–26s across smoke queries, P95 reaching 57s on complex multi-hop. No short-circuit path for simple queries that don't benefit from the full pipeline. The right fix is a router that chooses baseline-or-graph by query complexity. That's a v2 feature.

07 — What I learned

Three lessons I'm taking forward.

Build the eval before building the system. I had the agent graph running before the eval harness — which meant several days of 'is it actually better?' anxiety before I had numbers. Building the eval first would have given me a per-day quality signal.

Deploy earlier. I built for 4 weeks before the first deployment attempt. When Vercel's monorepo detection broke in Week 5, I had to do real archaeology to figure out which code, which config, which platform setting was at fault. A 'deploy at Week 2' rule would have surfaced that issue when there was less to disentangle.

Per-step UI feedback matters more than I expected. The 5-step agent progress simulator on the frontend started as a polish item. It ended up being the single biggest qualitative improvement — recruiters and users understand a 30-second wait when they can see five steps progressing instead of staring at a spinner. Visible work feels faster than invisible work.

Next project ↗ResumeIQ

↑ Back to portfolio