EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
Adding structured query-planning tools to an AI research agent lifts benchmark accuracy by up to 3.8 points — a modest but real improvement in how agents search the web.

The Thesis
Most AI agents that browse the web to answer hard questions do so sloppily: they issue redundant queries, lose track of what they've already found, and stitch evidence together poorly. This paper introduces Q+, a small set of explicit tools for query planning, progress monitoring, and evidence extraction, then bolts them onto Eigent, an open-source multi-agent system. The result, EigentSearch-Q+, gains 0.6 to 3.8 percentage points of accuracy across four standard benchmarks, depending on which large language model (LLM) backend is used. The gains are real but narrow, and the paper is fundamentally a systems engineering contribution rather than a theoretical advance. It matters because deep-research agents — systems that can autonomously investigate a topic and synthesize answers from live web sources — are becoming a commercial product category, and small, consistent accuracy gains translate directly into user trust.
Catalyst
Anthropic's public release of its 'think' tool paradigm in early 2025 gave the field a concrete vocabulary for making LLM reasoning steps explicit rather than implicit. Simultaneously, new benchmarks like FRAMES and WebWalkerQA created standardized test beds for multi-hop web research, making it possible to measure incremental improvements rigorously. These two developments — a design pattern and a measurement standard — are what make this kind of careful tooling work publishable and evaluable right now.
What's New
Earlier deep-research agents, including the base Eigent system this paper extends, treat web search as an opaque subroutine: the agent decides what to search, but there is no explicit mechanism for tracking search history, planning follow-up queries, or chunking long web pages into digestible evidence. Q+ adds three explicit tools — a query planner, a progress monitor, and an evidence extractor — that sit between the LLM and the browser. The claimed advantage is more coherent, less redundant search trajectories rather than raw speed or cost reduction.
The Counter
The accuracy gains here — 0.6 to 3.8 percentage points — are small enough that they may not survive contact with real users. Benchmark performance on SimpleQA-Verified or FRAMES does not reliably predict whether a product feels more useful in practice. The paper tests only three model backends, all from a single vendor family (OpenAI) plus one outlier (Minimax), so it's unclear whether Q+ helps or hurts with other frontier models like Claude or Gemini. More importantly, the 'think' tool paradigm this work is built on adds latency and token cost with every explicit reasoning step — costs the paper does not account for. Finally, the baseline (vanilla Eigent) is not a particularly strong competitor; comparing against commercial deep-research products like Perplexity's research mode or OpenAI's deep research tool would be far more convincing. This reads as a solid engineering blog post dressed as a research paper.
Longs
- MSFT — GitHub hosts Eigent's open-source repo; Azure is a likely deployment target for enterprise research agents
- GOOGL — Google Search API remains a primary web-access layer for agents like this; more agent queries mean more API revenue
- BBAI — BigBear.ai, a smaller AI analytics firm exposed to agentic enterprise workflows
- ARKG (ARK Genomic Innovation ETF) — not applicable; prefer ARKW (ARK Next Generation Internet ETF) for broad agentic AI exposure
- Perplexity AI (private) — direct competitor in the AI-native search space; this paper's results inform their roadmap
Shorts
- Perplexity AI — if structured agentic search becomes a commodity built into open-source frameworks, their moat of 'better search UX' narrows
- You.com and similar AI search startups — competing on implicit retrieval quality, which this paper argues is the wrong axis
Enablers (Picks & Shovels)
- OpenAI GPT-4.1 and GPT-5.1 APIs — used as the primary LLM backends in the benchmarks
- Minimax M2.5 — a Chinese frontier model used as a cost-efficient backend; its inclusion signals that Q+ is model-agnostic
- FRAMES and WebWalkerQA benchmarks — the standardized evaluation suites that make these incremental gains measurable
- Anthropic's 'think' tool design pattern — the conceptual template that motivated Q+'s explicit reasoning steps
Private Watchlist
- Perplexity AI — builds AI-native search products that face direct architectural competition from structured-agent approaches
- Eigent (open-source project, not yet a standalone company) — the base platform this paper extends
- Exa AI — provides neural web-search APIs that agent frameworks like this depend on for high-quality retrieval
Resources
The Paper
Deep research requires reasoning over web evidence to answer open-ended questions, and it is a core capability for AI agents. Yet many deep research agents still rely on implicit, unstructured search behavior that causes redundant exploration and brittle evidence aggregation. Motivated by Anthropic's "think" tool paradigm and insights from the information-retrieval literature, we introduce Q+, a set of query and evidence processing tools that make web search more deliberate by guiding query planning, monitoring search progress, and extracting evidence from long web snapshots. We integrate Q+ into the browser sub-agent of Eigent, an open-source, production-ready multi-agent workforce for computer use, yielding EigentSearch-Q+. Across four benchmarks (SimpleQA-Verified, FRAMES, WebWalkerQA, and X-Bench DeepSearch), Q+ improves Eigent's browser agent benchmark-size-weighted average accuracy by 3.0, 3.8, and 0.6 percentage points (pp) for GPT-4.1, GPT-5.1, and Minimax M2.5 model backends, respectively. Case studies further suggest that EigentSearch-Q+ produces more coherent tool-calling trajectories by making search progress and evidence handling explicit.