Skip to content

Python client library (python/web_researcher_mcp/) — typed SDK for direct use in Python apps, notebooks, and agents #237

@zoharbabin

Description

@zoharbabin

Summary

pip install web-researcher-mcp already works — but it only gives you a CLI entry point that runs the MCP server. Python developers who want to call our 27 research tools directly from their apps, notebooks, or agent pipelines have to manually manage an HTTP subprocess, speak raw JSON-RPC, and parse untyped dicts. This issue proposes a thin Python client library (python/web_researcher_mcp/client.py, ~250 lines) that ships inside the existing wheel and turns the Go binary into a typed, first-class Python SDK — no extra install step, no new package, no dependency hell.


Why Now

Python is the primary language for AI tooling, and we're already there. LangChain has 90k+ GitHub stars and ~10 million monthly PyPI downloads. LlamaIndex, CrewAI, AutoGen, and Haystack together reach tens of millions more. The Jupyter ecosystem has ~22 million users (JupyterCon 2023). The mcp Python SDK was among the top-100 fastest-growing PyPI packages in Q1 2025. Every major search/research API competitor (Tavily, Exa, Perplexity, Brave, Serper) ships a typed Python SDK as their primary developer interface — not an afterthought.

We already ship 7 platform wheels. The binary is already on-disk after pip install. The HTTP transport is already built. The only thing missing is the 250-line wrapper that makes it feel like tavily-python or exa-py instead of a subprocess plumbing exercise.

The distribution gap is discoverability. Docker Hub shows 4,007 pulls vs 16 GitHub stars — our users are infrastructure-minded, not Python-library-minded. Adding a client library extends reach to the larger Python-AI audience without changing our architecture at all.

Competitor signal:

  • tavily-python: async/sync TavilyClient, typed SearchResponse, 5-line quickstart
  • exa-py: Exa(api_key).search(query) → typed Result objects
  • perplexity-client: wraps OpenAI-compatible HTTP, typed models
  • None of them do citation verification, retraction detection, or bibliography audit — that's our differentiated surface that Python devs can't easily get to today

Target Users

1. AI/ML Engineer building a RAG pipeline
Currently: calls langchain_community.tools.TavilySearchResults for retrieval, has no citation-verification step, ships hallucinated citations to prod. With the client: drops in WebResearcherClient as a LangChain tool, calls verify_citation() before indexing, eliminates fabricated sources at the retrieval gate.

2. Data scientist / researcher in a Jupyter notebook
Currently: pastes DOIs into Google Scholar manually during literature review, copy-pastes abstracts into notes. With the client: single async with WebResearcherClient() as c: block searches, scrapes, checks retractions, and exports a .bib file — all in one cell.

3. Python automation / content pipeline developer
Currently: generates blog drafts with an LLM, manually spot-checks claims, has no systematic verification. With the client: pipes each LLM output through audit_bibliography() + verify_citation() before publish, flags mischaracterized sources automatically.

4. AI agent framework developer (CrewAI / AutoGen / LlamaIndex)
Currently: writes a custom Tool wrapper that spawns a subprocess, parses JSON, handles errors, restarts on crash. With the client: WebResearcherClient.as_langchain_tools() (or equivalent) returns a ready-made tool list — all 27 tools available with no boilerplate.


Use Cases with Code Sketches

1. Async research in a Jupyter notebook

from web_researcher_mcp import WebResearcherClient

async with WebResearcherClient() as client:
    results = await client.web_search("CRISPR off-target effects 2024", num_results=5)
    for r in results:
        citation = await client.verify_citation(url=r.url, claim="CRISPR has no off-target effects")
        print(r.title, "—", citation.verdict)

2. LangChain RAG with citation verification

from web_researcher_mcp import WebResearcherClient
from langchain.tools import Tool

client = WebResearcherClient()  # sync wrapper around async

def verified_search(query: str) -> str:
    results = client.web_search(query)
    verified = [r for r in results if client.verify_citation(url=r.url).verdict == "supported"]
    return "\n".join(r.snippet for r in verified)

tool = Tool(name="verified_web_search", func=verified_search, description="...")

3. CrewAI agent with all research tools

from web_researcher_mcp import WebResearcherClient
from crewai import Agent

client = WebResearcherClient()
research_agent = Agent(
    role="Research Analyst",
    tools=client.as_crewai_tools(["web_search", "academic_search", "verify_citation"]),
    goal="Find and verify sources for the given claim"
)

4. FastAPI endpoint that adds research to any app

from fastapi import FastAPI
from web_researcher_mcp import WebResearcherClient

app = FastAPI()
client = WebResearcherClient()

@app.get("/research")
async def research(query: str):
    async with client:
        results = await client.search_and_scrape(query)
        return {"results": results}

5. Bibliography audit script for academic writing

from web_researcher_mcp import WebResearcherClient

client = WebResearcherClient()
report = client.audit_bibliography(open("paper.bib").read(), format="bibtex")
retracted = [e for e in report.entries if e.retraction_status == "retracted"]
print(f"{len(retracted)} retracted sources found")

6. CLI tool for a Jupyter magic / VS Code extension

# jupyter magic: %%websearch
from web_researcher_mcp import WebResearcherClient, get_binary_path

# get_binary_path() already exists in the wheel shim — expose it publicly
# so downstream tools can locate the binary without launching it
binary = get_binary_path()  # /path/to/web_researcher_mcp/bin/web-researcher-mcp

Current State

What exists today:

  • ✅ 7 platform wheels on PyPI (pip install web-researcher-mcp)
  • ✅ Binary bundled in wheel at web_researcher_mcp/bin/web-researcher-mcp
  • get_binary_path() + main() in the wheel shim (scripts/build_wheels.py)
  • ✅ HTTP transport: PORT=8080 web-researcher-mcp → MCP JSON-RPC over HTTP
  • ✅ 27 tools, typed MCP tool schemas
  • ✅ MCP Python SDK (mcp package) has a client implementation

What's missing:

  • ❌ No way to from web_researcher_mcp import WebResearcherClient
  • ❌ No subprocess manager (start binary on PORT, stop on context exit)
  • ❌ No typed Python wrappers (callers get raw dict, not SearchResult)
  • ❌ No requirements in wheel (the mcp SDK is not listed as a dep)
  • ❌ No documentation on using the wheel programmatically

Proposed Design

Package structure (additions to the existing wheel)

python/web_researcher_mcp/
├── __init__.py          # re-exports: WebResearcherClient, get_binary_path, __version__
├── client.py            # WebResearcherClient (async-first, sync wrapper)
├── models.py            # Typed dataclasses: SearchResult, VerifyResult, ScrapeResult, ...
└── _server.py           # _ServerProcess: start/stop the Go binary, find free port

build_wheels.py already builds the wheel from a generated __init__.py shim. The new files live in python/web_researcher_mcp/ in the repo and are bundled into the wheel at build time — same pipeline, no new infra.

Core API

class WebResearcherClient:
    """Async-first client. Use as async context manager or call start()/stop() manually.
    
    Sync wrapper: use WebResearcherClient.sync() which returns a SyncWebResearcherClient
    that wraps each async method with asyncio.run().
    """
    
    def __init__(self, port: int | None = None, timeout: float = 30.0): ...
    
    async def __aenter__(self) -> "WebResearcherClient": ...
    async def __aexit__(self, *_): ...
    
    # Core tools — typed return values
    async def web_search(self, query: str, *, num_results: int = 5, provider: str | None = None) -> list[SearchResult]: ...
    async def scrape_page(self, url: str) -> ScrapeResult: ...
    async def academic_search(self, query: str, **kwargs) -> list[SearchResult]: ...
    async def verify_citation(self, *, url: str | None = None, doi: str | None = None, claim: str | None = None) -> VerifyResult: ...
    async def audit_bibliography(self, bibliography: str, format: str = "bibtex") -> BibliographyReport: ...
    
    # Pass-through for all 27 tools not listed above
    async def call(self, tool_name: str, **params) -> dict: ...
    
    # Framework integration helpers
    def as_langchain_tools(self, tools: list[str] | None = None) -> list: ...

Subprocess lifecycle (_server.py)

class _ServerProcess:
    """Start the bundled binary on a free port, stop on context exit."""
    
    def __init__(self, port: int | None = None):
        self.port = port or _find_free_port()
    
    def start(self) -> int:  # returns actual port
        env = {**os.environ, "PORT": str(self.port)}
        self._proc = subprocess.Popen([get_binary_path()], env=env)
        _wait_for_ready(self.port)
        return self.port
    
    def stop(self):
        self._proc.terminate()
        self._proc.wait(timeout=5)

Transport

Uses the MCP Python SDK (mcp>=1.0) as an optional dependency, falling back to raw httpx/urllib HTTP JSON-RPC if mcp is not installed. The endpoint is http://localhost:{port}/mcp (Streamable HTTP transport already supported by the binary).

Sync wrapper

# For notebooks and scripts that don't use async
client = WebResearcherClient.sync()
results = client.web_search("climate change 2024")  # blocks

Implementation Scope

File Change
python/web_researcher_mcp/__init__.py New — re-exports, __version__
python/web_researcher_mcp/client.py New — WebResearcherClient (~150 lines)
python/web_researcher_mcp/models.py New — typed dataclasses (~60 lines)
python/web_researcher_mcp/_server.py New — _ServerProcess (~50 lines)
scripts/build_wheels.py Update to bundle python/web_researcher_mcp/*.py into wheel
tests/python/test_client.py New — unit tests (mock server) + integration test
docs/TOOLS.md or new docs/PYTHON_CLIENT.md Usage quickstart

Total: ~350 lines new code, ~30 lines changed in build_wheels.py. No new Go changes.

Optional dependency on mcp SDK: add mcp>=1.0 to REQUIRES_DIST in the wheel metadata, or make it optional with a raw-HTTP fallback (recommended to keep the zero-deps spirit for basic usage).


Success Criteria

  • pip install web-researcher-mcp + from web_researcher_mcp import WebResearcherClient works on all 5 wheel platforms
  • async with WebResearcherClient() as c: results = await c.web_search("test") returns typed SearchResult objects
  • Sync wrapper: WebResearcherClient.sync().web_search("test") works without async boilerplate
  • WebResearcherClient.as_langchain_tools() returns a valid LangChain tool list (tested with langchain-core)
  • _ServerProcess starts and stops cleanly; port collision → retries with next port
  • Unit tests (mocked HTTP server) + one integration test in tests/python/
  • README quickstart updated with the 5-line Python usage example
  • make verify stays green (no new Go changes → no new Go CI gate; add python -m pytest tests/python/ to Makefile)

Labels

enhancement, python, dx

Related

Metadata

Metadata

Assignees

Labels

P3Future considerationenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions