Summary
pip install web-researcher-mcp already works — but it only gives you a CLI entry point that runs the MCP server. Python developers who want to call our 27 research tools directly from their apps, notebooks, or agent pipelines have to manually manage an HTTP subprocess, speak raw JSON-RPC, and parse untyped dicts. This issue proposes a thin Python client library (python/web_researcher_mcp/client.py, ~250 lines) that ships inside the existing wheel and turns the Go binary into a typed, first-class Python SDK — no extra install step, no new package, no dependency hell.
Why Now
Python is the primary language for AI tooling, and we're already there. LangChain has 90k+ GitHub stars and ~10 million monthly PyPI downloads. LlamaIndex, CrewAI, AutoGen, and Haystack together reach tens of millions more. The Jupyter ecosystem has ~22 million users (JupyterCon 2023). The mcp Python SDK was among the top-100 fastest-growing PyPI packages in Q1 2025. Every major search/research API competitor (Tavily, Exa, Perplexity, Brave, Serper) ships a typed Python SDK as their primary developer interface — not an afterthought.
We already ship 7 platform wheels. The binary is already on-disk after pip install. The HTTP transport is already built. The only thing missing is the 250-line wrapper that makes it feel like tavily-python or exa-py instead of a subprocess plumbing exercise.
The distribution gap is discoverability. Docker Hub shows 4,007 pulls vs 16 GitHub stars — our users are infrastructure-minded, not Python-library-minded. Adding a client library extends reach to the larger Python-AI audience without changing our architecture at all.
Competitor signal:
tavily-python: async/sync TavilyClient, typed SearchResponse, 5-line quickstart
exa-py: Exa(api_key) → .search(query) → typed Result objects
perplexity-client: wraps OpenAI-compatible HTTP, typed models
- None of them do citation verification, retraction detection, or bibliography audit — that's our differentiated surface that Python devs can't easily get to today
Target Users
1. AI/ML Engineer building a RAG pipeline
Currently: calls langchain_community.tools.TavilySearchResults for retrieval, has no citation-verification step, ships hallucinated citations to prod. With the client: drops in WebResearcherClient as a LangChain tool, calls verify_citation() before indexing, eliminates fabricated sources at the retrieval gate.
2. Data scientist / researcher in a Jupyter notebook
Currently: pastes DOIs into Google Scholar manually during literature review, copy-pastes abstracts into notes. With the client: single async with WebResearcherClient() as c: block searches, scrapes, checks retractions, and exports a .bib file — all in one cell.
3. Python automation / content pipeline developer
Currently: generates blog drafts with an LLM, manually spot-checks claims, has no systematic verification. With the client: pipes each LLM output through audit_bibliography() + verify_citation() before publish, flags mischaracterized sources automatically.
4. AI agent framework developer (CrewAI / AutoGen / LlamaIndex)
Currently: writes a custom Tool wrapper that spawns a subprocess, parses JSON, handles errors, restarts on crash. With the client: WebResearcherClient.as_langchain_tools() (or equivalent) returns a ready-made tool list — all 27 tools available with no boilerplate.
Use Cases with Code Sketches
1. Async research in a Jupyter notebook
from web_researcher_mcp import WebResearcherClient
async with WebResearcherClient() as client:
results = await client.web_search("CRISPR off-target effects 2024", num_results=5)
for r in results:
citation = await client.verify_citation(url=r.url, claim="CRISPR has no off-target effects")
print(r.title, "—", citation.verdict)
2. LangChain RAG with citation verification
from web_researcher_mcp import WebResearcherClient
from langchain.tools import Tool
client = WebResearcherClient() # sync wrapper around async
def verified_search(query: str) -> str:
results = client.web_search(query)
verified = [r for r in results if client.verify_citation(url=r.url).verdict == "supported"]
return "\n".join(r.snippet for r in verified)
tool = Tool(name="verified_web_search", func=verified_search, description="...")
3. CrewAI agent with all research tools
from web_researcher_mcp import WebResearcherClient
from crewai import Agent
client = WebResearcherClient()
research_agent = Agent(
role="Research Analyst",
tools=client.as_crewai_tools(["web_search", "academic_search", "verify_citation"]),
goal="Find and verify sources for the given claim"
)
4. FastAPI endpoint that adds research to any app
from fastapi import FastAPI
from web_researcher_mcp import WebResearcherClient
app = FastAPI()
client = WebResearcherClient()
@app.get("/research")
async def research(query: str):
async with client:
results = await client.search_and_scrape(query)
return {"results": results}
5. Bibliography audit script for academic writing
from web_researcher_mcp import WebResearcherClient
client = WebResearcherClient()
report = client.audit_bibliography(open("paper.bib").read(), format="bibtex")
retracted = [e for e in report.entries if e.retraction_status == "retracted"]
print(f"{len(retracted)} retracted sources found")
6. CLI tool for a Jupyter magic / VS Code extension
# jupyter magic: %%websearch
from web_researcher_mcp import WebResearcherClient, get_binary_path
# get_binary_path() already exists in the wheel shim — expose it publicly
# so downstream tools can locate the binary without launching it
binary = get_binary_path() # /path/to/web_researcher_mcp/bin/web-researcher-mcp
Current State
What exists today:
- ✅ 7 platform wheels on PyPI (
pip install web-researcher-mcp)
- ✅ Binary bundled in wheel at
web_researcher_mcp/bin/web-researcher-mcp
- ✅
get_binary_path() + main() in the wheel shim (scripts/build_wheels.py)
- ✅ HTTP transport:
PORT=8080 web-researcher-mcp → MCP JSON-RPC over HTTP
- ✅ 27 tools, typed MCP tool schemas
- ✅ MCP Python SDK (
mcp package) has a client implementation
What's missing:
- ❌ No way to
from web_researcher_mcp import WebResearcherClient
- ❌ No subprocess manager (start binary on
PORT, stop on context exit)
- ❌ No typed Python wrappers (callers get raw
dict, not SearchResult)
- ❌ No
requirements in wheel (the mcp SDK is not listed as a dep)
- ❌ No documentation on using the wheel programmatically
Proposed Design
Package structure (additions to the existing wheel)
python/web_researcher_mcp/
├── __init__.py # re-exports: WebResearcherClient, get_binary_path, __version__
├── client.py # WebResearcherClient (async-first, sync wrapper)
├── models.py # Typed dataclasses: SearchResult, VerifyResult, ScrapeResult, ...
└── _server.py # _ServerProcess: start/stop the Go binary, find free port
build_wheels.py already builds the wheel from a generated __init__.py shim. The new files live in python/web_researcher_mcp/ in the repo and are bundled into the wheel at build time — same pipeline, no new infra.
Core API
class WebResearcherClient:
"""Async-first client. Use as async context manager or call start()/stop() manually.
Sync wrapper: use WebResearcherClient.sync() which returns a SyncWebResearcherClient
that wraps each async method with asyncio.run().
"""
def __init__(self, port: int | None = None, timeout: float = 30.0): ...
async def __aenter__(self) -> "WebResearcherClient": ...
async def __aexit__(self, *_): ...
# Core tools — typed return values
async def web_search(self, query: str, *, num_results: int = 5, provider: str | None = None) -> list[SearchResult]: ...
async def scrape_page(self, url: str) -> ScrapeResult: ...
async def academic_search(self, query: str, **kwargs) -> list[SearchResult]: ...
async def verify_citation(self, *, url: str | None = None, doi: str | None = None, claim: str | None = None) -> VerifyResult: ...
async def audit_bibliography(self, bibliography: str, format: str = "bibtex") -> BibliographyReport: ...
# Pass-through for all 27 tools not listed above
async def call(self, tool_name: str, **params) -> dict: ...
# Framework integration helpers
def as_langchain_tools(self, tools: list[str] | None = None) -> list: ...
Subprocess lifecycle (_server.py)
class _ServerProcess:
"""Start the bundled binary on a free port, stop on context exit."""
def __init__(self, port: int | None = None):
self.port = port or _find_free_port()
def start(self) -> int: # returns actual port
env = {**os.environ, "PORT": str(self.port)}
self._proc = subprocess.Popen([get_binary_path()], env=env)
_wait_for_ready(self.port)
return self.port
def stop(self):
self._proc.terminate()
self._proc.wait(timeout=5)
Transport
Uses the MCP Python SDK (mcp>=1.0) as an optional dependency, falling back to raw httpx/urllib HTTP JSON-RPC if mcp is not installed. The endpoint is http://localhost:{port}/mcp (Streamable HTTP transport already supported by the binary).
Sync wrapper
# For notebooks and scripts that don't use async
client = WebResearcherClient.sync()
results = client.web_search("climate change 2024") # blocks
Implementation Scope
| File |
Change |
python/web_researcher_mcp/__init__.py |
New — re-exports, __version__ |
python/web_researcher_mcp/client.py |
New — WebResearcherClient (~150 lines) |
python/web_researcher_mcp/models.py |
New — typed dataclasses (~60 lines) |
python/web_researcher_mcp/_server.py |
New — _ServerProcess (~50 lines) |
scripts/build_wheels.py |
Update to bundle python/web_researcher_mcp/*.py into wheel |
tests/python/test_client.py |
New — unit tests (mock server) + integration test |
docs/TOOLS.md or new docs/PYTHON_CLIENT.md |
Usage quickstart |
Total: ~350 lines new code, ~30 lines changed in build_wheels.py. No new Go changes.
Optional dependency on mcp SDK: add mcp>=1.0 to REQUIRES_DIST in the wheel metadata, or make it optional with a raw-HTTP fallback (recommended to keep the zero-deps spirit for basic usage).
Success Criteria
Labels
enhancement, python, dx
Related
Summary
pip install web-researcher-mcpalready works — but it only gives you a CLI entry point that runs the MCP server. Python developers who want to call our 27 research tools directly from their apps, notebooks, or agent pipelines have to manually manage an HTTP subprocess, speak raw JSON-RPC, and parse untyped dicts. This issue proposes a thin Python client library (python/web_researcher_mcp/client.py, ~250 lines) that ships inside the existing wheel and turns the Go binary into a typed, first-class Python SDK — no extra install step, no new package, no dependency hell.Why Now
Python is the primary language for AI tooling, and we're already there. LangChain has 90k+ GitHub stars and ~10 million monthly PyPI downloads. LlamaIndex, CrewAI, AutoGen, and Haystack together reach tens of millions more. The Jupyter ecosystem has ~22 million users (JupyterCon 2023). The
mcpPython SDK was among the top-100 fastest-growing PyPI packages in Q1 2025. Every major search/research API competitor (Tavily, Exa, Perplexity, Brave, Serper) ships a typed Python SDK as their primary developer interface — not an afterthought.We already ship 7 platform wheels. The binary is already on-disk after
pip install. The HTTP transport is already built. The only thing missing is the 250-line wrapper that makes it feel liketavily-pythonorexa-pyinstead of a subprocess plumbing exercise.The distribution gap is discoverability. Docker Hub shows 4,007 pulls vs 16 GitHub stars — our users are infrastructure-minded, not Python-library-minded. Adding a client library extends reach to the larger Python-AI audience without changing our architecture at all.
Competitor signal:
tavily-python: async/syncTavilyClient, typedSearchResponse, 5-line quickstartexa-py:Exa(api_key)→.search(query)→ typedResultobjectsperplexity-client: wraps OpenAI-compatible HTTP, typed modelsTarget Users
1. AI/ML Engineer building a RAG pipeline
Currently: calls
langchain_community.tools.TavilySearchResultsfor retrieval, has no citation-verification step, ships hallucinated citations to prod. With the client: drops inWebResearcherClientas a LangChain tool, callsverify_citation()before indexing, eliminates fabricated sources at the retrieval gate.2. Data scientist / researcher in a Jupyter notebook
Currently: pastes DOIs into Google Scholar manually during literature review, copy-pastes abstracts into notes. With the client: single
async with WebResearcherClient() as c:block searches, scrapes, checks retractions, and exports a.bibfile — all in one cell.3. Python automation / content pipeline developer
Currently: generates blog drafts with an LLM, manually spot-checks claims, has no systematic verification. With the client: pipes each LLM output through
audit_bibliography()+verify_citation()before publish, flags mischaracterized sources automatically.4. AI agent framework developer (CrewAI / AutoGen / LlamaIndex)
Currently: writes a custom
Toolwrapper that spawns a subprocess, parses JSON, handles errors, restarts on crash. With the client:WebResearcherClient.as_langchain_tools()(or equivalent) returns a ready-made tool list — all 27 tools available with no boilerplate.Use Cases with Code Sketches
1. Async research in a Jupyter notebook
2. LangChain RAG with citation verification
3. CrewAI agent with all research tools
4. FastAPI endpoint that adds research to any app
5. Bibliography audit script for academic writing
6. CLI tool for a Jupyter magic / VS Code extension
Current State
What exists today:
pip install web-researcher-mcp)web_researcher_mcp/bin/web-researcher-mcpget_binary_path()+main()in the wheel shim (scripts/build_wheels.py)PORT=8080 web-researcher-mcp→ MCP JSON-RPC over HTTPmcppackage) has a client implementationWhat's missing:
from web_researcher_mcp import WebResearcherClientPORT, stop on context exit)dict, notSearchResult)requirementsin wheel (themcpSDK is not listed as a dep)Proposed Design
Package structure (additions to the existing wheel)
build_wheels.pyalready builds the wheel from a generated__init__.pyshim. The new files live inpython/web_researcher_mcp/in the repo and are bundled into the wheel at build time — same pipeline, no new infra.Core API
Subprocess lifecycle (
_server.py)Transport
Uses the MCP Python SDK (
mcp>=1.0) as an optional dependency, falling back to rawhttpx/urllibHTTP JSON-RPC ifmcpis not installed. The endpoint ishttp://localhost:{port}/mcp(Streamable HTTP transport already supported by the binary).Sync wrapper
Implementation Scope
python/web_researcher_mcp/__init__.py__version__python/web_researcher_mcp/client.pyWebResearcherClient(~150 lines)python/web_researcher_mcp/models.pypython/web_researcher_mcp/_server.py_ServerProcess(~50 lines)scripts/build_wheels.pypython/web_researcher_mcp/*.pyinto wheeltests/python/test_client.pydocs/TOOLS.mdor newdocs/PYTHON_CLIENT.mdTotal: ~350 lines new code, ~30 lines changed in
build_wheels.py. No new Go changes.Optional dependency on
mcpSDK: addmcp>=1.0toREQUIRES_DISTin the wheel metadata, or make it optional with a raw-HTTP fallback (recommended to keep the zero-deps spirit for basic usage).Success Criteria
pip install web-researcher-mcp+from web_researcher_mcp import WebResearcherClientworks on all 5 wheel platformsasync with WebResearcherClient() as c: results = await c.web_search("test")returns typedSearchResultobjectsWebResearcherClient.sync().web_search("test")works without async boilerplateWebResearcherClient.as_langchain_tools()returns a valid LangChain tool list (tested with langchain-core)_ServerProcessstarts and stops cleanly; port collision → retries with next porttests/python/make verifystays green (no new Go changes → no new Go CI gate; addpython -m pytest tests/python/to Makefile)Labels
enhancement,python,dxRelated