Python client library (python/web_researcher_mcp/) — typed SDK for direct use in Python apps, notebooks, and agents

## Summary

`pip install web-researcher-mcp` already works — but it only gives you a CLI entry point that runs the MCP server. Python developers who want to call our 27 research tools directly from their apps, notebooks, or agent pipelines have to manually manage an HTTP subprocess, speak raw JSON-RPC, and parse untyped dicts. This issue proposes a thin Python client library (`python/web_researcher_mcp/client.py`, ~250 lines) that ships inside the existing wheel and turns the Go binary into a typed, first-class Python SDK — no extra install step, no new package, no dependency hell.

---

## Why Now

**Python is the primary language for AI tooling, and we're already there.** LangChain has 90k+ GitHub stars and ~10 million monthly PyPI downloads. LlamaIndex, CrewAI, AutoGen, and Haystack together reach tens of millions more. The Jupyter ecosystem has ~22 million users (JupyterCon 2023). The `mcp` Python SDK was among the top-100 fastest-growing PyPI packages in Q1 2025. Every major search/research API competitor (Tavily, Exa, Perplexity, Brave, Serper) ships a typed Python SDK as their **primary developer interface** — not an afterthought.

We already ship 7 platform wheels. The binary is already on-disk after `pip install`. The HTTP transport is already built. The only thing missing is the 250-line wrapper that makes it feel like `tavily-python` or `exa-py` instead of a subprocess plumbing exercise.

**The distribution gap is discoverability.** Docker Hub shows 4,007 pulls vs 16 GitHub stars — our users are infrastructure-minded, not Python-library-minded. Adding a client library extends reach to the larger Python-AI audience without changing our architecture at all.

**Competitor signal:**
- `tavily-python`: async/sync `TavilyClient`, typed `SearchResponse`, 5-line quickstart
- `exa-py`: `Exa(api_key)` → `.search(query)` → typed `Result` objects
- `perplexity-client`: wraps OpenAI-compatible HTTP, typed models
- None of them do citation verification, retraction detection, or bibliography audit — that's our differentiated surface that Python devs can't easily get to today

---

## Target Users

**1. AI/ML Engineer building a RAG pipeline**
Currently: calls `langchain_community.tools.TavilySearchResults` for retrieval, has no citation-verification step, ships hallucinated citations to prod. With the client: drops in `WebResearcherClient` as a LangChain tool, calls `verify_citation()` before indexing, eliminates fabricated sources at the retrieval gate.

**2. Data scientist / researcher in a Jupyter notebook**
Currently: pastes DOIs into Google Scholar manually during literature review, copy-pastes abstracts into notes. With the client: single `async with WebResearcherClient() as c:` block searches, scrapes, checks retractions, and exports a `.bib` file — all in one cell.

**3. Python automation / content pipeline developer**
Currently: generates blog drafts with an LLM, manually spot-checks claims, has no systematic verification. With the client: pipes each LLM output through `audit_bibliography()` + `verify_citation()` before publish, flags mischaracterized sources automatically.

**4. AI agent framework developer (CrewAI / AutoGen / LlamaIndex)**
Currently: writes a custom `Tool` wrapper that spawns a subprocess, parses JSON, handles errors, restarts on crash. With the client: `WebResearcherClient.as_langchain_tools()` (or equivalent) returns a ready-made tool list — all 27 tools available with no boilerplate.

---

## Use Cases with Code Sketches

### 1. Async research in a Jupyter notebook

```python
from web_researcher_mcp import WebResearcherClient

async with WebResearcherClient() as client:
    results = await client.web_search("CRISPR off-target effects 2024", num_results=5)
    for r in results:
        citation = await client.verify_citation(url=r.url, claim="CRISPR has no off-target effects")
        print(r.title, "—", citation.verdict)
```

### 2. LangChain RAG with citation verification

```python
from web_researcher_mcp import WebResearcherClient
from langchain.tools import Tool

client = WebResearcherClient()  # sync wrapper around async

def verified_search(query: str) -> str:
    results = client.web_search(query)
    verified = [r for r in results if client.verify_citation(url=r.url).verdict == "supported"]
    return "\n".join(r.snippet for r in verified)

tool = Tool(name="verified_web_search", func=verified_search, description="...")
```

### 3. CrewAI agent with all research tools

```python
from web_researcher_mcp import WebResearcherClient
from crewai import Agent

client = WebResearcherClient()
research_agent = Agent(
    role="Research Analyst",
    tools=client.as_crewai_tools(["web_search", "academic_search", "verify_citation"]),
    goal="Find and verify sources for the given claim"
)
```

### 4. FastAPI endpoint that adds research to any app

```python
from fastapi import FastAPI
from web_researcher_mcp import WebResearcherClient

app = FastAPI()
client = WebResearcherClient()

@app.get("/research")
async def research(query: str):
    async with client:
        results = await client.search_and_scrape(query)
        return {"results": results}
```

### 5. Bibliography audit script for academic writing

```python
from web_researcher_mcp import WebResearcherClient

client = WebResearcherClient()
report = client.audit_bibliography(open("paper.bib").read(), format="bibtex")
retracted = [e for e in report.entries if e.retraction_status == "retracted"]
print(f"{len(retracted)} retracted sources found")
```

### 6. CLI tool for a Jupyter magic / VS Code extension

```python
# jupyter magic: %%websearch
from web_researcher_mcp import WebResearcherClient, get_binary_path

# get_binary_path() already exists in the wheel shim — expose it publicly
# so downstream tools can locate the binary without launching it
binary = get_binary_path()  # /path/to/web_researcher_mcp/bin/web-researcher-mcp
```

---

## Current State

**What exists today:**
- ✅ 7 platform wheels on PyPI (`pip install web-researcher-mcp`)
- ✅ Binary bundled in wheel at `web_researcher_mcp/bin/web-researcher-mcp`
- ✅ `get_binary_path()` + `main()` in the wheel shim (`scripts/build_wheels.py`)
- ✅ HTTP transport: `PORT=8080 web-researcher-mcp` → MCP JSON-RPC over HTTP
- ✅ 27 tools, typed MCP tool schemas
- ✅ MCP Python SDK (`mcp` package) has a client implementation

**What's missing:**
- ❌ No way to `from web_researcher_mcp import WebResearcherClient`
- ❌ No subprocess manager (start binary on `PORT`, stop on context exit)
- ❌ No typed Python wrappers (callers get raw `dict`, not `SearchResult`)
- ❌ No `requirements` in wheel (the `mcp` SDK is not listed as a dep)
- ❌ No documentation on using the wheel programmatically

---

## Proposed Design

### Package structure (additions to the existing wheel)

```
python/web_researcher_mcp/
├── __init__.py          # re-exports: WebResearcherClient, get_binary_path, __version__
├── client.py            # WebResearcherClient (async-first, sync wrapper)
├── models.py            # Typed dataclasses: SearchResult, VerifyResult, ScrapeResult, ...
└── _server.py           # _ServerProcess: start/stop the Go binary, find free port
```

`build_wheels.py` already builds the wheel from a generated `__init__.py` shim. The new files live in `python/web_researcher_mcp/` in the repo and are bundled into the wheel at build time — same pipeline, no new infra.

### Core API

```python
class WebResearcherClient:
    """Async-first client. Use as async context manager or call start()/stop() manually.
    
    Sync wrapper: use WebResearcherClient.sync() which returns a SyncWebResearcherClient
    that wraps each async method with asyncio.run().
    """
    
    def __init__(self, port: int | None = None, timeout: float = 30.0): ...
    
    async def __aenter__(self) -> "WebResearcherClient": ...
    async def __aexit__(self, *_): ...
    
    # Core tools — typed return values
    async def web_search(self, query: str, *, num_results: int = 5, provider: str | None = None) -> list[SearchResult]: ...
    async def scrape_page(self, url: str) -> ScrapeResult: ...
    async def academic_search(self, query: str, **kwargs) -> list[SearchResult]: ...
    async def verify_citation(self, *, url: str | None = None, doi: str | None = None, claim: str | None = None) -> VerifyResult: ...
    async def audit_bibliography(self, bibliography: str, format: str = "bibtex") -> BibliographyReport: ...
    
    # Pass-through for all 27 tools not listed above
    async def call(self, tool_name: str, **params) -> dict: ...
    
    # Framework integration helpers
    def as_langchain_tools(self, tools: list[str] | None = None) -> list: ...
```

### Subprocess lifecycle (`_server.py`)

```python
class _ServerProcess:
    """Start the bundled binary on a free port, stop on context exit."""
    
    def __init__(self, port: int | None = None):
        self.port = port or _find_free_port()
    
    def start(self) -> int:  # returns actual port
        env = {**os.environ, "PORT": str(self.port)}
        self._proc = subprocess.Popen([get_binary_path()], env=env)
        _wait_for_ready(self.port)
        return self.port
    
    def stop(self):
        self._proc.terminate()
        self._proc.wait(timeout=5)
```

### Transport

Uses the MCP Python SDK (`mcp>=1.0`) as an optional dependency, falling back to raw `httpx`/`urllib` HTTP JSON-RPC if `mcp` is not installed. The endpoint is `http://localhost:{port}/mcp` (Streamable HTTP transport already supported by the binary).

### Sync wrapper

```python
# For notebooks and scripts that don't use async
client = WebResearcherClient.sync()
results = client.web_search("climate change 2024")  # blocks
```

---

## Implementation Scope

| File | Change |
|------|--------|
| `python/web_researcher_mcp/__init__.py` | New — re-exports, `__version__` |
| `python/web_researcher_mcp/client.py` | New — `WebResearcherClient` (~150 lines) |
| `python/web_researcher_mcp/models.py` | New — typed dataclasses (~60 lines) |
| `python/web_researcher_mcp/_server.py` | New — `_ServerProcess` (~50 lines) |
| `scripts/build_wheels.py` | Update to bundle `python/web_researcher_mcp/*.py` into wheel |
| `tests/python/test_client.py` | New — unit tests (mock server) + integration test |
| `docs/TOOLS.md` or new `docs/PYTHON_CLIENT.md` | Usage quickstart |

Total: ~350 lines new code, ~30 lines changed in `build_wheels.py`. No new Go changes.

**Optional dependency on `mcp` SDK:** add `mcp>=1.0` to `REQUIRES_DIST` in the wheel metadata, or make it optional with a raw-HTTP fallback (recommended to keep the zero-deps spirit for basic usage).

---

## Success Criteria

- [ ] `pip install web-researcher-mcp` + `from web_researcher_mcp import WebResearcherClient` works on all 5 wheel platforms
- [ ] `async with WebResearcherClient() as c: results = await c.web_search("test")` returns typed `SearchResult` objects
- [ ] Sync wrapper: `WebResearcherClient.sync().web_search("test")` works without async boilerplate
- [ ] `WebResearcherClient.as_langchain_tools()` returns a valid LangChain tool list (tested with langchain-core)
- [ ] `_ServerProcess` starts and stops cleanly; port collision → retries with next port
- [ ] Unit tests (mocked HTTP server) + one integration test in `tests/python/`
- [ ] README quickstart updated with the 5-line Python usage example
- [ ] `make verify` stays green (no new Go changes → no new Go CI gate; add `python -m pytest tests/python/` to Makefile)

---

## Labels

`enhancement`, `python`, `dx`

## Related

- #160 — PyPI wheels (prerequisite, shipped in v1.25.0)
- #162 — deeplinks / .mcpb bundles (parallel distribution track)

File	Change
`python/web_researcher_mcp/__init__.py`	New — re-exports, `__version__`
`python/web_researcher_mcp/client.py`	New — `WebResearcherClient` (~150 lines)
`python/web_researcher_mcp/models.py`	New — typed dataclasses (~60 lines)
`python/web_researcher_mcp/_server.py`	New — `_ServerProcess` (~50 lines)
`scripts/build_wheels.py`	Update to bundle `python/web_researcher_mcp/*.py` into wheel
`tests/python/test_client.py`	New — unit tests (mock server) + integration test
`docs/TOOLS.md` or new `docs/PYTHON_CLIENT.md`	Usage quickstart

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python client library (python/web_researcher_mcp/) — typed SDK for direct use in Python apps, notebooks, and agents #237

Summary

Why Now

Target Users

Use Cases with Code Sketches

1. Async research in a Jupyter notebook

2. LangChain RAG with citation verification

3. CrewAI agent with all research tools

4. FastAPI endpoint that adds research to any app

5. Bibliography audit script for academic writing

6. CLI tool for a Jupyter magic / VS Code extension

Current State

Proposed Design

Package structure (additions to the existing wheel)

Core API

Subprocess lifecycle (`_server.py`)

Transport

Sync wrapper

Implementation Scope

Success Criteria

Labels

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Python client library (python/web_researcher_mcp/) — typed SDK for direct use in Python apps, notebooks, and agents #237

Description

Summary

Why Now

Target Users

Use Cases with Code Sketches

1. Async research in a Jupyter notebook

2. LangChain RAG with citation verification

3. CrewAI agent with all research tools

4. FastAPI endpoint that adds research to any app

5. Bibliography audit script for academic writing

6. CLI tool for a Jupyter magic / VS Code extension

Current State

Proposed Design

Package structure (additions to the existing wheel)

Core API

Subprocess lifecycle (_server.py)

Transport

Sync wrapper

Implementation Scope

Success Criteria

Labels

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Subprocess lifecycle (`_server.py`)