Summary
Two new tools — site_login and site_logout — that let a non-technical user authenticate to a login-gated website by opening a visible (headed) browser window, logging in as a human (including 2FA and CAPTCHA), and having the resulting session cookies captured, AES-256-GCM encrypted, and transparently injected into future scrape_page calls. The user's password is never seen by the server. The scrape_page API is unchanged — auth is invisible to the caller.
Motivation: LinkedIn posts, Instagram profiles, gated research repositories, and similar pages return extractionQuality:"partial" or kind:"blocked" even when the user has a valid account. There is currently no way to pass auth state to the scraper. Automated form-fill is fragile, bot-detected, and a ToS violation. The correct model is: user logs in as a human, server stores only the resulting cookie state, subsequent scrapes inject it automatically.
Design goals
- Non-technical UX: the user logs in normally in a visible Chrome window — no JSON, no headers, no cookie copying
- Zero credential exposure: password typed directly into the real site; tool only reads post-auth CDP cookie state
- Transparent injection:
scrape_page callers see no interface change; auth is automatic when cookies are available for the target domain
- Defense-in-depth security: AES-256-GCM at rest (reusing existing
persist.DiskStore), deterministic eTLD+1 domain scoping, SSRF-safe client unchanged, audit trail without cookie values, per-user isolation, TTL clamped to server-set cookie expiry
- Honest about ToS: explicit warning in tool description and docs that this may violate platform Terms; account suspension is a known risk the user accepts
Non-goals
- Automated form-fill (fragile, bot-detected, ToS violation — explicitly rejected)
- HTTP/multi-tenant server mode support for the capture step (headed browser requires a local display; HTTP mode returns a clear error)
- Bypassing 2FA or CAPTCHA (the user handles those interactively in the browser)
- Bulk harvesting of third-party data
- OAuth token refresh (a future extension)
Architecture
site_login tool (internal/tools/site_session.go)
└── cookievault.Store (internal/cookievault/cookievault.go)
└── persist.Store (internal/persist/disk.go — AES-256-GCM, already exists)
scrape_page (unchanged signature)
└── pipeline.Scrape(ctx, url, n) (unchanged signature)
├── auth.UserIDFromContext(ctx) + TenantIDFromContext(ctx)
├── vault.Get(ctx, tenantID, userID, eTLD+1(targetURL))
│ └── eTLD+1 gate: golang.org/x/net/publicsuffix.EffectiveTLDPlusOne
├── browser tier: page.SetCookies(domainCookies) before Navigate
└── HTTP tiers: NewSSRFSafeClientWithCookies(allowPrivate, targetURL, cookies)
└── cookiejar with publicsuffix.List — RFC 6265 scoping enforced by Go stdlib
No circular imports. Package dependency order:
cookievault → persist, x/net/publicsuffix
scraper → cookievault (Reader interface only)
tools → cookievault + scraper
main → all of the above
1. New package: internal/cookievault/cookievault.go
Create this file from scratch. No existing package to modify.
// Package cookievault stores per-user, per-domain, AES-256-GCM-encrypted session
// cookie bundles. The scraper pipeline uses the read-only Reader interface; the
// site_login / site_logout tools use the full Store interface.
//
// Data model: one Bundle per (tenantID, userID, registrableDomain). Stored as
// AES-256-GCM-encrypted JSON via persist.Store. TTL = min(maxTTL, server-set
// cookie expiry). Bundle keys: "cookievault:<tenantID>:<userID>:<eTLD+1>".
// Domain index key: "cookievault:index:<tenantID>:<userID>" → []string of domains.
//
// Cookie values MUST NOT appear in logs, audit records, or data-subject exports.
package cookievault
import (
"context"
"encoding/json"
"time"
"golang.org/x/net/publicsuffix"
"github.com/zoharbabin/web-researcher-mcp/internal/persist"
)
// StoredCookie is the serializable form of a browser cookie. Import-free from
// go-rod so the storage layer has no browser dependency.
type StoredCookie struct {
Name string `json:"name"`
Value string `json:"value"`
Domain string `json:"domain"`
Path string `json:"path"`
Secure bool `json:"secure"`
HTTPOnly bool `json:"httpOnly"`
SameSite string `json:"sameSite,omitempty"`
Expires time.Time `json:"expires,omitempty"` // zero = session-scoped
}
// Bundle is the stored unit: all cookies for one (tenant, user, registrableDomain)
// captured at a single login event.
type Bundle struct {
RegistrableDomain string `json:"registrableDomain"`
CapturedAt time.Time `json:"capturedAt"`
EarliestExpiry time.Time `json:"earliestExpiry"` // drives TTL; zero if all session cookies
TenantID string `json:"tenantId"`
UserID string `json:"userId"`
SchemaVersion int `json:"schemaVersion"` // always 1
Cookies []StoredCookie `json:"cookies"`
}
// Reader is the read-only interface used by the scraper pipeline. Intentionally
// minimal — the pipeline never writes sessions, only reads them.
type Reader interface {
Get(ctx context.Context, tenantID, userID, registrableDomain string) (Bundle, bool)
}
// Store is the full read/write interface used by the tools.
type Store interface {
Reader
Save(ctx context.Context, b Bundle, maxTTL time.Duration) error
Delete(ctx context.Context, tenantID, userID, registrableDomain string)
ListDomains(ctx context.Context, tenantID, userID string) []string
ExportUser(ctx context.Context, tenantID, userID string) (any, error)
EraseUser(ctx context.Context, tenantID, userID string) (int, error)
}
// Noop is the default when COOKIE_VAULT_ENABLED=false. All operations are safe
// no-ops — zero behavior change when the feature is not configured.
type Noop struct{}
func NewNoop() *Noop { return &Noop{} }
func (Noop) Save(_ context.Context, _ Bundle, _ time.Duration) error { return nil }
func (Noop) Get(_ context.Context, _, _, _ string) (Bundle, bool) { return Bundle{}, false }
func (Noop) Delete(_ context.Context, _, _, _ string) {}
func (Noop) ListDomains(_ context.Context, _, _ string) []string { return nil }
func (Noop) ExportUser(_ context.Context, _, _ string) (any, error) { return nil, nil }
func (Noop) EraseUser(_ context.Context, _, _ string) (int, error) { return 0, nil }
var _ Store = (*Noop)(nil)
// New returns a Store backed by the given persist.Store.
// maxTTL is the upper bound; 0 → 24h; capped at 7 days internally.
func New(store persist.Store, maxTTL time.Duration) Store { ... } // see full impl below
// RegistrableDomain wraps publicsuffix.EffectiveTLDPlusOne with a safe fallback.
// Returns "" for invalid/private/IP inputs (caller must gate on "" → skip injection).
func RegistrableDomain(host string) string {
d, err := publicsuffix.EffectiveTLDPlusOne(host)
if err != nil {
return ""
}
return d
}
Full storeImpl implementation (write in cookievault.go below the interface declarations):
type storeImpl struct {
store persist.Store
maxTTL time.Duration
}
func New(store persist.Store, maxTTL time.Duration) Store {
if maxTTL <= 0 { maxTTL = 24 * time.Hour }
if maxTTL > 7*24*time.Hour { maxTTL = 7 * 24 * time.Hour }
return &storeImpl{store: store, maxTTL: maxTTL}
}
func bundleKey(tenantID, userID, domain string) string {
return "cookievault:" + tenantID + ":" + userID + ":" + domain
}
func indexKey(tenantID, userID string) string {
return "cookievault:index:" + tenantID + ":" + userID
}
// effectiveTTL = min(maxTTL, timeUntilExpiry), floored at 1s.
// All-session-cookie bundles (EarliestExpiry zero) use maxTTL as fallback.
func effectiveTTL(b Bundle, maxTTL time.Duration) time.Duration {
if b.EarliestExpiry.IsZero() { return maxTTL }
until := time.Until(b.EarliestExpiry)
if until <= 0 { return time.Second }
if until < maxTTL { return until }
return maxTTL
}
func (s *storeImpl) Save(ctx context.Context, b Bundle, maxTTL time.Duration) error {
if maxTTL <= 0 { maxTTL = s.maxTTL }
data, err := json.Marshal(b)
if err != nil { return err }
s.store.Set(ctx, bundleKey(b.TenantID, b.UserID, b.RegistrableDomain), data, effectiveTTL(b, maxTTL))
s.updateIndex(ctx, b.TenantID, b.UserID, b.RegistrableDomain)
return nil
}
func (s *storeImpl) Get(ctx context.Context, tenantID, userID, registrableDomain string) (Bundle, bool) {
data, ok := s.store.Get(ctx, bundleKey(tenantID, userID, registrableDomain))
if !ok { return Bundle{}, false }
var b Bundle
if err := json.Unmarshal(data, &b); err != nil { return Bundle{}, false }
return b, true
}
func (s *storeImpl) Delete(ctx context.Context, tenantID, userID, registrableDomain string) {
s.store.Delete(ctx, bundleKey(tenantID, userID, registrableDomain))
s.removeFromIndex(ctx, tenantID, userID, registrableDomain)
}
func (s *storeImpl) ListDomains(ctx context.Context, tenantID, userID string) []string {
raw := s.loadRawIndex(ctx, tenantID, userID)
var live []string
for _, d := range raw {
if _, ok := s.Get(ctx, tenantID, userID, d); ok {
live = append(live, d)
}
}
if len(live) != len(raw) { s.saveIndex(ctx, tenantID, userID, live) } // lazy prune
return live
}
func (s *storeImpl) ExportUser(ctx context.Context, tenantID, userID string) (any, error) {
// IMPORTANT: cookie Values are intentionally omitted — they are live credentials.
type entry struct {
Domain string `json:"domain"`
CapturedAt time.Time `json:"capturedAt"`
ExpiresAt time.Time `json:"expiresAt,omitempty"`
CookieCount int `json:"cookieCount"`
}
var out []entry
for _, d := range s.ListDomains(ctx, tenantID, userID) {
b, ok := s.Get(ctx, tenantID, userID, d)
if !ok { continue }
out = append(out, entry{Domain: d, CapturedAt: b.CapturedAt, ExpiresAt: b.EarliestExpiry, CookieCount: len(b.Cookies)})
}
return out, nil
}
func (s *storeImpl) EraseUser(ctx context.Context, tenantID, userID string) (int, error) {
domains := s.ListDomains(ctx, tenantID, userID)
for _, d := range domains {
s.store.Delete(ctx, bundleKey(tenantID, userID, d))
}
s.store.Delete(ctx, indexKey(tenantID, userID))
return len(domains), nil
}
func (s *storeImpl) updateIndex(ctx context.Context, tenantID, userID, domain string) {
raw := s.loadRawIndex(ctx, tenantID, userID)
for _, d := range raw {
if d == domain { return }
}
s.saveIndex(ctx, tenantID, userID, append(raw, domain))
}
func (s *storeImpl) removeFromIndex(ctx context.Context, tenantID, userID, domain string) {
var filtered []string
for _, d := range s.loadRawIndex(ctx, tenantID, userID) {
if d != domain { filtered = append(filtered, d) }
}
s.saveIndex(ctx, tenantID, userID, filtered)
}
func (s *storeImpl) loadRawIndex(ctx context.Context, tenantID, userID string) []string {
data, ok := s.store.Get(ctx, indexKey(tenantID, userID))
if !ok { return nil }
var out []string
_ = json.Unmarshal(data, &out)
return out
}
func (s *storeImpl) saveIndex(ctx context.Context, tenantID, userID string, domains []string) {
if data, err := json.Marshal(domains); err == nil {
s.store.Set(ctx, indexKey(tenantID, userID), data, s.maxTTL)
}
}
var _ Store = (*storeImpl)(nil)
2. internal/consent/consent.go — add PurposeSessionCapture
Add to the Purpose const block:
// PurposeSessionCapture covers storing AES-encrypted browser session cookies for
// replay by the scraper. Regulated: cookies are personal data (GDPR Recital 30),
// capture may violate platform ToS (users must acknowledge this), and stored
// sessions are subject to data-subject rights (export/erasure).
PurposeSessionCapture Purpose = "session_capture"
Update AllPurposes:
var AllPurposes = []Purpose{PurposeMemory, PurposeAnalytics, PurposeWorkspace, PurposeSessionCapture}
3. internal/config/config.go — cookie vault config
Add CookieVaultConfig struct (near FeatureConfig):
// CookieVaultConfig configures the encrypted session-cookie store used by
// site_login and site_logout. Disabled by default.
type CookieVaultConfig struct {
Enabled bool
EncryptionKey string // 64 hex chars; MUST differ from CACHE_ENCRYPTION_KEY
EncryptionKeyPrev string // previous key for lazy rotation (same semantics as CACHE_ENCRYPTION_KEY_PREV)
MaxTTLHours int // default 24; max 168 (7 days); clamped to server-set cookie expiry
CaptureTimeoutSeconds int // default 300; max 600 (how long the headed window stays open)
}
Add CookieVault CookieVaultConfig field to Config struct.
In Load():
cfg.CookieVault = CookieVaultConfig{
Enabled: envBool("COOKIE_VAULT_ENABLED", false),
EncryptionKey: os.Getenv("COOKIE_VAULT_ENCRYPTION_KEY"),
EncryptionKeyPrev: os.Getenv("COOKIE_VAULT_ENCRYPTION_KEY_PREV"),
MaxTTLHours: envInt("COOKIE_VAULT_MAX_TTL_HOURS", 24),
CaptureTimeoutSeconds: envInt("SESSION_CAPTURE_TIMEOUT_SECONDS", 300),
}
if cfg.CookieVault.Enabled {
if len(cfg.CookieVault.EncryptionKey) != 64 {
return nil, fmt.Errorf("COOKIE_VAULT_ENCRYPTION_KEY must be 64 hex characters when COOKIE_VAULT_ENABLED=true")
}
if cfg.CookieVault.EncryptionKeyPrev != "" && len(cfg.CookieVault.EncryptionKeyPrev) != 64 {
return nil, fmt.Errorf("COOKIE_VAULT_ENCRYPTION_KEY_PREV must be 64 hex characters if set")
}
if cfg.CookieVault.MaxTTLHours > 168 { cfg.CookieVault.MaxTTLHours = 168 }
if cfg.CookieVault.CaptureTimeoutSeconds > 600 { cfg.CookieVault.CaptureTimeoutSeconds = 600 }
}
Also add Config.ChromePath string to pass through to the capture function (already exists as CHROME_PATH env var in config — verify field name and pass it through).
.env.example additions (near the memory/analytics block):
# ── Cookie vault (site_login / site_logout) ─────────────────────────────────────
# COOKIE_VAULT_ENABLED=false
# COOKIE_VAULT_ENCRYPTION_KEY= # 64 hex chars; MUST differ from CACHE_ENCRYPTION_KEY
# COOKIE_VAULT_ENCRYPTION_KEY_PREV= # previous key for zero-downtime rotation
# COOKIE_VAULT_MAX_TTL_HOURS=24 # max bundle retention hours (1–168)
# SESSION_CAPTURE_TIMEOUT_SECONDS=300 # headed login window timeout (max 600)
4. internal/scraper/ssrf.go — NewSSRFSafeClientWithCookies
Add this function (no changes to existing functions):
import (
"net/http/cookiejar"
"golang.org/x/net/publicsuffix"
"github.com/zoharbabin/web-researcher-mcp/internal/cookievault"
)
// NewSSRFSafeClientWithCookies returns a new SSRF-safe HTTP client whose cookie
// jar is pre-seeded with domain-matched cookies from the vault. The jar uses
// publicsuffix.List so RFC 6265 domain and scheme scoping is enforced by the
// Go stdlib — a Secure cookie is never sent to an http:// host and cookies are
// never sent cross-domain.
//
// The caller MUST only use the returned client for requests to targetURL's
// registrable domain; the jar enforces this but the restriction is explicit in
// the type contract.
//
// If cookies is nil or empty, this is equivalent to NewSSRFSafeClient(allowPrivate).
func NewSSRFSafeClientWithCookies(allowPrivate bool, targetURL string, cookies []cookievault.StoredCookie) (*http.Client, error) {
jar, err := cookiejar.New(&cookiejar.Options{PublicSuffixList: publicsuffix.List})
if err != nil {
return nil, err
}
if len(cookies) > 0 {
u, err := url.Parse(targetURL)
if err != nil {
return nil, err
}
var httpCookies []*http.Cookie
for _, c := range cookies {
hc := &http.Cookie{
Name: c.Name,
Value: c.Value,
Domain: c.Domain,
Path: c.Path,
Secure: c.Secure,
HttpOnly: c.HTTPOnly,
}
if !c.Expires.IsZero() {
hc.Expires = c.Expires
hc.RawExpires = c.Expires.Format(time.RFC1123)
}
httpCookies = append(httpCookies, hc)
}
jar.SetCookies(u, httpCookies)
}
client := NewSSRFSafeClient(allowPrivate) // unchanged transport (SSRF blocking still applies)
client.Jar = jar
return client, nil
}
5. internal/scraper/pipeline.go — CookieVault field + injection helpers
5a. Add to PipelineConfig
// CookieVault, when non-nil, is queried for domain-matched session cookies
// before each scrape. nil (default, Noop) means zero behavior change.
// tenantID and userID are read from the request context via auth package.
CookieVault cookievault.Reader
5b. New helper: vaultCookiesForURL
// vaultCookiesForURL returns vault cookies for the target URL's registrable domain,
// or nil when the vault is absent or has no stored session.
// tenantID/userID are read from ctx via auth.TenantIDFromContext / auth.UserIDFromContext.
func (p *Pipeline) vaultCookiesForURL(ctx context.Context, rawURL string) []cookievault.StoredCookie {
if p.config.CookieVault == nil {
return nil
}
u, err := url.Parse(rawURL)
if err != nil {
return nil
}
rd := cookievault.RegistrableDomain(u.Hostname())
if rd == "" {
return nil
}
tenantID := auth.TenantIDFromContext(ctx)
userID := auth.UserIDFromContext(ctx)
bundle, ok := p.config.CookieVault.Get(ctx, tenantID, userID, rd)
if !ok {
return nil
}
return bundle.Cookies
}
5c. HTTP tier injection
In each HTTP-tier scrape method (scrapeMarkdown, scrapeHTML, scrapeStealth, scrapePatents — anywhere that constructs an *http.Client or uses p.client):
// Build a per-request client. When vault cookies are available for this domain,
// use a fresh client with a pre-seeded jar (avoids sharing jar state between
// concurrent requests). Fall back to the shared SSRF-safe client otherwise.
reqClient := p.client
if cookies := p.vaultCookiesForURL(ctx, url); len(cookies) > 0 {
var err error
reqClient, err = scraper.NewSSRFSafeClientWithCookies(p.config.AllowPrivateIPs, url, cookies)
if err != nil {
slog.Warn("cookie injection skipped", "url", url, "error", err)
reqClient = p.client
}
}
// Use reqClient for all subsequent http.Request calls in this method.
Note: p.client is the current shared SSRF-safe client. Per-request client creation only when cookies are present avoids any performance impact on the common (unauthenticated) path.
5d. Browser tier injection
New method in browser.go (or in a new internal/scraper/inject_browser.go):
// injectVaultCookies looks up stored session cookies for rawURL's domain and
// injects them into the page via proto.NetworkSetCookies before navigation.
// Non-fatal: on any error it logs and returns without blocking the scrape.
func (p *Pipeline) injectVaultCookies(ctx context.Context, page *rod.Page, rawURL string) {
cookies := p.vaultCookiesForURL(ctx, rawURL)
if len(cookies) == 0 {
return
}
u, _ := url.Parse(rawURL)
var params []*proto.NetworkCookieParam
for _, c := range cookies {
// Redundant domain-safety check (defense in depth).
cookieDomain := strings.TrimPrefix(c.Domain, ".")
if cookievault.RegistrableDomain(cookieDomain) != cookievault.RegistrableDomain(u.Hostname()) {
continue
}
// Never inject a Secure cookie into an http:// target.
if c.Secure && u.Scheme != "https" {
continue
}
param := &proto.NetworkCookieParam{
Name: c.Name,
Value: c.Value,
Domain: c.Domain,
Path: c.Path,
Secure: c.Secure,
HTTPOnly: c.HTTPOnly,
}
if !c.Expires.IsZero() {
exp := proto.TimeSinceEpoch(float64(c.Expires.Unix()))
param.Expires = &exp
}
if c.SameSite != "" {
param.SameSite = proto.NetworkCookieSameSite(c.SameSite)
}
params = append(params, param)
}
if len(params) == 0 {
return
}
if err := page.SetCookies(params); err != nil {
slog.Warn("browser cookie injection failed", "url", rawURL, "error", err)
}
}
Call site in scrapeBrowser (after stealth.Page(browser), before page.Navigate):
p.injectVaultCookies(ctx, page, url)
Important: page.SetCookies targets the page's isolated session context (via CDP sessionID), not the global browser profile. This means injected cookies are page-scoped and cannot leak to other concurrent scrapes. Verify this is the case for the pinned go-rod version — the CDP Network.setCookies command's scope is determined by whether the sessionId is set in the CDP message.
6. New file: internal/scraper/capture.go — headed browser login capture
This file contains CaptureLoginSession and its helpers. It is the only place that launches a headed browser; it is entirely separate from getBrowserPool().
package scraper
import (
"context"
"fmt"
"net/url"
"os"
"strings"
"time"
"github.com/go-rod/rod"
"github.com/go-rod/rod/lib/launcher"
"github.com/go-rod/rod/lib/proto"
"github.com/zoharbabin/web-researcher-mcp/internal/cookievault"
)
// captureHook is a test seam. When non-nil, CaptureLoginSession calls it instead
// of launching a real browser. Tests set this via package init or t.Cleanup reset.
// MUST be nil in production; never exported.
var captureHook func(ctx context.Context, loginURL, registrableDomain, tenantID, userID, chromePath string) (cookievault.Bundle, error)
// loginPageSubstrings are URL path fragments that indicate the user is still on
// an auth page. Login is considered complete when NONE of these appear in the
// current URL AND (a known session cookie appeared OR new cookies are present).
var loginPageSubstrings = []string{
"/login", "/signin", "/sign-in", "/auth", "/checkpoint",
"/accounts/login", "/session/new", "/oauth", "/sso",
"/saml", "/oidc", "/forgot", "/reset-password",
}
// knownSessionCookies maps registrable domains to the cookie name(s) that
// definitively indicate a successful session. Checking these provides sub-second
// detection vs. waiting for URL change on redirect-heavy flows.
// This map is an accelerator only — the URL-change fallback fires regardless.
var knownSessionCookies = map[string][]string{
"linkedin.com": {"li_at"},
"instagram.com": {"sessionid"},
"facebook.com": {"xs"},
"twitter.com": {"auth_token"},
"x.com": {"auth_token"},
"github.com": {"user_session"},
"reddit.com": {"reddit_session"},
"notion.so": {"token_v2"},
}
// CaptureLoginSession opens a headed (visible) browser at loginURL and blocks
// until the user completes login or ctx is cancelled.
//
// Isolation guarantees:
// - Separate launcher instance from getBrowserPool() — NEVER touches poolOnce.
// - Ephemeral UserDataDir (os.MkdirTemp, 0700), destroyed via defer on every exit path.
// - The profile dir is cleaned up even on panic (deferred os.RemoveAll).
//
// This function MUST only be called in STDIO mode (the caller enforces this).
func CaptureLoginSession(
ctx context.Context,
loginURL, registrableDomain, tenantID, userID, chromePath string,
) (cookievault.Bundle, error) {
if captureHook != nil {
return captureHook(ctx, loginURL, registrableDomain, tenantID, userID, chromePath)
}
tempDir, err := os.MkdirTemp("", "web-researcher-capture-*")
if err != nil {
return cookievault.Bundle{}, fmt.Errorf("could not create capture profile dir: %w", err)
}
if err := os.Chmod(tempDir, 0700); err != nil {
_ = os.RemoveAll(tempDir)
return cookievault.Bundle{}, fmt.Errorf("could not secure capture profile dir: %w", err)
}
defer os.RemoveAll(tempDir) // CRITICAL: runs on all exit paths including panic+recover
l := launcher.New().
Headless(false).
UserDataDir(tempDir).
Set("no-sandbox").
Set("disable-dev-shm-usage").
Set("no-first-run").
Set("no-default-browser-check").
Set("disable-background-networking")
if chromePath != "" && chromePath != chromeDisabled {
l = l.Bin(chromePath)
}
controlURL, err := l.Launch()
if err != nil {
return cookievault.Bundle{}, fmt.Errorf("could not launch browser for login: %w", err)
}
browser := rod.New().ControlURL(controlURL)
if err := browser.Connect(); err != nil {
l.Kill()
return cookievault.Bundle{}, fmt.Errorf("could not connect to browser: %w", err)
}
defer func() {
_ = browser.Close()
l.Kill()
}()
page, err := browser.Page(proto.TargetCreateTarget{URL: loginURL})
if err != nil {
return cookievault.Bundle{}, fmt.Errorf("could not open login page: %w", err)
}
defer page.Close()
page = page.Context(ctx)
if err := page.WaitLoad(); err != nil {
if ctx.Err() != nil {
return cookievault.Bundle{}, fmt.Errorf("login timed out before page loaded")
}
return cookievault.Bundle{}, fmt.Errorf("login page failed to load: %w", err)
}
return pollUntilLoggedIn(ctx, page, loginURL, registrableDomain, tenantID, userID)
}
// pollUntilLoggedIn polls at 1-second intervals for login-completion signals.
// Signals (first match wins):
// 1. A knownSessionCookie for registrableDomain appeared with a non-empty value.
// 2. Current URL no longer matches any loginPageSubstring AND at least one cookie
// is present for the domain (catches redirect-based and SPA flows).
// 3. ctx cancelled → timeout error.
// 4. page.Info() fails → browser was closed by user → abort error.
func pollUntilLoggedIn(
ctx context.Context,
page *rod.Page,
loginURL, registrableDomain, tenantID, userID string,
) (cookievault.Bundle, error) {
ticker := time.NewTicker(time.Second)
defer ticker.Stop()
sessionNames := knownSessionCookies[registrableDomain]
for {
select {
case <-ctx.Done():
return cookievault.Bundle{}, fmt.Errorf("login timed out — the browser window has been closed automatically")
case <-ticker.C:
info, err := page.Info()
if err != nil {
return cookievault.Bundle{}, fmt.Errorf("login aborted — browser window was closed")
}
cookies, err := page.Cookies(nil)
if err != nil {
continue // transient CDP error; keep polling
}
if detectLoginComplete(info.URL, cookies, registrableDomain, sessionNames) {
return buildBundle(cookies, registrableDomain, tenantID, userID), nil
}
}
}
}
// detectLoginComplete applies the two-signal heuristic.
func detectLoginComplete(currentURL string, cookies []*proto.NetworkCookie, rd string, sessionNames []string) bool {
// Signal 1: known session cookie appeared.
for _, name := range sessionNames {
for _, c := range cookies {
if c.Name == name && c.Value != "" {
return true
}
}
}
// Signal 2: URL moved away from all login-path patterns AND domain has cookies.
for _, pattern := range loginPageSubstrings {
if strings.Contains(strings.ToLower(currentURL), pattern) {
return false
}
}
// URL is not a login page — check for domain-matching cookies.
for _, c := range cookies {
if cookievault.RegistrableDomain(strings.TrimPrefix(c.Domain, ".")) == rd {
return true
}
}
return false
}
// buildBundle constructs a Bundle from the page's cookie state, filtered to
// cookies whose Domain eTLD+1 matches registrableDomain.
func buildBundle(cookies []*proto.NetworkCookie, rd, tenantID, userID string) cookievault.Bundle {
var stored []cookievault.StoredCookie
var earliest time.Time
for _, c := range cookies {
if cookievault.RegistrableDomain(strings.TrimPrefix(c.Domain, ".")) != rd {
continue
}
sc := cookievault.StoredCookie{
Name: c.Name,
Value: c.Value,
Domain: c.Domain,
Path: c.Path,
Secure: c.Secure,
HTTPOnly: c.HTTPOnly,
SameSite: string(c.SameSite),
}
if exp := float64(c.Expires); exp > 0 {
t := time.Unix(int64(exp), 0)
sc.Expires = t
if earliest.IsZero() || t.Before(earliest) {
earliest = t
}
}
stored = append(stored, sc)
}
return cookievault.Bundle{
RegistrableDomain: rd,
CapturedAt: time.Now().UTC(),
EarliestExpiry: earliest,
TenantID: tenantID,
UserID: userID,
SchemaVersion: 1,
Cookies: stored,
}
}
7. New file: internal/tools/site_session.go
Both tools in one file (they share consent/identity/vault patterns).
site_login (write tool — writeAnnotations(false))
Input struct:
type siteLoginInput struct {
URL string `json:"url" jsonschema:"Login page URL (e.g. https://www.linkedin.com/login). A browser window opens; log in normally.,required"`
Domain string `json:"domain,omitempty" jsonschema:"Override the registrable domain (e.g. 'linkedin.com'). Inferred from URL if omitted; override when the login URL is on a different subdomain than the content."`
}
Registration guard: only register when deps.CookieVault is non-Noop (mirror the memory_save pattern — check via type assertion _, isNoop := deps.CookieVault.(*cookievault.Noop); !isNoop).
Handler logic (complete):
!deps.Features.StdioMode → return {status:"http_mode_unsupported", reason:"site_login requires a local display..."}
!deps.Consent.HasConsent(ctx, consent.PurposeSessionCapture) → {status:"no_consent", reason:"..."}
userID == "" || userID == "anonymous" → {status:"unavailable", reason:"requires authenticated user"}
- Validate URL via
scraper.ValidateScrapeURL
- Compute
registrableDomain from input.Domain (if set) or url.Parse(input.URL).Hostname() via cookievault.RegistrableDomain; return toolError on ""
context.WithTimeout(ctx, time.Duration(deps.Config.CookieVault.CaptureTimeoutSeconds)*time.Second)
bundle, err := scraper.CaptureLoginSession(captureCtx, rawURL, registrableDomain, tenantID, userID, deps.Config.ChromePath)
- On error: return
structuredError(err.Error(), ToolError{Kind: ErrKindAuth, Retryable: false, SuggestedAction: ActionInformUser})
deps.CookieVault.Save(ctx, bundle, time.Duration(deps.Config.CookieVault.MaxTTLHours)*time.Hour)
- Audit (domain + cookieCount — never cookie values)
- Return
{status:"captured", domain, cookieCount, capturedAt, expiresAt (if non-zero), trust:"user-asserted-content"}
Tool description (verbatim — required for TestToolDescriptionQuality):
"Open a browser window for you to log into a website normally (username, password, 2FA, CAPTCHA all handled by you). After login, the session cookies are captured and encrypted so future scrape_page calls to that site work without re-logging-in. Credentials are never seen by the server — only the post-login cookie state is stored. IMPORTANT: Automating a logged-in session may violate the website's Terms of Service. Use only with your own accounts on sites you are authorized to access programmatically. STDIO-only: requires a local display. Use site_logout to revoke the stored session."
site_logout (write tool — writeAnnotations(true) — idempotent)
Input struct:
type siteLogoutInput struct {
Domain string `json:"domain" jsonschema:"Registrable domain to revoke the stored session for (e.g. 'linkedin.com').,required"`
}
Handler logic:
- Consent + identity gates (same as
site_login)
- Validate/normalize domain via
cookievault.RegistrableDomain
_, exists := deps.CookieVault.Get(ctx, tenantID, userID, rd) then deps.CookieVault.Delete(...)
- Return
{status:"ok"|"not_found", domain:rd} — never an error (Delete is always safe)
Output schemas (add to internal/tools/schemas.go):
var siteLoginOutputSchema = map[string]any{
"type": "object",
"properties": map[string]any{
"status": map[string]any{"type": "string", "enum": []any{"captured", "timeout", "aborted", "no_consent", "unavailable", "http_mode_unsupported"}},
"domain": map[string]any{"type": "string"},
"cookieCount": map[string]any{"type": "integer"},
"capturedAt": map[string]any{"type": "string", "format": "date-time"},
"expiresAt": map[string]any{"type": "string", "format": "date-time"},
"reason": map[string]any{"type": "string"},
"trust": trustUserAsserted,
},
}
var siteLogoutOutputSchema = map[string]any{
"type": "object",
"properties": map[string]any{
"status": map[string]any{"type": "string", "enum": []any{"ok", "not_found", "no_consent", "unavailable"}},
"domain": map[string]any{"type": "string"},
},
}
8. internal/tools/registry.go — Dependencies + Features + RegisterAll
Add to Dependencies:
// CookieVault holds encrypted session cookies for site_login / site_logout.
// Is cookievault.Noop when COOKIE_VAULT_ENABLED=false (default).
CookieVault cookievault.Store
Add to Features:
// StdioMode is true when running without an HTTP port (no PORT env var).
// Required to gate site_login, which opens a headed browser needing a local display.
StdioMode bool
In RegisterAll, add (mirror the memory/analytics conditional pattern):
if _, isNoop := deps.CookieVault.(*cookievault.Noop); !isNoop {
registerSiteLogin(srv, deps)
registerSiteLogout(srv, deps)
}
9. cmd/web-researcher-mcp/main.go — wiring
After the memory/analytics/workspace store construction block:
// Cookie vault (site_login / site_logout).
var vaultStore cookievault.Store = cookievault.NewNoop()
if cfg.CookieVault.Enabled {
vaultPersist, err := persist.NewDiskStore(
filepath.Join(cacheDir, "cookievault"),
cfg.CookieVault.EncryptionKey,
cfg.CookieVault.EncryptionKeyPrev,
)
if err != nil {
return fmt.Errorf("cookie vault persist store: %w", err)
}
maxTTL := time.Duration(cfg.CookieVault.MaxTTLHours) * time.Hour
vaultStore = cookievault.New(vaultPersist, maxTTL)
dataSubjectRegistry.Register(
"cookievault",
datasubject.ExporterFunc(func(ctx context.Context, s datasubject.Subject) (any, error) {
return vaultStore.ExportUser(ctx, s.TenantID, s.UserID)
}),
datasubject.EraserFunc(func(ctx context.Context, s datasubject.Subject) (int, error) {
return vaultStore.EraseUser(ctx, s.TenantID, s.UserID)
}),
)
}
Wire into Dependencies:
deps := tools.Dependencies{
// ... existing fields ...
CookieVault: vaultStore,
Features: tools.Features{
// ... existing fields ...
StdioMode: cfg.Port == 0,
},
}
Wire into scraper.NewPipeline:
scraper.NewPipeline(scraper.PipelineConfig{
// ... existing fields ...
CookieVault: vaultStore, // cookievault.Store implements cookievault.Reader
})
10. internal/tools/metadata_test.go + tools_test.go
metadata_test.go — add to expectedTools:
"site_login",
"site_logout",
Add to writeTools in TestAllToolsHaveAnnotations:
"site_login": true,
"site_logout": true,
tools_test.go — in setupTestDeps(), add:
memPersist := persist.NewMemoryStore()
deps.CookieVault = cookievault.New(memPersist, 24*time.Hour)
deps.Features.StdioMode = true
The conditional registration check (_, isNoop := deps.CookieVault.(*cookievault.Noop); !isNoop) will be true for the non-Noop MemoryStore-backed vault, so site_login and site_logout appear in listTools.
11. docs/TOOLS.md — tool sections
Add two sections in the correct numeric sequence (tools 27 and 28 based on current expectedTools count of 26). Format must exactly match the ## Tool N: \name`pattern required byTestToolsDocMatchesRegistry`.
The site_login section must include a Terms and authorization subsection:
Major platforms (LinkedIn, Meta, Twitter/X, Instagram) explicitly prohibit automated access to logged-in content in their Terms of Service. Using site_login with these platforms may result in account suspension. This tool is intended for use with your own accounts on sites where you have authorization for programmatic access (e.g., internal systems, developer environments, sites whose ToS permit it). The MCP server and its authors accept no liability for ToS violations or resulting account actions.
Security model (authoritative summary)
Domain isolation: the primary gate
eTLD+1(targetURL.Hostname()) is computed deterministically in Go via golang.org/x/net/publicsuffix.EffectiveTLDPlusOne. This value is the vault key. No LLM-settable parameter, tool argument, or scraped page content can influence which vault entry is retrieved for a given scrape_page call. A prompt-injection attack instructing the model to "use LinkedIn cookies for attacker.com" fails because the gate is server-side and unconditional.
A second per-cookie check inside injectVaultCookies / injectCookiesIntoBrowserPage verifies eTLD+1(cookieDomain) == eTLD+1(targetURL.Hostname()) as defense-in-depth against deserialization bugs or future storage tampering.
Cross-domain redirect handling: for HTTP tiers, cookiejar with publicsuffix.List enforces RFC 6265 per-hop — a 302 from linkedin.com → evil.com drops the LinkedIn cookie on the second hop automatically.
Credential vs. cookie
The user's password is typed into the real login form inside the headed Chrome window. The server uses CDP Network.getAllCookies (exposed as page.Cookies(nil) in go-rod) to read the post-auth cookie state — HttpOnly cookies included. The server never intercepts, stores, or transmits the password.
Encryption at rest
Reuses persist.DiskStore (AES-256-GCM, per-blob AAD, atomic rename, 0600 files, 0700 dir, SHA-256-hashed filenames, lazy two-key rotation). Dedicated COOKIE_VAULT_ENCRYPTION_KEY MUST differ from CACHE_ENCRYPTION_KEY — this limits blast radius from cache key exposure and ensures the high-value vault has an independent secret.
Session TTL
min(COOKIE_VAULT_MAX_TTL_HOURS, earliest server-set cookie expiry). Default cap: 24h. Hard max: 7 days. The persist.DiskStore 8-byte expiry prefix enforces TTL at read time without a background cleaner — an expired bundle returns (Bundle{}, false) on Get.
Headed browser isolation
The auth-capture browser is launched fresh per site_login call via launcher.New() with a os.MkdirTemp UserDataDir. It NEVER touches getBrowserPool() or poolOnce. The temp dir is removed via defer os.RemoveAll on all exit paths. The scraping pool (headless, shared, poolOnce singleton) is completely unaffected.
Audit log safety
Cookie values MUST NOT appear in any audit event, log line, or exported data. Audit events carry: tool_name, tenant_id, user_id, timestamp, success, metadata: {registrableDomain, cookieCount, expiresAt}. Data-subject export carries: domain, capturedAt, expiresAt, cookieCount — never value.
SSRF defense-in-depth
Cookie injection does not weaken SSRF protection:
- HTTP tiers:
NewSSRFSafeClientWithCookies uses the same newSSRFSafeTransport — private IP blocking applies regardless of auth state.
- Browser tier: CDP
page.SetCookies is domain-scoped; the SSRF-safe client and the eTLD+1 gate both apply before injection.
A prompt-injected SSRF attempt (e.g., scrape http://169.254.169.254/ with LinkedIn cookies) fails at the SSRF transport layer — cookies are irrelevant to the IP-level block.
ToS disclaimer (required in tool description and docs)
The site_login tool description MUST include:
"IMPORTANT: Automating a logged-in session may violate the website's Terms of Service. Use only with your own accounts on sites you are authorized to access programmatically."
LinkedIn §8.2, Twitter/X ToS (Sept 2023), Meta §3.2.3, Instagram §4.2 all explicitly prohibit automated access including with valid credentials. Use of this feature with those platforms is a ToS violation the user knowingly accepts. The tool's docs section must state this clearly, mirroring the project's "plain language tone" standard (no jargon — a grad student must understand the risk).
Test matrix
Unit tests: internal/cookievault/cookievault_test.go
| Test |
Verifies |
TestRegistrableDomain_Known |
linkedin.com, www.linkedin.com, sub.linkedin.com → "linkedin.com" |
TestRegistrableDomain_Invalid |
IP, localhost, "not-a-domain!!!" → "" |
TestNoop_AllOps |
All methods no-op and never panic |
TestStoreImpl_SaveGet_RoundTrip |
Save then Get returns same bundle |
TestStoreImpl_Overwrite |
Second Save same domain overwrites; ListDomains still one entry |
TestStoreImpl_TTLExpiry |
Bundle expires after TTL (inject clock via MemoryStore mocked TTL) |
TestStoreImpl_EarliestExpiry_Clamps_TTL |
Cookie expiry in 1h → TTL ≤ 1h |
TestStoreImpl_AllSessionCookies_UsesMaxTTL |
All-session bundle (Expires zero) → TTL = maxTTL |
TestStoreImpl_Delete |
Delete removes bundle and index entry |
TestStoreImpl_ListDomains_LazyPrune |
Expired bundles removed from index on next ListDomains call |
TestStoreImpl_ExportUser_NoCookieValues |
ExportUser result contains domain/capturedAt/count; Value field absent from JSON |
TestStoreImpl_EraseUser |
All bundles + index removed; ListDomains → nil |
TestDomainIsolation_TenantBoundary |
TenantA cookies NOT returned for TenantB same user |
TestDomainIsolation_UserBoundary |
UserA cookies NOT returned for UserB same tenant |
TestEffectiveTTL_Cases |
All four branches: zero expiry, past expiry, expiry < maxTTL, expiry > maxTTL |
Unit tests: internal/scraper/capture_test.go
| Test |
Verifies |
TestDetectLoginComplete_KnownSessionCookie |
li_at present → true for linkedin.com |
TestDetectLoginComplete_URLChangedFromLogin |
URL /login → /feed, no known cookie → true |
TestDetectLoginComplete_StillOnLogin |
URL still /login → false |
TestDetectLoginComplete_NoCookies |
URL changed but no domain cookies → false |
TestBuildBundle_DomainFilter |
google.com cookie excluded from linkedin.com bundle |
TestBuildBundle_EarliestExpiry |
Multi-cookie; soonest expiry wins |
TestBuildBundle_AllSessionCookies |
EarliestExpiry zero when all Expires zero |
TestCaptureLoginSession_HookOverride |
captureHook fires; no browser launched |
Unit tests: internal/scraper/pipeline_test.go (additions)
| Test |
Verifies |
TestVaultCookiesForURL_NilVault |
Returns nil without panic |
TestVaultCookiesForURL_NoMatch |
Vault has cookies for other domain → nil |
TestVaultCookiesForURL_Match |
Returns bundle cookies for matching domain |
TestInjectVaultCookies_NilVault |
No-op; page.SetCookies never called |
TestInjectVaultCookies_DomainMismatch |
Cross-domain cookie excluded |
TestInjectVaultCookies_SecureDrop |
Secure cookie not injected into http:// target |
TestNewSSRFSafeClientWithCookies_ScopedToTargetDomain |
Jar only sends cookies to target domain; other-domain httptest server receives none |
TestNewSSRFSafeClientWithCookies_SecureCookieDroppedOnHTTP |
Secure cookie absent on http:// request |
TestNewSSRFSafeClientWithCookies_NilCookies |
Returns valid client without panic |
Unit tests: internal/tools/tools_test.go (additions)
| Test |
Verifies |
TestSiteLogin_NotRegisteredWhenNoop |
ListTools excludes site_login when CookieVault is Noop |
TestSiteLogin_HTTPMode |
StdioMode=false → {status:"http_mode_unsupported"} |
TestSiteLogin_NoConsent |
Missing consent → {status:"no_consent"} |
TestSiteLogin_EmptyURL |
toolError("url is required") |
TestSiteLogin_InvalidDomain |
domain="bad!!!" → toolError |
TestSiteLogin_CaptureSuccess |
captureHook returns valid bundle; vault.Get confirms storage; output status="captured", cookieCount matches |
TestSiteLogin_CaptureTimeout |
captureHook returns error; structuredError with ErrKindAuth returned |
TestSiteLogin_AuditNoCookieValues |
Mock auditor asserts no Value field in audit metadata |
TestSiteLogout_OK |
Site_login then site_logout; subsequent vault.Get returns (_, false) |
TestSiteLogout_NotFound |
Logout on domain with no stored session → {status:"not_found"} (not an error) |
TestSiteLogout_Idempotent |
Two logout calls → "ok" then "not_found"; never an error |
TestSiteLogout_NoConsent |
Missing consent → {status:"no_consent"} |
Integration tests: injection round-trip
In internal/scraper/pipeline_test.go, add an HTTPS httptest.Server that:
- Returns
200 OK with Content-Type: text/html and a substantial article body when the auth cookie is present
- Returns
401 Unauthorized when the auth cookie is absent
| Test |
Verifies |
TestHTTPTierCookieInjection_AuthSuccess |
Vault pre-seeded; scrape returns full content |
TestHTTPTierCookieInjection_NoCookies |
Vault empty; scrape returns 401/blocked |
TestHTTPTierCookieInjection_WrongDomain |
Vault has cookies for different eTLD+1; not injected; 401 |
TestHTTPTierCookieInjection_ExpiredBundle |
Vault TTL elapsed; Get returns false; 401 |
TestHTTPTierCookieInjection_TenantIsolation |
TenantA cookies not injected for TenantB context |
E2E tests (tests/e2e/ — //go:build live)
// TestE2E_LinkedInWithPreloadedSession:
// Prereq: LINKEDIN_TEST_BUNDLE_JSON env var — pre-captured bundle JSON for a test account.
// 1. Unmarshal bundle and Save directly into vault (bypass headed browser in CI).
// 2. Call scrape_page on a known public LinkedIn post URL.
// 3. Assert extractionQuality:"complete" and content length > 200 chars.
//
// TestE2E_SiteLogout_RestoresUnauthState:
// Follows TestE2E_LinkedInWithPreloadedSession.
// 1. Call site_logout for "linkedin.com".
// 2. Call scrape_page on the same URL.
// 3. Assert extractionQuality:"partial" (unauthenticated = back to scraping without auth).
Build order
Each step is independently compilable and testable. Run go test -race ./... after each step.
internal/cookievault/cookievault.go + cookievault_test.go — no upstream changes; all unit tests pass
internal/consent/consent.go — PurposeSessionCapture + AllPurposes; no test changes
internal/config/config.go + .env.example — CookieVaultConfig struct + Load() + validation
internal/scraper/ssrf.go — NewSSRFSafeClientWithCookies; new unit tests for jar scoping
internal/scraper/pipeline.go — CookieVault cookievault.Reader in PipelineConfig; vaultCookiesForURL helper; nil-safe injection in HTTP tiers; injectVaultCookies helper; call site in scrapeBrowser. Existing tests pass (nil vault = zero behavior change)
internal/scraper/capture.go + capture_test.go — CaptureLoginSession + helpers; captureHook test seam
internal/tools/schemas.go — siteLoginOutputSchema, siteLogoutOutputSchema
internal/tools/site_session.go — registerSiteLogin, registerSiteLogout
internal/tools/registry.go — CookieVault field, StdioMode field, conditional RegisterAll block
internal/tools/metadata_test.go + tools_test.go — expectedTools, writeTools, setupTestDeps
cmd/web-researcher-mcp/main.go — vault construction, datasubject registration, pipeline wiring
docs/TOOLS.md — ## Tool 27: \site_login`+## Tool 28: `site_logout`` sections
go test -race ./... + make verify + make rebuild-local + IRL smoke test
Acceptance criteria
Implementation notes
-
golang.org/x/net/publicsuffix: run go list -m golang.org/x/net before go get — it is almost certainly a transitive dependency of go-rod and therefore already in go.sum.
-
proto.NetworkCookieParam.Expires type: check the pinned go-rod version's exact type (*proto.TimeSinceEpoch vs. proto.TimeSinceEpoch). The field is omitempty so omitting it for session cookies is safe.
-
Browser incognito context for injection: stealth.Page(browser) creates a per-page context but shares the browser profile. For auth-injected scrapes, browser.MustIncognito().MustPage(url) creates a fully isolated context and prevents cookie state from leaking between concurrent authenticated scrapes. Evaluate using this for the browser tier when len(vaultCookies) > 0.
-
captureHook test seam: the package-level var MUST be nil in production. Reset it in each test via t.Cleanup(func() { captureHook = nil }) to avoid test-order contamination.
-
LinkedIn lidc 24h TTL: the lidc cookie (data-center routing) expires in 24 hours. After 24h a previously-captured LinkedIn session may start returning session-not-found errors on some requests. The vault's TTL enforcement (clamped to earliestExpiry) handles this correctly — the bundle expires when lidc expires, prompting the user to re-run site_login. This is by design and consistent with the honest-about-limitations principle.
-
Consent auto-grant in STDIO mode: in STDIO mode with STDIO_USER_ID set, the existing memory/analytics consent is auto-granted on first run. PurposeSessionCapture should NOT be auto-granted — it requires explicit user acknowledgement of the ToS risk before first use. The consent grant for session_capture must be user-initiated (e.g., a separate MCP tool call to the consent endpoint, or via CONSENT_GRANT env for operator-controlled deployment).
Summary
Two new tools —
site_loginandsite_logout— that let a non-technical user authenticate to a login-gated website by opening a visible (headed) browser window, logging in as a human (including 2FA and CAPTCHA), and having the resulting session cookies captured, AES-256-GCM encrypted, and transparently injected into futurescrape_pagecalls. The user's password is never seen by the server. Thescrape_pageAPI is unchanged — auth is invisible to the caller.Motivation: LinkedIn posts, Instagram profiles, gated research repositories, and similar pages return
extractionQuality:"partial"orkind:"blocked"even when the user has a valid account. There is currently no way to pass auth state to the scraper. Automated form-fill is fragile, bot-detected, and a ToS violation. The correct model is: user logs in as a human, server stores only the resulting cookie state, subsequent scrapes inject it automatically.Design goals
scrape_pagecallers see no interface change; auth is automatic when cookies are available for the target domainpersist.DiskStore), deterministiceTLD+1domain scoping, SSRF-safe client unchanged, audit trail without cookie values, per-user isolation, TTL clamped to server-set cookie expiryNon-goals
Architecture
No circular imports. Package dependency order:
1. New package:
internal/cookievault/cookievault.goCreate this file from scratch. No existing package to modify.
Full
storeImplimplementation (write incookievault.gobelow the interface declarations):2.
internal/consent/consent.go— addPurposeSessionCaptureAdd to the
Purposeconst block:Update
AllPurposes:3.
internal/config/config.go— cookie vault configAdd
CookieVaultConfigstruct (nearFeatureConfig):Add
CookieVault CookieVaultConfigfield toConfigstruct.In
Load():Also add
Config.ChromePath stringto pass through to the capture function (already exists asCHROME_PATHenv var in config — verify field name and pass it through)..env.exampleadditions (near the memory/analytics block):4.
internal/scraper/ssrf.go—NewSSRFSafeClientWithCookiesAdd this function (no changes to existing functions):
5.
internal/scraper/pipeline.go—CookieVaultfield + injection helpers5a. Add to
PipelineConfig5b. New helper:
vaultCookiesForURL5c. HTTP tier injection
In each HTTP-tier scrape method (
scrapeMarkdown,scrapeHTML,scrapeStealth,scrapePatents— anywhere that constructs an*http.Clientor usesp.client):Note:
p.clientis the current shared SSRF-safe client. Per-request client creation only when cookies are present avoids any performance impact on the common (unauthenticated) path.5d. Browser tier injection
New method in
browser.go(or in a newinternal/scraper/inject_browser.go):Call site in
scrapeBrowser(afterstealth.Page(browser), beforepage.Navigate):Important:
page.SetCookiestargets the page's isolated session context (via CDPsessionID), not the global browser profile. This means injected cookies are page-scoped and cannot leak to other concurrent scrapes. Verify this is the case for the pinned go-rod version — the CDPNetwork.setCookiescommand's scope is determined by whether thesessionIdis set in the CDP message.6. New file:
internal/scraper/capture.go— headed browser login captureThis file contains
CaptureLoginSessionand its helpers. It is the only place that launches a headed browser; it is entirely separate fromgetBrowserPool().7. New file:
internal/tools/site_session.goBoth tools in one file (they share consent/identity/vault patterns).
site_login(write tool —writeAnnotations(false))Input struct:
Registration guard: only register when
deps.CookieVaultis non-Noop (mirror thememory_savepattern — check via type assertion_, isNoop := deps.CookieVault.(*cookievault.Noop); !isNoop).Handler logic (complete):
!deps.Features.StdioMode→ return{status:"http_mode_unsupported", reason:"site_login requires a local display..."}!deps.Consent.HasConsent(ctx, consent.PurposeSessionCapture)→{status:"no_consent", reason:"..."}userID == "" || userID == "anonymous"→{status:"unavailable", reason:"requires authenticated user"}scraper.ValidateScrapeURLregistrableDomainfrominput.Domain(if set) orurl.Parse(input.URL).Hostname()viacookievault.RegistrableDomain; returntoolErroron ""context.WithTimeout(ctx, time.Duration(deps.Config.CookieVault.CaptureTimeoutSeconds)*time.Second)bundle, err := scraper.CaptureLoginSession(captureCtx, rawURL, registrableDomain, tenantID, userID, deps.Config.ChromePath)structuredError(err.Error(), ToolError{Kind: ErrKindAuth, Retryable: false, SuggestedAction: ActionInformUser})deps.CookieVault.Save(ctx, bundle, time.Duration(deps.Config.CookieVault.MaxTTLHours)*time.Hour){status:"captured", domain, cookieCount, capturedAt, expiresAt (if non-zero), trust:"user-asserted-content"}Tool description (verbatim — required for
TestToolDescriptionQuality):site_logout(write tool —writeAnnotations(true)— idempotent)Input struct:
Handler logic:
site_login)cookievault.RegistrableDomain_, exists := deps.CookieVault.Get(ctx, tenantID, userID, rd)thendeps.CookieVault.Delete(...){status:"ok"|"not_found", domain:rd}— never an error (Delete is always safe)Output schemas (add to
internal/tools/schemas.go):8.
internal/tools/registry.go—Dependencies+Features+RegisterAllAdd to
Dependencies:Add to
Features:In
RegisterAll, add (mirror the memory/analytics conditional pattern):9.
cmd/web-researcher-mcp/main.go— wiringAfter the memory/analytics/workspace store construction block:
Wire into
Dependencies:Wire into
scraper.NewPipeline:10.
internal/tools/metadata_test.go+tools_test.gometadata_test.go— add toexpectedTools:Add to
writeToolsinTestAllToolsHaveAnnotations:tools_test.go— insetupTestDeps(), add:The conditional registration check (
_, isNoop := deps.CookieVault.(*cookievault.Noop); !isNoop) will betruefor the non-Noop MemoryStore-backed vault, sosite_loginandsite_logoutappear inlistTools.11.
docs/TOOLS.md— tool sectionsAdd two sections in the correct numeric sequence (tools 27 and 28 based on current
expectedToolscount of 26). Format must exactly match the## Tool N: \name`pattern required byTestToolsDocMatchesRegistry`.The
site_loginsection must include a Terms and authorization subsection:Security model (authoritative summary)
Domain isolation: the primary gate
eTLD+1(targetURL.Hostname())is computed deterministically in Go viagolang.org/x/net/publicsuffix.EffectiveTLDPlusOne. This value is the vault key. No LLM-settable parameter, tool argument, or scraped page content can influence which vault entry is retrieved for a givenscrape_pagecall. A prompt-injection attack instructing the model to "use LinkedIn cookies for attacker.com" fails because the gate is server-side and unconditional.A second per-cookie check inside
injectVaultCookies/injectCookiesIntoBrowserPageverifieseTLD+1(cookieDomain) == eTLD+1(targetURL.Hostname())as defense-in-depth against deserialization bugs or future storage tampering.Cross-domain redirect handling: for HTTP tiers,
cookiejarwithpublicsuffix.Listenforces RFC 6265 per-hop — a 302 fromlinkedin.com→evil.comdrops the LinkedIn cookie on the second hop automatically.Credential vs. cookie
The user's password is typed into the real login form inside the headed Chrome window. The server uses CDP
Network.getAllCookies(exposed aspage.Cookies(nil)in go-rod) to read the post-auth cookie state —HttpOnlycookies included. The server never intercepts, stores, or transmits the password.Encryption at rest
Reuses
persist.DiskStore(AES-256-GCM, per-blob AAD, atomicrename,0600files,0700dir, SHA-256-hashed filenames, lazy two-key rotation). DedicatedCOOKIE_VAULT_ENCRYPTION_KEYMUST differ fromCACHE_ENCRYPTION_KEY— this limits blast radius from cache key exposure and ensures the high-value vault has an independent secret.Session TTL
min(COOKIE_VAULT_MAX_TTL_HOURS, earliest server-set cookie expiry). Default cap: 24h. Hard max: 7 days. Thepersist.DiskStore8-byte expiry prefix enforces TTL at read time without a background cleaner — an expired bundle returns(Bundle{}, false)onGet.Headed browser isolation
The auth-capture browser is launched fresh per
site_logincall vialauncher.New()with aos.MkdirTempUserDataDir. It NEVER touchesgetBrowserPool()orpoolOnce. The temp dir is removed viadefer os.RemoveAllon all exit paths. The scraping pool (headless, shared,poolOncesingleton) is completely unaffected.Audit log safety
Cookie values MUST NOT appear in any audit event, log line, or exported data. Audit events carry:
tool_name,tenant_id,user_id,timestamp,success,metadata: {registrableDomain, cookieCount, expiresAt}. Data-subject export carries:domain,capturedAt,expiresAt,cookieCount— nevervalue.SSRF defense-in-depth
Cookie injection does not weaken SSRF protection:
NewSSRFSafeClientWithCookiesuses the samenewSSRFSafeTransport— private IP blocking applies regardless of auth state.page.SetCookiesis domain-scoped; the SSRF-safe client and theeTLD+1gate both apply before injection.A prompt-injected SSRF attempt (e.g., scrape
http://169.254.169.254/with LinkedIn cookies) fails at the SSRF transport layer — cookies are irrelevant to the IP-level block.ToS disclaimer (required in tool description and docs)
The
site_logintool description MUST include:LinkedIn §8.2, Twitter/X ToS (Sept 2023), Meta §3.2.3, Instagram §4.2 all explicitly prohibit automated access including with valid credentials. Use of this feature with those platforms is a ToS violation the user knowingly accepts. The tool's docs section must state this clearly, mirroring the project's "plain language tone" standard (no jargon — a grad student must understand the risk).
Test matrix
Unit tests:
internal/cookievault/cookievault_test.goTestRegistrableDomain_KnownTestRegistrableDomain_InvalidTestNoop_AllOpsTestStoreImpl_SaveGet_RoundTripTestStoreImpl_OverwriteTestStoreImpl_TTLExpiryTestStoreImpl_EarliestExpiry_Clamps_TTLTestStoreImpl_AllSessionCookies_UsesMaxTTLTestStoreImpl_DeleteTestStoreImpl_ListDomains_LazyPruneTestStoreImpl_ExportUser_NoCookieValuesValuefield absent from JSONTestStoreImpl_EraseUserTestDomainIsolation_TenantBoundaryTestDomainIsolation_UserBoundaryTestEffectiveTTL_CasesUnit tests:
internal/scraper/capture_test.goTestDetectLoginComplete_KnownSessionCookieTestDetectLoginComplete_URLChangedFromLoginTestDetectLoginComplete_StillOnLoginTestDetectLoginComplete_NoCookiesTestBuildBundle_DomainFilterTestBuildBundle_EarliestExpiryTestBuildBundle_AllSessionCookiesTestCaptureLoginSession_HookOverrideUnit tests:
internal/scraper/pipeline_test.go(additions)TestVaultCookiesForURL_NilVaultTestVaultCookiesForURL_NoMatchTestVaultCookiesForURL_MatchTestInjectVaultCookies_NilVaultTestInjectVaultCookies_DomainMismatchTestInjectVaultCookies_SecureDropTestNewSSRFSafeClientWithCookies_ScopedToTargetDomainTestNewSSRFSafeClientWithCookies_SecureCookieDroppedOnHTTPTestNewSSRFSafeClientWithCookies_NilCookiesUnit tests:
internal/tools/tools_test.go(additions)TestSiteLogin_NotRegisteredWhenNoopTestSiteLogin_HTTPMode{status:"http_mode_unsupported"}TestSiteLogin_NoConsent{status:"no_consent"}TestSiteLogin_EmptyURLtoolError("url is required")TestSiteLogin_InvalidDomainTestSiteLogin_CaptureSuccessTestSiteLogin_CaptureTimeoutTestSiteLogin_AuditNoCookieValuesValuefield in audit metadataTestSiteLogout_OKTestSiteLogout_NotFound{status:"not_found"}(not an error)TestSiteLogout_IdempotentTestSiteLogout_NoConsent{status:"no_consent"}Integration tests: injection round-trip
In
internal/scraper/pipeline_test.go, add an HTTPShttptest.Serverthat:200 OKwithContent-Type: text/htmland a substantial article body when the auth cookie is present401 Unauthorizedwhen the auth cookie is absentTestHTTPTierCookieInjection_AuthSuccessTestHTTPTierCookieInjection_NoCookiesTestHTTPTierCookieInjection_WrongDomainTestHTTPTierCookieInjection_ExpiredBundleTestHTTPTierCookieInjection_TenantIsolationE2E tests (
tests/e2e/—//go:build live)Build order
Each step is independently compilable and testable. Run
go test -race ./...after each step.internal/cookievault/cookievault.go+cookievault_test.go— no upstream changes; all unit tests passinternal/consent/consent.go—PurposeSessionCapture+AllPurposes; no test changesinternal/config/config.go+.env.example—CookieVaultConfigstruct +Load()+ validationinternal/scraper/ssrf.go—NewSSRFSafeClientWithCookies; new unit tests for jar scopinginternal/scraper/pipeline.go—CookieVault cookievault.ReaderinPipelineConfig;vaultCookiesForURLhelper; nil-safe injection in HTTP tiers;injectVaultCookieshelper; call site inscrapeBrowser. Existing tests pass (nil vault = zero behavior change)internal/scraper/capture.go+capture_test.go—CaptureLoginSession+ helpers;captureHooktest seaminternal/tools/schemas.go—siteLoginOutputSchema,siteLogoutOutputSchemainternal/tools/site_session.go—registerSiteLogin,registerSiteLogoutinternal/tools/registry.go—CookieVaultfield,StdioModefield, conditionalRegisterAllblockinternal/tools/metadata_test.go+tools_test.go—expectedTools,writeTools,setupTestDepscmd/web-researcher-mcp/main.go— vault construction, datasubject registration, pipeline wiringdocs/TOOLS.md—## Tool 27: \site_login`+## Tool 28: `site_logout`` sectionsgo test -race ./...+make verify+make rebuild-local+ IRL smoke testAcceptance criteria
go test -race ./...passes with zero failuresTestAllToolsRegistered:site_login+site_logoutappear inlistToolswhenCookieVaultis non-NoopTestAllToolsHaveAnnotations: both tools carrywriteAnnotations; read tools unchangedTestToolsDocMatchesRegistry:docs/TOOLS.mdcontains matching## Tool N:sectionsTestOutputSchemaMatchesResponse: output schemas match actual JSON for both toolsTestExternalContentToolsCarryTrustMarker:site_loginoutput carries"user-asserted-content"ExportUserresult (asserted inTestSiteLogin_AuditNoCookieValues)site_loginin HTTP mode returns{status:"http_mode_unsupported"}without launching a browsersite_logoutis idempotent:{status:"ok"}then{status:"not_found"}; neverIsError:trueeTLD+1gate: cookies forlinkedin.comNOT injected intogoogle.comrequest (pipeline integration test)make rebuild-local→site_login/site_logoutabsent from tool list whenCOOKIE_VAULT_ENABLEDis unsetEraseUser,ListDomains→ nil,Get→(_, false)for any stored domaingolang.org/x/net/publicsuffix(verify:go list -m golang.org/x/net— almost certainly already present transitively via go-rod)site_loginexit path (success, timeout, abort, error)Implementation notes
golang.org/x/net/publicsuffix: rungo list -m golang.org/x/netbeforego get— it is almost certainly a transitive dependency of go-rod and therefore already ingo.sum.proto.NetworkCookieParam.Expirestype: check the pinned go-rod version's exact type (*proto.TimeSinceEpochvs.proto.TimeSinceEpoch). The field isomitemptyso omitting it for session cookies is safe.Browser incognito context for injection:
stealth.Page(browser)creates a per-page context but shares the browser profile. For auth-injected scrapes,browser.MustIncognito().MustPage(url)creates a fully isolated context and prevents cookie state from leaking between concurrent authenticated scrapes. Evaluate using this for the browser tier whenlen(vaultCookies) > 0.captureHooktest seam: the package-level var MUST benilin production. Reset it in each test viat.Cleanup(func() { captureHook = nil })to avoid test-order contamination.LinkedIn
lidc24h TTL: thelidccookie (data-center routing) expires in 24 hours. After 24h a previously-captured LinkedIn session may start returning session-not-found errors on some requests. The vault's TTL enforcement (clamped toearliestExpiry) handles this correctly — the bundle expires whenlidcexpires, prompting the user to re-runsite_login. This is by design and consistent with the honest-about-limitations principle.Consent auto-grant in STDIO mode: in STDIO mode with
STDIO_USER_IDset, the existing memory/analytics consent is auto-granted on first run.PurposeSessionCaptureshould NOT be auto-granted — it requires explicit user acknowledgement of the ToS risk before first use. The consent grant forsession_capturemust be user-initiated (e.g., a separate MCP tool call to the consent endpoint, or viaCONSENT_GRANTenv for operator-controlled deployment).