Skip to content

feat: optional key-gated reddit provider + native scrape_page routing (OAuth, low authorityTier, deletion-propagation) #243

@zoharbabin

Description

@zoharbabin

Summary

Add Reddit as a search provider and native scrape_page route, mirroring the HackerNews integration (#242) in shape but not in access model. Unlike HN, Reddit cannot be a zero-config keyless drop-in — it requires OAuth, app registration/approval, and carries a hard data-deletion-propagation obligation. This is a real optional, key-gated feature (like Brave/Tavily/Exa), not a docs sweep.

Why Reddit is NOT another HN

Verified against Reddit's own Data API Wiki (accessed 2026-06-14, https://support.reddithelp.com/hc/en-us/articles/16160319875092-Reddit-Data-API-Wiki):

Dimension Hacker News (shipped, #242) Reddit
Auth None (Firebase REST + Algolia, anonymous) Mandatory OAuth — "Traffic not using OAuth or login credentials will be blocked." The legacy .json endpoints are gated too.
Onboarding Zero Register an OAuth app and request access, gated by the Responsible Builder Policy + Developer Terms + Data API Terms
Cost/commercial Free, unrestricted Free tier = 100 QPM per OAuth client id (10-min averaging window, bursts allowed); commercial use gated (was free until 2023)
Data retention No obligation 48h deletion propagation: must delete any cached post/comment/user content once deleted on Reddit; "retention… even if disassociated, de-identified or anonymized… is a violation"
Authority/mission fit Dev-leaning, defensible Low-authority UGC — the opposite of the "cites real, authoritative sources" moat

praw (linked in the request) is a Python lib and irrelevant to our Go, zero-extra-deps stance — we'd implement OAuth2 client-credentials against the REST API directly with stdlib + NewSSRFSafeClient.

Scope (proposed)

  1. reddit search providerinternal/search/reddit.go implementing search.Provider.

    • OAuth2 client-credentials flow (REDDIT_CLIENT_ID + REDDIT_CLIENT_SECRET), token cached + refreshed on 401. Closest template = internal/search/exa.go (key-gated, breaker-wrapped, header auth, Name(), var _ Provider assertion). HN provider internal/search/hackernews.go is the closest sibling for result mapping.
    • Compliant User-Agent (Reddit mandates <platform>:<appID>:<version> (by /u/<user>); generic UAs are throttled) — surface via an env or build a descriptive default.
    • Web (search posts/comments) + News (new/rising). Images returns (nil, nil) (gallery/UGC images out of scope) — never an error (would trip the breaker).
    • Register: add \"reddit\" to search.SupportedProviders (provider.go:117) + a case in NewProviderByName (provider.go:147); AvailableProviders (provider.go:188) picks it up automatically. Add ONLY to the general web tool enums (web_search, news_search, search_and_scrape) — NOT image/academic/patent (same deliberate-omission rule as HN).
    • Config: RedditClientID/RedditClientSecret fields in config.go (mirror ExaAPIKey at :137) + required-when-SEARCH_PROVIDER=reddit check (mirror :256) + .env.example.
    • Own circuit.New(...) breaker via the AvailableProviders constructor.
  2. Native scrape_page routing for reddit.com / redd.it thread+comment URLs.

    • Add case isRedditURL(url): result, err = p.scrapeReddit(...) in pipeline.go:202 switch (alongside the HN case at :206). Template = internal/scraper/hackernews.go (isHNURL, scrapeHN, typed errors, NewSSRFSafeClient).
    • Fetches a thread's selftext + top comments via the OAuth'd API; honest typed errors (404→ErrNotFound, 429→rateLimitError, etc.).
  3. Low authorityTier — Reddit results/scrapes must band low (internal/content/classify.go:92 authorityTier); it's UGC. Never an academic/news authoritative fallback. This keeps the trust positioning intact.

Compliance gate (Security Rule #7 — must resolve BEFORE merge)

The 48h deletion-propagation obligation collides with any persistence:

  • Response cache is already TTL'd (~1h) + AES-encrypted — incidentally compliant.
  • Decision needed: do NOT persist Reddit content to the session store / disk persist layer beyond the request cache, OR build explicit eviction. Simplest compliant stance: never persist Reddit-sourced content past the response cache TTL; document it. No accumulation, no separate deletion daemon.
  • Tenant boundary unchanged (cache key is content-hash, no per-user Reddit data).

Open questions / decision points

  • Does it earn a slot? Authority-fit is weak and the compliance cost is real. Recommend parking against the trust-first roadmap unless there's clear demand — file now so it's tracked, decide priority separately.
  • Commercial-use terms: confirm our free-tier 100 QPM eligibility covers an open-source MCP server before promoting it.

Verification

  • Unit: OAuth token fetch/refresh (mock httptest), isRedditURL matching, result mapping, Images returns (nil,nil), low authorityTier.
  • Live (//go:build live): real OAuth client-credentials + search + thread scrape; skip when creds absent (same pattern as keyed providers).
  • Drift gates: provider count is registry-derived (no hardcoded counts); add reddit to web-tool enum docs in the same PR; make verify green.
  • Confirm no Reddit content lands in session/persist stores (grep + a persistence test).

Reuse anchors (verified 2026-06-14)

  • internal/search/exa.go — key-gated provider w/ header auth, breaker, Name(), var _ Provider (OAuth template)
  • internal/search/hackernews.go — sibling result mapping; internal/scraper/hackernews.goisHNURL/scrapeHN native-route template
  • internal/search/provider.go:117/147/188SupportedProviders / NewProviderByName / AvailableProviders
  • internal/scraper/pipeline.go:202 — scrape routing switch
  • internal/config/config.go:137/256 — key field + required-when-selected gate
  • internal/content/classify.go:92authorityTier banding

Refs: https://github.com/praw-dev/praw · https://www.reddit.com/dev/api/ · https://support.reddithelp.com/hc/en-us/articles/16160319875092-Reddit-Data-API-Wiki

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions