Summary
Add Reddit as a search provider and native scrape_page route, mirroring the HackerNews integration (#242) in shape but not in access model. Unlike HN, Reddit cannot be a zero-config keyless drop-in — it requires OAuth, app registration/approval, and carries a hard data-deletion-propagation obligation. This is a real optional, key-gated feature (like Brave/Tavily/Exa), not a docs sweep.
Why Reddit is NOT another HN
Verified against Reddit's own Data API Wiki (accessed 2026-06-14, https://support.reddithelp.com/hc/en-us/articles/16160319875092-Reddit-Data-API-Wiki):
| Dimension |
Hacker News (shipped, #242) |
Reddit |
| Auth |
None (Firebase REST + Algolia, anonymous) |
Mandatory OAuth — "Traffic not using OAuth or login credentials will be blocked." The legacy .json endpoints are gated too. |
| Onboarding |
Zero |
Register an OAuth app and request access, gated by the Responsible Builder Policy + Developer Terms + Data API Terms |
| Cost/commercial |
Free, unrestricted |
Free tier = 100 QPM per OAuth client id (10-min averaging window, bursts allowed); commercial use gated (was free until 2023) |
| Data retention |
No obligation |
48h deletion propagation: must delete any cached post/comment/user content once deleted on Reddit; "retention… even if disassociated, de-identified or anonymized… is a violation" |
| Authority/mission fit |
Dev-leaning, defensible |
Low-authority UGC — the opposite of the "cites real, authoritative sources" moat |
praw (linked in the request) is a Python lib and irrelevant to our Go, zero-extra-deps stance — we'd implement OAuth2 client-credentials against the REST API directly with stdlib + NewSSRFSafeClient.
Scope (proposed)
-
reddit search provider — internal/search/reddit.go implementing search.Provider.
- OAuth2 client-credentials flow (
REDDIT_CLIENT_ID + REDDIT_CLIENT_SECRET), token cached + refreshed on 401. Closest template = internal/search/exa.go (key-gated, breaker-wrapped, header auth, Name(), var _ Provider assertion). HN provider internal/search/hackernews.go is the closest sibling for result mapping.
- Compliant User-Agent (Reddit mandates
<platform>:<appID>:<version> (by /u/<user>); generic UAs are throttled) — surface via an env or build a descriptive default.
Web (search posts/comments) + News (new/rising). Images returns (nil, nil) (gallery/UGC images out of scope) — never an error (would trip the breaker).
- Register: add
\"reddit\" to search.SupportedProviders (provider.go:117) + a case in NewProviderByName (provider.go:147); AvailableProviders (provider.go:188) picks it up automatically. Add ONLY to the general web tool enums (web_search, news_search, search_and_scrape) — NOT image/academic/patent (same deliberate-omission rule as HN).
- Config:
RedditClientID/RedditClientSecret fields in config.go (mirror ExaAPIKey at :137) + required-when-SEARCH_PROVIDER=reddit check (mirror :256) + .env.example.
- Own
circuit.New(...) breaker via the AvailableProviders constructor.
-
Native scrape_page routing for reddit.com / redd.it thread+comment URLs.
- Add
case isRedditURL(url): result, err = p.scrapeReddit(...) in pipeline.go:202 switch (alongside the HN case at :206). Template = internal/scraper/hackernews.go (isHNURL, scrapeHN, typed errors, NewSSRFSafeClient).
- Fetches a thread's selftext + top comments via the OAuth'd API; honest typed errors (404→
ErrNotFound, 429→rateLimitError, etc.).
-
Low authorityTier — Reddit results/scrapes must band low (internal/content/classify.go:92 authorityTier); it's UGC. Never an academic/news authoritative fallback. This keeps the trust positioning intact.
Compliance gate (Security Rule #7 — must resolve BEFORE merge)
The 48h deletion-propagation obligation collides with any persistence:
- Response cache is already TTL'd (~1h) + AES-encrypted — incidentally compliant.
- Decision needed: do NOT persist Reddit content to the session store / disk persist layer beyond the request cache, OR build explicit eviction. Simplest compliant stance: never persist Reddit-sourced content past the response cache TTL; document it. No accumulation, no separate deletion daemon.
- Tenant boundary unchanged (cache key is content-hash, no per-user Reddit data).
Open questions / decision points
- Does it earn a slot? Authority-fit is weak and the compliance cost is real. Recommend parking against the trust-first roadmap unless there's clear demand — file now so it's tracked, decide priority separately.
- Commercial-use terms: confirm our free-tier 100 QPM eligibility covers an open-source MCP server before promoting it.
Verification
- Unit: OAuth token fetch/refresh (mock httptest),
isRedditURL matching, result mapping, Images returns (nil,nil), low authorityTier.
- Live (
//go:build live): real OAuth client-credentials + search + thread scrape; skip when creds absent (same pattern as keyed providers).
- Drift gates: provider count is registry-derived (no hardcoded counts); add
reddit to web-tool enum docs in the same PR; make verify green.
- Confirm no Reddit content lands in session/persist stores (grep + a persistence test).
Reuse anchors (verified 2026-06-14)
internal/search/exa.go — key-gated provider w/ header auth, breaker, Name(), var _ Provider (OAuth template)
internal/search/hackernews.go — sibling result mapping; internal/scraper/hackernews.go — isHNURL/scrapeHN native-route template
internal/search/provider.go:117/147/188 — SupportedProviders / NewProviderByName / AvailableProviders
internal/scraper/pipeline.go:202 — scrape routing switch
internal/config/config.go:137/256 — key field + required-when-selected gate
internal/content/classify.go:92 — authorityTier banding
Refs: https://github.com/praw-dev/praw · https://www.reddit.com/dev/api/ · https://support.reddithelp.com/hc/en-us/articles/16160319875092-Reddit-Data-API-Wiki
Summary
Add Reddit as a search provider and native
scrape_pageroute, mirroring the HackerNews integration (#242) in shape but not in access model. Unlike HN, Reddit cannot be a zero-config keyless drop-in — it requires OAuth, app registration/approval, and carries a hard data-deletion-propagation obligation. This is a real optional, key-gated feature (like Brave/Tavily/Exa), not a docs sweep.Why Reddit is NOT another HN
Verified against Reddit's own Data API Wiki (accessed 2026-06-14, https://support.reddithelp.com/hc/en-us/articles/16160319875092-Reddit-Data-API-Wiki):
.jsonendpoints are gated too.praw(linked in the request) is a Python lib and irrelevant to our Go, zero-extra-deps stance — we'd implement OAuth2 client-credentials against the REST API directly with stdlib +NewSSRFSafeClient.Scope (proposed)
redditsearch provider —internal/search/reddit.goimplementingsearch.Provider.REDDIT_CLIENT_ID+REDDIT_CLIENT_SECRET), token cached + refreshed on 401. Closest template =internal/search/exa.go(key-gated, breaker-wrapped, header auth,Name(),var _ Providerassertion). HN providerinternal/search/hackernews.gois the closest sibling for result mapping.<platform>:<appID>:<version> (by /u/<user>); generic UAs are throttled) — surface via an env or build a descriptive default.Web(search posts/comments) +News(new/rising).Imagesreturns(nil, nil)(gallery/UGC images out of scope) — never an error (would trip the breaker).\"reddit\"tosearch.SupportedProviders(provider.go:117) + a case inNewProviderByName(provider.go:147);AvailableProviders(provider.go:188) picks it up automatically. Add ONLY to the general web tool enums (web_search, news_search, search_and_scrape) — NOT image/academic/patent (same deliberate-omission rule as HN).RedditClientID/RedditClientSecretfields inconfig.go(mirrorExaAPIKeyat :137) + required-when-SEARCH_PROVIDER=redditcheck (mirror :256) +.env.example.circuit.New(...)breaker via theAvailableProvidersconstructor.Native
scrape_pagerouting forreddit.com/redd.itthread+comment URLs.case isRedditURL(url): result, err = p.scrapeReddit(...)inpipeline.go:202switch (alongside the HN case at :206). Template =internal/scraper/hackernews.go(isHNURL,scrapeHN, typed errors,NewSSRFSafeClient).ErrNotFound, 429→rateLimitError, etc.).Low
authorityTier— Reddit results/scrapes must band low (internal/content/classify.go:92authorityTier); it's UGC. Never an academic/news authoritative fallback. This keeps the trust positioning intact.Compliance gate (Security Rule #7 — must resolve BEFORE merge)
The 48h deletion-propagation obligation collides with any persistence:
Open questions / decision points
Verification
isRedditURLmatching, result mapping,Imagesreturns(nil,nil), low authorityTier.//go:build live): real OAuth client-credentials + search + thread scrape; skip when creds absent (same pattern as keyed providers).redditto web-tool enum docs in the same PR;make verifygreen.Reuse anchors (verified 2026-06-14)
internal/search/exa.go— key-gated provider w/ header auth, breaker,Name(),var _ Provider(OAuth template)internal/search/hackernews.go— sibling result mapping;internal/scraper/hackernews.go—isHNURL/scrapeHNnative-route templateinternal/search/provider.go:117/147/188—SupportedProviders/NewProviderByName/AvailableProvidersinternal/scraper/pipeline.go:202— scrape routing switchinternal/config/config.go:137/256— key field + required-when-selected gateinternal/content/classify.go:92—authorityTierbandingRefs: https://github.com/praw-dev/praw · https://www.reddit.com/dev/api/ · https://support.reddithelp.com/hc/en-us/articles/16160319875092-Reddit-Data-API-Wiki