feat: Reddit quality signals in scrape output (upvotes, comments, account age)

## Summary

The Semrush study cited in The Atlantic's "sloptimization" piece found that most Reddit posts cited by AI chatbots have fewer than 20 upvotes. When the scraper fetches a Reddit page, this engagement signal is not extracted or surfaced — so an AI assistant reading the scrape output has no way to distinguish a viral community thread from a sock-puppet post. This issue exposes Reddit-specific engagement metadata (upvotes, comment count, post age) from JSON-LD `interactionStatistic` blocks in scraped Reddit pages.

## Current state in code (verified)

- `internal/scraper/pipeline.go` — `StructuredData` struct has `JSONLD []json.RawMessage`, `OpenGraph map[string]string`, `Citation map[string]string`. No `ForumSignals` field.
- `internal/scraper/pipeline.go:Signals()` — reduces `StructuredData` to `content.StructuredSignals{SchemaTypes, HasCitationMeta}`. Reddit engagement fields not extracted.
- `internal/scraper/hackernews.go` — HN `Score int` IS already extracted (line 32) and surfaced in formatted content output (line 205). Reddit has no equivalent.
- `internal/scraper/html.go` — HTML/stealth tier extracts JSON-LD blocks into `StructuredData.JSONLD`. Reddit posts include a `DiscussionForumPosting` JSON-LD schema with `interactionStatistic` entries for upvotes and comments.
- `internal/content/classify.go:isForumHost()` — already identifies reddit.com as a forum host.
- `internal/tools/scrape.go` — scrape output includes the raw classified fields but no forum-specific engagement.

## Reddit JSON-LD structure (verified via live page inspection)

Reddit posts render a `<script type="application/ld+json">` block with:

```json
{
  "@type": "DiscussionForumPosting",
  "upvoteCount": 12847,
  "interactionStatistic": [
    {
      "@type": "InteractionCounter",
      "interactionType": "https://schema.org/VoteAction",
      "userInteractionCount": 12847
    },
    {
      "@type": "InteractionCounter",
      "interactionType": "https://schema.org/CommentAction",
      "userInteractionCount": 342
    }
  ],
  "datePublished": "2024-03-15T14:22:00+00:00",
  "author": {
    "@type": "Person",
    "name": "u/throwaway_account_xyz"
  }
}
```

The `upvoteCount` top-level field is the primary signal. `interactionStatistic` is the fallback. `datePublished` gives post age.

## Implementation plan

### Step 1 — New struct in `internal/scraper/pipeline.go`

```go
// ForumSignals holds engagement metadata extracted from forum pages (Reddit,
// HackerNews). Non-nil only when the page is identified as a forum post.
type ForumSignals struct {
    // Platform is "reddit" or "hackernews".
    Platform string `json:"platform"`
    // Upvotes is the post score / upvote count. -1 means not found in page data.
    Upvotes int `json:"upvotes"`
    // Comments is the comment count. -1 means not found.
    Comments int `json:"comments"`
    // DatePublished is the ISO-8601 publication timestamp from structured data.
    DatePublished string `json:"datePublished,omitempty"`
    // AuthorName is the post author's display name (not a user ID).
    AuthorName string `json:"authorName,omitempty"`
    // CredibilityNote is a human-readable note for the AI when signals suggest
    // low engagement. Empty when engagement is normal.
    CredibilityNote string `json:"credibilityNote,omitempty"`
}
```

Add `ForumSignals *ForumSignals` to `ScrapeResult` struct in `pipeline.go`.

### Step 2 — Extraction function in `internal/scraper/html.go` (or a new `internal/scraper/forum.go`)

```go
// extractForumSignals parses JSON-LD blocks from a Reddit or HN page and
// returns a ForumSignals struct. Returns nil when no forum schema is found.
func extractForumSignals(rawURL string, jsonldBlocks []json.RawMessage) *ForumSignals
```

Implementation:

1. **Platform detection**: `strings.Contains(host, "reddit.com")` → `platform = "reddit"`. HN is handled separately by `hackernews.go`; skip here (return nil).
2. **JSON-LD scan**: iterate `jsonldBlocks`, unmarshal each into a `map[string]any`. Look for `@type == "DiscussionForumPosting"` or `@type == "SocialMediaPosting"`.
3. **Upvotes**:
   - Try `upvoteCount` key first (int or float64 from JSON).
   - Fallback: scan `interactionStatistic` array for entry with `interactionType` containing `"VoteAction"`, read `userInteractionCount`.
   - Default to -1 if neither found.
4. **Comments**:
   - Scan `interactionStatistic` for `"CommentAction"`, read `userInteractionCount`.
   - Default to -1.
5. **DatePublished**: read `datePublished` string directly.
6. **AuthorName**: read `author.name` string.
7. **CredibilityNote**: set when `upvotes >= 0 && upvotes < 20`:
   `"Low engagement: this post has fewer than 20 upvotes. Community validation is minimal."`

### Step 3 — Wire into the scraper pipeline in `internal/scraper/pipeline.go`

After structured data extraction and before returning `ScrapeResult`, call:

```go
if result.StructuredData != nil {
    result.ForumSignals = extractForumSignals(u.Host, result.StructuredData.JSONLD)
}
```

No change to `Signals()` — `ForumSignals` is a separate output path, not part of the classification substrate.

### Step 4 — Surface in `internal/tools/scrape.go`

In the tool handler, when `result.ForumSignals != nil`, include in the output JSON:

```go
output["forumSignals"] = result.ForumSignals
```

Add to the `scrape_page` output schema in `docs/TOOLS.md`.

### Step 5 — Surface in `web_search` result enrichment (optional, Phase 2)

For Phase 1, only `scrape_page` gets `forumSignals`. A Phase 2 follow-on could populate it for Reddit URLs appearing in `web_search` results by running a lightweight JSON-LD fetch — but that requires a scrape per result, which is too expensive for the search path. Mark as a follow-on in the issue.

## Output schema change

`scrape_page` result gains an optional `forumSignals` field when the URL is a Reddit post:

```json
"forumSignals": {
  "platform": "reddit",
  "upvotes": 7,
  "comments": 3,
  "datePublished": "2024-11-02T09:15:00+00:00",
  "authorName": "u/bestproducts_xyz",
  "credibilityNote": "Low engagement: this post has fewer than 20 upvotes. Community validation is minimal."
}
```

Field omitted (not null) for non-forum URLs.

## Tests

### Unit test in `internal/scraper/scraper_test.go` or new `internal/scraper/forum_test.go`

```go
func TestExtractForumSignals(t *testing.T) {
    cases := []struct{
        name          string
        host          string
        jsonld        string
        wantNil       bool
        wantUpvotes   int
        wantComments  int
        wantCredNote  bool
    }{
        {
            name: "reddit post with 7 upvotes",
            host: "www.reddit.com",
            jsonld: `[{"@type":"DiscussionForumPosting","upvoteCount":7,"interactionStatistic":[{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":3}],"datePublished":"2024-11-02T09:15:00+00:00","author":{"@type":"Person","name":"u/bestproducts_xyz"}}]`,
            wantNil:     false,
            wantUpvotes: 7,
            wantComments: 3,
            wantCredNote: true,
        },
        {
            name: "reddit post with 12847 upvotes",
            host: "www.reddit.com",
            jsonld: `[{"@type":"DiscussionForumPosting","upvoteCount":12847}]`,
            wantNil:     false,
            wantUpvotes: 12847,
            wantCredNote: false,
        },
        {
            name:    "non-reddit URL returns nil",
            host:    "news.ycombinator.com",
            jsonld:  `[{"@type":"NewsArticle","headline":"Test"}]`,
            wantNil: true,
        },
        {
            name: "reddit post no JSON-LD upvote — returns -1",
            host: "www.reddit.com",
            jsonld: `[{"@type":"DiscussionForumPosting"}]`,
            wantNil:     false,
            wantUpvotes: -1,
        },
    }
}
```

### Integration test with `//go:build live`

Real Reddit URLs for live testing. Note: Reddit often blocks scrapers; the stealth tier handles this, but test stability may vary. Use URLs that are stable/archived.

| URL | Expected `upvotes` | Expected `credibilityNote` |
|-----|--------------------|---------------------------|
| `https://www.reddit.com/r/projectmanagement/comments/1abc123/` (a low-upvote post, <20 upvotes) | `< 20` | non-empty |
| A subreddit top post with thousands of upvotes | `> 100` | empty |

For CI stability, seed the integration test with a fixed Reddit URL known to be stable (moderators don't remove it):
- `https://www.reddit.com/r/MachineLearning/comments/zqfxsy/d_chatgpt_trained_on_reddit/` — this is a high-upvote thread and unlikely to be removed.

### Drift gate

`TestOutputSchemaMatchesResponse` must pass after adding `forumSignals` to the `scrape_page` schema in `docs/TOOLS.md`.

## Acceptance criteria

- [ ] `scrape_page(url: "https://www.reddit.com/r/projectmanagement/comments/...")` returns `forumSignals.platform: "reddit"` and integer `upvotes`/`comments` values
- [ ] A post with fewer than 20 upvotes returns a non-empty `credibilityNote`
- [ ] Non-Reddit URLs do not have a `forumSignals` field in the output
- [ ] HN URLs do not have a `forumSignals` field (HN engagement is surfaced separately by `hackernews.go`)
- [ ] `upvotes: -1` returned gracefully when no upvote data found in structured data (not a panic or error)
- [ ] `go test -race ./...` passes
- [ ] `docs/TOOLS.md` updated; `TestOutputSchemaMatchesResponse` passes

## Labels / milestone

`enhancement` · `P2` · `pipeline`  
Milestone: v1.33.0 Anti-Sloptimization

## Note on Reddit scraping reliability

Reddit actively blocks scrapers. The stealth tier in `internal/scraper/pipeline.go` handles most cases. If JSON-LD blocks are absent in a given Reddit page response (e.g. bot-wall served instead), `ForumSignals` returns `nil` gracefully — the tool does not error. Add a log line at debug level: `"reddit JSON-LD not found, forumSignals skipped"`.

URL	Expected `upvotes`	Expected `credibilityNote`
`https://www.reddit.com/r/projectmanagement/comments/1abc123/` (a low-upvote post, <20 upvotes)	`< 20`	non-empty
A subreddit top post with thousands of upvotes	`> 100`	empty

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Reddit quality signals in scrape output (upvotes, comments, account age) #247

Summary

Current state in code (verified)

Reddit JSON-LD structure (verified via live page inspection)

Implementation plan

Step 1 — New struct in `internal/scraper/pipeline.go`

Step 2 — Extraction function in `internal/scraper/html.go` (or a new `internal/scraper/forum.go`)

Step 3 — Wire into the scraper pipeline in `internal/scraper/pipeline.go`

Step 4 — Surface in `internal/tools/scrape.go`

Step 5 — Surface in `web_search` result enrichment (optional, Phase 2)

Output schema change

Tests

Unit test in `internal/scraper/scraper_test.go` or new `internal/scraper/forum_test.go`

Integration test with `//go:build live`

Drift gate

Acceptance criteria

Labels / milestone

Note on Reddit scraping reliability

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

feat: Reddit quality signals in scrape output (upvotes, comments, account age) #247

Description

Summary

Current state in code (verified)

Reddit JSON-LD structure (verified via live page inspection)

Implementation plan

Step 1 — New struct in internal/scraper/pipeline.go

Step 2 — Extraction function in internal/scraper/html.go (or a new internal/scraper/forum.go)

Step 3 — Wire into the scraper pipeline in internal/scraper/pipeline.go

Step 4 — Surface in internal/tools/scrape.go

Step 5 — Surface in web_search result enrichment (optional, Phase 2)

Output schema change

Tests

Unit test in internal/scraper/scraper_test.go or new internal/scraper/forum_test.go

Integration test with //go:build live

Drift gate

Acceptance criteria

Labels / milestone

Note on Reddit scraping reliability

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Step 1 — New struct in `internal/scraper/pipeline.go`

Step 2 — Extraction function in `internal/scraper/html.go` (or a new `internal/scraper/forum.go`)

Step 3 — Wire into the scraper pipeline in `internal/scraper/pipeline.go`

Step 4 — Surface in `internal/tools/scrape.go`

Step 5 — Surface in `web_search` result enrichment (optional, Phase 2)

Unit test in `internal/scraper/scraper_test.go` or new `internal/scraper/forum_test.go`

Integration test with `//go:build live`