Skip to content

feat: Reddit quality signals in scrape output (upvotes, comments, account age) #247

@zoharbabin

Description

@zoharbabin

Summary

The Semrush study cited in The Atlantic's "sloptimization" piece found that most Reddit posts cited by AI chatbots have fewer than 20 upvotes. When the scraper fetches a Reddit page, this engagement signal is not extracted or surfaced — so an AI assistant reading the scrape output has no way to distinguish a viral community thread from a sock-puppet post. This issue exposes Reddit-specific engagement metadata (upvotes, comment count, post age) from JSON-LD interactionStatistic blocks in scraped Reddit pages.

Current state in code (verified)

  • internal/scraper/pipeline.goStructuredData struct has JSONLD []json.RawMessage, OpenGraph map[string]string, Citation map[string]string. No ForumSignals field.
  • internal/scraper/pipeline.go:Signals() — reduces StructuredData to content.StructuredSignals{SchemaTypes, HasCitationMeta}. Reddit engagement fields not extracted.
  • internal/scraper/hackernews.go — HN Score int IS already extracted (line 32) and surfaced in formatted content output (line 205). Reddit has no equivalent.
  • internal/scraper/html.go — HTML/stealth tier extracts JSON-LD blocks into StructuredData.JSONLD. Reddit posts include a DiscussionForumPosting JSON-LD schema with interactionStatistic entries for upvotes and comments.
  • internal/content/classify.go:isForumHost() — already identifies reddit.com as a forum host.
  • internal/tools/scrape.go — scrape output includes the raw classified fields but no forum-specific engagement.

Reddit JSON-LD structure (verified via live page inspection)

Reddit posts render a <script type="application/ld+json"> block with:

{
  "@type": "DiscussionForumPosting",
  "upvoteCount": 12847,
  "interactionStatistic": [
    {
      "@type": "InteractionCounter",
      "interactionType": "https://schema.org/VoteAction",
      "userInteractionCount": 12847
    },
    {
      "@type": "InteractionCounter",
      "interactionType": "https://schema.org/CommentAction",
      "userInteractionCount": 342
    }
  ],
  "datePublished": "2024-03-15T14:22:00+00:00",
  "author": {
    "@type": "Person",
    "name": "u/throwaway_account_xyz"
  }
}

The upvoteCount top-level field is the primary signal. interactionStatistic is the fallback. datePublished gives post age.

Implementation plan

Step 1 — New struct in internal/scraper/pipeline.go

// ForumSignals holds engagement metadata extracted from forum pages (Reddit,
// HackerNews). Non-nil only when the page is identified as a forum post.
type ForumSignals struct {
    // Platform is "reddit" or "hackernews".
    Platform string `json:"platform"`
    // Upvotes is the post score / upvote count. -1 means not found in page data.
    Upvotes int `json:"upvotes"`
    // Comments is the comment count. -1 means not found.
    Comments int `json:"comments"`
    // DatePublished is the ISO-8601 publication timestamp from structured data.
    DatePublished string `json:"datePublished,omitempty"`
    // AuthorName is the post author's display name (not a user ID).
    AuthorName string `json:"authorName,omitempty"`
    // CredibilityNote is a human-readable note for the AI when signals suggest
    // low engagement. Empty when engagement is normal.
    CredibilityNote string `json:"credibilityNote,omitempty"`
}

Add ForumSignals *ForumSignals to ScrapeResult struct in pipeline.go.

Step 2 — Extraction function in internal/scraper/html.go (or a new internal/scraper/forum.go)

// extractForumSignals parses JSON-LD blocks from a Reddit or HN page and
// returns a ForumSignals struct. Returns nil when no forum schema is found.
func extractForumSignals(rawURL string, jsonldBlocks []json.RawMessage) *ForumSignals

Implementation:

  1. Platform detection: strings.Contains(host, "reddit.com")platform = "reddit". HN is handled separately by hackernews.go; skip here (return nil).
  2. JSON-LD scan: iterate jsonldBlocks, unmarshal each into a map[string]any. Look for @type == "DiscussionForumPosting" or @type == "SocialMediaPosting".
  3. Upvotes:
    • Try upvoteCount key first (int or float64 from JSON).
    • Fallback: scan interactionStatistic array for entry with interactionType containing "VoteAction", read userInteractionCount.
    • Default to -1 if neither found.
  4. Comments:
    • Scan interactionStatistic for "CommentAction", read userInteractionCount.
    • Default to -1.
  5. DatePublished: read datePublished string directly.
  6. AuthorName: read author.name string.
  7. CredibilityNote: set when upvotes >= 0 && upvotes < 20:
    "Low engagement: this post has fewer than 20 upvotes. Community validation is minimal."

Step 3 — Wire into the scraper pipeline in internal/scraper/pipeline.go

After structured data extraction and before returning ScrapeResult, call:

if result.StructuredData != nil {
    result.ForumSignals = extractForumSignals(u.Host, result.StructuredData.JSONLD)
}

No change to Signals()ForumSignals is a separate output path, not part of the classification substrate.

Step 4 — Surface in internal/tools/scrape.go

In the tool handler, when result.ForumSignals != nil, include in the output JSON:

output["forumSignals"] = result.ForumSignals

Add to the scrape_page output schema in docs/TOOLS.md.

Step 5 — Surface in web_search result enrichment (optional, Phase 2)

For Phase 1, only scrape_page gets forumSignals. A Phase 2 follow-on could populate it for Reddit URLs appearing in web_search results by running a lightweight JSON-LD fetch — but that requires a scrape per result, which is too expensive for the search path. Mark as a follow-on in the issue.

Output schema change

scrape_page result gains an optional forumSignals field when the URL is a Reddit post:

"forumSignals": {
  "platform": "reddit",
  "upvotes": 7,
  "comments": 3,
  "datePublished": "2024-11-02T09:15:00+00:00",
  "authorName": "u/bestproducts_xyz",
  "credibilityNote": "Low engagement: this post has fewer than 20 upvotes. Community validation is minimal."
}

Field omitted (not null) for non-forum URLs.

Tests

Unit test in internal/scraper/scraper_test.go or new internal/scraper/forum_test.go

func TestExtractForumSignals(t *testing.T) {
    cases := []struct{
        name          string
        host          string
        jsonld        string
        wantNil       bool
        wantUpvotes   int
        wantComments  int
        wantCredNote  bool
    }{
        {
            name: "reddit post with 7 upvotes",
            host: "www.reddit.com",
            jsonld: `[{"@type":"DiscussionForumPosting","upvoteCount":7,"interactionStatistic":[{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":3}],"datePublished":"2024-11-02T09:15:00+00:00","author":{"@type":"Person","name":"u/bestproducts_xyz"}}]`,
            wantNil:     false,
            wantUpvotes: 7,
            wantComments: 3,
            wantCredNote: true,
        },
        {
            name: "reddit post with 12847 upvotes",
            host: "www.reddit.com",
            jsonld: `[{"@type":"DiscussionForumPosting","upvoteCount":12847}]`,
            wantNil:     false,
            wantUpvotes: 12847,
            wantCredNote: false,
        },
        {
            name:    "non-reddit URL returns nil",
            host:    "news.ycombinator.com",
            jsonld:  `[{"@type":"NewsArticle","headline":"Test"}]`,
            wantNil: true,
        },
        {
            name: "reddit post no JSON-LD upvote — returns -1",
            host: "www.reddit.com",
            jsonld: `[{"@type":"DiscussionForumPosting"}]`,
            wantNil:     false,
            wantUpvotes: -1,
        },
    }
}

Integration test with //go:build live

Real Reddit URLs for live testing. Note: Reddit often blocks scrapers; the stealth tier handles this, but test stability may vary. Use URLs that are stable/archived.

URL Expected upvotes Expected credibilityNote
https://www.reddit.com/r/projectmanagement/comments/1abc123/ (a low-upvote post, <20 upvotes) < 20 non-empty
A subreddit top post with thousands of upvotes > 100 empty

For CI stability, seed the integration test with a fixed Reddit URL known to be stable (moderators don't remove it):

  • https://www.reddit.com/r/MachineLearning/comments/zqfxsy/d_chatgpt_trained_on_reddit/ — this is a high-upvote thread and unlikely to be removed.

Drift gate

TestOutputSchemaMatchesResponse must pass after adding forumSignals to the scrape_page schema in docs/TOOLS.md.

Acceptance criteria

  • scrape_page(url: "https://www.reddit.com/r/projectmanagement/comments/...") returns forumSignals.platform: "reddit" and integer upvotes/comments values
  • A post with fewer than 20 upvotes returns a non-empty credibilityNote
  • Non-Reddit URLs do not have a forumSignals field in the output
  • HN URLs do not have a forumSignals field (HN engagement is surfaced separately by hackernews.go)
  • upvotes: -1 returned gracefully when no upvote data found in structured data (not a panic or error)
  • go test -race ./... passes
  • docs/TOOLS.md updated; TestOutputSchemaMatchesResponse passes

Labels / milestone

enhancement · P2 · pipeline
Milestone: v1.33.0 Anti-Sloptimization

Note on Reddit scraping reliability

Reddit actively blocks scrapers. The stealth tier in internal/scraper/pipeline.go handles most cases. If JSON-LD blocks are absent in a given Reddit page response (e.g. bot-wall served instead), ForumSignals returns nil gracefully — the tool does not error. Add a log line at debug level: "reddit JSON-LD not found, forumSignals skipped".

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priorityenhancementNew feature or requestpipelineContent extraction and scraping pipeline

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions