Summary
The Semrush study cited in The Atlantic's "sloptimization" piece found that most Reddit posts cited by AI chatbots have fewer than 20 upvotes. When the scraper fetches a Reddit page, this engagement signal is not extracted or surfaced — so an AI assistant reading the scrape output has no way to distinguish a viral community thread from a sock-puppet post. This issue exposes Reddit-specific engagement metadata (upvotes, comment count, post age) from JSON-LD interactionStatistic blocks in scraped Reddit pages.
Current state in code (verified)
internal/scraper/pipeline.go — StructuredData struct has JSONLD []json.RawMessage, OpenGraph map[string]string, Citation map[string]string. No ForumSignals field.
internal/scraper/pipeline.go:Signals() — reduces StructuredData to content.StructuredSignals{SchemaTypes, HasCitationMeta}. Reddit engagement fields not extracted.
internal/scraper/hackernews.go — HN Score int IS already extracted (line 32) and surfaced in formatted content output (line 205). Reddit has no equivalent.
internal/scraper/html.go — HTML/stealth tier extracts JSON-LD blocks into StructuredData.JSONLD. Reddit posts include a DiscussionForumPosting JSON-LD schema with interactionStatistic entries for upvotes and comments.
internal/content/classify.go:isForumHost() — already identifies reddit.com as a forum host.
internal/tools/scrape.go — scrape output includes the raw classified fields but no forum-specific engagement.
Reddit JSON-LD structure (verified via live page inspection)
Reddit posts render a <script type="application/ld+json"> block with:
{
"@type": "DiscussionForumPosting",
"upvoteCount": 12847,
"interactionStatistic": [
{
"@type": "InteractionCounter",
"interactionType": "https://schema.org/VoteAction",
"userInteractionCount": 12847
},
{
"@type": "InteractionCounter",
"interactionType": "https://schema.org/CommentAction",
"userInteractionCount": 342
}
],
"datePublished": "2024-03-15T14:22:00+00:00",
"author": {
"@type": "Person",
"name": "u/throwaway_account_xyz"
}
}
The upvoteCount top-level field is the primary signal. interactionStatistic is the fallback. datePublished gives post age.
Implementation plan
Step 1 — New struct in internal/scraper/pipeline.go
// ForumSignals holds engagement metadata extracted from forum pages (Reddit,
// HackerNews). Non-nil only when the page is identified as a forum post.
type ForumSignals struct {
// Platform is "reddit" or "hackernews".
Platform string `json:"platform"`
// Upvotes is the post score / upvote count. -1 means not found in page data.
Upvotes int `json:"upvotes"`
// Comments is the comment count. -1 means not found.
Comments int `json:"comments"`
// DatePublished is the ISO-8601 publication timestamp from structured data.
DatePublished string `json:"datePublished,omitempty"`
// AuthorName is the post author's display name (not a user ID).
AuthorName string `json:"authorName,omitempty"`
// CredibilityNote is a human-readable note for the AI when signals suggest
// low engagement. Empty when engagement is normal.
CredibilityNote string `json:"credibilityNote,omitempty"`
}
Add ForumSignals *ForumSignals to ScrapeResult struct in pipeline.go.
Step 2 — Extraction function in internal/scraper/html.go (or a new internal/scraper/forum.go)
// extractForumSignals parses JSON-LD blocks from a Reddit or HN page and
// returns a ForumSignals struct. Returns nil when no forum schema is found.
func extractForumSignals(rawURL string, jsonldBlocks []json.RawMessage) *ForumSignals
Implementation:
- Platform detection:
strings.Contains(host, "reddit.com") → platform = "reddit". HN is handled separately by hackernews.go; skip here (return nil).
- JSON-LD scan: iterate
jsonldBlocks, unmarshal each into a map[string]any. Look for @type == "DiscussionForumPosting" or @type == "SocialMediaPosting".
- Upvotes:
- Try
upvoteCount key first (int or float64 from JSON).
- Fallback: scan
interactionStatistic array for entry with interactionType containing "VoteAction", read userInteractionCount.
- Default to -1 if neither found.
- Comments:
- Scan
interactionStatistic for "CommentAction", read userInteractionCount.
- Default to -1.
- DatePublished: read
datePublished string directly.
- AuthorName: read
author.name string.
- CredibilityNote: set when
upvotes >= 0 && upvotes < 20:
"Low engagement: this post has fewer than 20 upvotes. Community validation is minimal."
Step 3 — Wire into the scraper pipeline in internal/scraper/pipeline.go
After structured data extraction and before returning ScrapeResult, call:
if result.StructuredData != nil {
result.ForumSignals = extractForumSignals(u.Host, result.StructuredData.JSONLD)
}
No change to Signals() — ForumSignals is a separate output path, not part of the classification substrate.
Step 4 — Surface in internal/tools/scrape.go
In the tool handler, when result.ForumSignals != nil, include in the output JSON:
output["forumSignals"] = result.ForumSignals
Add to the scrape_page output schema in docs/TOOLS.md.
Step 5 — Surface in web_search result enrichment (optional, Phase 2)
For Phase 1, only scrape_page gets forumSignals. A Phase 2 follow-on could populate it for Reddit URLs appearing in web_search results by running a lightweight JSON-LD fetch — but that requires a scrape per result, which is too expensive for the search path. Mark as a follow-on in the issue.
Output schema change
scrape_page result gains an optional forumSignals field when the URL is a Reddit post:
"forumSignals": {
"platform": "reddit",
"upvotes": 7,
"comments": 3,
"datePublished": "2024-11-02T09:15:00+00:00",
"authorName": "u/bestproducts_xyz",
"credibilityNote": "Low engagement: this post has fewer than 20 upvotes. Community validation is minimal."
}
Field omitted (not null) for non-forum URLs.
Tests
Unit test in internal/scraper/scraper_test.go or new internal/scraper/forum_test.go
func TestExtractForumSignals(t *testing.T) {
cases := []struct{
name string
host string
jsonld string
wantNil bool
wantUpvotes int
wantComments int
wantCredNote bool
}{
{
name: "reddit post with 7 upvotes",
host: "www.reddit.com",
jsonld: `[{"@type":"DiscussionForumPosting","upvoteCount":7,"interactionStatistic":[{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":3}],"datePublished":"2024-11-02T09:15:00+00:00","author":{"@type":"Person","name":"u/bestproducts_xyz"}}]`,
wantNil: false,
wantUpvotes: 7,
wantComments: 3,
wantCredNote: true,
},
{
name: "reddit post with 12847 upvotes",
host: "www.reddit.com",
jsonld: `[{"@type":"DiscussionForumPosting","upvoteCount":12847}]`,
wantNil: false,
wantUpvotes: 12847,
wantCredNote: false,
},
{
name: "non-reddit URL returns nil",
host: "news.ycombinator.com",
jsonld: `[{"@type":"NewsArticle","headline":"Test"}]`,
wantNil: true,
},
{
name: "reddit post no JSON-LD upvote — returns -1",
host: "www.reddit.com",
jsonld: `[{"@type":"DiscussionForumPosting"}]`,
wantNil: false,
wantUpvotes: -1,
},
}
}
Integration test with //go:build live
Real Reddit URLs for live testing. Note: Reddit often blocks scrapers; the stealth tier handles this, but test stability may vary. Use URLs that are stable/archived.
| URL |
Expected upvotes |
Expected credibilityNote |
https://www.reddit.com/r/projectmanagement/comments/1abc123/ (a low-upvote post, <20 upvotes) |
< 20 |
non-empty |
| A subreddit top post with thousands of upvotes |
> 100 |
empty |
For CI stability, seed the integration test with a fixed Reddit URL known to be stable (moderators don't remove it):
https://www.reddit.com/r/MachineLearning/comments/zqfxsy/d_chatgpt_trained_on_reddit/ — this is a high-upvote thread and unlikely to be removed.
Drift gate
TestOutputSchemaMatchesResponse must pass after adding forumSignals to the scrape_page schema in docs/TOOLS.md.
Acceptance criteria
Labels / milestone
enhancement · P2 · pipeline
Milestone: v1.33.0 Anti-Sloptimization
Note on Reddit scraping reliability
Reddit actively blocks scrapers. The stealth tier in internal/scraper/pipeline.go handles most cases. If JSON-LD blocks are absent in a given Reddit page response (e.g. bot-wall served instead), ForumSignals returns nil gracefully — the tool does not error. Add a log line at debug level: "reddit JSON-LD not found, forumSignals skipped".
Summary
The Semrush study cited in The Atlantic's "sloptimization" piece found that most Reddit posts cited by AI chatbots have fewer than 20 upvotes. When the scraper fetches a Reddit page, this engagement signal is not extracted or surfaced — so an AI assistant reading the scrape output has no way to distinguish a viral community thread from a sock-puppet post. This issue exposes Reddit-specific engagement metadata (upvotes, comment count, post age) from JSON-LD
interactionStatisticblocks in scraped Reddit pages.Current state in code (verified)
internal/scraper/pipeline.go—StructuredDatastruct hasJSONLD []json.RawMessage,OpenGraph map[string]string,Citation map[string]string. NoForumSignalsfield.internal/scraper/pipeline.go:Signals()— reducesStructuredDatatocontent.StructuredSignals{SchemaTypes, HasCitationMeta}. Reddit engagement fields not extracted.internal/scraper/hackernews.go— HNScore intIS already extracted (line 32) and surfaced in formatted content output (line 205). Reddit has no equivalent.internal/scraper/html.go— HTML/stealth tier extracts JSON-LD blocks intoStructuredData.JSONLD. Reddit posts include aDiscussionForumPostingJSON-LD schema withinteractionStatisticentries for upvotes and comments.internal/content/classify.go:isForumHost()— already identifies reddit.com as a forum host.internal/tools/scrape.go— scrape output includes the raw classified fields but no forum-specific engagement.Reddit JSON-LD structure (verified via live page inspection)
Reddit posts render a
<script type="application/ld+json">block with:{ "@type": "DiscussionForumPosting", "upvoteCount": 12847, "interactionStatistic": [ { "@type": "InteractionCounter", "interactionType": "https://schema.org/VoteAction", "userInteractionCount": 12847 }, { "@type": "InteractionCounter", "interactionType": "https://schema.org/CommentAction", "userInteractionCount": 342 } ], "datePublished": "2024-03-15T14:22:00+00:00", "author": { "@type": "Person", "name": "u/throwaway_account_xyz" } }The
upvoteCounttop-level field is the primary signal.interactionStatisticis the fallback.datePublishedgives post age.Implementation plan
Step 1 — New struct in
internal/scraper/pipeline.goAdd
ForumSignals *ForumSignalstoScrapeResultstruct inpipeline.go.Step 2 — Extraction function in
internal/scraper/html.go(or a newinternal/scraper/forum.go)Implementation:
strings.Contains(host, "reddit.com")→platform = "reddit". HN is handled separately byhackernews.go; skip here (return nil).jsonldBlocks, unmarshal each into amap[string]any. Look for@type == "DiscussionForumPosting"or@type == "SocialMediaPosting".upvoteCountkey first (int or float64 from JSON).interactionStatisticarray for entry withinteractionTypecontaining"VoteAction", readuserInteractionCount.interactionStatisticfor"CommentAction", readuserInteractionCount.datePublishedstring directly.author.namestring.upvotes >= 0 && upvotes < 20:"Low engagement: this post has fewer than 20 upvotes. Community validation is minimal."Step 3 — Wire into the scraper pipeline in
internal/scraper/pipeline.goAfter structured data extraction and before returning
ScrapeResult, call:No change to
Signals()—ForumSignalsis a separate output path, not part of the classification substrate.Step 4 — Surface in
internal/tools/scrape.goIn the tool handler, when
result.ForumSignals != nil, include in the output JSON:Add to the
scrape_pageoutput schema indocs/TOOLS.md.Step 5 — Surface in
web_searchresult enrichment (optional, Phase 2)For Phase 1, only
scrape_pagegetsforumSignals. A Phase 2 follow-on could populate it for Reddit URLs appearing inweb_searchresults by running a lightweight JSON-LD fetch — but that requires a scrape per result, which is too expensive for the search path. Mark as a follow-on in the issue.Output schema change
scrape_pageresult gains an optionalforumSignalsfield when the URL is a Reddit post:Field omitted (not null) for non-forum URLs.
Tests
Unit test in
internal/scraper/scraper_test.goor newinternal/scraper/forum_test.goIntegration test with
//go:build liveReal Reddit URLs for live testing. Note: Reddit often blocks scrapers; the stealth tier handles this, but test stability may vary. Use URLs that are stable/archived.
upvotescredibilityNotehttps://www.reddit.com/r/projectmanagement/comments/1abc123/(a low-upvote post, <20 upvotes)< 20> 100For CI stability, seed the integration test with a fixed Reddit URL known to be stable (moderators don't remove it):
https://www.reddit.com/r/MachineLearning/comments/zqfxsy/d_chatgpt_trained_on_reddit/— this is a high-upvote thread and unlikely to be removed.Drift gate
TestOutputSchemaMatchesResponsemust pass after addingforumSignalsto thescrape_pageschema indocs/TOOLS.md.Acceptance criteria
scrape_page(url: "https://www.reddit.com/r/projectmanagement/comments/...")returnsforumSignals.platform: "reddit"and integerupvotes/commentsvaluescredibilityNoteforumSignalsfield in the outputforumSignalsfield (HN engagement is surfaced separately byhackernews.go)upvotes: -1returned gracefully when no upvote data found in structured data (not a panic or error)go test -race ./...passesdocs/TOOLS.mdupdated;TestOutputSchemaMatchesResponsepassesLabels / milestone
enhancement·P2·pipelineMilestone: v1.33.0 Anti-Sloptimization
Note on Reddit scraping reliability
Reddit actively blocks scrapers. The stealth tier in
internal/scraper/pipeline.gohandles most cases. If JSON-LD blocks are absent in a given Reddit page response (e.g. bot-wall served instead),ForumSignalsreturnsnilgracefully — the tool does not error. Add a log line at debug level:"reddit JSON-LD not found, forumSignals skipped".