Skip to content

feat: self-promotion bias signal in source classification #244

@zoharbabin

Description

@zoharbabin

Summary

Brand-authored content that places their own product at the top of a ranking list is a primary vector for sloptimization (generative engine optimization). The source classifier already labels blogs and reputation tiers, but it doesn't detect the specific pattern of a page that (a) ranks products and (b) puts its own brand first. This issue adds a selfPromotionSignal field to the source classification output so AI assistants can distinguish "Shopify ranked #1 by PCMag" from "Shopify ranked #1 by Shopify."

Current state in code (verified)

  • internal/content/classify.goClassifySource() calls classifySourceType() + classifyDomainCategory() + LookupDomainReputation(). No self-promotion detection anywhere.
  • internal/tools/classify.goclassificationFields() emits sourceType, authorityTier, domainCategory, optionally domainReputation. No selfPromotionSignal key.
  • internal/tools/classify.go:enrichResultsWithReputation() — already attaches per-result classification to every web_search result. Adding the new field here propagates it automatically.
  • internal/content/classify.go:isBlogHost() — existing pattern for host-heuristic detection shows the right structure to follow.

Implementation plan

Step 1 — New struct in internal/content/classify.go

// SelfPromotionSignal is non-nil when the page matches the ranking-list +
// own-brand pattern. Attached to SourceClassification by ClassifySource.
type SelfPromotionSignal struct {
    Detected    bool   `json:"detected"`
    BrandDomain string `json:"brandDomain"` // host of the page, e.g. "shopify.com"
    BrandToken  string `json:"brandToken"`  // extracted brand name, e.g. "shopify"
    // RankPosition is the 1-based position of the brand's own product in the
    // first ranking list found. 0 means the brand appears but not in position 1.
    RankPosition int    `json:"rankPosition"`
    Confidence   string `json:"confidence"` // "high" | "medium" | "low"
}

Add SelfPromotion *SelfPromotionSignal field to SourceClassification.

Step 2 — Detection function in internal/content/classify.go

// DetectSelfPromotion checks whether body contains a ranking list that puts
// the page's own domain brand in position 1. It is deliberately conservative:
// false negatives are preferable to false positives (mislabeling independent
// reviews as self-promotional breaks the tool's trust contract).
//
// Detection logic:
//  1. Extract the brand token from host (strip TLD + "www"): "shopify.com" → "shopify"
//  2. Scan for ranking patterns: ordered HTML lists (<ol>), numbered markdown
//     lines ("1. Shopify"), or "Best X" / "#1" / "top pick" heading adjacency
//  3. Check whether the brand token appears within the first 2 list items
//  4. Confirm this is not a comparison article FROM an independent site that
//     merely happens to start with the brand (use isBlogHost + host-match guard)
func DetectSelfPromotion(host, title, body string) *SelfPromotionSignal

Implementation notes:

  • Brand token extraction: strings.ToLower(strings.Split(registrableDomain(host), ".")[0]) where registrableDomain strips www. prefix and returns host.tld. Use same parent-domain walk logic already in reputation.go:LookupDomainReputation.
  • Ranking pattern regex (compile once as package-level var):
  • Position 1 check: if the brand token appears in the first <li> text or first numbered list entry, RankPosition = 1.
  • Confidence:
    • "high": own domain host + brand in <li> position 1 + ranking title
    • "medium": own domain host + brand in position 1, no explicit ranking heading
    • "low": host match only, brand appears somewhere in content but no confirmed ranking structure

Step 3 — Wire into ClassifySource in internal/content/classify.go

ClassifySource already receives rawURL and would need the page body. The function signature must be extended:

// ClassifySource classifies a source URL. body is the extracted page text and
// is used for self-promotion detection (pass "" to skip that check).
func ClassifySource(rawURL string, authority float64, sig StructuredSignals, lens string, body string) SourceClassification

Update all callers:

  • internal/tools/classify.go:classifySource() already has body — pass it through.
  • Any other callers of ClassifySource pass "" to preserve existing behavior.

Step 4 — Emit in internal/tools/classify.go:classificationFields()

if c.SelfPromotion != nil && c.SelfPromotion.Detected {
    fields["selfPromotionSignal"] = c.SelfPromotion
}

This automatically propagates to:

  • web_search results via enrichResultsWithReputation()
  • scrape_page output via classifySource()classificationFields()

Step 5 — Update docs/TOOLS.md

Add selfPromotionSignal to the web_search and scrape_page output schemas under the sourceClassification section.

Output schema change

web_search result objects and scrape_page output gain an optional field:

"selfPromotionSignal": {
  "detected": true,
  "brandDomain": "shopify.com",
  "brandToken": "shopify",
  "rankPosition": 1,
  "confidence": "high"
}

Field is omitted (not null) when detected is false or body is empty.

Tests

Unit test in internal/content/classify_test.go

func TestDetectSelfPromotion(t *testing.T) {
    cases := []struct{
        name     string
        host     string
        title    string
        body     string
        wantNil  bool
        wantRank int
        wantConf string
    }{
        {
            name:     "shopify blog lists shopify first",
            host:     "www.shopify.com",
            title:    "Best Ecommerce Platforms for 2024",
            body:     "<ol><li>Shopify — best overall</li><li>WooCommerce</li><li>BigCommerce</li></ol>",
            wantNil:  false,
            wantRank: 1,
            wantConf: "high",
        },
        {
            name:    "independent review that ranks shopify first is not self-promotion",
            host:    "www.pcmag.com",
            title:   "Best Ecommerce Platforms",
            body:    "<ol><li>Shopify — best overall</li><li>WooCommerce</li></ol>",
            wantNil: true,
        },
        {
            name:    "shopify blog that does not self-rank",
            host:    "www.shopify.com",
            title:   "How to Start an Online Store",
            body:    "Starting an online store requires a platform...",
            wantNil: true,
        },
        {
            name:     "clickup blog lists clickup first",
            host:     "clickup.com",
            title:    "10 Best Project Management Tools",
            body:     "1. ClickUp\n2. Asana\n3. Monday.com",
            wantNil:  false,
            wantRank: 1,
            wantConf: "high",
        },
    }
}

Integration test: real URLs

Add to internal/tools/tools_test.go or a new internal/content/classify_live_test.go with //go:build live:

URL Expected detected Expected rankPosition
https://www.shopify.com/blog/best-ecommerce-platforms true 1
https://clickup.com/blog/best-project-management-tools/ true 1
https://www.notion.so/blog/best-note-taking-apps (if exists) true 1
https://www.pcmag.com/picks/the-best-project-management-software false N/A
https://www.techradar.com/best/best-ecommerce-platform false N/A

Drift gate

Add selfPromotionSignal to the web_search output schema section in docs/TOOLS.md and confirm TestOutputSchemaMatchesResponse still passes (it validates schema fields against live handler output).

Acceptance criteria

  • DetectSelfPromotion("www.shopify.com", "Best Ecommerce Platforms", body_with_ol_shopify_first) returns {Detected: true, RankPosition: 1, Confidence: "high"}
  • DetectSelfPromotion("www.pcmag.com", ...) always returns nil (no false positives on independent reviewers)
  • web_search(query: "best ecommerce platforms", claim: "Shopify is the best ecommerce platform") result for shopify.com URLs includes selfPromotionSignal.detected: true
  • scrape_page(url: "https://www.shopify.com/blog/best-ecommerce-platforms") output includes selfPromotionSignal
  • No existing tests broken (go test -race ./... passes)
  • docs/TOOLS.md updated; TestToolsDocMatchesRegistry passes

Labels / milestone

enhancement · P2 · pipeline
Milestone: v1.33.0 Anti-Sloptimization (create if not exists)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priorityenhancementNew feature or requestpipelineContent extraction and scraping pipeline

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions