# 03 · Collector

The privacy boundary. Everything downstream consumes `NormalizedPost`/`CountPoint` and never learns the transport.

## 1 · Interface

```ts
interface Collector {
  counts(query: string, opts: { granularity: "hour", start?: string, end?: string }): Promise<CountPoint[]>
  posts(query: string, opts: {
    mode: "relevancy" | "recency"
    sinceId?: string
    maxPosts: number               // hard cap for this call, enforced inside
  }): Promise<{ posts: NormalizedPost[], newestId: string | null }>
}
interface CountPoint { hour: string; mentions: number }
```

Implementations (all private): `XOfficialCollector` (build now), `FixtureCollector` (build now, for tests), `X3pCollector` (stub the class, implement at swap time; the twitterapi.io client pattern exists in Sift's `records-ingestion-utils` for reference, but do NOT import it).

## 2 · XOfficialCollector (X API v2, verified June 2026)

- `GET /2/tweets/counts/recent` — `query`, `granularity=hour`. 300 req/15min app-auth. Does not consume the monthly post cap.
- `GET /2/tweets/search/recent` — `max_results=100`, `tweet.fields=created_at,public_metrics,lang,referenced_tweets,conversation_id,possibly_sensitive,entities`, `expansions=author_id`, `user.fields=public_metrics,verified,created_at`. `sort_order=relevancy` for top pulls; `sort_order=recency` + `since_id` for the tail. 450 req/15min.
  - **Known issue**: relevancy mode sometimes omits `next_token`. Treat missing token as end-of-results, never retry-loop on it.
  - No engagement operators exist (`min_faves` is web-only). Rank locally on `public_metrics`.
- Query strings come from the registry compiler (02), ≤ 512 chars.
- Page limits per pull: relevancy ≤ 5 pages, recency ≤ 10 pages, and stop early at `maxPosts`.
- 429 / 5xx: exponential backoff with jitter (1s base, 5 tries), then surrender the pull (cron picks up next cycle; `since_id` makes it lossless).
- Bootstrap (new entity): one `GET /2/tweets/search/all` sweep, 30 days, ≤ 500/page, capped 5,000 posts, then never again.

## 3 · Pull recipe (per entity per content cycle)

1. relevancy pull (top conversation) — `maxPosts: 300`
2. recency pull `since_id = last_seen` (the tail) — `maxPosts: 500`
3. Union, dedup by `post_id` (also against D1 `posts_seen`), normalize, write one raw object per pull:
   `raw/<source>/<entity>/<yyyy-mm-dd>/<runId>/<query-hash>-<mode>.json` (full API response, untouched)
4. Insert `posts_seen`, debit `budget_ledger`, enqueue new post IDs to `avb-enrich`.

Daily metrics refresh: one `GET /2/tweets?ids=` batch (100/req) re-fetching yesterday's top-100 per entity so engagement matures ~24h before launch-fingerprint freeze.

## 4 · Cadence (driven by scheduler, see 05 §4)

| State | counts | content pulls | per-entity daily cap |
|---|---|---|---|
| baseline | hourly | every 4h | 1,000 |
| elevated | hourly | hourly | 4,000 |
| launch | every 30 min | every 30 min | 20,000 |

## 5 · Budget guards (hard, tested)

- `budget_ledger` debited **before** each posts() call with the requested `maxPosts`; call refused if it would exceed the entity's daily cap or the global monthly cap (env `MONTHLY_POST_BUDGET`, default 600,000).
- Refusal is a logged no-op, not an error; counts never stop.
- At 90% of monthly budget: Slack webhook warning + all entities clamp to baseline cadence (kill-switch).
- Tests: simulated queue replay must not double-debit (debit keyed by runId, idempotent); cap breach must refuse; 429 storm must back off and surrender.

## 6 · Shadow harness (build the harness now, use at swap time)

`pnpm shadow --entities fable-5,gpt-5-2 --days 7 --candidate x-3p` runs candidate alongside incumbent, writes both to raw under distinct sources, and reports: hourly count deltas, unique-author overlap, top-20 post overlap per day. Swap criterion: ≥ 95% on all three. Output is a markdown report in `ops/shadow-reports/`.
