playing with the new claude fable aaaaaaaand my token limit was reached
A public, always-fresh read on the AI model conversation: which models have mindshare, whether the crowd is loving or roasting them, what specifically people like and dislike, and which voices are driving it. artificialanalysis.ai for capability; this for perception. Refreshed every few hours, near-real-time during launches.
Every benchmark that matters is saturated or gamed, and everyone knows it. When a model ships, the real eval happens in public on X within 48 hours: builders post side-by-sides, influencers pick winners, and complaints (pricing, limits, regressions) crystallize into narrative. Nobody aggregates that signal. We already built this machine for brands; pointing it at AI models is a showcase with a built-in audience.
Sentiment about a model is effectively decided in the first two days. A leaderboard that updates every few hours during a launch owns that moment; monthly reports miss it entirely.
artificialanalysis owns capability. LMArena owns head-to-head preference (and is itself accused of being gamed). Nobody owns mindshare + sentiment + who-said-what. The slot is open.
This is Sift's core listening pipeline running on the most-watched topic in tech. Every chart is an implicit product demo, with "powered by Sift" on a property journalists will cite.
We are complementary to artificialanalysis, not competing with it. Their axis is what models score; ours is what people say. The two views disagree often, and the disagreement is the story.
| artificialanalysis.ai | This property | |
|---|---|---|
| Unit of truth | Benchmark scores, latency, price per token | Mentions, sentiment, themes, voices on X |
| Update cycle | When they re-run evals | Every 2-4 hours; 30-60 min during launches |
| Can it be gamed? | Increasingly yes (training on test sets, eval-tuning) | Astroturfing is possible but visible: author history, account age, and voice classification expose it |
| Answers | "Which model is most capable per dollar?" | "Which model is winning hearts? What do people actually complain about? Who should I follow?" |
| Editorial | Methodology pages | Auto-drafted, human-edited launch reads (the fable-5 page is the prototype) |
Standalone brand with a visible "by Sift" mark. Mockups below use vibebench as the placeholder. Final call is a rollout open question; domains unverified.
"Vibe check" is already the community's word for post-launch model evaluation. Memorable, dev-native, slightly irreverent, and the name itself states the thesis: vibes are the benchmark that can't be trained on. Risk: reads playful to enterprise press.
Descriptive and credible in a citation ("according to ModelPulse..."). Pulse conveys the refresh cadence. Risk: forgettable, sounds like a monitoring SaaS.
Exactly the right word: the spirit of the moment. Feels editorial and serious. Risk: spelling, domain availability, and a German loanword in every URL.
Maximum SEO clarity, zero explanation needed, "Index" invites citation. Risk: generic; harder to build brand affection around.
v0 ships with one model (Fable 5, the data we already have). The registry is designed for all four vectors from day one, because mindshare attaches at different altitudes: the model gets the launch buzz, the lab gets the trust debate, the product gets the consumer complaints.
| Kind | Examples | What mindshare means here | Relations |
|---|---|---|---|
| model | Fable 5, GPT-5.2, Gemini 3 Ultra, Grok 4.1 | Launch buzz, capability debate, like/dislike themes | belongs to a lab; succeeded-by chains for generations |
| open-weight | DeepSeek V4, Qwen3.5, Kimi K3, Llama 5 | Same as model + deployment/cost/fine-tune chatter; distinct audience | belongs to a lab; tagged open-weight for the filter tab |
| lab | Anthropic, OpenAI, Google DeepMind, xAI, Meta AI | Trust, drama, policy, hiring, brand-level sentiment | parent of models and products; owns official handles |
| product | ChatGPT, Claude Code, Cursor, Copilot, Gemini app | Consumer UX sentiment, pricing/limits complaints, switching talk | belongs to a lab (or company); often the loudest vector |
Visitors bring two different questions in two different tenses: "what do people like right now?" and "how is/was the launch?" The product treats them as two clocks running on the same data, so neither hijacks the other. Now is rolling and never freezes; Launch is T0-aligned, lives for 72 hours, then freezes into a permanent archive.
The leaderboard, like/dislike themes, voices, and the change feed all run on rolling windows (1h to 30d). This is the default lens on every page. Launch chrome never replaces it: at most it adds a badge and a second tab.
When launch mode is on, the entity page grows a Launch tab: day-0-aligned curves, the Launch Score forming in real time, an overlay against past launches. At T+72h the fingerprint and final score freeze into /launches. The Now tab just keeps rolling.
| Question | Surface | Lifecycle |
|---|---|---|
| "What do people like right now?" | Leaderboard + entity Now tab: Vibe Score, themes, voices, change feed | Rolling, refreshed every 2-4h (30 min in launch mode) |
| "How is the launch going?" | Entity Launch tab: day-0 curves, provisional Launch Score, past-launch overlay | Live for 72h, decays with launch mode |
| "How do launches compare?" | /launches archive: every fingerprint, every final score | Permanent, append-only, uncopyable |
| "What's the story?" | Reads: drafted from the same JSON, shipped when the story warrants one | Occasional, frozen on publish |
Who has mindshare right now. Rank, share of voice, sentiment split, momentum, change ticker.
One model in depth: Now tab (likes/dislikes with receipts, voices) + Launch tab during launches.
The archive: every launch's day-0 fingerprint and Launch Score, side by side across history.
Occasional thought pieces drafted from the data, and the credibility anchor explaining every number.
The homepage. One scroll answers "what's hot, what's loved, what's hated." Mindshare is share of tracked-set mentions (counts API, so it's cheap and hourly). Sentiment splits come from LLM-classified content pulls. The Fable 5 row carries real numbers from the June 9 capture; other rows are illustrative.
| # | Model | Mindshare | Δ 24h | Sentiment | Score | 7-day | Top theme |
|---|---|---|---|---|---|---|---|
| 1 | G GPT-5.2OpenAI | 24.1% | ▼ 4.2 | 58 | agents | ||
| 2 | F Fable 5 launchAnthropic | 21.8% | ▲ 18.6 | 62 | novelty | ||
| 3 | G Gemini 3 UltraGoogle DeepMind | 16.4% | ▼ 2.8 | 64 | multimodal | ||
| 4 | X Grok 4.1xAI | 11.2% | ▼ 3.1 | 41 | trust & safety | ||
| 5 | D DeepSeek V4 openDeepSeek | 9.7% | ▲ 1.4 | 78 | cost | ||
| 6 | L Llama 5 openMeta AI | 5.8% | — | 50 | fine-tuning | ||
| 7 | Q Qwen3.5-Max openAlibaba | 4.6% | ▲ 0.6 | 66 | coding | ||
| 8 | K Kimi K3 openMoonshot | 3.4% | ▼ 0.9 | 60 | agents |
1,107 posts, a 9-point net-positive start, and a pricing complaint that won't quit. Updated every 30 minutes.
The crowd cares less about MMLU than about $0.12 per million tokens.
No launch spike, but the steadiest positive drift of any frontier model this quarter.
Everything above the fold comes from two JSON files on the CDN: index.json (leaderboard rows, KPIs, themes) and reads.json (editorial strip). No backend, no API: the page is static and hydrates client-side. The Fable 5 row is real captured data.
One model in full depth. Mentions, sentiment, themes, and post metrics are real, from the June 9 Fable 5 capture (1,107 mentions, 976 unique authors); thread footprints, facet shares, and voiced-by splits are illustrative until the pipeline computes them. The what-people-say module reads like an Amazon review summary: an AI consensus paragraph that may only state what the theme counts support, color-coded aspect chips (mixed themes like quality show both sides), and a per-theme drill-in with the receipts as embeds. Top posts render as official X embeds; voices carry classification badges (the official-vs-3rd-party axis, formalized in v2).
People are taken with how different Fable 5 feels: novelty dominates the positive conversation (198 posts), with the hands-on demo wave (25) and ux praise (37) behind it. The friction is concentrated in pricing (74 posts), and 70% of those cite token limits specifically. quality is genuinely mixed: praised in long-form writing, dinged on consistency.
playing with the new claude fable aaaaaaaand my token limit was reached
Actually it's fine guys! I figured out a way, see below. Claude Fable 5 is a great model.
This is a historic day for AI. Claude Fable (Mythos) was just released, and it's insane.
omg hahah look what Claude Fable made! level devil but fable devil (medium reasoning effort)
| Account | Who | Voice score | Posts |
|---|---|---|---|
| @giffmana | builder | 94 | 3 |
| @karpathy | builder | 91 | 1 |
| @TrungTPhan | creator | 73 | 2 |
| @BillyM2k | power user | 68 | 2 |
| @milesdeutscher | influencer | 61 | 2 |
| @AshleyDCan rising | creator | 55 | 1 |
Ranked by voice score (earned engagement + threads rooted + consistency), not followers. @claudeai and @sama live in their own tabs: official content and rival takes are context, not "the community."
The page is a static template hydrated from one file: models/fable-5/summary.json. Posts render as official X embeds (oEmbed), so deleted tweets vanish on their own and we never republish text. The cards above are styled placeholders for the embed slots.
A count next to a chip says that people complain about pricing. Compelling content answers the four questions a count can't: what exactly, said by whom, is it growing or fading, and is it normal for the category. Every theme carries six layers, all computed, all receipted.
Inside every theme, the sub-complaints with shares: pricing splits into token limits (70%), subscription tiers (19%), API cost (11%). Each facet keeps its own receipts. "Pricing" is a category; "caps hit mid-session" is content.
Hourly and daily series per theme, labeled accelerating / steady / fading. The complaint half-life is the tell: launch-day gripes that fade by day 3 were noise; ones still growing are product truth. Nobody else measures this.
Theme × author class. 74 pricing complaints where a third come from builders is a different fact than 74 from anon accounts. Complaints get weighed, not just counted.
Complaint rate per 1k classified posts, against the median for the entity's kind. "67/1k, 2.3× the frontier median" turns a raw count into a judgment. Amazon can't compare across products; we compare across every model we track.
An LLM one-liner per major theme ("The complaint is specific: caps hit mid-session, not subscription pricing"), passed through the same citation validator as the consensus summary. Quotable, and never beyond the data.
Out-of-vocabulary topics proposed by the classifier surface with an "emerging" badge once they cross a floor (15 posts). The vocabulary grows from the conversation instead of ossifying.
/models/fable-5/themes/pricing: a shareable URL with its own OG card. In v2 the homepage's hot-theme chips link to the cross-model view of the same theme ("who gets roasted on pricing"), which may be the single most clickable page on the site.One theme, fully unpacked. Real counts from the June 9 capture (74 pricing posts of 1,107 mentions); facet shares, voiced-by splits, and field rates are illustrative until the pipeline computes them.
"The complaint is specific: usage caps hit mid-session on the new model, not subscription pricing. Heavy users ran into limits within hours of launch."
validated · cites 74 postsFacets come from per-theme keyword lists plus classifier output; a facet renders once it covers ≥ 10% of the theme. Sub-10% chatter stays in the long tail count.
Nearly a third of the complaint comes from builder-class accounts: this one carries weight. A theme that's 90% anon reads very differently, and the page says so.
| Model | Rate /1k | vs median |
|---|---|---|
| Fable 5 | 67 | 2.3× |
| Grok 4.1 | 41 | 1.4× |
| GPT-5.2 | 31 | 1.1× |
| Gemini 3 Ultra | 24 | 0.8× |
| frontier median | 29 |
playing with the new claude fable aaaaaaaand my token limit was reached
Theme pages render from the theme object inside summary.json: facets, trajectory, voiced-by, field rates, verdict, and receipt post IDs all ship in one place. Every theme page gets its own OG card, because "2.3× the frontier median" is built to be screenshot.
Themes follow the conversation; the scorecard answers the questions people always bring: how is it at coding, art, depth, speed, price? A fixed rubric, scored per model from classified posts, always rendered, comparable across the whole field. And for everything the rubric can't predict, emergent capability tags ("good at OpenClaw") that accumulate from the discourse and become searchable.
Share-based (positive / opinionated), so a model with 50 mentions and one with 5,000 compare honestly. Renders at n ≥ 10; thin rows show "not enough signal" instead of a fake number. Same rubric on every model page → instant cross-model comparison.
The classifier extracts "good at / bad at" objects as free text; normalization collapses variants ("OpenClaw", "open claw", "openclaw agent" → openclaw). Tags are the rubric's long tail: nobody plans a rubric row for OpenClaw, but 41 people just told us. Every tag links to its receipts, and every tag feeds the global searchable index →
| # | Model | Crowd verdict | ✓ good | ✗ bad | Score | Receipts |
|---|---|---|---|---|---|---|
| 1 | F Fable 5 | "one-shots OpenClaw configs that used to take an afternoon" | 41 | 3 | 93 | posts ↗ |
| 2 | G GPT-5.2 | "solid but needs more steering on skills" | 29 | 8 | 78 | posts ↗ |
| 3 | K Kimi K3 | "surprisingly capable for the price" | 12 | 4 | 71 | posts ↗ |
| 4 | X Grok 4.1 | "keeps hallucinating tool names" | 5 | 9 | 36 | posts ↗ |
Verdict snippets are validated one-liners (same citation rules as everywhere else). Search is a typeahead over one static tags.json: no backend, instant, works offline.
Every "best model for X" Google query eventually lands somewhere. /good-at/<tag> pages are built to be that somewhere: crowd-evidence-ranked, receipted, refreshed every few hours. This may be the property's biggest organic-traffic surface.
Rank by raw engagement and the "top posts" of every launch are the labs' own announcements, forever. The real review is the 47-reply thread where builders argue about what broke. So the data model treats owned vs organic as a first-class axis from day one, and the default content surface is deep organic threads, not loud posts.
Authors carry an identity (12 classes, builder to anon), a lab affiliation, builder-panel membership, and a follower bucket. Relationship to each entity (owned / affiliated / rival / community) derives from one helper. Full model in the next section.
Posts group by conversation. A thread scores on replies, unique participants, and reply-chain depth, with an organic-root requirement for the default view. A 600-RT announcement scores below a 47-reply, 23-person argument, by design.
Owned content gets its own labeled rail (announcements matter; they're just not "the conversation"). The owned/organic mention split is itself a published stat: a launch that's 80% owned-amplification reads very differently from one that's 80% organic.
"Official vs community" is two different questions, so the data model keeps two orthogonal axes. Identity is who an account is (builder, creator, media…), classified once per author. Relationship is where they stand relative to a specific entity (owned / affiliated / rival / community), derived per author-entity pair from one affiliation field. @sama is one author record: owned on GPT-5.2 pages, rival on Fable 5 pages. Conflating these is how every social tool gets this wrong.
| Identity (classified once) | Who that is · June 9 capture examples | Relationship to Fable 5 (derived) |
|---|---|---|
| official | Brand accounts from the registry: @claudeai, @ClaudeDevs | owned |
| leadership | Execs/founders of a tracked lab: @sama (OpenAI CEO) | rival |
| employee | Lab staff by bio: researchers, DevRel, engineers | owned or rival, by lab |
| partner | Commercial/ecosystem ties: integrators, launch partners, sponsored creators | affiliated |
| builder | People who demonstrably ship: @giffmana, @karpathy | community |
| researcher | Academics, eval authors, independent benchmarkers | community |
| creator | Educators, newsletter/video explainers: @TrungTPhan, @AshleyDCan | community |
| power_user | Heavy daily users with opinions and no product: @BillyM2k | community |
| media | Journalists and outlets: @verge, @WatcherGuru | community |
| investor | VCs, analysts, market commentators | community |
| influencer | Reach-first amplifiers: @milesdeutscher, @MarioNawfal | community |
| anon | The long tail (88% of Fable 5's authors posted exactly once) | community |
Per author per entity: earned engagement on classified posts + threads rooted that went deep + active days in window, percentile-scaled within the community. @giffmana outranks accounts with 10× his followers because the conversation engaged him. Follower count is a display detail, never a rank key.
Voices panels default to the Community tab. Official & affiliated get their own tab; rival takes get theirs (a rival researcher dunking on a launch is real signal, labeled honestly). "Rising" badges mark community voices whose score jumped this window: new experts surface instead of the same six accounts forever.
Every metric publishes its relationship split: owned / affiliated / rival / community mentions per window. A launch that's 80% owned amplification versus 80% community chatter is the difference between a push and a moment, and the site says which one happened.
The editorial layer. The guaranteed launch artifact is the Launch tab and its score (pure data, zero human dependency); a read ships on top when the story warrants one. An LLM drafts the narrative from the same JSON the dashboard uses, a human edits and signs off, and during launch week the data blocks keep refreshing live inside the prose. The fable-5 page on the landing site is the hand-built prototype of exactly this.
Anthropic shipped Fable 5 at 18:02 UTC on June 9. Within four hours it had jumped from #7 to #2 in mindshare. The crowd's verdict so far: genuinely novel, mildly magical, and everyone is hitting their token limit.
The launch arc was textbook: a quiet Monday baseline of roughly 40 mentions a day, a vertical spike when @claudeai posted, and a second wave at 21:00 UTC when the demos started landing. What's unusual is the shape of the sentiment. Most frontier launches open with a polarized split; Fable 5 opened with an 87% neutral wall of news-sharing and a small but unusually durable positive core.
Novelty dominates (198 classified posts). The "Mythos" architecture is the hook; the word "different" appears in a fifth of all positive posts. UX (37) and the demo wave (25) follow: the level-devil game clone from @LexnLin did real numbers for a single-shot demo.
Actually it's fine guys! I figured out a way, see below. Claude Fable 5 is a great model.
Pricing is the complaint (74 posts), and inside it, one specific grievance: token limits. The single most-engaged negative post of the day is a one-liner about hitting the cap. Trust-and-safety chatter (35) is mostly spillover from the broader alignment debate rather than anything specific to this launch; bugs (17) are scattered and minor so far.
Official accounts earned the engagement crown honestly: @ClaudeDevs' Build Day post is the top post of the launch, full stop. The amplification layer was crypto-news media (@WatcherGuru's "JUST IN" carried 608 retweets of reach on its own), and the credibility layer came from builders: @karpathy and @giffmana posting hands-on within hours moved the sentiment needle more than any official content.
Methodology: mentions counted via the X counts API across 4 alias queries; sentiment and themes classified per-post by LLM; full pipeline on the methodology page. Source data: models/fable-5/summary.json
Reads are versioned documents, not dashboards: each refresh appends to the timeline, and once launch mode ends the read freezes into the permanent record of how the launch went. That archive becomes the moat.
What separates "a dashboard" from "the thing people cite": coined metrics with named methodologies, and charts that exist nowhere else. Five of them, all computed from data the pipeline already produces. The first two are the two clocks' numbers: the daily standing and the opening weekend.
Pure favorability: how the model is regarded right now, on purpose excluding volume (mindshare already measures popularity, the way box office sits next to the Tomatometer). The leaderboard's Score column, the entity hero number, and the Gap's y-axis.
The Vibe Score's sibling on the Launch clock: opening weekend, frozen into the archive at T+72h. Weighted blend, formula versioned and published like an index methodology. "Fable 5 debuted at 87" is the sentence the press writes; nobody has coined this for model launches.
The thesis as one chart. The off-diagonal is the story: models people love that benchmarks underrate, and the reverse. Standing homepage chart, regenerated every refresh, built to be screenshot.
Aggregate sentiment is 87% neutral noise on news days. A curated builder panel (the karpathy/giffmana tier, classification already in the pipeline) gets its own series. Zero marginal X cost: panel members are already in the pulls.
Dashboards show state; return visits come from change. Events are pure diffs of consecutive rollups (thresholded, deduped), shown as a ticker on the homepage and a timeline on entity pages. Brigading gets flagged in public rather than silently absorbed: the credibility defense doubles as content.
A leaderboard nobody trusts is worthless. The methodology page is a first-class surface, written for the skeptical reader, and every chart on the site deep-links to the section that explains it.
artificialanalysis earned citations because their methodology survives scrutiny. Ours must survive a harder test: sentiment is easier to dismiss than latency. The defense is radical transparency, including publishing the per-entity query strings and a daily data-quality note ("Jun 9: RT flood from crypto-news accounts inflated volume; themes unaffected").
It also disarms the obvious attack: when a lab's fans claim the numbers are rigged, the answer is a public query string and a JSON file they can download.
Benchmarks are table stakes; they want the field report. Like/dislike themes and builder-class voices answer the question benchmarks can't: what's it like to live with this model.
Launch mode is a war room they don't have to build. The hour-by-hour arc, the complaint taxonomy, and who's amplifying: this audience converts to Sift pipeline.
"Mindshare jumped from 3% to 22% in 24 hours (per vibebench)" is a sentence that writes itself into coverage. Citations are the growth loop.