Back to rubric
Rubric version history

How the rubric evolved

The Context Management Index did not arrive fully formed. v1.0 is the product of seven internal research iterations over six months. Every major change was driven by external research rather than internal opinions, drawing on IBM's 47% accuracy study, Anthropic's attention paper, the AVRS framework, Augment Code's context engine case study, and 120+ internally scored teams.

This page is the full development log. v0.1 was intentionally simple, and we are not hiding that. Showing the full process is the credibility anchor, and it gives teams context when explaining the rubric to stakeholders.

7
Versions to v1.0
6 mo
Development span
3 → 8
Dimensions grew
120+
Teams calibrated v1.0

Version timeline

VersionDateStatusDimHeadline
v1.02026-04Public8First public release. Data-calibrated weights + standardized criteria + full public methodology.
v0.92026-03Internal8Added Measurement & Feedback Loops as the 8th dimension
v0.82026-02Internal7Added Context Window Optimization; split out Tool Configuration
v0.72026-01Internal6Split Team Context Sharing out; introduced team-size-based weighting
v0.52025-12Internal5Added Memory & Persistence as a dedicated dimension; introduced the Tool Swap Test
v0.32025-11Internal4Added Documentation-as-Context after the IBM 47% accuracy study
v0.12025-10Internal3First draft: a binary checklist for context files

Detailed history

Click any version to expand. v1.0 is open by default.

Changes from v0.9
reweightedAll dimensions· calibrated against 120+ teams
renamedMeasurement & Feedback· shortened from 'Measurement & Feedback Loops' for table fit
Why it changed
  • Correlation analysis of 120+ scored teams from v0.8 and v0.9 internal runs allowed weights to be recalibrated against actual outcome predictors.
  • Multiple regression of dimension scores against self-reported AI productivity outcomes revealed which dimensions actually predict outcomes at each team size.
  • Need for a stable, public version to anchor the leaderboard launch.
Weights by team size
DimensionIndieSmallMediumLarge
Project Context Files0.200.180.140.10
Memory & Persistence0.200.140.100.08
Documentation-as-Context0.080.120.140.14
Team Context Sharing0.020.100.140.16
Context Window Optimization0.100.080.100.10
Code Organization for AI0.100.140.120.10
Tool Configuration0.200.140.100.10
Measurement & Feedback0.100.100.160.22
Research sources
120+ team correlation analysisAll prior version sources retained
Known limitations
  • 120-team sample is enough to detect major patterns but not enough for 2-decimal precision.
  • No industry vertical analysis; fintech vs. game studio may have different optimal weights.
  • US/Europe heavy sample; APAC patterns may differ.
  • Self-reported outcomes carry bias; v1.2+ aims for objective signals (PR cycle time, turnover ratio).

Why this history matters

The rubric is not a marketing asset. It is a research artifact, and showing the full development process is the credibility anchor:

  • v0.1 was embarrassingly simple. We're not hiding that.
  • Every major change was driven by external research, not internal decisions. IBM's study, Anthropic's attention paper, AVRS, Augment Code's engine case study.
  • We got things wrong and fixed them. We collapsed team sharing into project context files in v0.5 and split it out in v0.7. We buried tool configuration and then separated it in v0.8.
  • The field is still moving. v1.0 is not the final version. v1.1 is targeted for the next quarter based on the next round of scoring data and emerging research.
For the leaderboard

A score under v1.0 is auditable and means something specific. Future versions never retroactively change past scores. Users can opt to re-score under any version.

For tool vendors

No tool gets scored, ever. If any vendor ships a feature that genuinely changes what good practice looks like, the rubric updates, regardless of who built it.