Rolling Out AI Coding Tools: The Phase Everyone Skips

Every guide to rolling out AI coding tools tells the same story: pick a pilot group, write a policy, pass security review, expand to everyone, announce success. The story isn't wrong. It's just incomplete — it ends at the exact moment the outcome gets decided.

Deployment is the prologue. What determines whether your team actually gets better with these tools is what happens in the three to six months after everyone has a seat. Almost nobody plans for that phase. Most leaders don't know it exists until they're standing in the middle of it, looking at a usage dashboard that says "adoption: 94%" and a team that doesn't seem meaningfully faster.

This guide covers the standard playbook briefly — it matters, and getting it wrong creates real problems — and then spends most of its time on the phase the other guides skip.

The standard playbook, and where it's right

Four things the conventional rollout advice gets right. Do them; just don't mistake them for the whole job.

Pick a pilot group deliberately. Eight to fifteen engineers, four to six weeks, working on real tickets — not a toy project, because toy projects hide every interesting failure mode. Include at least one loud skeptic. A pilot made entirely of enthusiasts tells you how enthusiasts feel, which you already knew. The pilot's output isn't a verdict on the tool; it's a draft of your norms and a list of the problems you'll hit at scale.

Set policy and guardrails before broad access. Decide in writing: which codebases and data classes can be exposed to the tool, what your stance is on AI-generated code and licensing, and — most importantly — that AI-written code goes through the same review bar as human-written code. The teams that skip this either over-restrict (and the tool dies) or under-restrict (and legal finds out later).

Run procurement and security review properly. Data retention terms, training opt-outs, where transcripts and telemetry live, who administers seats. Boring, necessary, and a one-time cost. The vendors expect these questions now; the answers are mostly fine.

Announce norms, not just availability. "You now have access to X" is not a rollout. Say what's encouraged, what's required, what's off-limits, and who to ask. Engineers fill silence with their own assumptions, and half of them will assume the safest possible interpretation, which is "ignore it."

All of this is table stakes. It gets the tool into people's hands legally, safely, and with expectations set. Then the standard playbook says "monitor adoption" and stops — and that's where the actual problem starts.

The "tried it once" cliff

Here is what the dashboard won't show you at month three.

A few of your engineers have genuinely restructured how they work. They plan before they generate, they hand whole tasks to the agent, they've built their own scaffolding of saved commands and project context. Most of the rest tried the tool a few times, got a mediocre result on a hard problem, quietly concluded "it's fine for boilerplate," and went back to typing — with the occasional autocomplete accepted. Your seat-utilization report counts both groups as active users. A person who pasted one stack trace into a chat window last Tuesday is "adopted."

This distribution is not a sign your rollout failed. It's what every rollout produces by default, because nothing in the deployment process develops skill — it only distributes access. OpenAI's 2025 enterprise research put a number on the result: roughly a 4x productivity gap between power users and typical users of the same tools, with identical access. Same licenses, same models, same announcement email. The variance lives entirely in how people use the thing.

It can be worse than flat. The 2024 DORA report found AI adoption climbing while delivery stability fell — a 7.2% drop in stability associated with increased AI use. More code, generated faster, by people who haven't learned to direct or review it well, is not an improvement. It's a louder version of the old problem.

A rollout distributes access. Nothing in it develops skill. The gap between those two is where most AI tooling investments quietly die.

The mistake underneath all of this is treating proficiency as something that happens by exposure — give people the tool and time, and skill will follow. It doesn't, any more than giving everyone a gym membership produces a fit team. We've written separately about why adoption and proficiency are different measurements and why conflating them ruins your data; the short version is that "has used it" and "is good with it" need to be tracked as different things, because they are.

The missing phase: proficiency development

So what does the skipped phase actually consist of? Four activities, in order.

Observe real usage. Not surveys — people are unreliable narrators of their own workflow, and "yeah, I use it a lot" can mean almost anything. The unit of observation is the session: what someone actually asked for, how they steered, what they did when the agent went sideways. Standard engineering metrics can't see any of this. PR counts and DORA numbers look identical whether someone ran a disciplined agent workflow or fought a chatbot for an hour.

Find your emergent power users. That 4x cohort already exists on your team; the rollout produced it by accident. They are frequently not who you'd guess — not necessarily your most senior people, sometimes a mid-level engineer who happened to develop good instincts for decomposing work. You can't designate power users in advance. You can only find the ones the rollout created, which is why observation comes first.

Name what they do. "She's just good with it" is useless for teaching. Watch the strong sessions and the mystery resolves into specific, observable behaviors: planning before generating code, delegating self-contained tasks wholesale, keeping context clean instead of letting a session sprawl, saving repeated instructions as reusable commands, committing work in small reviewable units. Behaviors with names can be taught. Vibes can't.

Make spread deliberate. Left alone, practices spread by adjacency — whoever sits near the power user picks things up, everyone else doesn't. Replace accident with intent: one named practice at a time, demonstrated on real work, with a specific next practice assigned per person. "Get better at AI" is uncoachable; "you've never used plan mode on a refactor — try it on the next one" is a Tuesday. Our guide to coaching engineers on AI tools covers how to run those conversations without it feeling like surveillance.

A monthly loop you can run starting next week

None of this requires tooling. Here is a cadence one EM or staff engineer can run in a few hours a month. It's manual and lossy, and it still beats what most teams do, which is nothing.

Week 1 — collect. Ask three to five engineers to share one real session each from the past month: a transcript, a screen recording, or a 15-minute walkthrough of how they tackled an actual ticket. Make the ask specific ("the trickiest thing you used the agent for") and make it safe — this is show-and-tell, not audit. Known bias: people share their best sessions. Accept it; even cherry-picked sessions reveal technique.

Week 2 — review and name. One person reads everything and writes a half-page: what did the strongest sessions do differently, stated as two or three concrete behaviors in plain language. "Asked for a plan and edited it before any code was written." "Gave the agent the failing test instead of describing the bug." Resist the urge to write a taxonomy. You need a list, not a framework.

Week 3 — spread one practice. Pick a single behavior from the list. Have the person who does it well demonstrate it on a live ticket in front of the team — twenty minutes, real work, mistakes included. Then give each engineer one specific practice to try before the next cycle. One. The lunch-and-learn that covers eight tips changes nothing; the assignment to try one thing on real work sometimes does.

Week 4 — check and re-aim. Quick pulse: who tried their assigned practice, what happened, what surprised them. Keep what moved, drop what didn't, pick next month's focus. Then the loop repeats with fresh sessions.

Run this for a quarter and you will know more about how your team actually uses AI tools than any dashboard has told you — and you'll have a teaching mechanism instead of a hope.

The honest caveat: the manual version doesn't scale past one team and sees only what people volunteer. Five self-selected sessions a month out of hundreds is a keyhole. It's the right keyhole — but it's still a keyhole.

Where Accrete fits

Accrete instruments this loop instead of replacing it. A small client on each developer's machine parses their Claude Code session transcripts — every session, not five hand-picked ones — and classifies which practices appear: planning before coding, delegating to subagents, using saved commands, context management, committing work. The team dashboard turns that into a per-person, per-practice adoption matrix, so "find your power users and name what they do" stops being a monthly archaeology project and becomes something you look at. Session drill-down gives you the week-3 teaching material from your own codebase. The full picture is on the coach your team page.

We're pre-launch, so we won't pretend to have benchmark numbers — we don't yet. What we have is the conviction, from running this loop by hand, that the observe-name-spread cycle is the phase that decides whether your rollout was worth it.

The rest of our guides take individual pieces of this phase deeper. If you'd rather run the loop with instrumentation than with a shared doc and goodwill, join the early access list — or become a design partner and shape the practice catalog around what your team's sessions actually show.