AI as multiplier, discipline as durability

May 2026 · Process · AI-augmented work · ~13 min read

TL;DR

AI as a tool is a lottery — some sessions go fast, some produce technical debt nobody catches. AI as a multiplier needs discipline that survives the session: a constitutional document the AI is bound by, ADRs that record every non-trivial decision, persistent memory that bridges conversations, automation that prevents the wrong shortcut. With that scaffolding, one person ships at near-team velocity and the project ends up more resilient to disappearance, not less. This is the system I built around myself.

The realization that started this

When I first started using AI seriously for real production work — late 2024, on what became ECM — I noticed two things. First, the speedup was real: tasks that would have taken half a day finished in 20 minutes. Second, the speedup was wildly uneven. One session produced a clean, well-tested module. The next session produced something that compiled, ran, and quietly broke the no-throw rule in three places.

The pattern was clear once I named it. AI is excellent at producing code that matches the prompt. AI is mediocre at remembering the rules of this specific codebase across sessions. AI is terrible at relitigating the decisions you've already made when those decisions aren't visible in the local context of the file it's editing.

The fix isn't "better prompts." The fix is making the rules of the codebase structurally visible to every conversation, so the AI doesn't have to remember them — it has to read them. That single shift is what turns AI from a 2x lottery into a 10–20x multiplier with consistent output.

Five artifacts that make it work

The system I converged on has five distinct artifacts. Each solves a specific failure mode I'd observed.

1. CLAUDE.md / AGENTS.md — the constitution

A single file at the root of the repository that AI is instructed to read before writing anything. Mine opens with: "This file is law. Every code change, every file created, every pattern used MUST comply. Violations are not 'I'll fix it later' — they are rejected. Read before writing."

(I call mine CLAUDE.md because that's what Claude Code reads by default. If I were starting today, I'd name it AGENTS.md — the format has converged across agents, and the agent-agnostic naming is the right default. The content rules are identical; the filename is just where today's agents look first.)

The content is short — about 300 lines — and densely opinionated. It encodes the rules that aren't visible from looking at a random source file: no throwing in domain or application layers, use Result types; no DTOs outside the API layer; no any casts — use unknown with type guards; no cross-module imports, events only; specific carve-outs for cross-module atomic writes via forwardRef. Concrete examples of the wrong way and the right way, side by side.

This solves the "the AI doesn't know the rules of this codebase" failure mode. After CLAUDE.md was strict enough, no session ever wrote a throw in a use case again. The compliance rate went from "most of the time" to "every time" because the rule was visible at session start.

One honest caveat about the rule list. Not every entry in CLAUDE.md is a principle. Some are accommodations of where the codebase started, encoded into discipline because consistency outgrew the cost of fixing the original state. ECM has a concrete example: the frontend contracts are snake_case throughout — an uncommon convention for TypeScript/React land, applied consistently. Someone with stronger opinions about ecosystem norms might rightly have pushed back. The origin is honest: the FE was initially mocked by a non-developer who wrote entity.ts files sketching the data shapes we'd need; the AI read those as DB contract files and wrote them in snake_case (DB convention); the contributor referenced them everywhere; by the time I looked, every FE contract was snake_case. Three options were on the table:

Add mapper layers — DB → entity, entity → API — and restore canonical camelCase across the FE.
Refactor the FE to camelCase and teach the contributor a convention that's a style, not a rule, accepting refactor regression risk from someone who wouldn't fully understand why the change was happening.
Drop the two layers of mapping and live the snake_case life.

I picked (3). It reduced mapping overhead, removed long sessions of "how, why, what" with a contributor without a tech background, and sped up delivery. Do I care today? No. Would I refactor it? Probably not — the regression risk is high and the cost isn't worth it. Would I have made the same call with five engineers on the team at decision time? Almost certainly no — I'd have caught it earlier and gone with option (2): refactor the FE to canonical casing with capable hands available, no ongoing mapper tax to maintain forever. Mapper layers (option 1) are the right reach when refactor risk dominates; option (2) wins when you have the people to do it properly the first time.

The point isn't to defend the casing choice. The more precise framing of what I actually have is "uncommon convention, applied consistently, costing me ecosystem-friendly defaults" — which is materially different from "anti-pattern." Anti-pattern implies this hurts. What I have is this surprises new entrants and bumps into a few library defaults at the margin. Bounded cost, bounded surface. The deeper point is that "discipline" in CLAUDE.md isn't a synonym for "principle." Some rules are honest accommodations of an early state you didn't unwind. Both kinds — principled and accommodation — get the same consistent AI compliance, and the operational win is the same. The honest framing for a reader: be willing to call accommodations what they are, especially when explaining the system to someone new. Pretending every rule is a deep design choice will eventually trip you up when somebody asks "why this way?" and the truthful answer is "because it was already this way."

2. ADRs — the decision graph

Architecture Decision Records, written as I go. Each one captures one architectural decision: context, decision, consequences, alternatives considered with rejection reasons, implementation notes. ECM has 38 of them in nine months.

The thing ADRs solve that CLAUDE.md doesn't is the "why didn't you just..." failure mode. Future me — or a future contributor, or the AI in a session three months from now — looks at a piece of code that seems weird and thinks "we should just refactor this back to the obvious thing." The ADR has the obvious thing in the "Alternatives Considered" section with the reason it was rejected. The refactor doesn't happen. The hard-won design stays.

I wrote a separate post on ADR-driven solo development. The short version: writing ADRs while making the decision (not in batches afterwards) costs 20 minutes per decision and saves hours per decision when future-you would otherwise relitigate it.

3. Memory files — the persistent context

Conversation-spanning notes that the AI can read at the start of any new session. Mine are organized by category: user profile (who I am, what I'm working on), project facts (key technical details that recur), feedback (corrections I've given the AI that should generalize), references (where external resources live).

This solves the "new session, no context, has to learn everything from scratch" failure mode. A fresh conversation doesn't restart at zero — it inherits the curated, distilled version of what previous sessions discovered. The AI walks in already knowing my codebase has 38 ADRs and one extracted service and a deferred trigger-based audit design.

What matters: the memory isn't a transcript or a journal. It's a curated set of durable facts — things still true a month later. When something changes (pilot date shifts, an ADR supersedes another, my opinion on a pattern reverses), the memory gets updated. Stale memory is worse than no memory; the AI will confidently assert things that were true once and aren't anymore. Curating means being honest about what's still load-bearing.

4. Skills, hooks, slash commands, MCPs — the automation layer

Where CLAUDE.md and ADRs encode rules, this layer encodes workflows. A "deep-audit" skill that does a full codebase pass with a defined set of phases. A "pull-and-review" skill that fetches the latest commits, runs verification, and produces a structured review. A hook that runs gitleaks before every commit.

This solves the "the right way to do X involves seven steps and I'll skip one" failure mode. When a workflow is automation, you can't skip step 4 because step 4 is a line in the script. When a workflow is "remember to run gitleaks before pushing," step 4 gets skipped sometimes — and that's how secrets end up in commits and force-pushes happen.

Four concrete practices once you've passed the basics:

Push domain logic into skills, not the constitution. AGENTS.md is read on every session and burns tokens; skills load on demand. Anything domain-specific — workflow-engine guidance, billing math, RLS plumbing, integration patterns — belongs in a skill the agent reads only when the work calls for it. The token win is real the moment your constitution starts approaching 500+ lines. The discipline is the same; the cost is much lower.
Use MCPs for daily-touched tools. Linear / Jira, GitHub, Slack, Confluence, your CI — if you touch it daily, an MCP server gives the agent typed, scoped access through real schemas. Better than scraping CLI output, better than the agent guessing flags from memory. The agent calls the actual API; you get to define what's exposed and what isn't.
Write your own tools when nothing fits. Custom scripts triggered by skills, or a full MCP server when the surface justifies it. Marginal cost of a one-off script is low; marginal benefit of "this exact workflow, one command, reproducible by future-you or future-AI" is high. Don't wait for someone to publish the perfect tool — write the imperfect one that matches your workflow.
Add guardrails before they're needed. Settings-level permission rules that prevent destructive operations without explicit approval. Hooks that fail loud on rule violations (gitleaks pre-commit is the canonical one — caught my .env leak before it ever reached origin in a different session). The cost of writing a guardrail is small; the cost of an unguarded mistake is sometimes catastrophic. Install them like seatbelts — before the crash, not after.
Wire CI ratchets — quantitative quality gates that only move one direction. Beyond guardrails on individual mistakes, CI-enforced baselines convert "we accept X violations today" into "we never accept more than X tomorrow." Lint warning count, test coverage percentage, cross-module import count, bundle size budget, any-cast count, TypeScript strict-file count — whichever metric matters, capture today's value as a committed baseline and have CI fail if the next PR exceeds it. ECM has a dep-cruiser ratchet that locks in the current count of cross-module direct imports; any new violation fails the build, forcing it to be either fixed or explicitly justified in the same PR. This matters disproportionately in AI-augmented work: code generation is fast, so without quantitative friction, debt accumulates faster than human review catches it — and AI is genuinely good at satisfying explicit quantitative constraints if they're enforced in CI. The same pattern works for teams of one or teams of fifty; the bigger the team, the more it matters, because human review cannot reliably scale to "is this PR introducing three more any casts than necessary?" but CI absolutely can. Cost: a few hours to capture baselines, ~30 minutes to wire into the pipeline. Benefit: structural. Technical debt becomes mathematically monotone-non-increasing — it can be paid down, it cannot accidentally grow.
Skills auto-trigger — the description is the load-bearing field. Most people I see treat skills like slash commands ("invoke when I type /name"). They aren't. The agent scans every skill's description when analysing a request and decides which to load natively, without anyone asking. The implication: the description is the matching function. A description like "Helps with billing" fires on every billing-adjacent message and clogs the agent's context; a description like "Use when adding a new tariff, fee type, or modifying billing math. Do NOT use for invoice generation or payment matching (see invoice-workflow skill)" fires precisely when it should. Write descriptions like you're writing the dispatch rule, because that's what they are.
Compose skills, don't duplicate them. A skill's workflow can reference other skills — "for the migration step, follow the migrations skill; for rollout, the canary-deploy skill" — and the agent chains them when it follows the trail. Most teams treat skills as isolated and end up duplicating content (the migration procedure copy-pasted into three skills, each diverging over time). They're not isolated. Cross-reference, keep each skill focused, let the agent compose. The win compounds as the skill set grows — refactor any procedure once, every dependent skill picks it up automatically.

5. Verify-before-claiming — the rule that catches false confidence

The most underrated discipline. AI is fluent. AI is confident. AI is sometimes confidently wrong. The rule: before asserting a fact about the codebase, verify it against the current source. Not from memory. Not from "what's typical in NestJS projects." From grep, find, an actual file read.

This is in my memory as an explicit feedback rule because the AI used to confidently tell me there were 550 console.log calls polluting my production code. When I asked it to verify, the number was 1. The other 549 were in OLD_modules/ (reference code) and one-off scripts. The original confident assertion would have triggered a fake "cleanup" sprint addressing a non-problem.

The lesson: fluency is not accuracy. AI is fluent enough to make confidently wrong claims indistinguishable from confidently right ones at first glance. Verify-before-claiming is the rule that closes the gap, and it pays for itself the first time it catches a phantom problem.

6. Context window discipline — know when to start fresh

The most underrated operational discipline. Sessions degrade. Past roughly the 100k-token mark, the agent starts forgetting things established earlier in the conversation, gets confused about which version of the codebase it's looking at, makes suggestions that contradict things you settled an hour ago. Tell-signs: it asks you a question it already asked, it asserts a fact you corrected ten messages ago, its proposed diffs lose precision.

The fix is mechanical: /clear and start fresh. You keep the durable artifacts — AGENTS.md, ADRs, memory files, skills — and lose the session history that was actively hurting more than helping. The agent walks back in with the constitution and the cheat sheet, no accumulated confusion.

Operational rule: if a task is going to span more than a few hours of agent work, plan at least one /clear checkpoint. Summarize the state to memory if useful, then start clean. Better five minutes on a handoff than thirty minutes debugging why the agent suddenly forgot the rules. The agents that ship for half a day aren't running one session; they're running a chain of fresh ones, each with the persistent context but none with the noise.

What this looks like in practice — a side project in 3 nights

After an AI workshop in 2026, I decided to test how far the system could go in compressed time. The result: a sprint-orchestration and code-review dashboard, built in roughly three evenings of AI-augmented work.

What ended up in it:

Bun + Hono, JSX server components, SQLite via Bun's native driver
DDD-lite structure: domain/, application/, infrastructure/, interface/ — the same layering I use in ECM
Result pattern across all use cases (same rule, copied via CLAUDE.md)
A 3-layer JSON repair pipeline for LLM output that doesn't quite parse: skill-side validation, parser-layer repair strategies, and a re-prompt fallback. This part is production-grade and worth lifting into its own library someday.
A self-learning triage loop: human corrects a classification → "upskill" use case writes patterns to a knowledge table → future classifications read learned patterns first.
A "memory palace" — verified review findings accumulate as entries with confidence scores; subsequent reviews cite them via  markers so trust can be tracked.
Dual-backend AI abstraction: same workflow runs against Claude CLI or Cursor; only the subprocess invocation differs. Both read the same skills folder.
13 unit + integration tests; in-memory SQLite for fakes.

What is not in it, which is the more interesting part:

No narration comments. No "// return result" lines.
No TODO/FIXME debt anywhere in source.
No commented-out code blocks.
Zero any casts in domain or application layers — all type-narrowing via guards.

This is what the multiplier looks like. The discipline encoded in CLAUDE.md (copied from ECM) and the patterns codified in the ADRs (also copied) produced a 36K-LOC codebase with the same quality profile as something written carefully over months. The work was three nights, not three months, because the rules came pre-loaded.

A side observation: most 3-night side projects have at least one "good enough" shortcut that becomes the technical debt nobody wants to clean up. This one doesn't. Not because I'm a saint — because the rules made the shortcuts impossible. CLAUDE.md says no any; the AI knew not to write it. CLAUDE.md says no narration comments; the AI knew not to add them. The discipline that took years to develop manually was, by then, a 300-line file the AI honored.

The full anatomy — the JSON-repair pipeline, the self-learning triage upskill loop, the dual-backend AI abstraction (same skills folder consumed by Claude CLI and Cursor agents), the memory palace, the Jira / GitLab / Figma integration surfaces — is in a separate writeup. For this post, the takeaway is the velocity profile, not the architecture.

What AI is actually good at — and what it isn't

The post so far has talked about AI as if it's uniformly a multiplier. It isn't. The multiplier is real for some kinds of work and absent (or negative) for others. Worth being granular before someone reads this and assumes AI is the answer to every problem.

AI is genuinely good at:

Coding from clear specification. "Implement this use case following ADR-008's pattern, given this interface and these tests" — huge multiplier, with discipline. The clearer the spec, the bigger the win.
Generating boilerplate and scaffolds from patterns. New repository, new use case, new value object — the templates are well-defined, and AI matches them faster than you can type.
Writing tests for code you've already written. Especially edge cases you missed. The AI sees the function shape and can enumerate the corners.
Documentation generation from code + intent. JSDoc, READMEs, ADRs from a conversation transcript — strong.
First-draft writing. Blog posts, design docs, change-log entries, commit messages. Your edit pass is fast; the cold-start cost is what AI removes.
Searching and summarizing codebases too large to read. "Show me every site that calls into the EDC adapter" or "summarize how the workflow engine handles claim-state recovery" — strong, especially with verify-before-claiming.

AI is mixed at:

Architectural design. Collaborator, not driver. AI defaults to generic patterns ("you could use a workflow engine here…") that often miss the domain's actual constraints. Your domain knowledge has to drive; AI fills in the patterns once you've decided the shape.
Debugging. Good at hypothesis generation ("here are five things that could cause this symptom"); mediocre at staying focused on the actual symptom once it picks a hypothesis. The "rubber duck that talks back" failure mode is real — AI will confidently chase a wrong lead unless you re-anchor it to the actual evidence.
Code review. Catches different things than humans — typo bugs, missing null checks, inconsistent naming. Misses architectural drift, judgment calls, and "this is technically right but wrong for this codebase." Complement, not replacement.

AI is genuinely bad at:

Maintaining architectural consistency across sessions without artifacts. The system above (CLAUDE.md, ADRs, memory) fixes this; AI alone doesn't. Without the system, every session re-derives the patterns and they drift.
Catching its own confabulation. AI cannot reliably self-detect when it's confidently wrong. Verify-before-claiming exists because of this. A reader who trusts AI's confidence ships AI's mistakes at scale.
Long-running multi-file refactors without supervision. Past a certain blast radius, the agent loses the thread — applies the rule to file A, forgets it by file J, contradicts itself at file Q. Human oversight required for any cross-cutting refactor; or break it into many small AI-shaped tasks.
Anything requiring silence. AI always answers. The "sit with this and think for an afternoon" mode that experienced engineers use to find the actually-elegant solution — it can't. If the right move is "wait, don't decide today," AI will produce a confident decision anyway.
Business context that isn't documented anywhere. "Why did we choose X" lives in your head until it's in an ADR. AI can't infer it. The first time AI suggests reverting a decision you made for a reason you never wrote down is the moment you wish you'd written it down.

The honest summary: AI is a multiplier where your output is well-specified; it's a coin flip where your thinking is well-specified; and it's a liability where the work requires judgment you haven't externalized. Calibrate accordingly.

The bus factor argument (counterintuitive)

"Bus factor" in software is the number of people who'd need to be hit by a bus before the project becomes unrecoverable. Low bus factor = bad: the project depends on too few people.

The conventional view is that solo dev = bus factor of 1, which is as bad as it gets. AI augmentation doesn't change that — if anything it makes it worse, because more is being shipped per unit of person-time, so more is at risk.

I'd argue the opposite: a solo dev with the system above has a higher bus factor than a solo dev without it. Not infinitely high — bus factor 1 is still 1 in absolute terms. But the work is more recoverable. Here's why.

If I disappeared tomorrow and someone else inherited ECM:

CLAUDE.md tells them the rules of the codebase in one read. They don't have to derive the conventions empirically.
38 ADRs with rejection reasons tell them why every weird-looking thing is the way it is. They don't relitigate decisions I already made.
Memory files tell them what the past nine months looked like — pilot launch, major refactors, competitive context.
Skills + slash commands tell them the standard workflows in executable form. They don't have to guess how I review a PR or pull and verify.
Cheat sheet over the ADRs gives them the map; full ADRs give them the territory.

None of these existed in any prior solo project of mine. The discipline that AI augmentation forced — because without it the multiplier doesn't work — turns out to also be the discipline that turns codified knowledge into a transfer-friendly artifact. The bus factor went from "1 with no documentation" to "1 with about 200 pages of structured, current, internally-consistent documentation." Still 1. But the recovery story is dramatically different.

The interesting corollary: companies hiring senior engineers should be looking for ADR sets in their portfolios, not LinkedIn endorsements. An engineer with 38 well-maintained ADRs in a personal project demonstrates a discipline that's vanishingly rare and trivially transferable to any codebase.

The honest costs

None of this is free. The trade-offs I've actually paid:

Onboarding is heavy. A new contributor (human or AI) has to read CLAUDE.md, skim the cheat sheet, dip into the ADRs that pertain to their first task. That's an hour or two before they can write a line. The alternative — discovering the rules empirically through code review — is faster on day one and much slower on day twenty.
Writing ADRs takes 20 minutes per decision. ~40 ADRs × 20 min = ~13 hours of ADR-writing time over nine months. Trivial against the time those ADRs saved by preventing relitigation of past decisions, but real.
Memory curation requires honesty. Wrong memory is worse than no memory. I've had to delete confidently-asserted memories that turned out to be false (or became outdated). The discipline isn't writing memories; it's auditing them and being willing to remove the wrong ones.
The verify-before-claiming rule slows the AI down. AI feels less magical when it has to spend five tool calls checking the codebase before answering. The answers are more right. But the surface experience is "I asked a question and it took thirty seconds." That's the cost of accuracy.
The system is opinionated and won't fit every team. If your team prefers light documentation, this is too heavy. If your team prefers organic conventions, this is too rigid. The point isn't that everyone should copy my CLAUDE.md — it's that you need some equivalent if you want AI to be a consistent multiplier instead of a sporadic one.
The discipline becomes a one-way ratchet. Once CLAUDE.md, ADRs, memory, and skills exist, you can't safely turn them off. The codebase is the rules now. The first session that ignores the rule list ships code that drifts; the second session matches the drift; by session five, the system is producing the average of "with discipline" and "without," and your enforcement is structurally weaker. Real lock-in. The good news: maintenance cost is small — roughly half an hour a week to keep memory accurate, ADRs current, rules current. The bad news: it's not zero, and the consequence of skipping it is steady, asymmetric degradation. You can build the system in a month; you can't unbuild the dependency on it without paying weeks of refactor.

The selection effect — AI accelerates writing, not thinking

The hardest truth, before closing.

AI doesn't make you a better engineer. It makes a good engineer faster.

If you can't review what AI writes, AI just produces faster bad code. If you can't articulate your architecture clearly enough to write a CLAUDE.md, AI propagates whatever conventions it pattern-matches from its training data. If you can't tell when AI is confabulating, AI confabulates at scale.

A weak engineer with AI ships weak code faster. A strong engineer with AI ships strong code faster. The delta between "weak" and "strong" doesn't compress under AI augmentation; it amplifies. The discipline encoded in the system — CLAUDE.md, ADRs, memory, automation, verify-before-claiming, context-window hygiene — is what makes the second case possible. Without it, the first case is the default outcome, and the engineer who was already strong gets *much* further than the one who was just OK.

This is why the post is about discipline rather than AI. AI is the multiplier; discipline is the precondition. You don't get the multiplier without the precondition. People who lead with "AI made me 20× faster" usually skipped articulating the precondition because they treat it as ambient. It isn't ambient. It's the work that makes the multiplier exist.

A useful diagnostic before trusting your AI workflow at scale: turn off the AI for a week and write code yourself. If the quality drops noticeably, you were leaning on the AI to think for you, and the system is masking a skill gap that will show up the moment the AI is wrong. If quality stays consistent, you were using AI to write faster — which is the right shape. The first case is fragile; the second is durable.

The takeaway

The thing I most want to land:

The output you can get from AI is bounded by the discipline of the system around it, not the cleverness of the prompt. Prompt engineering is real but small. System engineering — what your codebase, your rules, your memory, your automation look like — is what determines whether AI is a multiplier or a lottery.

If you're a solo dev trying to ship something serious with AI: invest in the system, not the prompts. Write a CLAUDE.md (or equivalent) for your project before the second session. Start ADRs from the first non-trivial decision. Curate memory after every conversation that taught you something durable. Automate workflows that repeat.

The system paid for itself in weeks for me. Whether it does for you depends on the scale and lifetime of what you're building. The work it produces is what survives — if you've built the system around it, the work tends to outlast the session that produced it.

And — the part I find most worth saying out loud — it's how you ship like a team of one without the project rotting underneath you. The boring patterns are the ones that compound. AI doesn't change that; it amplifies whichever direction you're already going.