Three nights of AI: anatomy of a side project

May 2026 · Case study · AI-augmented work · ~14 min read

TL;DR

After an AI workshop, I spent three evenings building a sprint-orchestration and code-review dashboard around a real team's workflow — Jira pulls tickets, GitLab feeds MRs, Figma supplies design snapshots, and an agentic review pipeline grounded in a worktree produces structured findings with a self-learning triage loop. DDD-lite layering, branded value objects throughout, a JSON repair pipeline worth lifting into other projects, and a dual-backend abstraction so the same skill files drive Claude CLI or Cursor agents. This is the full anatomy — what got built, why each piece is shaped the way it is, and the honest 3-night caveats. Companion to "AI as multiplier, discipline as durability", which is the philosophy; this post is the anatomy.

The setup: post-workshop curiosity, low stakes, no plan

Late spring 2026, an AI workshop ran through what the agentic-coding tooling could actually do in 2026. The demos were impressive but artificial — toy problems, scaffolded inputs. I wanted to know what would happen if I pointed the same tools at an actually messy problem with no scaffolding: real customers' tickets, real merge requests, real designs, no curated dataset.

The problem I picked: a real team's sprint-review workflow. The flow is awkward — open Jira to read the ticket, switch to GitLab to read the MR, switch to Figma to check the design, hold all three in your head, write a review comment back into GitLab. Repeated per MR, per sprint, by every reviewer. The bottleneck isn't reading code; it's holding context across three systems. If an agent could pre-assemble the context and pre-write the structured review, the human reviewer's job collapses to "is this analysis correct?" instead of "let me reconstruct what's going on here."

I gave myself three evenings. No production target, no users to ship to, just the question: can the system I built around ECM produce something genuinely useful in compressed time?

What got built

Bun + Hono on the server, JSX server components for the UI, SQLite via Bun's native driver, plain HTML over the wire with SSE for streaming long operations. No frontend framework — the surface is small enough that vanilla HTML + modals beats React in setup cost for a three-night project.

The major surfaces:

Sprint board — Kanban swimlanes per sub-team (three product sub-teams). Tickets pulled from Jira, classified by the triage agent, displayed in their team's lane.
MR reviews — for each open merge request: fetch the ticket, fetch the diff and discussions, fetch the Figma snapshot if linked, run the agentic review, produce structured findings.
Triage — incoming Jira tickets get auto-classified to teams; reviewers can correct; corrections feed a learning loop.
Memory palace — verified review findings accumulate as searchable, citation-able entries.
Per-task chat — each task has a thread where you can ask follow-up questions with access to the worktree, the Jira context, and prior review findings.
Figma integration — design snapshots fetched and embedded so the reviewer (or the agent) can compare implementation against intent.

DDD-lite layering throughout, zero any casts in domain or application code, no narration comments, no TODO debt anywhere in source. Three nights.

Architecture overview — DDD-lite with branded VOs

The same layering I use in ECM, scaled down to a single-process app:

src/
├── domain/                    Zod entities, repository interfaces, branded VOs
├── application/               Use cases (review-mr, triage-task) + sync services
├── infrastructure/            Jira/GitLab/Figma clients, SQLite repos, Claude (API+CLI), git worktree
├── interface/web/             Hono routes + JSX components
└── config/                    Env schema, board definitions

No NestJS. No DI container. Hono for the HTTP layer, plain class instantiation for everything else. The layering exists because it's the right shape for the work, not because a framework imposed it.

Why branded value objects by default

This deserves a section of its own because it's the discipline choice most engineers skip even when they "do DDD." Standard value objects wrap primitives in classes; branded value objects go one step further — they use nominal typing through TypeScript's phantom-property trick so two strings that look identical at runtime are different types at compile time.

type JiraTicketKey = string & { readonly __brand: 'JiraTicketKey' };
type MrIid         = string & { readonly __brand: 'MrIid' };
type ReviewId      = string & { readonly __brand: 'ReviewId' };

function getReview(id: ReviewId): Promise<Result<Review, string>> { ... }

// At a call site:
getReview(jiraKey);   // ❌ compile error — JiraTicketKey is not ReviewId
getReview(reviewId);  // ✓

The value object's .create() method returns a Result<BrandedType, string>; the brand is the marker that "this string passed validation." After parsing an HTTP request, you don't have a string; you have either a Result.success(ReviewId) or a Result.failure. The boundary handles the question explicitly; the rest of the code never sees raw strings claiming to be IDs.

Five reasons this is the default in this project (and in ECM):

Catches "wrong ID type passed to wrong function" at compile time. The most common bug in systems with many ID types — passing a UserId where a CommunityId was expected — becomes a type error before the code runs. In a dashboard that juggles JiraTicketKey, MrIid, FigmaNodeId, ReviewId, UserId and a handful more, this matters a lot.
Self-documenting function signatures. fetchMrDiff(iid: MrIid) tells you exactly what kind of identifier this function expects. fetchMrDiff(iid: string) tells you nothing.
Zero runtime cost. Brands are purely type-system; they compile to plain strings. No performance penalty, no serialization changes, no impact on JSON.parse or DB columns. The win is entirely at compile time.
Survives serialization boundaries. Unlike class-based VOs, branded strings round-trip through JSON, SQLite, and HTTP without transformation. You re-validate at the boundary via .create(), which produces a Result — but the data itself stays a string the whole time.
Particularly valuable in AI-augmented codebases. This one is the unsung reason. AI is fluent and confident; it will happily grab "any string named id" and use it wherever a string is expected. Branded VOs force the AI to convert explicitly via .create(), which means validation happens AND the AI literally cannot accidentally substitute one ID type for another. It's compile-time discipline that the AI can't bypass — exactly the kind of structural guarantee the AI-multiplier post says you need.

Cost: each domain ID type needs a brand declaration and a .create() validator. Maybe 15 lines per type. For a project with a dozen ID types, that's 180 lines of one-time scaffolding. Cheap, and the AI writes it for you the moment you describe the pattern.

The integration surfaces

Three external systems, three different integration shapes:

Jira — REST API, paged ticket fetch on startup, then poll for deltas. Tickets land in a local SQLite cache; the triage agent classifies; the sprint board reads from the cache. Decoupled enough that Jira being down doesn't take the dashboard down.

GitLab — REST + webhook-friendly. The dashboard fetches MRs, diffs, and discussion threads; on review run, it also spawns a git worktree of the MR's branch so the agent can grep the actual code rather than reason about a diff in isolation. Findings can be posted back as a comment via the GitLab API.

Figma — REST API for design files; for the review path, the dashboard fetches a PNG snapshot of the linked Figma node, encodes it base64, and includes it in the review context so the agent can compare implementation against design intent. Tacked-on feels accurate here; it works, but it's less polished than the other two surfaces.

The agentic review workflow — seven steps, one skill

The full review path runs as a single skill (mr-review) that the agent invokes with the worktree directory and the assembled context. The skill defines seven steps with explicit responsibilities:

Orient. Read the Jira ticket. Read the MR title + description. Form a one-paragraph "what is this trying to do" hypothesis.
Palace-read. Search the memory palace for prior findings that might apply. Cite them with  markers so trust can be tracked back to source.
Intent-check. Compare the MR's actual changes against the orient-step hypothesis. Does the implementation match the intent? Note discrepancies.
Verify-against-worktree. For each non-trivial claim about the code, grep the worktree. No claims allowed without a citation to a file:line.
Findings. Produce a structured list — severity, location, claim, suggested fix.
Fixes. For high-severity findings with obvious fixes, propose the code change.
JSON. Serialize the whole thing as a JSON payload the dashboard can parse.

The skill is opinionated about order. Orient first, palace-read second — because palace knowledge filters how you read the MR. Verify-against-worktree fourth — because no claim should leave the agent's mouth unverified. JSON last — because formatting concerns shouldn't pollute the reasoning steps.

The pipeline streams tool_use and tool_result events back to the dashboard via SSE so the human can watch the agent work in real time, which turns out to matter for trust — seeing the agent grep for a specific function name before claiming it doesn't exist is much more convincing than a clean final report.

The JSON repair pipeline — the part worth lifting

LLM JSON output is almost-right most of the time. Stray trailing commas, an unescaped quote inside a string, a premature closing brace, control characters that shouldn't be there. Naïve JSON.parse fails on any of these. A naïve retry just re-rolls the same dice.

The pipeline has three layers, in increasing cost:

Layer 1 — skill-side validation. The review skill writes the JSON to a temp file, then runs node -e "JSON.parse(...)" as the very last step. If parse fails, the skill itself sees the error and is asked to re-emit. Cheap; runs inside the same agent invocation; catches ~70% of malformed output.

Layer 2 — parser-side repair. If the JSON arrives malformed despite skill-side validation, a parser library tries ten specific repair strategies in sequence: unescaped quotes, premature braces, truncated arrays, control character stripping, smart-quote normalization, and a few more. Each strategy is targeted at a specific failure mode I've seen in actual output. Order matters: cheaper checks first. Catches ~25% more.

Layer 3 — re-prompt fallback. If both layers fail, a lightweight re-prompt asks the model to "fix this JSON structure only" with the broken output as input. New tokens, but only for the repair; the original reasoning isn't redone. Catches the remaining ~5%.

The end-to-end effect: malformed-JSON-from-LLM stops being an exception in the user's flow and becomes a quiet repair step the user never sees. The pattern composes well outside this project — any LLM-output-JSON-consumer would benefit from the same three layers. It's the most reusable thing in the codebase.

Self-learning triage — corrections that compound

Incoming Jira tickets get auto-classified to one of three teams. Sometimes wrong. When the human re-classifies, two things happen:

The correction is logged with the original ticket text.
A separate "upskill" skill runs against the correction: given a ticket that was misclassified and the correct team, what generalizable pattern do you extract? The output is a short pattern written to a triage_knowledge table.

Future triage runs read learned patterns first. "Tickets mentioning data-schema evolution typically go to Framework, not Product." "Tickets with a Figma link almost always belong to Product." Patterns accumulate; classification accuracy improves.

This isn't reinforcement learning. It's not fine-tuning. It's the simplest possible learning loop: persist what humans correct, read those persisted corrections on future runs. RL papers describe orders of magnitude more elaborate systems; this two-step loop captures most of the value at a fraction of the complexity. For a dashboard that processes maybe 50 tickets a week, it's exactly the right scale of "learning."

Memory palace — RAG with citations and trust

Verified findings from past reviews accumulate as entries in a memory_palace table. Each entry has a confidence score (initialized from the reviewer's trust signal), keywords for retrieval, and a back-link to the originating review.

Subsequent reviews search the palace before forming their own findings. If a relevant entry is found, the new review's findings cite it via the  marker. The dashboard renders the citation as a click-through to the originating review, so a reader can verify the chain of reasoning.

Trust is tracked: when a finding citing a palace entry gets accepted by a human, the palace entry's confidence ticks up. When it gets rejected, confidence ticks down. Below a threshold, the entry stops being retrieved. The system can forget things it's no longer sure of.

The citation markers are the part I'd refine if I picked this up again. Today they're plain HTML comments parsed by a regex on the dashboard side; one missed comment and the citation chain breaks silently. A typed structure embedded in the JSON output would be safer.

The dual-backend abstraction

The interesting design constraint: the same agentic workflow needs to run against both Claude CLI and Cursor agents. Different invocation shapes, different tool-use formats, different lifecycle hooks — but the same skills folder, the same prompts, the same expected output JSON.

The solution is a thin provider abstraction: a single switch (ai.provider === 'cursor' ? cursor : claude) routes to either backend. Both read from the same .claude/skills/ directory. Both produce the same JSON contract. The dashboard's review use case is provider-agnostic.

This is small (~150 lines) and earns its keep the moment someone asks "could we try this with Cursor?" Two days of refactoring saved by the one day spent on the abstraction upfront. The general lesson: agent-side workflow code should be provider-neutral; only the subprocess invocation should know which agent it is.

What's rough — the honest 3-night caveats

Three nights is enough to ship. It's not enough to polish. The rough edges I'd name openly:

Figma integration feels tacked-on. Optional path, base64-encoded PNG snapshots, no real handling of multi-page designs or component variations. Works for the canonical case; degrades silently outside it.
Memory palace citation markers are fragile. HTML-comment-embedded references parsed by regex on the dashboard side. One escaping mistake breaks the citation chain silently. Should be a typed structure in the JSON output.
Chat is underdeveloped. The per-task chat surface works but isn't as polished as review or triage. No context-window management, no session-lifecycle policy.
Test density is modest. 13 spec files for ~36K LOC. Adequate for "the layers do what they claim," weak for "every edge case is locked in."
No auth. Single-user, localhost-only by design. Productionizing would require a real auth layer; the current localhost-only model is fine for the experiment but no further.
No observability beyond stdout logs. Fine for a side project; would need Sentry / Loki / structured logging for anything customer-facing.

None of these are show-stoppers; they're all "you wouldn't run this for real customers without addressing them." But the absence of these polished layers is exactly what made 36K LOC in 3 nights possible. The right way to read this codebase is "the experiment shipped" — not "the product shipped." Two very different things.

What I'd lift to ECM

The patterns from this experiment that are worth promoting to production-grade systems:

The JSON repair pipeline. Three layers of defense against malformed LLM JSON. Worth its own npm package; certainly worth lifting into any pipeline that consumes LLM JSON.
The agentic skill structure — orient, palace-read, verify-against-worktree, findings, JSON. The order is the discipline. ECM has a few places (registration review, EDC scrape diagnosis) that would benefit from the same scaffolding.
The dual-backend abstraction. Even if you only use one provider today, isolating the provider boundary costs little and earns optionality.
The self-learning triage loop. Generic shape: human correction → upskill skill → persisted pattern → read on subsequent runs. Applicable any time an AI does classification under human oversight.
Branded VOs as default. Not unique to this project, but worth re-emphasising. Compile-time discipline AI can't bypass.

The patterns I'd not lift verbatim: the dashboard's HTML-comment citation system (needs typed structure), the Figma integration (needs proper component-aware handling), the localhost-only auth model (needs replacement, obviously). These were three-night decisions; they don't deserve to be three-year decisions.

Closing

The point of this experiment wasn't to ship a production tool. It was to test the claim made in the AI-multiplier post: that the system around AI determines whether the multiplier is real. Three nights, 36K lines of TypeScript with the same quality discipline as ECM, several patterns worth lifting back into ECM — the claim holds up.

The deeper test is that I wrote this post six months after the dashboard shipped. The code didn't rot in those six months because the system that produced it is the same system that maintains ECM. CLAUDE.md, ADRs (smaller set than ECM's), memory files, the discipline — all pre-loaded from the parent project. Three nights of work, but six months of intact patterns afterward.

That's what the multiplier looks like when it's real. Not "I shipped fast" — many people ship fast. But "I shipped fast and the result didn't immediately decay." The decay-resistance is what separates an AI-augmented codebase from a vibe-coded one. Both can ship in three nights. Only one is recognizable six months later.