What we know is crooked.
A frank inventory of the rough edges. Some are intentional shortcuts we are happy with. Some are mistakes we plan to fix. Some are tracked in the issue tracker; some are just stuck in the team's collective memory. All of them are listed here so a new contributor doesn't burn a day rediscovering them.
Schema-as-code, no migrations
What. Schema is created via EnsureCreatedAsync() at startup; new columns are inline ALTER TABLE … ADD COLUMN IF NOT EXISTS in Program.cs. No version table, drops/renames not handled. See ADR-001·Schema as code → for full rationale.
Plan. Switch to EF Core migrations or Flyway when a real "drop column" is needed. Until then, intentional.
Single API replica
What. Production runs one API Pod. Two would be incorrect: the scheduler has no distributed lock, the channel registry and approval gate are in-memory. Vertical scaling is the only knob today. See ADR-003·Single replica → for full rationale.
Plan. Postgres advisory locks for scheduling + Redis pub/sub channel registry + persistent approval state. Defined, not started.
Headless approvals are auto-approved
What. Scheduled jobs install HeadlessApprovalManager which auto-approves all destructive ops — without this, scheduled writes never execute. A poorly-written prompt could delete cloud resources unattended; tenant admins can mitigate via TenantToolAccessEntity. See ADR-004·Headless auto-approve → for full rationale.
Plan. "Request approval and pause" mode for headless paths — designed, not built.
Stale CloudMind directories
What. backend/src/CloudMind.* directories survive from a partial product rename. They contain only empty obj/ folders and aren't referenced by any project.
Why they're still there. Removing them cleanly takes a moment of nerve and we keep deferring it.
Plan. Delete in the next housekeeping PR.
"Skill" overloaded twice
What. The Agent project's instruction modules used to live under Skills/; the new invocable-skills feature claims the same word. Branch feat/60191-skills-slash-commands renamed the folder to InstructionModules/ but the .csproj still embeds both paths during the migration.
Plan. Drop the legacy Skills/Builtin/**/* embed once all callers reference InstructionModules/. Test thoroughly — embedded resources fail silently.
No frontend test suite
What. The frontend has no Vitest / Jest tests. Type-check via npm run build is the contract.
Why. Move-fast trade-off taken early; the app is mostly stateless render of API responses.
The catch. Refactors of the chat reducer (useAgent) require manual exploratory testing.
Plan. Add Vitest for useAgent message-stream tests at minimum. Other components can wait.
Polymorphic schedule_type
What. scheduled_jobs.schedule_type accepts both "recurring" and "cron". The REST endpoint normalises cron → recurring on save, but legacy rows can still be either. The runner handles both.
Why we left it. A migration that touches every existing job is risky. The runner accepting both is harmless.
Plan. A one-shot script during a maintenance window to canonicalise to "recurring"; then drop the synonym handling.
EF Core 10 ghost-entry workaround
What. After a DbUpdateConcurrencyException, EF's StateManager can leave a ghost entry that breaks the next Add(). Several repositories explicitly call _db.ChangeTracker.Clear() before the retry.
Why. Upstream EF Core bug — tracked, not fixed in 10.0.3.
The catch. The workaround is repeated; if a new repository hits the same code path without the clear, it reproduces.
Plan. Wait for an EF fix; remove the explicit clears once available. Search "ChangeTracker.Clear()" to find them all.
Repository return type ergonomics
What. Dictionary<string, object?> everywhere — no compile-time key safety, no IDE autocomplete; typos surface as runtime 400s. Type safety is delegated to TypeScript on the frontend and integration tests on the backend. See ADR-002·Dictionary repos → for full rationale.
Plan. No imminent change; the pattern's flexibility has paid for itself.
Inline bytea attachments for skills
What. Skill attachments live in skill_attachments.content as inline bytea.
Why. Expected payloads are small; transactional integrity is free; cascade delete is one foreign key.
The catch. If we ever attach a 50 MB PDF to a skill, the SELECT * FROM skill_attachments response gets ugly.
Plan. Move to MinIO/S3 if we observe size growth in production.
Non-deterministic specialist fan-out cost
What. The planner-driven fan-out can spawn 4–6 parallel specialist runs in one user turn. Cost on a heavy turn is unpredictable.
The catch. A single turn can blow a user's daily budget.
Mitigation. Per-tenant MaxToolTurns, the credit cap fallback, the auto-router — all dampen the worst case.
Plan. Add per-turn cost cap with explicit user prompt to escalate. Designed, not built.
Test taxonomy
What. Tests live in tests/AgenticIT.Tests/ under Unit/, Integration/, Parity/. Default dotnet test excludes Integration/* via default.runsettings.
The catch. Easy to forget to run integration tests locally; CI catches them but the dev loop doesn't.
Plan. Add a pre-push git hook that runs the integration suite when matching paths change.
Decisions also tracked as ADRs
These items are intentional choices (not mistakes), each with a full decision record in §11:
- Hash router. The frontend uses a hash-based router rather than a framework like React Router — keeps the SPA deployable as static assets without server-side routing. See ADR-005·Hash router →.
- Stop-button 5-second SLA. StopWatchdog guarantees termination within 5 seconds even mid-stream. See ADR-006·Stop watchdog SLA →.
- JSONB envelope versioning.
ChatMessageEntity.contentstores a versioned JSON wrapper so new fields land without breaking older clients. See ADR-007·Envelope versioning →. - MCP service as a separate process. The MCP gateway runs in its own container rather than in-process with the API — process isolation, independent restarts, shared nothing. See ADR-008·MCP process isolation →.
Notes also worth writing down
- OAuth provider's metadata refresh is per-process. A long-running container caches discovery documents; rotating an OAuth provider mid-day requires a restart.
- Langfuse trace sampling is 100 % today. We have not needed sampling — but at higher tenant counts we will.
- The
vbox-organizationscache has no eviction policy beyond per-row updates. A tenant whose vBox roster changes daily will accumulate stale entries. - Frontend localStorage token survives until the user logs out or the JWT expires. There is no client-side rotation.
- The
messagesURL of OpenRouter must be relative without leading slash. A leading/sends the request tohttps://openrouter.ai/messagesinstead ofhttps://openrouter.ai/api/v1/messages. We have lost an afternoon to this once.