§ 09 — Tech debt

What we know is crooked.

A frank inventory of the rough edges. Some are intentional shortcuts we are happy with. Some are mistakes we plan to fix. Some are tracked in the issue tracker; some are just stuck in the team's collective memory. All of them are listed here so a new contributor doesn't burn a day rediscovering them.

Schema-as-code, no migrations

What. Schema is created via EnsureCreatedAsync() at startup; new columns are inline ALTER TABLE … ADD COLUMN IF NOT EXISTS in Program.cs. No version table, drops/renames not handled. See ADR-001·Schema as code → for full rationale.

Plan. Switch to EF Core migrations or Flyway when a real "drop column" is needed. Until then, intentional.

Single API replica

What. Production runs one API Pod. Two would be incorrect: the scheduler has no distributed lock, the channel registry and approval gate are in-memory. Vertical scaling is the only knob today. See ADR-003·Single replica → for full rationale.

Plan. Postgres advisory locks for scheduling + Redis pub/sub channel registry + persistent approval state. Defined, not started.

Headless approvals are auto-approved

What. Scheduled jobs install HeadlessApprovalManager which auto-approves all destructive ops — without this, scheduled writes never execute. A poorly-written prompt could delete cloud resources unattended; tenant admins can mitigate via TenantToolAccessEntity. See ADR-004·Headless auto-approve → for full rationale.

Plan. "Request approval and pause" mode for headless paths — designed, not built.

Stale CloudMind directories

What. backend/src/CloudMind.* directories survive from a partial product rename. They contain only empty obj/ folders and aren't referenced by any project.

Why they're still there. Removing them cleanly takes a moment of nerve and we keep deferring it.

Plan. Delete in the next housekeeping PR.

"Skill" overloaded twice

What. The Agent project's instruction modules used to live under Skills/; the new invocable-skills feature claims the same word. Branch feat/60191-skills-slash-commands renamed the folder to InstructionModules/ but the .csproj still embeds both paths during the migration.

Plan. Drop the legacy Skills/Builtin/**/* embed once all callers reference InstructionModules/. Test thoroughly — embedded resources fail silently.

No frontend test suite

What. The frontend has no Vitest / Jest tests. Type-check via npm run build is the contract.

Why. Move-fast trade-off taken early; the app is mostly stateless render of API responses.

The catch. Refactors of the chat reducer (useAgent) require manual exploratory testing.

Plan. Add Vitest for useAgent message-stream tests at minimum. Other components can wait.

Polymorphic schedule_type

What. scheduled_jobs.schedule_type accepts both "recurring" and "cron". The REST endpoint normalises cron → recurring on save, but legacy rows can still be either. The runner handles both.

Why we left it. A migration that touches every existing job is risky. The runner accepting both is harmless.

Plan. A one-shot script during a maintenance window to canonicalise to "recurring"; then drop the synonym handling.

EF Core 10 ghost-entry workaround

What. After a DbUpdateConcurrencyException, EF's StateManager can leave a ghost entry that breaks the next Add(). Several repositories explicitly call _db.ChangeTracker.Clear() before the retry.

Why. Upstream EF Core bug — tracked, not fixed in 10.0.3.

The catch. The workaround is repeated; if a new repository hits the same code path without the clear, it reproduces.

Plan. Wait for an EF fix; remove the explicit clears once available. Search "ChangeTracker.Clear()" to find them all.

Repository return type ergonomics

What. Dictionary<string, object?> everywhere — no compile-time key safety, no IDE autocomplete; typos surface as runtime 400s. Type safety is delegated to TypeScript on the frontend and integration tests on the backend. See ADR-002·Dictionary repos → for full rationale.

Plan. No imminent change; the pattern's flexibility has paid for itself.

Inline bytea attachments for skills

What. Skill attachments live in skill_attachments.content as inline bytea.

Why. Expected payloads are small; transactional integrity is free; cascade delete is one foreign key.

The catch. If we ever attach a 50 MB PDF to a skill, the SELECT * FROM skill_attachments response gets ugly.

Plan. Move to MinIO/S3 if we observe size growth in production.

Non-deterministic specialist fan-out cost

What. The planner-driven fan-out can spawn 4–6 parallel specialist runs in one user turn. Cost on a heavy turn is unpredictable.

The catch. A single turn can blow a user's daily budget.

Mitigation. Per-tenant MaxToolTurns, the credit cap fallback, the auto-router — all dampen the worst case.

Plan. Add per-turn cost cap with explicit user prompt to escalate. Designed, not built.

Test taxonomy

What. Tests live in tests/AgenticIT.Tests/ under Unit/, Integration/, Parity/. Default dotnet test excludes Integration/* via default.runsettings.

The catch. Easy to forget to run integration tests locally; CI catches them but the dev loop doesn't.

Plan. Add a pre-push git hook that runs the integration suite when matching paths change.

Decisions also tracked as ADRs

These items are intentional choices (not mistakes), each with a full decision record in §11:

Notes also worth writing down