Clusterless operations for native cloud teams

The VM is the pod.
The cloud is the control plane.

Skiff gives teams Kubernetes-class operational-leverage without Kubernetes-class cost or complexity. Skiff allows you to define and compile your service definition into cloud-native primitives, run managed ops like canary rollouts or database restores, debug with built-in observability, and act as a harness for agents for safe investigations and repairs. No need for managed control planes, overlay networks, hidden controller state machines, Kubernetes/cloud impedance mismatch, or thousands of lines of YAML.

Object storage is the only dependency Reduced cost and complexity IAM is workload identity CLI/TUI is agent-native
The gap

Terraform describes shape. Kubernetes reconciles clusters. Production needs journeys.

Real operations are not just resources. They are restores, key rotations, canaries, failovers, migrations, approvals, compensations, and evidence. Skiff gives those journeys a first-class operational substrate.

Terraform is stable state

Excellent for declaring what should exist. Weak for live, multi-step operations with rollback gates and partial failure.

  • Great for infrastructure shape
  • ×Not a runtime operations engine
  • ×Deploys become plan/apply choreography

Kubernetes is a parallel cloud

Powerful, but it recreates scheduling, networking, identity, secrets, and health as cluster primitives.

  • Great ecosystem and abstractions
  • ×Cluster operations become the job
  • ×Operators hide procedural complexity

Skiff is explicit operations

Compile simple specs to cloud primitives, store signed desired state in object storage, and run operational sagas on demand.

  • Object-storage-backed operating ledger
  • Object storage state with CAS docs
  • Runbooks as typed, resumable graphs
Architecture

One bucket, cloud IAM, and a tiny runner.

Skiff deploys through the same path it recovers from: write durable object state first, then move visible cloud primitives, while skiffd remains a rebuildable facade over the ledger.

Durable operating ledger

Release manifests, operation intents, plans, SBOMs, provenance, audit entries, and events are immutable objects in the state bucket.

CAS coordination

Service state, operation state, leases, and member state live in narrow control documents updated with compare-and-swap semantics.

Visible cloud primitives

ASGs, target groups, IAM roles, launch templates, logs, and provider IDs stay visible instead of being hidden behind cluster abstractions.

Signed releases CAS leases Append-only events Skiffd hot index Direct CLI recovery AWS first Multi-cloud interface
$skiff deploy payments-api --env prod
Durable write first, memory after
Operator intentCLI or skiffd

Deploy payments-api

A normal service deploy becomes a typed, auditable operation with traceable release evidence.

specvalidated
policypassed
tracetr_01J
Object storage ledgersource of truth
State bucketimmutable history plus CAS control
Signed releasecreate-only
Operation intentcreate-only
Service controlCAS update
Recent eventsappend-only
Cloud rolloutvisible primitives
ASG refresh0 of 6 new
Target grouptraffic held
IAM rolescoped runner
LogsCloudWatch
new release traffic0%
01CompileSpec becomes typed IR and a provider plan.
02PublishRelease evidence is written once.
03CAS leaseControl document fences the operation.
04RollCloud primitives start the refresh.
05VerifyRunner checks signatures before start.
06PromoteTraffic shifts only after health gates.
Operational sagas

Runbooks as typed, resumable operation graphs.

Database restores, key rotations, canary deploys, regional failovers, migrations, and repairs are not hidden controllers. They are explicit sagas: planned, approved, executed, paused, resumed, compensated, and audited.

check.preflightVerify restore pointBackups, permissions, health, blast radius
database.restoreRestore new DBPoint-in-time restore, no in-place overwrite
check.shadowSmoke test APIRun service against restored database
approval.manualApprove cutoverHuman or agent gate before traffic moves

Canary deploy

Start at 5%, bake, evaluate health and metrics, advance, pause, or compensate by rolling back.

Database restore

Restore to a new database, smoke test, approval gate, secret cutover, service rollout, old DB retention.

Key rotation

Create new versions, canary consumers, promote aliases, roll services, delay destructive cleanup.

Regional failover

Verify replica lag, freeze writes, promote, shift traffic, verify, and clearly mark irreversible steps.

Kubernetes cutover

Deploy Skiff shadow service, shift traffic by weight, compare metrics, retire the old service safely.

Incident repair

Collect evidence, recommend safe actions, run reversible remediation, append every event.

Agent-native operations

Humans get clarity. Agents get structure.

Every Skiff command has deterministic JSON output, explicit risk, recommended next actions, idempotency keys, failure taxonomy, and safety classification. An agent armed with the CLI can diagnose, deploy, repair, roll back, and resume without scraping logs or guessing state.

--format json action graphs safe commands idempotent ops risk metadata
skiff doctor --format json
{
  "ok": false,
  "code": "CANARY_FAILED",
  "summary": "new release failed readiness",
  "facts": [
    "rollout paused at 10%",
    "new targets return 500 on /healthz",
    "previous stable release is healthy"
  ],
  "recommended_actions": [
    {
      "id": "rollback",
      "command": "skiff saga start rollback --service payments-api --to previous-stable --yes --format json",
      "mutating": true,
      "safety": "reversible",
      "confidence": 0.91
    }
  ]
}
CLI and TUI

A beautiful cockpit for deployments, sagas, logs, metrics, and recovery.

The TUI is a frontend over the same deterministic API and object-state model. There is no separate magic path for humans.

skiff tuiprod · us-west-2 · state bucket healthy
payments-apigreen
orders-apigreen
invoice-workeryellow
payments-dbgreen
restore sagawait

payments-api

release 2026.05.16.1 · 6/6 healthy · p95 91ms · error 0.2% · cpu 48%

Saga: restore payments-db

verify restore pointdone
restore new databasedone
shadow API smoke testdone
approval before cutoverwaiting
·update secret pointerpending
·roll payments-apipending

Facts: restored DB is available, shadow API passed, current DB snapshot exists. Action: approve cutover or reject saga.

Secure by default

Do not make users become security experts to get secure operations.

Skiff's defaults are intentionally conservative: signed releases, digest-pinned artifacts, least-privilege IAM, encrypted state, conditional writes, no SSH ingress, managed sessions, KMS, secret references, and explicit approval for risky sagas.

Signed stateRelease manifests, saga intents, plans, and important artifacts are signed and verified.
Least privilegeIAM policies are compiled from specs and denied if they get too broad.
No SSH defaultDebug through cloud-managed sessions and audited diagnostic bundles.
Explicit riskIrreversible saga steps are labeled, gated, and never hidden behind controllers.
Incremental adoption

Start small. Do not rewrite your world.

Skiff should be easy to try from AWS, Terraform, or Kubernetes. The happy path is direct apply, but Terraform generation and Kubernetes migration are first-class bridges.

Direct AWS mode

Skiff CLI or stateless skiffd writes object state and calls AWS APIs. Fastest path, no Terraform state, object-state native.

  • Best default
  • Disaster-recovery friendly

Terraform bridge

Generate or adopt Terraform for stable infrastructure shape, then let Skiff own release pointers, rollouts, sagas, and diagnostics.

  • Enterprise review friendly
  • No deploy-by-plan/apply requirement

Kubernetes migration

Import Deployment/Service/Ingress, deploy shadow Skiff services, then cut traffic over through a weighted migration saga.

  • No cliff jump
  • Unsupported features are explicit
Customer journeys

Recipes for the operations teams actually run.

Skiff should ship with opinionated, understandable recipes. Users can inspect the plan, run it, pause it, approve it, or let agents execute low-risk paths.

01

API server + managed database

Deploy an API, create a managed database, wire secrets, emit logs/metrics, and get default restore, rotate, and canary sagas.

$ skiff init stack api-db payments
$ skiff deploy
$ skiff restore database payments-db --to latest
02

Multi-region API + regional database

Run services in two regions, maintain database replication, test failover, and promote through explicit high-risk sagas.

$ skiff failover stack payments --to us-east-1
! replica promotion is irreversible after new writes
03

Queue worker with autoscaling

Scale worker VMs from queue depth or age, keep logs and metrics normalized, and debug failures without clusters.

$ skiff init worker invoice-worker
$ skiff metrics invoice-worker queue-lag
04

Kubernetes service migration

Import, deploy shadow, compare health, shift traffic, and decommission the old service only after evidence says it is safe.

$ skiff import kube ./k8s --out skiff.yaml
$ skiff deploy --shadow
$ skiff saga start traffic-cutover --steps 5,25,50,100
Implementation shape

Golang core, provider plugins, saga steps, and object-state discipline.

Skiff is composable without becoming an operator framework. Plugins register provider capabilities, runtime addons, saga step kinds, diagnostics, and recipes.

file layout
skiff/
  cmd/
    skiff/          # CLI/TUI
    skiffd/         # stateless API server
    skiff-runner/   # VM runner
    skiff-worker/   # optional saga/index worker
  internal/
    compiler/ ir/ provider/aws/
    state/ objstore/ release/
    saga/ saga/steps/
    doctor/ policy/ plugins/
    tui/ observability/
  pkg/
    spec/ pluginapi/ sagaapi/ sdk/
  examples/
    api-db/ worker/ mtls/
    multiregion-db/
saga step API
type Step interface {
  Kind() string
  Plan(ctx context.Context, req StepRequest) (*StepPlan, error)
  Run(ctx context.Context, req StepRequest) (*StepResult, error)
  Resume(ctx context.Context, req StepRequest) (*StepResult, error)
  Compensate(ctx context.Context, req StepRequest, result StepResult) (*StepResult, error)
  Doctor(ctx context.Context, req StepRequest) ([]Finding, error)
}
Roadmap

Build the core until the five commands feel magical.

The first version should be narrow and excellent: AWS, stateless services, object state, signed releases, runner, logs, doctor, rollback, and sagas.

Phase 1

Stateless AWS service

Service spec, compiler IR, S3 state, signed release ledger, ASG/ALB/IAM, runner, CloudWatch logs, status and rollback.

Phase 2

Doctor and sagas

Structured diagnostics, canary saga, rollback saga, operation leases, append-only events, agent-safe JSON.

Phase 3

Managed dependencies

API + managed database recipe, restore saga, secret rotation saga, database smoke tests, cost and shape advisor.

Phase 4

Adoption bridges

Terraform generate/adopt, Kubernetes import, shadow deploy, weighted traffic cutover, TUI, hot skiffd indexes.

Phase 5

Plugins and multi-region

mTLS plugin, provider conformance, stateful recipes, regional failover sagas, GCP/Azure provider work.