Kubernetes-class operational leverage without Kubernetes-class cost

Skiff gives small teams platform-grade deploys and operations.

Skiff gives teams one operating path for stateless and stateful services: least-privilege permissions, signed releases, canary deploys, runbooks as code, agent-readable command output, and native cloud primitives without a cluster control plane.

Deploy safelyCanaries, health gates, release signatures, and controlled promotion are standard.
Use native cloud primitivesSkiff works with VMs, autoscaling groups, target groups, IAM roles, and log groups.
Operate stateful workBackups, restores, failovers, and migrations become explicit typed runbooks.
Agent-safe by defaultJSON output, risk labels, approval gates, and human escalation are part of the command surface.
The problem

Teams want the operating habits, without a second platform to run.

Kubernetes can deliver good operational patterns, but teams often pay for them with controllers, CRDs, clusters, custom policy layers, observability wiring, bespoke rollout scripts, and enough YAML to bury the customer journey.

The real cost is operational discipline.Every deploy, incident, rollback, secret rotation, and restore needs a safe default path.
Skiff turns production work into commands.Common journeys become explicit runbooks with checks, approvals, and concrete next steps.
Cloud primitives stay native.Skiff uses load balancers, autoscaling groups, IAM roles, and log groups without burying them under a cluster model.
Kube taxmany layers
Policy templatescustom
Rollout scriptsmanual
Recovery docsstale
Audit trailsscattered
Skiff roadone path
01Bootstrap guardrailssecure defaults, least privilege, context ready
02Canary and observehealth, metrics, logs, target groups
03Agent-safe runbooksJSON, risk labels, approvals
04Recorded changesactor, trace, risk, summary

The first win is safer defaults.Secure bootstrap, signing, identity, logs, and deploy shape are ready before the first service ships.

Deploys become a known procedure.The operator sees the release, traffic state, health gates, and the business-safe next step instead of bouncing across unrelated consoles.

Daily operations are designed before the incident.Every long-running operation is resumable and every repair has explicit risk and reversibility.

Changes are easy to review.Mutating production operations record who acted, what changed, where it ran, and how risky it was.

Stateful and stateless

One operating model for services that scale out and services that hold state.

Stateless web APIs, workers, queues, databases, and stateful members need different runbooks. They still need the same operational contract: secure identity, safe deploys, health checks, scoped recovery commands, and recorded changes.

Stateless service

Least privilegeruntime IAM only
Canarytraffic gates
ObserveSLO and logs
Rollbacksafe command

Stateful service

Scoped accessstate refs only
Change windowapproval gate
Backup freshnessknown restore point
Failovertyped saga

Identity starts narrow.Workloads get only the cloud permissions they need. Operators get scoped deploy and recovery permissions, not a blanket cluster-admin escape hatch.

Deploy paths match the workload.A web API can canary through target groups while a stateful member uses explicit approval and preflight checks.

Observability is attached to the journey.Status, logs, health, backups, and cloud resources show up in the same operational context.

Recovery is an executable runbook.Rollback, failover, drain, restore, and resume are typed actions with risk and reversibility.

Stateful work gets first-class treatment.Backup, restore, failover, maintenance, migration, and repair are part of the product model.
Operators keep the cloud model.Skiff packages the safe sequence as a runbook without pretending AWS is a cluster.
One VM runs one workload replica by default.The isolation boundary stays simple enough for SREs and agents to reason about during incidents.
Secure by default

The default path includes the controls SREs ask for later.

Skiff puts the production checklist into the workflow: least privilege, release signatures, immutable history, secret references, approval gates, and audit records. Teams do not have to remember the secure version of the command under pressure.

Least privilege is generated with the service.Runtime access, deployer access, and operator access are separate by default.
Production releases are signed.Runners verify release manifests and digests before serving traffic.
Secrets stay referenced.Object state and events carry secret references and redacted summaries, not plaintext.
1Least privilege IAMScoped roles for deployer, runner, and operator actions.
2Signed releasesManifest, digest, service, environment, and expiry are verified.
3Digest-pinned artifactsProduction workloads run what was approved, not a mutable tag.
4No SSH-first debugDiagnostics start with scoped logs, status, and reviewable sessions.
5Secret referencesObject state points to the secure store and redacts output.
6Audit by defaultActor, trace ID, target, risk, and summary on every mutation.

Access starts from the operation.The service gets only what it needs, and high-risk actions require explicit approval.

The release is immutable, not a pointer.Skiff verifies signed manifests and runtime manifests before rollout.

Runtime safety does not rely on memory.The runner can read durable state directly and verify before starting the workload.

History survives handoffs.Humans and agents get the same traceable event stream when continuing an operation.

Runbooks as code

Operational knowledge becomes typed, resumable workflows.

During an outage, a wiki runbook is too easy to misread or skip. In Skiff, canaries, drains, restores, failovers, rotations, and repairs are explicit sagas with typed steps, stored progress, compensation where possible, and clear events.

CCanary deployadvance only on health
RDatabase restoreprove backup before cutover
SSecret rotationstage, verify, revoke old
FRegional failoverapprove, route, validate
01PlanStrict inputs, policy checks, expected provider actions.low
02Write intentDurable operation or saga intent before any side effect.audited
03Stage safelyCanary traffic or staged credentials with live verification.reversible
03Gate impactBackup freshness, maintenance window, and operator approval.approval
04Record resultStore step results before waiting, so work can resume.resumable

Canary deploys become an operational contract.Skiff advances only when health, logs, and SLO checks support promotion.

Restores follow a recorded plan.The runbook proves backup freshness, isolates blast radius, and records the cutover plan.

Secret rotation has a safe middle.Stage new references, verify workloads, then revoke old credentials after consumers move.

Failover is explicit and reviewable.Skiff separates plan, approval, route changes, validation, and compensation.

Every long-running operation can resume.Provider operation IDs and step results are stored before Skiff waits on cloud APIs.
Compensation is named honestly.Skiff distinguishes reversible, compensatable, partially reversible, and irreversible work.
Operators and agents see the same graph.Facts, hypotheses, risk, recommended commands, and approval requirements are structured.
Canary deploys

Promotion waits for live signals.

The deploy journey connects release signing, rollout traffic, target health, SLO signals, logs, and recorded changes. The animation below shows traffic moving only as checks pass.

Traffic split

stable
100%
canary
0%
errors
.04%
p95
142ms

Promotion checks

Signed release verifiedpass
Runtime manifest verifiedpass
New targets healthypass
SLO burn within budgetpass
Error budget unchangedpass
Audit event appendeddone

Before traffic moves, the release is checked.The runner verifies signatures and digests before serving, and the operation starts with a trace ID.

Small traffic proves the runtime path.Skiff watches target health, logs, and SLO signals before increasing exposure.

Promotion is gated by live signals.Traffic advances when signals are good, pauses when they are ambiguous, and recommends repair when they are unsafe.

Completion records the change.The service control updates after durable state, and the operator can explain exactly what changed.

Built-in observability

Status, logs, health, and next actions stay in one operational frame.

Skiff keeps operators from assembling context from scratch. Service status is tied to the operation, underlying cloud resources, trace ID, target health, logs, findings, and recommended commands. The same context is available as JSON for agents.

Fresh reads when it matters.Critical status can reload durable objects instead of trusting stale memory.
Findings are explicit.Malformed state, provider drift, unhealthy targets, and risky recommendations are surfaced with context.
No opaque platform layer.Operators can move between Skiff and the AWS console without decoding a second resource model.
servicepayments-api prod
rolloutop_01J canary at 50%
targeti-0abc unhealthy in tg-prod-payments
logstrace tr_01J timeout to secret provider
metricp95 320ms, errors .08%, burn safe
findingcanary paused; stable serving 50%
resumeop_01J continue after credential fix
{ "ok": false, "code": "CANARY_PAUSED", "risk": "no", "facts": ["one target unhealthy", "stable still serving"], "recommended_actions": [ "inspect scoped logs", "request human approval", "resume op_01J" ] }

Detect starts from the service journey.The operator sees rollout state and customer traffic before chasing raw telemetry.

Correlation is built into the command output.Trace IDs connect target health, logs, events, and cloud resource IDs.

Recommendations are structured.Skiff separates facts from hypotheses and marks actions as no, low, medium, or high risk.

Resume is a first-class operation.After the fix, the same operation can continue without reconstructing state from memory.

Agent-first tooling

Agents get JSON, context, and human gates.

Every command supports --format=json. Skiff packages the facts, trace IDs, operation IDs, recent events, risk labels, and approval requirements an agent needs to help without becoming an unreviewable controller.

Machine-readable output Every command can return JSON with facts, findings, command suggestions, trace ID, and operation ID.
--format=json
Problem context Skiff collects service state, target health, recent events, release data, and cloud resource IDs in one answer.
context
Risk and approval Actions are labeled no, low, medium, or high risk. High-risk work can require two-party authorization.
human gate

Agents do not scrape prose.JSON mode is a stable interface for status, doctor output, recommendations, and errors.

Skiff manages the context packet.The agent gets the service, operation, trace, cloud resources, recent events, and next commands together.

Risk is explicit.Commands are classified before they run, including whether they mutate state and how reversible they are.

Escalation is part of the flow.Agent escalations to humans and two-party authorization are built into high-risk operations.

Every command has JSON mode.Agents can parse facts, hypotheses, recommendations, mutating flags, trace IDs, and operation IDs without screen scraping.
Agent safety is first-class.Recommended actions carry no, low, medium, or high risk labels, plus reversibility and approval requirements.
Humans stay in the loop.High-risk actions and agent escalations can require two-party authorization before Skiff runs them.
Adoption

Adopt the operating model without rewriting your whole cloud estate.

Skiff is AWS-first and works against existing cloud accounts. Import known cloud shape, keep native cloud resources in the model, and add paved operational journeys one service at a time.

Start with one service.Bring one API, worker, or stateful member under signed release and operational runbooks.
Keep Terraform where it helps.Terraform can express infrastructure shape while Skiff owns the operational journey.
Billable resources stay explicit.State buckets, NAT gateways, load balancers, certificates, autoscaling groups, and log groups remain named in output.
Existing AWSVPCs, ALBs, ASGs, IAM, logs
Current IaCTerraform stays useful for shape
Current servicesAPIs, workers, stateful members
Skiff contextstate root, signer, region, env
Signed release pathcanary, logs, health, audit
Operational portfoliorunbooks, agents, break-glass CLI

Discovery makes the cloud shape legible.Skiff does not hide existing primitives or require teams to pretend the cloud is a cluster.

Bootstrap installs secure defaults.The environment gets state, signing, IAM, logs, and context without exposing low-level IDs on the happy path.

The first service proves the path.Operators get canary deploys, release verification, status, logs, doctor output, and a CLI fallback.

Expansion is additive.Each new service adds typed operations instead of another pile of bespoke YAML.

Common user journeys

The product surface is the work operators actually do.

Skiff is designed around operational jobs: ship safely, respond to degraded service, recover data, rotate credentials, and hand work to an agent or another human with JSON context, risk labels, and enough history to continue.

Ship a release

Compile spec, sign release, canary traffic, watch health, promote or pause.

Repair degraded service

Pull JSON context, inspect logs, classify risk, escalate or run bounded repair.

Restore state

Verify backup, isolate target, approve cutover, validate health, record the result.

Rotate credentials

Stage new reference, roll workloads, confirm consumers, revoke old access.

Shipping stays in one path.The operator sees release, rollout state, traffic, health, logs, and the next safe action together.

Repair begins with observed facts.Doctor output recommends commands, labels risk, and asks for human approval when an agent should not act alone.

State recovery gets first-class treatment.Restore work includes backup freshness, risk classification, cutover, validation, and traceable results.

Credential rotation is deliberate.Skiff stages the change, verifies workloads, revokes old access, and records what changed.

Implementation shape

The internals exist to preserve the operator promise.

The operator-facing promise rests on durable object state, a stateless facade, a direct CLI fallback, immutable history, CAS controls, and typed sagas.

skiff deploy payments-api --canary
  -> write operation intent
  -> create signed release manifest
  -> CAS service control
  -> watch target health
  -> append audit event

skiff --direct status payments-api
  -> read object state directly
  -> rebuild enough view to recover
Object storage is durable truthState lives in signed or schema-versioned objects.
skiffd is a rebuildable facadeIndexes and streams are fast views, not the database.
Runner verifies before servingVM-local runtime checks manifests and artifacts directly.
Operations are auditableActor, trace ID, target, risk, and summary are recorded.

Durable state comes first.Mutating operations write object storage before updating in-memory views.

The facade can fail without taking truth with it.skiffd powers normal UX, but the CLI can still read object state directly.

The VM is the workload boundary.Runners verify signed releases and report state transitions without relying on a cluster control plane.

Audit is part of the contract.Every mutating production operation is traceable, resumable when long-running, and explicit about risk.

Object state is the durable substrate.Release manifests, operation intents, saga graphs, events, controls, indexes, and audits have clear mutation rules.
Control docs are also lock docs.Compare-and-swap on the relevant control document prevents separate lock files and stale ownership.
Native cloud primitives stay native.Skiff uses ASGs, target groups, IAM roles, and log groups without adding a thick platform abstraction.