Kubernetes-class operational leverage without Kubernetes-class cost

Skiff gives small teams platform-grade deploys and operations.

Skiff gives teams one operating path for stateless and stateful services: least-privilege permissions, signed releases, canary deploys, runbooks as code, agent-readable command output, and native cloud primitives without a cluster control plane.

See the operating model Browse common journeys Review the security posture

Deploy safelyCanaries, health gates, release signatures, and controlled promotion are standard.

Use native cloud primitivesSkiff works with VMs, autoscaling groups, target groups, IAM roles, and log groups.

Operate stateful workBackups, restores, failovers, and migrations become explicit typed runbooks.

Agent-safe by defaultJSON output, risk labels, approval gates, and human escalation are part of the command surface.

The problem

Teams want the operating habits, without a second platform to run.

Kubernetes can deliver good operational patterns, but teams often pay for them with controllers, CRDs, clusters, custom policy layers, observability wiring, bespoke rollout scripts, and end up managing megabytes of YAML.

The real cost is operational discipline.Every team needs a consistent path to run canary deploys, conduct rollbacks, and run codified mutating operations.

Skiff turns production work into commands.Common journeys become explicit runbooks with checks, approvals, and concrete next steps.

Native cloud means less cognitive cost.Skiff uses cloud native load balancers, autoscaling groups, IAM roles, and log groups. No need to reason through multiple abstractions.

Kube taxmany layers

Policy templatescustom

Rollout scriptsmanual

Recovery docsstale

Audit trailsscattered

Skiff roadone path

01Bootstrap guardrailssecure defaults, least privilege, context ready

02Canary and observehealth, metrics, logs, target groups

03Agent-safe runbooksJSON, risk labels, approvals

04Recorded changesactor, trace, risk, summary

The first win is safer defaults.Secure bootstrap, signing, identity, logs, and deploy shape are ready before the first service ships.

Deploys become a known procedure.The operator sees the release, traffic state, health gates, and the business-safe next step instead of bouncing across unrelated consoles.

Daily operations are designed before the incident.Every long-running operation is resumable and every repair has explicit risk and reversibility.

Changes are easy to review.Mutating production operations record who acted, what changed, where it ran, and how risky it was.

Stateful and stateless

One operating model for services that scale out and services that hold state.

Stateless web APIs, workers, queues, databases, and stateful members need different runbooks. They still need the same operational contract: secure identity, safe deploys, health checks, scoped recovery commands, and recorded changes.

Stateless service

Least privilegeruntime IAM only

Canarytraffic gates

ObserveSLO and logs

Rollbacksafe command

Stateful service

Scoped accessstate refs only

Change windowapproval gate

Backup freshnessknown restore point

Failovertyped saga

Identity starts narrow.Workloads get only the cloud permissions they need. Operators get scoped deploy and recovery permissions, not a blanket cluster-admin escape hatch.

Deploy paths match the workload.A web API can canary through target groups while a stateful member uses explicit approval and preflight checks.

Observability is attached to the journey.Status, logs, health, backups, and cloud resources show up in the same operational context.

Recovery is an executable runbook.Rollback, failover, drain, restore, and resume are typed actions with risk and reversibility.

Stateful work gets first-class treatment.Backup, restore, failover, maintenance, migration, and repair are part of the product model.

Operators keep the cloud model.Skiff packages the safe sequence as a runbook without pretending AWS is a cluster.

Cloud VMs are the primary primitiveThe isolation boundary stays secure and simple enough for SREs and agents to reason about during incidents.

Secure by default

The default path includes the controls SREs ask for later.

Skiff puts the production checklist into the workflow: least privilege, release signatures, immutable history, secret references, approval gates, and audit records. Teams do not have to remember the secure version of the command under pressure.

Least privilege is generated with the service.Runtime access, deployer access, and operator access are separate by default.

Production releases are signed.Runners verify release manifests and digests before serving traffic.

Secrets stay referenced.Object state and events carry secret references and redacted summaries, not plaintext.

1Least privilege IAMScoped roles for deployer, runner, and operator actions.

2Signed and pinned releasesManifest, digest, service, environment, and expiry are verified.

3Runtime guardrailsNo SSH-first debug, secret references, and redacted operational output.

4Audit by defaultActor, trace ID, target, risk, and summary on every mutation.

Access starts from the operation.The service gets only what it needs, and high-risk actions require explicit approval.

The release is immutable, not a pointer.Skiff verifies signed manifests and runtime manifests before rollout.

Runtime safety does not rely on memory.The runner can read durable state directly and verify before starting the workload.

History survives handoffs.Humans and agents get the same traceable event stream when continuing an operation.

Runbooks as code

Operational knowledge becomes typed, resumable workflows.

During an outage, a wiki runbook is too easy to misread or skip. In Skiff, canaries, drains, restores, failovers, rotations, and repairs are explicit sagas with typed steps, stored progress, compensation where possible, and clear events.

01Canary gateShift a small slice of traffic and promote only after live health checks.reversible

02Restore gateVerify backup freshness, isolate the target, and record the cutover plan.approval

03Rotation gateStage the new secret reference, verify consumers, then revoke old access.audited

04Failover gateApprove route changes, validate health, and store resumable step results.resumable

Canary deploys become an operational contract.Skiff advances only when health, logs, and SLO checks support promotion.

Restores follow a recorded plan.The runbook proves backup freshness, isolates blast radius, and records the cutover plan.

Secret rotation has a safe middle.Stage new references, verify workloads, then revoke old credentials after consumers move.

Failover is explicit and reviewable.Skiff separates plan, approval, route changes, validation, and compensation.

Every long-running operation can resume.Provider operation IDs and step results are stored before Skiff waits on cloud APIs.

Compensation is named honestly.Skiff distinguishes reversible, compensatable, partially reversible, and irreversible work.

Operators and agents see the same graph.Facts, hypotheses, risk, recommended commands, and approval requirements are structured.

Canary deploys

Promotion waits for live signals.

The deploy journey connects release signing, rollout traffic, target health, SLO signals, logs, and recorded changes. The animation below shows traffic moving only as checks pass.

Traffic split

stable

100%

canary

errors

.04%

p95

142ms

Promotion checks

Signed release verifiedpass

New targets healthypass

SLO burn within budgetpass

Audit event appendeddone

Before traffic moves, the release is checked.The runner verifies signatures and digests before serving, and the operation starts with a trace ID.

Small traffic proves the runtime path.Skiff watches target health, logs, and SLO signals before increasing exposure.

Promotion is gated by live signals.Traffic advances when signals are good, pauses when they are ambiguous, and recommends repair when they are unsafe.

Completion records the change.The service control updates after durable state, and the operator can explain exactly what changed.

Built-in observability

Status, logs, health, and next actions stay in one operational view.

Skiff keeps operators from assembling context from scratch. Service status is tied to the operation, underlying cloud resources, trace ID, target health, logs, findings, and recommended commands. The same context is available as JSON for agents.

Fresh reads when it matters.Critical status can reload durable objects instead of trusting stale memory.

Findings are explicit.Malformed state, provider drift, unhealthy targets, and risky recommendations are surfaced with context.

No opaque platform layer.Operators can move between Skiff and the AWS console without decoding a second resource model.

servicepayments-api prod

rolloutop_01J canary at 50%

targeti-0abc unhealthy in tg-prod-payments

logstrace tr_01J timeout to secret provider

metricp95 320ms, errors .08%, burn safe

findingcanary paused; stable serving 50%

resumeop_01J continue after credential fix

{ "ok": false, "code": "CANARY_PAUSED", "risk": "no", "facts": ["one target unhealthy", "stable still serving"], "recommended_actions": [ "inspect scoped logs", "request human approval", "resume op_01J" ] }

Detect starts from the service journey.The operator sees rollout state and customer traffic before chasing raw telemetry.

Correlation is built into the command output.Trace IDs connect target health, logs, events, and cloud resource IDs.

Recommendations are structured.Skiff separates facts from hypotheses and marks actions as no, low, medium, or high risk.

Resume is a first-class operation.After the fix, the same operation can continue without reconstructing state from memory.

Agent-first tooling

Agents get JSON, context, and human gates.

Every command supports --format=json. Skiff packages the facts, trace IDs, operation IDs, recent events, risk labels, and approval requirements an agent needs to help without becoming an unreviewable controller.

Machine-readable output Every command can return JSON with facts, findings, command suggestions, trace ID, and operation ID.

--format=json

Problem context Skiff collects service state, target health, recent events, release data, and cloud resource IDs in one answer.

context

Risk and approval Actions are labeled no, low, medium, or high risk. High-risk work can require two-party authorization.

human gate

Human escalation Escalation packages the trace, proposed command, mutating flag, and approval requirement for review.

escalate

Agents do not scrape prose.JSON mode is a stable interface for status, doctor output, recommendations, and errors.

Skiff manages the context packet.The agent gets the service, operation, trace, cloud resources, recent events, and next commands together.

Risk is explicit.Commands are classified before they run, including whether they mutate state and how reversible they are.

Escalation is part of the flow.Agent escalations to humans and two-party authorization are built into high-risk operations.

Every command has JSON mode.Agents can parse facts, hypotheses, recommendations, mutating flags, trace IDs, and operation IDs without screen scraping.

Agent safety is first-class.Recommended actions carry no, low, medium, or high risk labels, plus reversibility and approval requirements.

Humans stay in the loop.High-risk actions and agent escalations can require two-party authorization before Skiff runs them.

Common user journeys

The product surface is the work operators actually do.

Skiff is designed around operational jobs: ship safely, respond to degraded service, recover data, rotate credentials, control cloud spend, audit security posture, investigate incidents, and hand work to an agent or another human with JSON context, risk labels, and enough history to continue.

Ship a release

Compile spec, sign release, canary traffic, watch health, promote or pause.

Repair degraded service

Pull JSON context, inspect logs, classify risk, escalate or run bounded repair.

Restore state

Verify backup, isolate target, approve cutover, validate health, record the result.

Rotate credentials

Stage new reference, roll workloads, confirm consumers, revoke old access.

Control cloud cost

Inspect billable resources, explain idle capacity, right-size safely, and record savings actions.

Audit security posture

Review IAM scope, release trust, secret references, debug sessions, and audit coverage.

Investigate incident root cause

Correlate deploys, health, events, provider IDs, and operator actions into a traceable timeline.

Shipping stays in one path.The operator sees release, rollout state, traffic, health, logs, and the next safe action together.

Repair begins with observed facts.Doctor output recommends commands, labels risk, and asks for human approval when an agent should not act alone.

State recovery gets first-class treatment.Restore work includes backup freshness, risk classification, cutover, validation, and traceable results.

Credential rotation is deliberate.Skiff stages the change, verifies workloads, revokes old access, and records what changed.

Cost control starts from named resources.Skiff keeps NAT gateways, ALBs, ASGs, target groups, log groups, and state buckets visible enough to manage spend.

Security audits use the same evidence trail.Operators can review trust roots, least-privilege policy, debug posture, secret references, and mutation history together.

Root cause work gets a durable timeline.Events, releases, provider IDs, health checks, and human actions stay tied to operation and trace IDs.

Implementation shape

The internals exist to preserve the operator promise.

The operator-facing promise rests on durable object state, a stateless facade, a direct CLI fallback, immutable history, CAS controls, and typed sagas.

skiff deploy payments-api --canary
  -> write operation intent
  -> create signed release manifest
  -> CAS service control
  -> watch target health
  -> append audit event

skiff --direct status payments-api
  -> read object state directly
  -> rebuild enough view to recover

Object storage is durable truthState lives in signed or schema-versioned objects.

skiffd is a rebuildable facadeIndexes and streams are fast views, not the database.

Runner verifies before servingVM-local runtime checks manifests and artifacts directly.

Operations are auditableActor, trace ID, target, risk, and summary are recorded.

Durable state comes first.Mutating operations write object storage before updating in-memory views.

The facade can fail without taking truth with it.skiffd powers normal UX, but the CLI can still read object state directly.

The VM is the workload boundary.Runners verify signed releases and report state transitions without relying on a cluster control plane.

Audit is part of the contract.Every mutating production operation is traceable, resumable when long-running, and explicit about risk.

Object state is the durable substrate.Release manifests, operation intents, saga graphs, events, controls, indexes, and audits have clear mutation rules.

Control docs are also lock docs.Compare-and-swap on the relevant control document prevents separate lock files and stale ownership.

Native cloud primitives stay native.Skiff uses ASGs, target groups, IAM roles, and log groups without adding a thick platform abstraction.

Skiff

Get platform-grade operations without running Kubernetes.

Start with a secure bootstrap, ship one service through signed canary releases, and turn the recurring production work into typed runbooks.

curl -fsSL https://raw.githubusercontent.com/s1liconcow/skiff/main/scripts/install.sh | bash