Операции

Operational behavior сервиса строится вокруг safety, isolation и полной наблюдаемости попытки. Сервис должен уметь выполнять опасные infra-actions, но только через typed contracts, policy checks и контролируемые runners.

Execution Lifecycle

sequenceDiagram participant CP as control-plane-service participant D as dispatcher participant R as runner participant V as Vault participant A as artifact-worker participant S as S3/MinIO CP->>D: executeJob(JobSpec) D->>D: validate and create attempt D->>R: lease attempt R->>D: heartbeat(started) R->>V: resolve secret refs R->>R: execute typed runner action R->>D: heartbeat(progress) R->>A: submit logs and outputs A->>S: write artifact bundle A->>D: artifact refs R->>D: complete attempt D->>CP: ExecutionResult callback

Runner Isolation Policy

Базовые правила:

Каждый runner запускается в изолированном pod/process sandbox.
Runner имеет минимальные network permissions, заданные runner_requirements.
Secret values доступны только в runtime memory и только конкретному attempt.
File system workspace ephemeral.
Raw outputs проходят secret masking до сохранения.
Любая destructive operation должна быть явно помечена в JobSpec.

Запрещено:

generic uncontrolled shell;
передача secret values в inputs.variables;
запись kubeconfig, ssh key или cloud credentials в artifacts;
выполнение package content, который не прошел marketplace approval или trusted source policy;
изменение plan или operation status напрямую.

Approved Commands

Execution допускается только через typed job types:

allowed_job_types:
  ansible:
    - ansible.playbook.run
    - ansible.role.run
    - ansible.inventory.validate
    - ansible.check_mode.run
  kubernetes:
    - kubernetes.manifest.apply
    - kubernetes.manifest.delete
    - kubernetes.wait
    - kubernetes.rollout.status
    - kubernetes.resource.patch
    - kubernetes.resource.read
  helm:
    - helm.install
    - helm.upgrade
    - helm.rollback
    - helm.uninstall
    - helm.test
    - helm.template
  opentofu:
    - opentofu.plan
    - opentofu.apply
    - opentofu.destroy
    - opentofu.output
    - opentofu.state.inspect
  ssh:
    - ssh.command.run
    - ssh.file.upload
    - ssh.file.download
    - ssh.service.restart
    - ssh.system.fact.collect
  verification:
    - verify.http.endpoint
    - verify.tcp.connect
    - verify.kubernetes.condition
    - verify.postgres.ready
    - verify.resource.capability
    - verify.custom.probe

Новый job type добавляется как protocol extension: runner implementation, contract schema, policy rules, evidence behavior и docs.

Secrets

Vault является единственным источником secret material.

Правила:

JobSpec содержит только vault://... refs.
Dispatcher не раскрывает secret values.
Runner получает секрет по lease-bound token или service identity.
Secret redaction применяется к logs, heartbeat messages, errors и artifacts.
Secret access audit должен быть связан с attempt_id.

Artifacts

Каждый attempt производит artifact bundle.

Минимальный bundle:

artifact_manifest:
  attempt_id: exec-attempt-123
  job_type: ansible.playbook.run
  files:
    - path: logs/stdout.log
      stream: stdout
      digest: sha256:abc
    - path: logs/stderr.log
      stream: stderr
      digest: sha256:def
    - path: result/result.json
      type: normalized_result
      digest: sha256:ghi
  redaction:
    applied: true
    rules_version: 1

UI показывает logs/artifacts через control-plane-service, который запрашивает refs и выдает product-level access.

Cancellation

Cancellation flow:

sequenceDiagram participant CP as control-plane-service participant D as dispatcher participant R as runner participant A as artifact-worker CP->>D: cancelAttempt(attempt_id) D->>D: mark cancellation requested R->>D: heartbeat D-->>R: cancellation_requested=true R->>R: stop safely R->>A: upload partial artifacts R->>D: complete(cancelled) D->>CP: ExecutionResult(cancelled)

Если runner не отвечает, dispatcher ждет grace period и переводит attempt в lost или cancelled по policy.

Timeouts

Timeout задается в JobSpec и может быть ограничен service policy.

Поведение:

dispatcher отслеживает deadline;
runner получает cancellation request;
partial logs сохраняются;
final status становится timed_out, если runner подтвердил timeout, или lost, если heartbeat пропал;
control-plane-service решает, делать retry, PlanPatch или manual intervention.

Retries

Retry всегда создает новый attempt.

flowchart LR NodeRun1[node_run attempt 1 failed] --> CP[control-plane decision] CP --> NodeRun2[node_run attempt 2] NodeRun2 --> Attempt2[execution_attempt new id]

Execution-plane не решает retry policy сам. Он может только вернуть failure category и retry hints.

Idempotency

Idempotency защищает API от повторной доставки команды.

Правила:

executeJob idempotent по idempotency_key;
completeAttempt idempotent по attempt_id + terminal payload digest;
повторный heartbeat append-only;
artifact creation deduplicates by digest where safe.

Observability

Метрики:

attempts by status/job type/runner kind;
queue latency;
lease acquisition latency;
execution duration;
heartbeat lag;
artifact upload duration;
failure categories;
cancellation latency;
runner pool capacity.

Traces:

operation_id;
plan_node_id;
node_run_id;
attempt_id;
runner_instance_id.

Logs:

service logs не содержат secret values;
runner raw logs уходят в artifact storage;
operational errors имеют stable error code.

Disaster Recovery

После сбоя:

dispatcher восстанавливает active attempts из БД;
expired leases переводятся в lost;
pending outbox events публикуются повторно;
callbacks в control-plane-service доставляются повторно;
artifact worker проверяет incomplete uploads и помечает artifacts как partial или failed.

Extension Points

AI agent может читать logs/evidence через control-plane-service и предлагать repair PlanPatch.
IAM добавляет service identities, permissions и approval policy.
Audit подписывается на execution events и secret access records.
Billing использует attempt duration, runner kind и artifact size.

Эти extensions не получают право обходить JobSpec, Vault refs и runner isolation.

Модули Обзор

На странице

Execution Lifecycle Runner Isolation Policy Approved Commands Secrets Artifacts Cancellation Timeouts Retries Idempotency Observability Disaster Recovery Extension Points