Documentation

Prime Intellect RL coding runs

Use Syva sandboxes as the isolated execution layer for coding-agent rollouts, evaluations, and reward checks.

Rollout shape

Coding-agent RL needs a repeatable environment for every trajectory: sample a task, start from a clean repository state, let the model inspect and edit files, run a deterministic grader, then return reward and metadata.

Syva fits the execution boundary. Create one sandbox per rollout, write the task files into /workspace, expose a narrow shell tool to the model, run the grader inside the sandbox, and destroy the sandbox in teardown. The trainer and model stay outside Syva; Syva runs the code.

Set up a local runner

Start with the public Python SDK plus your model client. If you target a local or self-hosted Syva control plane, add SYVA_API_URL to the same env file.

Terminal
mkdir syva-prime-runner
cd syva-prime-runner
uv init --bare
uv add syva openai python-dotenv
printf "SYVA_API_KEY=syva_...\nOPENAI_API_KEY=sk-...\nOPENAI_MODEL=your-coding-model\n" > .env.local

Create one coding task

A task can be a JSONL row with a prompt, input files, and the command that decides reward. Real datasets add repository snapshots, hidden tests, or rubrics; this tiny fixture keeps the mechanics visible.

tasks.jsonl
{"id":"add-one","prompt":"Fix add_one so the tests pass. Only edit app.py.","files":{"app.py":"def add_one(x):\n    return x\n","test_app.py":"from app import add_one\n\n\ndef test_add_one():\n    assert add_one(2) == 3\n"},"test_command":"python -m pytest -q"}

Run the rollout in Syva

This runner builds a reusable image recipe from python:3.12-slim, writes the task into the sandbox, lets the model call a shell tool, and scores with pytest. Outbound network is disabled for the sandbox because the task and grader are local.

run_rollout.py
from __future__ import annotations

import argparse
import json
import os
from pathlib import Path
from textwrap import dedent
from typing import Any

from dotenv import load_dotenv
from openai import OpenAI
from syva import Image, Sandbox


def load_first_task(path: Path) -> dict[str, Any]:
    with path.open() as handle:
        for line in handle:
            stripped = line.strip()
            if stripped:
                return json.loads(stripped)
    raise ValueError(f"{path} does not contain a task")


def workspace_files(task: dict[str, Any]) -> dict[str, str]:
    files = task.get("files", {})
    if not isinstance(files, dict):
        raise TypeError("task files must be an object")
    return {f"/workspace/{path}": str(contents) for path, contents in files.items()}


def create_sandbox(task: dict[str, Any]) -> Sandbox:
    image = Image.from_registry("python:3.12-slim").pip_install("pytest")
    return Sandbox.create(
        image=image,
        cpu_cores=2,
        memory_mb=2048,
        start_command="tail -f /dev/null",
        network_policy="deny-all",
        labels={"workflow": "prime-rl", "task": str(task["id"])},
        # Rollouts extract their reward and traces during the run, so there is
        # nothing worth snapshotting; ephemeral sandboxes are simply destroyed
        # on TTL expiry instead of being stopped with a snapshot.
        persistent=False,
    )


def write_task(sandbox: Sandbox, task: dict[str, Any]) -> None:
    sandbox.run_command("mkdir", args=["-p", "/workspace"], timeout_ms=30_000)
    sandbox.fs.write_files(workspace_files(task))


def run_shell(sandbox: Sandbox, command: str) -> str:
    result = sandbox.run_command(
        "bash",
        args=["-lc", command],
        cwd="/workspace",
        timeout_ms=30_000,
    )
    return dedent(
        f"""
        exit_code: {result.exit_code}
        stdout:
        {result.stdout}
        stderr:
        {result.stderr}
        """
    ).strip()


def ask_model(client: OpenAI, messages: list[dict[str, Any]]) -> dict[str, Any]:
    response = client.chat.completions.create(
        model=os.environ["OPENAI_MODEL"],
        messages=messages,
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "run_shell",
                    "description": "Run a shell command in the Syva sandbox.",
                    "parameters": {
                        "type": "object",
                        "properties": {"command": {"type": "string"}},
                        "required": ["command"],
                        "additionalProperties": False,
                    },
                },
            }
        ],
        tool_choice="auto",
    )
    return response.choices[0].message.model_dump()


def run_rollout(task: dict[str, Any], max_turns: int, timeout_ms: int) -> dict[str, Any]:
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    sandbox = create_sandbox(task)

    try:
        write_task(sandbox, task)
        messages: list[dict[str, Any]] = [
            {
                "role": "system",
                "content": (
                    "You are a coding agent. Inspect the files, edit app.py, "
                    "run the tests, and stop once the tests pass."
                ),
            },
            {"role": "user", "content": str(task["prompt"])},
        ]

        for _ in range(max_turns):
            assistant_message = ask_model(client, messages)
            messages.append(assistant_message)
            tool_calls = assistant_message.get("tool_calls") or []
            if not tool_calls:
                break

            for tool_call in tool_calls:
                arguments = json.loads(tool_call["function"]["arguments"])
                output = run_shell(sandbox, str(arguments["command"]))
                messages.append(
                    {
                        "role": "tool",
                        "tool_call_id": tool_call["id"],
                        "content": output,
                    }
                )

        score = sandbox.run_command(
            "bash",
            args=["-lc", str(task["test_command"])],
            cwd="/workspace",
            timeout_ms=timeout_ms,
        )
        final_app = sandbox.run_command("cat", args=["app.py"], cwd="/workspace", timeout_ms=30_000)

        return {
            "task_id": task["id"],
            "reward": 1.0 if score.exit_code == 0 else 0.0,
            "exit_code": score.exit_code,
            "stdout": score.stdout,
            "stderr": score.stderr,
            "timed_out": score.timed_out,
            "final_app_py": final_app.stdout,
            "turns": len(messages),
            "sandbox_id": sandbox.id,
        }
    finally:
        sandbox.destroy()


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--task", type=Path, required=True)
    parser.add_argument("--max-turns", type=int, default=8)
    parser.add_argument("--timeout-ms", type=int, default=120_000)
    args = parser.parse_args()

    load_dotenv(".env.local")
    if not os.environ.get("OPENAI_MODEL"):
        raise RuntimeError("OPENAI_MODEL is required")

    task = load_first_task(args.task)
    result = run_rollout(task, args.max_turns, args.timeout_ms)
    print(json.dumps(result, indent=2))


if __name__ == "__main__":
    main()
Terminal
uv run python run_rollout.py --task tasks.jsonl
Output
{
  "task_id": "add-one",
  "reward": 1.0,
  "exit_code": 0,
  "stdout": ".                                                                        [100%]\n1 passed in 0.01s\n",
  "stderr": "",
  "timed_out": false,
  "final_app_py": "def add_one(x):\n    return x + 1\n",
  "turns": 5,
  "sandbox_id": "sbx_..."
}

Map it to Prime Intellect

Use run_rollout as the execution call inside a Verifiers or Prime harness. The dataset supplies the JSON row, the trainer supplies the model, and the Syva result becomes the reward signal plus rollout metadata.

Keep Syva calls at the environment boundary. Rollout state should store product-level facts such as task id, sandbox id, image ref, command output, reward, and error type. Avoid coupling the environment to host placement, cache state, or other implementation details.

References: Prime Intellect Verifiers overview and prime-rl environments.

Production notes

  • Validate SYVA_API_KEY during environment load so trainer workers fail before sampling tasks.
  • Create one sandbox per rollout and always destroy it in a finally block or teardown hook.
  • Use labels such as workflow, dataset, and task for observability.
  • Score with deterministic tests or a versioned grader, then store reward with the test output tail.
  • Start with one smoke rollout, then scale concurrency gradually once image builds and teardown are clean.
  • Public sandbox create supports image, CPU, memory, labels, ports, start command, environment variables, and network policy. It does not expose public GPU, disk-size, worker-placement, or snapshot controls.
  • HTTP ports must be declared when the sandbox is created. If a rollout needs a preview server, choose the port before calling Sandbox.create.
  • Image recipes build on first use. For high-concurrency RL runs, prebuild or reuse stable image refs before starting many rollouts.