Skip to main content

3 posts tagged with "Python"

Python tag

View All Tags

Building a GPT-5.3-Codex Agent Harness

· 3 min read
VictorStackAI
VictorStackAI

GPT-5.3-Codex just dropped, and I wasted no time throwing it into a custom agent harness to see if it can actually handle complex supervision loops better than its predecessors.

Why I Built It

The announcement of GPT-5.3-Codex promised significantly better instruction following for long-chain tasks. Usually, when a model claims "better reasoning," it means "more verbose." I wanted to verify if it could actually maintain state and adhere to strict tool-use protocols without drifting off into hallucination land after turn 10.

Instead of testing it on a simple script, I built codex-agent-harness—a Python-based environment that simulates a terminal, manages a tool registry, and enforces a supervisor hook to catch the agent if it tries to run rm -rf / (or just hallucinates a command that doesn't exist).

The Solution

The harness is built around a few core components: a ToolRegistry that maps Python functions to schema definitions, and an Agent loop that manages the conversation history and context window.

One of the key features is the "Supervisor Hook." This isn't just a logger; it's an interceptor. Before the agent's chosen action is executed, the harness pauses, evaluates the safety of the call, and can reject it entirely.

Architecture

The Tool Registry

I wanted the tool definitions to be as lightweight as possible. I used decorators to register functions, automatically generating the JSON schema needed for the API.

class ToolRegistry:
def __init__(self):
self.tools = {}

def register(self, func):
"""Decorator to register a tool."""
schema = self._generate_schema(func)
self.tools[func.__name__] = {
"func": func,
"schema": schema
}
return func

def _generate_schema(self, func):
# Simplified schema generation logic
return {
"name": func.__name__,
"description": func.__doc__,
"parameters": {"type": "object", "properties": {}}
}

The Code

I've published the harness as a standalone repo. It's a great starting point if you want to test new models in a controlled, local environment without spinning up a full orchestration framework.

View Code

What I Learned

  • Context Adherence is Real: GPT-5.3-Codex actually respects the system prompt's negative constraints (e.g., "Do not use sudo") much better than 4.6, which often needed reminders.
  • Structured Outputs: The model is far less prone to "syntax drift" in its JSON outputs. I didn't have to write nearly as much retry logic for malformed JSON.
  • The "Lazy" Factor: Interestingly, 5.3 seems a bit too efficient. If you don't explicitly ask for verbose logs, it will just say "Done." Great for production, bad for debugging. I had to force it to be verbose in the system prompt.

References

Sandboxed Python in the Browser with Pydantic's Monty

· 2 min read

Recently, Simon Willison shared research on running Pydantic's Monty in WebAssembly. Monty is a minimal, secure Python interpreter written in Rust, designed specifically for safely executing code generated by LLMs.

The key breakthrough here is the ability to run Python code with microsecond latency in a strictly sandboxed environment, either on the server (via Rust/Python) or directly in the browser (via WASM).

I've put together a demo project that explores both the Python integration and the WebAssembly build.

View Code

What is Monty?

Monty is a subset of Python implemented in Rust. Unlike Pyodide or MicroPython, which aim for full or broad compatibility, Monty is built for speed and security. It provides:

  1. Restricted Environment: No access to the host file system or network by default.
  2. Fast Startup: Ideal for "serverless" or "agentic" workflows where you need to run small snippets of code frequently.
  3. Rust Foundation: Leveraging Rust's safety and performance.

Running it in the Browser

By compiling Monty to WebAssembly, we can provide a Python REPL that runs entirely on the client side. This is perfect for interactive documentation, playground environments, or edge-side code execution.

In my demo, I've included the WASM assets and a simple HTML interface to try it out.

Why this matters for AI Agents

AI agents often need to execute code to solve problems (e.g., math, data processing). Traditional sandboxing (Docker, Firecracker) has significant overhead. Monty offers a "sandbox-in-a-sandbox" approach that is lightweight enough to be part of the inner loop of an LLM interaction.

Check out the GitHub repository for the full source and instructions on how to run it yourself.