Native macOS Agent

COMPUTER USE

I'm building an AI agent that can see and control my Mac — and documenting every step of the journey. What works, what doesn't, what I'm learning along the way. This page is a living build log.

$ python agent.py "Open Safari and search Google for 'Voronoi decomposition examples'"

🖥️ Screen: 2560x1440, Displays: 1

→ computer.left_click (326, 1388) — Safari in Dock

→ computer.key cmd+l — focus URL bar

→ computer.type "google.com"

→ computer.key Return — navigate

→ computer.screenshot — Google homepage loaded

→ computer.left_click (640, 380) — search field

→ computer.type "Voronoi decomposition examples"

→ computer.key Return — search

✅ Task complete. Search results displayed.

scroll

// How This Started

The Real Story

Anthropic announced their computer use feature today — AI that can see your screen and control your mouse and keyboard. My first thought wasn't "let me download it." It was "let me try to build my own before I even look at theirs."

So I told my AI assistant (running on OpenClaw) to go look up what Anthropic's version does, and just make it better. That was literally the first prompt. No planning doc, no architecture review. Just "see what they did and one-up it."

Within minutes, we had a working v1 — screenshot, send to Claude, execute the action, repeat. It worked. It also took 17 steps to open Safari and type a URL. That's when I started pushing.

"Make it better. Use advanced mathematics. Think outside the box." — That was the next prompt. And we started stacking my own ideas on top. What if we read the macOS accessibility tree instead of guessing from pixels? What if we hardcoded common sequences so the AI doesn't fumble the same task every time? What if we only take screenshots when the outcome is actually uncertain?

Each idea led to the next. The macro system came from watching Claude take 8 attempts to focus a URL bar. The accessibility integration came from realizing the OS already knows where every button is. We even explored Voronoi decomposition, Markov chains, and Kalman filters — some of it practical, some of it R&D for later.

None of this was planned from the start. It was built iteratively — one conversation, one experiment, one "what if" at a time. I'd push an idea, my AI assistant would build it, we'd test it, and the results would spark the next idea.

This is one of several projects I'm tinkering with right now, and I figured I'd start creating pages like this to act as living blogs — documenting the build as it happens. The entire v1 + v2 stack, the architecture, this website — all built in a single afternoon. I'll keep updating this page as the project develops.

↓ Keep scrolling to see what we've built so far. This page will grow as the project develops.

// What's Actually Built

Three layers that solve real problems

After testing v1, I identified the actual bottlenecks: Claude fumbles common tasks, screenshots are sent when nothing changed, and we're guessing at UI element positions when the OS already knows them. Three solutions:

Deterministic Action Macros

v1 took 17 API round-trips to open a URL — Claude tried clicking the URL bar, missed, tried Cmd+T, tried clicking again, finally used Cmd+L. The fix: hardcode the reliable path. open_url("google.com") = activate Safari → Cmd+L → type → Enter. One function call. Zero AI needed for known sequences.

17 steps → 1 step • Measured improvement from real testing

macOS Accessibility Tree

Instead of making Claude guess where a button is from a screenshot, just ask the OS. macOS exposes every interactive element — buttons, text fields, links — with their positions, labels, and states via the Accessibility API. Claude gets both the screenshot AND a structured list of everything clickable. Eyes plus a screen reader.

pyobjc → AXUIElement • Every element with position, role, and label

Screen Change Detection

v1 screenshots after every action, even when nothing changed. Simple pixel comparison tells us whether an action actually did something. If the screen didn't change after a click, something went wrong — retry or try a different approach. If we know what will happen (Cmd+L always focuses the URL bar), skip the screenshot entirely.

Pixel diff threshold • Skip screenshots when outcome is predictable

The v2 codebase also includes experimental modules for Voronoi decomposition, Markov state chains, PCA fingerprinting, and Kalman filtering — mathematical approaches that could further optimize the system. They're built and unit-tested but not yet wired into the live agent loop. Research, not production. Yet.

// How It Works

What actually happens when you give it a task

v1 — What I Built First

Take full screenshot of the Mac

Scale to 1280×720, send to Claude API

Claude says: "Click at (234, 567)"

pyautogui executes the click

Screenshot again, send again...

•••

17 round trips to open one URL

It works. It's just painfully slow.

v2 — What I'm Building Now

Check: is this a known task? → Run macro (no AI needed)

Read accessibility tree — get all clickable elements + positions

Send Claude the element list + screenshot (structured context)

Execute action, diff the screen — did it work?

If predictable outcome, skip next screenshot

✓

Fewer API calls, faster execution, reliable results

Same task, fraction of the steps.

// What I Learned

Honest lessons from the afternoon

macOS Security Is Real

Synthetic drag events are blocked at the OS level. Three different approaches (Quartz CGEvents, pyautogui, cliclick) all failed. AppleScript window management works 100% of the time. Work WITH the OS, not against it.

The AI Shouldn't Figure Out "How"

Claude is brilliant at deciding WHAT to do. It's wasteful at figuring out HOW — it took 8 attempts to focus a URL bar. Hardcode the reliable path. Let Claude handle novel decisions, not routine mechanics.

The OS Knows More Than the Screenshot

macOS Accessibility API gives you every button, text field, and link with exact positions and labels. Sending Claude a screenshot and asking "where's the search bar?" is like asking someone to read a menu while the waiter is standing right there.

Most Screenshots Are Wasted

After pressing Cmd+L, the URL bar is focused. Always. Taking a screenshot to verify costs ~5 seconds and thousands of tokens to confirm something with 99% certainty. The fix: skip the screenshot when the outcome is predictable.

Session Replay Changes Everything

Logging every action + screenshot to disk means you can replay any session and see exactly what happened. This is how you debug, optimize, and eventually build training data for local models.

Math Is R&D, Not Production (Yet)

I built modules for Voronoi decomposition, Markov chains, Kalman filters, and PCA fingerprinting. They're elegant. They're also solving problems I don't have yet. The accessibility tree + macros + screen diff gave me 90% of the improvement with 10% of the complexity. The math will matter when we hit scale.

// Day 1 — Evening

The Wiring

Most people would've stopped after the architecture and the math modules. But the whole point was to build something that works — not just passes tests. So the same evening, we wired it all up, hit real problems, and ran the first live tests. All still Day 1.

v2 Optimization Sprint — 7 improvements, one session

Accessibility Tree → Claude

The AccessibilityTree class existed but was dead code. Now it feeds into every interaction — system prompt, tool results, initial context. Claude sees what's on screen and gets a structured element list.

ax_click / ax_type

Click elements by label, not pixel coordinates. Claude says ax_click("Submit") and the system finds exact center coords from the AX tree. Near-zero miss rate on apps that expose accessibility.

wait_for_stable

Replaced time.sleep(0.3) with continuous SSIM-based screen diffing. 3s timeout for clicks, 10s for navigation. The screen_diff module already had this — just wasn't wired in.

Macro Bash Interception

When Claude calls macros via bash, the command gets parsed and routed through MacroEngine in-process. No subprocess overhead.

Smart Screenshot Gating

When screen_changed returns false, only text + AX tree sent — no image. Saves ~1,500 tokens per unchanged screenshot.

AX Tree from Step 1

Accessibility tree is captured and included from the very first interaction, not just after fallbacks. Claude has full context from the start.

Accurate Cost Tracking

Separate input/output token tracking. Sonnet 4: $3/MTok input, $15/MTok output. Old code used flat $15/MTok for everything — was 5× overcounting inputs.

All 76 tests pass · Committed as b62ae36

// The Permission Wall

The Accessibility Saga

Everything compiled. Tests passed. Time to run it for real. Except macOS had other plans.

The Problem

macOS blocks synthetic mouse clicks and keystrokes unless the sending process has Accessibility permissions. Our Python runs inside node (via openclaw-gateway), not Terminal.app. We kept adding Terminal to Accessibility — it didn't help.

launchd → openclaw-gateway (node) → zsh → python

We were granting permission to Terminal.app. The actual process sending events was /opt/homebrew/Cellar/node/25.8.1_1/bin/node.

The Fix

Add the actual Node.js binary to Accessibility permissions. One checkbox in System Settings. Hours of debugging to find which checkbox.

ps -p $PPID -o comm=

This command shows which binary is actually the parent process. Don't assume it's Terminal.

The Lesson

The process that needs permissions isn't always obvious. Check the actual process tree. macOS traces the permission to the binary, not the app wrapper, not the terminal, not the shell — the binary.

// First Real Test

Controlling OBS Studio

With accessibility granted, we pointed the agent at something real: programmatically recording the screen through OBS. No scripts, no OBS API — just the agent seeing the UI and clicking buttons.

What Happened

Activated OBS via AppleScript

Screenshot → find "Start Recording" button

Vision estimated (1033, 522) → clicked → missed

Real position was (1810, 925) — 780px off

Cropped to controls area → re-analyzed → second click hit

Waited 5 seconds, clicked Stop Recording

✓

Recording saved: ~/Movies/2026-03-24 21-53-33.mov (2.4 MB)

It worked. But we found critical problems.

What We Learned

Qt apps don't expose AX tree. OBS returned 0 accessibility elements. Must use vision.

Full-screen coordinate estimation is unreliable. 780px miss on first attempt.

Cropping dramatically improves accuracy. Narrowing the analysis area gave correct coordinates.

→

This is exactly why the YOLO model matters — exact bounding boxes, no guessing.

ax_click for native apps. YOLO for everything else.

// The Practice Arena

33% — The Honest Baseline

If you can't measure it, you can't improve it. We built a web app with click targets, drag zones, text fields, buttons, and checkboxes — every element at a known position, every interaction scored to the pixel.

7/21

Overall score

33% baseline

~5/7

Circle targets hit

25-48px off center

0/3

Text fields filled

Clicks missed inputs

0/7

Drag + Buttons + Checkboxes

Coordinate offset

Root Cause

Vision model estimates element positions from screenshots, but there's a systematic offset that varies by screen area — Safari chrome height, content scaling, Retina coordinate mapping all contribute.

What This Proves

Vision-based coordinate guessing is fundamentally unreliable without calibration. For AX-enabled apps, ax_click by label is far superior. For everything else, we need the YOLO model with exact bounding boxes.

Every click records distance-from-center. Every session generates labeled training data. The training ground isn't just a test — it's the engine that will drive accuracy from 33% to 95%+. Built as a web app (Tkinter wasn't available), which turned out better — works in any browser, and the agent interacts through Safari just like it would with any real app.

// What This Improves On

The standard approach and why I moved past it

Anthropic ships a reference implementation for computer use — it runs in a Docker container with a VNC virtual display. It works. I started by copying that approach, then realized it has fundamental limitations that can't be fixed with better prompts.

📸

Screenshot Every Step

The default loop screenshots after every single action. Opening a URL = 17 round trips. Most of those screenshots confirm things we already know.

🏗️

Docker + VNC Overhead

The official approach runs in a container with a virtual display. I'm running natively on macOS — direct screen access, zero overhead, real apps.

🎯

Vision-Only Element Finding

Making AI guess button positions from pixels when the OS literally has a list of every interactive element with exact coordinates. The accessibility tree is right there.

🔄

No Memory Between Steps

Every step starts fresh — no memory of where things are, no learned shortcuts, no recognition of familiar screens. Like getting amnesia between every action.

// Where We Are

What's proven vs what's next

v1 tested

17 steps

URL navigation (brute force)

Measured baseline

v2 macro

Same task, deterministic macro

Code written, testing next

Training Ground

33%

7/21 — Day 1 baseline

Honest starting point

v2 optimizations

Shipped in one session

All 76 tests pass

OBS Recording

✓

Programmatic screen recording

First real-app test

$0.50-1.00

$0.06

Cost per task target

↓ 90% with accurate tracking

Window move

300px ✓

AppleScript — pixel accurate

Tested & verified

v2 modules

Math layers built (Day 1)

Voronoi · Markov · Kalman · PCA · Macros · Planner

// What's Next

The roadmap to autonomous perception

The v2 mathematical modules are built and tested in isolation. Next phase: wire them into the live agent loop, benchmark against v1, and separate perception from cognition entirely.

Self-Training YOLO Vision

A YOLOv8 model trained to identify UI elements locally — buttons, text fields, menus, icons — in approximately 5 milliseconds.

Every Claude API fallback generates free labeled training data
macOS Accessibility tree provides bounding box ground truth
Self-improving: accuracy compounds with every session
Goal: eliminate API calls for perception entirely

Training Ground + Calibration

The practice app measures click accuracy to ±1px. Every session generates YOLO training data automatically. The self-improving loop starts here.

Calibration mode: click known target, measure offset, apply correction
Every screenshot + known position = free YOLO label
Goal: climb from 33% to 95%+ through calibration
Practice → measure → correct → practice again

Perception ↔ Cognition Split

The key architectural insight: seeing what's on screen (fast, local, free) and deciding what to do about it (reasoning, planning) are fundamentally different problems.

YOLO handles all perception at ~5ms per frame
Claude API reserved strictly for reasoning and planning
Token costs drop by an order of magnitude
Response latency approaches real-time interaction