Native macOS Agent

COMPUTER USE

I'm building an AI agent that can see and control my Mac — and documenting every step of the journey. What works, what doesn't, what I'm learning along the way. This page is a living build log.

$ python agent.py "Open Safari and search Google for 'Voronoi decomposition examples'"
🖥️ Screen: 2560x1440, Displays: 1
→ computer.left_click (326, 1388) — Safari in Dock
→ computer.key cmd+l — focus URL bar
→ computer.type "google.com"
→ computer.key Return — navigate
→ computer.screenshot — Google homepage loaded
→ computer.left_click (640, 380) — search field
→ computer.type "Voronoi decomposition examples"
→ computer.key Return — search
✅ Task complete. Search results displayed.
$
scroll

The Real Story

Anthropic announced their computer use feature today — AI that can see your screen and control your mouse and keyboard. My first thought wasn't "let me download it." It was "let me try to build my own before I even look at theirs."

So I told my AI assistant (running on OpenClaw) to go look up what Anthropic's version does, and just make it better. That was literally the first prompt. No planning doc, no architecture review. Just "see what they did and one-up it."

Within minutes, we had a working v1 — screenshot, send to Claude, execute the action, repeat. It worked. It also took 17 steps to open Safari and type a URL. That's when I started pushing.

"Make it better. Use advanced mathematics. Think outside the box." — That was the next prompt. And we started stacking my own ideas on top. What if we read the macOS accessibility tree instead of guessing from pixels? What if we hardcoded common sequences so the AI doesn't fumble the same task every time? What if we only take screenshots when the outcome is actually uncertain?

Each idea led to the next. The macro system came from watching Claude take 8 attempts to focus a URL bar. The accessibility integration came from realizing the OS already knows where every button is. We even explored Voronoi decomposition, Markov chains, and Kalman filters — some of it practical, some of it R&D for later.

None of this was planned from the start. It was built iteratively — one conversation, one experiment, one "what if" at a time. I'd push an idea, my AI assistant would build it, we'd test it, and the results would spark the next idea.

This is one of several projects I'm tinkering with right now, and I figured I'd start creating pages like this to act as living blogs — documenting the build as it happens. The entire v1 + v2 stack, the architecture, this website — all built in a single afternoon. I'll keep updating this page as the project develops.

↓ Keep scrolling to see what we've built so far. This page will grow as the project develops.

Three layers that solve real problems

After testing v1, I identified the actual bottlenecks: Claude fumbles common tasks, screenshots are sent when nothing changed, and we're guessing at UI element positions when the OS already knows them. Three solutions:

01

Deterministic Action Macros

v1 took 17 API round-trips to open a URL — Claude tried clicking the URL bar, missed, tried Cmd+T, tried clicking again, finally used Cmd+L. The fix: hardcode the reliable path. open_url("google.com") = activate Safari → Cmd+L → type → Enter. One function call. Zero AI needed for known sequences.

17 steps → 1 step • Measured improvement from real testing
02

macOS Accessibility Tree

Instead of making Claude guess where a button is from a screenshot, just ask the OS. macOS exposes every interactive element — buttons, text fields, links — with their positions, labels, and states via the Accessibility API. Claude gets both the screenshot AND a structured list of everything clickable. Eyes plus a screen reader.

pyobjc → AXUIElement • Every element with position, role, and label
03

Screen Change Detection

v1 screenshots after every action, even when nothing changed. Simple pixel comparison tells us whether an action actually did something. If the screen didn't change after a click, something went wrong — retry or try a different approach. If we know what will happen (Cmd+L always focuses the URL bar), skip the screenshot entirely.

Pixel diff threshold • Skip screenshots when outcome is predictable

The v2 codebase also includes experimental modules for Voronoi decomposition, Markov state chains, PCA fingerprinting, and Kalman filtering — mathematical approaches that could further optimize the system. They're built and unit-tested but not yet wired into the live agent loop. Research, not production. Yet.

What actually happens when you give it a task

v1 — What I Built First

1

Take full screenshot of the Mac

2

Scale to 1280×720, send to Claude API

3

Claude says: "Click at (234, 567)"

4

pyautogui executes the click

5

Screenshot again, send again...

•••

17 round trips to open one URL

It works. It's just painfully slow.

v2 — What I'm Building Now

1

Check: is this a known task? → Run macro (no AI needed)

2

Read accessibility tree — get all clickable elements + positions

3

Send Claude the element list + screenshot (structured context)

4

Execute action, diff the screen — did it work?

5

If predictable outcome, skip next screenshot

Fewer API calls, faster execution, reliable results

Same task, fraction of the steps.

Honest lessons from Day 1

macOS Security Is Real

Synthetic drag events are blocked at the OS level. Three different approaches (Quartz CGEvents, pyautogui, cliclick) all failed. AppleScript window management works 100% of the time. Work WITH the OS, not against it.

The AI Shouldn't Figure Out "How"

Claude is brilliant at deciding WHAT to do. It's wasteful at figuring out HOW — it took 8 attempts to focus a URL bar. Hardcode the reliable path. Let Claude handle novel decisions, not routine mechanics.

The OS Knows More Than the Screenshot

macOS Accessibility API gives you every button, text field, and link with exact positions and labels. Sending Claude a screenshot and asking "where's the search bar?" is like asking someone to read a menu while the waiter is standing right there.

Most Screenshots Are Wasted

After pressing Cmd+L, the URL bar is focused. Always. Taking a screenshot to verify costs ~5 seconds and thousands of tokens to confirm something with 99% certainty. The fix: skip the screenshot when the outcome is predictable.

Session Replay Changes Everything

Logging every action + screenshot to disk means you can replay any session and see exactly what happened. This is how you debug, optimize, and eventually build training data for local models.

Math Is R&D, Not Production (Yet)

I built modules for Voronoi decomposition, Markov chains, Kalman filters, and PCA fingerprinting. They're elegant. They're also solving problems I don't have yet. The accessibility tree + macros + screen diff gave me 90% of the improvement with 10% of the complexity. The math will matter when we hit scale.

The standard approach and why I moved past it

Anthropic ships a reference implementation for computer use — it runs in a Docker container with a VNC virtual display. It works. I started by copying that approach, then realized it has fundamental limitations that can't be fixed with better prompts.

📸

Screenshot Every Step

The default loop screenshots after every single action. Opening a URL = 17 round trips. Most of those screenshots confirm things we already know.

🏗️

Docker + VNC Overhead

The official approach runs in a container with a virtual display. I'm running natively on macOS — direct screen access, zero overhead, real apps.

🎯

Vision-Only Element Finding

Making AI guess button positions from pixels when the OS literally has a list of every interactive element with exact coordinates. The accessibility tree is right there.

🔄

No Memory Between Steps

Every step starts fresh — no memory of where things are, no learned shortcuts, no recognition of familiar screens. Like getting amnesia between every action.

What's proven vs what's next

v1 tested
17 steps
URL navigation (brute force)
Measured baseline
v2 macro
1
Same task, deterministic macro
Code written, testing next
Window move
300px ✓
AppleScript — pixel accurate
Tested & verified
v2 modules
6
Math layers built (Day 1)
Voronoi · Markov · Kalman · PCA · Macros · Planner

The roadmap to autonomous perception

The v2 mathematical modules are built and tested in isolation. Next phase: wire them into the live agent loop, benchmark against v1, and separate perception from cognition entirely.

Self-Training YOLO Vision

A YOLOv8 model trained to identify UI elements locally — buttons, text fields, menus, icons — in approximately 5 milliseconds.

  • Every Claude API fallback generates free labeled training data
  • macOS Accessibility tree provides bounding box ground truth
  • Self-improving: accuracy compounds with every session
  • Goal: eliminate API calls for perception entirely

Perception ↔ Cognition Split

The key architectural insight: seeing what's on screen (fast, local, free) and deciding what to do about it (reasoning, planning) are fundamentally different problems.

  • YOLO handles all perception at ~5ms per frame
  • Claude API reserved strictly for reasoning and planning
  • Token costs drop by an order of magnitude
  • Response latency approaches real-time interaction

Built by Bryan Brodsky

I build AI systems that do real work on real machines. This project combines systems engineering, information theory, computational geometry, and signal processing into something that actually works — not a demo, not a proof of concept, but a tool I use every day.

CISSP Former IT Director — Absci (NASDAQ: ABSI) AI Systems Engineer