Custom cursor

Building a desktop AI agent with Anthropic Computer Use and Tauri

What I learned giving AI real access to a real desktop.

Lacy Morrow/

When Anthropic released Computer Use last year, every demo showed Claude navigating a browser in a sandboxed VM. Cool, but limited. I wanted to see what happens when you give it a real desktop with real apps. So I built Juno.

Juno is a native macOS app. You describe a task -- type it or say it out loud -- and an AI agent takes over your mouse and keyboard. It can open apps, fill forms, browse the web, organize files. Anything you'd do manually.

Here's what I learned building it.

Why native, not web

The obvious approach would be a browser extension or an Electron wrapper. I went with Tauri (Rust backend, lightweight webview frontend) for a few reasons:

Access to real OS APIs. macOS accessibility APIs let you read the UI element tree, get precise element positions, and interact with any app. A browser extension can only see web pages. An Electron app can shell out to osascript, but that's duct tape compared to native accessibility bindings.

Voice needs native audio. Juno has always-on voice control using a custom Whisper plugin. Running speech-to-text locally in Rust is fast and private. Doing it in a browser means getUserMedia permissions, no background processing when the tab is hidden, and sending audio somewhere for transcription.

Performance and resource usage. The Juno app uses about 80MB of RAM. An equivalent Electron app would start at 300MB+ before doing anything.

The tradeoff is platform lock-in. Juno currently only runs on macOS. Windows and Linux are possible but would need separate accessibility implementations.

The agent architecture

Early versions had a single agent loop: take screenshot, send to Claude, get back actions, execute them, repeat. This works for simple tasks but falls apart on anything complex.

The current design uses a multi-agent system:

Orchestrator
  ├── Desktop Agent (mouse, keyboard, app control)
  ├── Browser Agent (web navigation, content extraction)
  └── File Agent (filesystem operations)

The Orchestrator sees your request, decides which specialist(s) to invoke, and coordinates results. Specialists run with isolated memory -- they can't interfere with each other.

This matters for tasks like "research the top 5 coffee shops near me and put them in a spreadsheet." The Browser Agent handles the research. The Desktop Agent opens the spreadsheet app. The File Agent can save/organize the result. The Orchestrator stitches it together.

Screenshots and the Computer Use API

The core loop for any Computer Use agent is:

  1. 1.Take a screenshot
  2. 2.Send it to Claude with the current task context
  3. 3.Claude returns tool calls (click at x,y; type text; scroll; etc.)
  4. 4.Execute the tool calls
  5. 5.Take another screenshot to verify
  6. 6.Repeat until done

The tricky parts:

Screenshot quality vs. token cost. Full retina screenshots are 5120x2880. Sending that to the API burns tokens fast. I settled on JPEG at quality 85 for normal screenshots (saves ~60% on tokens) and PNG only for the zoom tool when Claude needs to read small text.

Coordinate accuracy. Claude returns click coordinates based on the screenshot dimensions. If your screenshot is scaled, your clicks land in the wrong place. I spent a lot of time getting the coordinate math right between logical pixels, physical pixels, and screenshot dimensions.

Action verification. After every action, you need to take a new screenshot to confirm it worked. Did the click land on the right button? Did the text appear in the right field? Without this feedback loop, the agent goes off the rails after one missed click.

Voice control

Juno uses a custom Tauri plugin wrapping OpenAI's Whisper model for local speech-to-text. The plugin:

  • Runs the whisper.cpp library in a background thread
  • Monitors the microphone continuously
  • Detects speech vs. silence using energy thresholds
  • Transcribes speech segments and sends them to the agent

All of this happens on-device. No audio is sent to any server. The wake word ("Hey Juno" by default) is just a string match on the transcription output, which is simple but works well enough.

The main challenge was getting the voice-activity detection right. Too sensitive and it triggers on keyboard typing. Too conservative and it cuts off the beginning of sentences. I ended up using a combination of energy threshold + minimum duration + silence gap detection.

Security model

Giving an AI agent mouse and keyboard access is, obviously, a security concern. Juno has a tiered permission system:

Level 1 (Read-only): Screenshots and accessibility tree reading only
Level 2 (Safe actions): Mouse movement, scrolling, non-destructive keyboard input
Level 3 (Standard): Clicking, typing, app launching
Level 4 (Elevated): File operations, shell commands (with whitelist)
Level 5 (Full): Everything, including system settings changes

By default, Juno runs at Level 3. Users can lower it for specific tasks or raise it when they trust the operation. There's also a tool approval system where certain sensitive operations (deleting files, sending messages, making purchases) require explicit user confirmation before the agent proceeds.

The CLI: giving other AI agents desktop superpowers

One thing I didn't plan for initially but turned out to be really useful: packaging Juno's capabilities as an MCP (Model Context Protocol) server.

npx juno-cua

This starts an MCP server that any compatible AI agent can connect to. Claude Code, Cursor, Codex -- they can all use it to take screenshots, read the accessibility tree, and control the desktop.

The use case: you're pair-programming with an AI agent and it needs to check something in a browser, or look at a design in Figma, or verify that the app you just built actually works visually. Instead of you describing what you see, the agent just looks.

What I'd do differently

Start with the CLI, not the GUI. The standalone desktop app is cool but the CLI/MCP integration has more immediate developer utility. If I were starting over, I'd ship the CLI first and add the GUI later.

Don't build your own browser automation. I spent weeks on browser control (injecting scripts, handling navigation events, dealing with iframes). Playwright or Puppeteer via a subprocess would have been faster to ship and more reliable.

Test on built apps early. macOS permissions (screen recording, accessibility) behave differently in dev mode vs. a signed .app bundle. I discovered this the hard way after building features that worked in dev but broke in production because the bundle identifier was different.

Try it

Source: github.com/lacymorrow/juno

Website: junebug.ai

CLI: npx juno-cua

macOS 14+ required. It's in beta -- rough edges exist. If you're interested in AI desktop automation or Computer Use, I'd appreciate feedback.

Want to try AI desktop automation?

Download Juno