AI Can Use Your Computer Now. Here's What That Actually Means.
AI Can Use Your Computer Now. Here's What That Actually Means.
On March 5, 2026, OpenAI released GPT-5.4 — the first general-purpose AI model with native computer use. Within days, it scored 75.0% on OSWorld-Verified, a benchmark designed to test whether AI can operate real computer environments. The human expert baseline is 72.4%. GPT-5.2, released months earlier, scored 47.3%.
That's a 58% improvement in a single release cycle, crossing the superhuman threshold on the way. If you weren't paying attention to computer-use agents before March 2026, the state of the field is going to surprise you. If you were paying attention, it's going to surprise you anyway.
Eighteen months ago, the idea of an AI that could look at your screen, move your cursor, click buttons, type text, and complete multi-step workflows without human intervention was a research curiosity. Today there are at least ten serious implementations — from open-source frameworks to enterprise platforms to consumer browsers — and the philosophical disagreements between them reveal more about where this is heading than the benchmarks do.
This is a map of that landscape. Not promotional. Not speculative. Just what exists, how it works, what it costs, and what the differences actually mean.
The Screenshot Loop: How All of This Works
Every computer-use agent, regardless of who built it, runs some version of the same fundamental loop:
- Take a screenshot of the current screen state
- Send that screenshot to a vision-capable AI model
- The model analyzes what it sees and decides what to do next
- It outputs a structured action: move cursor to coordinates (x, y), click, type this text, press these keys
- A harness executes that action on the actual computer
- Take another screenshot
- Repeat
This is conceptually identical to how a remote desktop user operates — looking at pixels, deciding what to do, moving a mouse. The AI never touches the DOM, never reads accessibility trees (with a few exceptions), never calls APIs behind the scenes. It literally looks at the screen and clicks things.
This matters because it means computer-use agents are universally compatible. Any application that a human can operate by looking at a screen and clicking — legacy enterprise software, desktop applications, web apps, terminal interfaces — is theoretically operable by a computer-use agent. No integration required. No API necessary. The agent uses the application the same way you do.
The differences between implementations are in everything surrounding that loop: what model powers the vision, how actions are structured, what safety constraints exist, how errors are handled, and — critically — where the agent actually runs.
The Contenders
GPT-5.4 (OpenAI)
The headline act. GPT-5.4 is the first model where computer use isn't a bolted-on capability but a native feature of the model itself. It ships as "agent mode" in ChatGPT for Pro, Plus, and Team subscribers.
On the coding side, OpenAI runs a separate product: Codex, a cloud-based software engineering agent now available on Windows (March 4, 2026) that works on tasks in parallel, each in its own sandbox preloaded with your repo. The latest coding model is GPT-5.3-Codex — a Codex-native agent that pairs frontier coding performance with general reasoning for long-horizon technical work, 25% faster than its predecessor. GPT-5.3-Codex is available natively in Cursor and VS Code as well as the Codex app and CLI.
The numbers: GPT-5.4 scores 75.0% on OSWorld-Verified (superhuman) for computer use, 92.8% on Online-Mind2Web for browser use, and 33% fewer errors per response than GPT-5.2, with a standard 272K-token context window (an experimental 1-million-token option is available via API at 2x the normal rate). For coding, GPT-5.3-Codex leads SWE-Bench Pro at 56.8% and Terminal-Bench at 77.3%. The split is deliberate — GPT-5.4 handles the general agent work, GPT-5.3-Codex handles the sustained coding.
The implementation supports two modes. In code mode, the model writes Python using Playwright or similar libraries to issue browser and desktop commands. In screenshot mode, it interprets raw screen images and issues mouse and keyboard commands directly — the classic loop.
One genuinely useful feature is configurable reasoning effort — five levels from "none" to "xhigh" that let you trade speed for depth depending on the task. Filling out a form doesn't need the same reasoning as debugging a distributed system. A complementary feature called Tool Search cuts token costs roughly in half by letting the model efficiently select which tools to invoke rather than loading everything into context.
Pricing: $2.50 per million input tokens, $15 per million output tokens. The Pro tier — which unlocks the highest reasoning levels — runs $30/$180. These numbers matter because screenshot-loop agents are token-hungry by nature. Every screenshot is a large image token. Every action-response cycle is a round trip. A complex workflow can burn through tokens fast.
Claude's Three-Track Approach (Anthropic)
Anthropic's strategy is the most architecturally interesting because it's actually three separate products solving different parts of the problem.
Computer Use API is the screenshot-loop implementation. It ships as a beta API with three tools — computer (mouse and keyboard control), bash (terminal commands), and text_editor (file manipulation). It runs inside a Docker container with a full Ubuntu desktop. The setup is explicit about what it is: a sandboxed environment where the AI operates a real computer through screenshots and clicks, exactly like the universal loop described above.
Claude Code is something different entirely. It's a terminal-native agent that works through structured tool calls — reading files, editing specific lines, running commands, searching codebases — without ever taking a screenshot. It doesn't use the computer like a human. It uses the computer like a developer: through precise, programmatic interactions with the filesystem and shell.
Claude Cowork is the newest addition — launched in January 2026 as a research preview, with Windows support added in February. Cowork packages the Computer Use capabilities into a consumer-facing desktop experience. Instead of Docker containers and API calls, you assign tasks through the Claude Desktop app — organize files, create documents, synthesize research, manage your desktop — and Claude executes them autonomously, taking screenshots and navigating your actual UI. It's Computer Use for people who don't write code, with scheduled tasks, eleven open-source agentic plugins spanning productivity, enterprise search, sales, finance, data, legal, marketing, customer support, product management, and biology research, and direct file access without manual uploads. Available for Pro ($20/month) and above.
The benchmark split tells the story. On OSWorld (the screenshot-based benchmark), Sonnet 4.6 hits 72.5% and Opus 4.6 reaches 72.7% — competitive but just under GPT-5.4's 75.0%. That's a dramatic trajectory: Sonnet 3.5 scored 14.9%, Sonnet 3.5 v2 hit 28.0%, Sonnet 3.6 reached 42.2%, Sonnet 4.5 jumped to 61.4%, and Sonnet 4.6 landed at 72.5% — from 15% to human-parity in sixteen months. On SWE-Bench Verified (the standard coding benchmark), Opus 4.6 scores around 80.8%, with GPT-5.4 close behind. On the harder SWE-Bench Pro variant, GPT-5.3-Codex leads at 56.8%.
The real distinction isn't raw scores — they're closer than you'd expect. It's the approach. Claude Code's structured tool use (file edits, terminal commands, codebase search) produces different results than GPT-5.4's more general agent mode. For tasks that look like "use a GUI application the way a human would," GPT-5.4 currently leads. For sustained multi-file software engineering workflows where the agent lives in your terminal for hours, Claude Code's structured approach has an architectural advantage. And Cowork bridges the gap for non-developers who want desktop automation without touching a terminal — a move that wiped roughly $285 billion from enterprise SaaS market caps when Anthropic launched eleven industry-specific Cowork plugins in February 2026, as investors digested the implications for legal, finance, and sales software. The screenshot loop is most universal. The structured approach is most precise. The desktop app is most accessible.
Other numbers worth knowing: Claude Code averages 21.2 independent tool calls without needing human intervention. It can sustain tasks for up to 14.5 hours. Its Agent Teams feature spawns multiple Opus instances working in parallel on different parts of a problem. And a 92% prefix caching reuse rate means repeated patterns in your workflow get dramatically cheaper over time.
Anthropic's acquisitions signal where they think this is heading. They acquired Vercept in February 2026 — a Seattle AI startup whose Vy agent could operate a remote MacBook autonomously. Vy was immediately shut down, team absorbed. This is Anthropic's second major acquisition in three months, following the Bun coding engine acquisition in December 2025 — both focused on making Claude better at operating real computers. They're investing in all three tracks simultaneously.
Perplexity Computer
Perplexity's approach might be the most ambitious in the field — and the hardest to compare to everything else on this list. Launched on February 25, 2026, Perplexity Computer isn't a screenshot-loop agent or a browser extension. It's a multi-model orchestration platform that routes tasks to specialized AI models, creates sub-agents, and runs workflows autonomously for hours, days, or weeks.
The architecture: instead of relying on a single model to see and click, Computer orchestrates 19 different frontier models, routing each subtask to the one best suited for it. Claude Opus 4.6 handles core reasoning. Gemini handles deep research. GPT-5.2 handles long-context recall. Grok handles lightweight speed tasks. Nano Banana generates images. Veo 3.1 produces video. When given a complex goal, the system breaks it into subtasks and spawns specialized sub-agents to handle each part in parallel.
The integration layer is the differentiator: Computer connects to over 400 applications — Gmail, Outlook, GitHub, Linear, Slack, Notion, Snowflake, Salesforce — through authenticated integrations, not through screenshot automation. It's closer to an enterprise workflow engine than a computer-use agent in the traditional sense.
Then in March 2026, Perplexity went further. Personal Computer puts a persistent, always-on agent on a dedicated Mac mini — giving it 24/7 access to your local files, applications, and active sessions. The AI doesn't wait for you to open a chat. It monitors triggers, executes proactive tasks, and carries work forward around the clock.
Computer for Enterprise rounds out the lineup with SOC 2 Type II compliance, SAML SSO, audit logs, and isolated cloud execution environments. In a study of over 16,000 queries benchmarked against institutional standards from McKinsey, Harvard, MIT, and BCG, Perplexity claims Computer saved their internal teams $1.6 million in labor costs and performed 3.25 years of work in four weeks.
Since launch, Perplexity has continued expanding Computer's capabilities. Recent updates include Skills (modular task templates), Model Council (consensus across multiple models for high-stakes decisions), Voice Mode, and — notably — GPT-5.3-Codex as a dedicated coding subagent, meaning Computer can now dispatch OpenAI's best coding model specifically for software tasks while routing other work to Claude or Gemini. GPT-5.4 is also now available to Pro and Max subscribers directly in Perplexity search.
Meanwhile, the Comet browser — Perplexity's Chromium-based browser with the AI search engine built in — serves as the consumer-facing entry point. Available on desktop, Android, and iOS, it lets Max subscribers choose their underlying model (Opus 4.6 by default) and handles research, content analysis, and task automation within a browsing experience that feels like a normal browser with an AI copilot.
All of this sits behind a $200/month Perplexity Max subscription — the same price tier as ChatGPT Pro. 10,000 monthly compute credits included.
Google Gemini Agent (Project Mariner)
Google's entry started as a Chrome extension research prototype and is now being absorbed into Google's primary AI interface. Project Mariner, originally powered by Gemini 2.0, has been upgraded to run on Gemini 3's advanced reasoning — Google's latest model family, which replaced Gemini 2.5 as the default in the Gemini app. The underlying model matters because Google is now on its third generation since Mariner launched, each with substantially better agentic capabilities.
The most significant development is Agent Mode in the Gemini app, which brings Mariner's browser automation directly into Google's primary AI interface. Agent Mode lets users state objectives and have Gemini orchestrate multi-step tasks — managing Calendar, organizing inboxes, browsing the web, and coordinating across Google Workspace apps. The interface shows the chat on the left and a live web preview on the right, displaying actions as they happen.
Mariner's core capability remains browser-native: an "Observe-Plan-Act" loop that scores 83.5% on WebVoyager (a web-specific benchmark), handles up to ten simultaneous tasks, and includes a "Teach & Repeat" feature for workflow replication. The constraint hasn't changed — this is browser-only, not desktop — but the integration into Google's ecosystem means it reaches billions of potential users rather than a handful of research testers.
Price: $249.99 per month for Google AI Ultra. Currently available to U.S. users only. For that price, you get an agent that's actively being folded into the broader Gemini platform — the destination is Gemini Agent as a built-in capability, not Mariner as a standalone product.
Microsoft Copilot Studio + Copilot Cowork
Microsoft's approach is the most enterprise-focused and the most pragmatic about model selection. Copilot Studio's Computer-Using Agents (CUAs) support multiple foundation models — including both Claude Sonnet 4.5 and OpenAI's own CUA model — and recommend different models for different tasks. Their documentation specifically recommends Claude Sonnet 4.5 for "dynamic user interfaces and interpretation of dense, changing dashboards."
This is notable: Microsoft, OpenAI's largest investor, ships a computer-use product that recommends a competitor's model for certain workloads. The pragmatism is telling. At the enterprise tier, nobody cares about model loyalty. They care about what works.
The biggest development is Copilot Cowork, announced March 9, 2026. Cowork brings long-running, multi-step task execution into Microsoft 365 — tasks that run for minutes or hours, coordinating actions across apps and producing real outputs along the way. Built in partnership with Anthropic, Cowork pairs Anthropic's agentic models for multi-step reasoning with Microsoft 365's data layer. It runs in a protected, sandboxed cloud environment so tasks continue safely as you switch devices. Currently in research preview with limited customers, going GA on May 1 as part of the Microsoft 365 E7 Frontier Suite.
Copilot Studio adds what individual model providers don't: credential management for secure unattended execution, Cloud PC pooling powered by Windows 365 that auto-scales based on workload demand, enterprise governance and monitoring integrated with Microsoft Purview for session traceability with screenshots. If GPT-5.4 and Claude are the engines, Microsoft is building the fleet management system — and with Cowork, they're also building the longest-duration autonomous agent in the enterprise space.
UiPath Screen Agent (ScreenPlay)
The enterprise incumbent. UiPath has spent years building robotic process automation (RPA) — software that automates repetitive business tasks by interacting with application interfaces. Screen Agent — now part of UiPath's next-generation ScreenPlay platform — is what happens when you replace traditional scripted automation with an LLM that can see and reason about screens.
Powered by Claude Opus 4.5, Screen Agent achieved 67.1% on OSWorld-Verified in January 2026, earning the #1 ranking at the time (before GPT-5.4's release pushed the ceiling to 75%). Notably, this was achieved using agentic UI automation alone — no code-based actions, no DOM parsing, no accessibility hooks. Pure visual reasoning. ScreenPlay agents understand natural language goals like "find the invoice from last month and download it" and autonomously navigate interfaces the way humans do.
Its key differentiator is "self-healing automation" — when a UI changes (a button moves, a menu gets reorganized, a dialog box looks different), the agent adapts instead of breaking. Traditional RPA scripts are brittle; they fail the moment a pixel is out of place. LLM-powered screen agents are robust because they understand intent, not just coordinates.
UiPath is also the first enterprise platform with AIUC-1 certification (AI-driven User Interface Control), a compliance framework validated by Schellman that matters in regulated industries. Cross-platform support covers Mac, Linux, and Windows. UiPath was named to G2's 2026 Best Software Awards in five categories including Best Agentic AI Software.
AGI Company — OSAgent
The highest raw benchmark score belongs to an outfit called AGI Company, whose OSAgent hits 76.26% on OSWorld — above GPT-5.4's 75.0% and above the human expert baseline. The agent was trained to continuously self-check its actions and verify outcomes in real time, correcting on the next turn when a step fails. Details beyond the benchmark are sparse. The company appears to be a specialized computer-agent shop rather than a general-purpose AI lab. Worth watching, but the lack of public documentation makes it hard to evaluate beyond the headline number.
Simular Agent S2 / S3 (Open Source)
The most interesting open-source trajectory in the field. Simular's Agent S2 introduced a modular framework that works purely from raw screenshots — no accessibility trees, no DOM parsing, no application-specific hooks. The architecture splits the job: one module reads the UI, another handles high-level planning, another executes low-level clicking. This compositional approach sustains accuracy over very long action sequences better than a single monolithic model. S2 scored 48.8% on OSWorld and 54.3% on AndroidWorld, surpassing UI-TARS at 46.8%.
Then Agent S3 arrived — scoring 72.6% on OSWorld with Behavior Best-of-N (bBoN) scaling, surpassing the human expert baseline. The jump from S2 to S3 came from three architectural changes: a flat worker-only policy (removing S2's manager-worker hierarchy), hybrid GUI + code execution (the agent can generate and run Python/Bash alongside GUI interactions), and bBoN scaling (selecting the best outcome from multiple rollouts). The full trajectory — Agent S (20.6%) → S2 (48.8%) → S3 (72.6%) — happened in roughly one year.
Agent S3 matters because it's fully open-source and model-agnostic. If you want to understand how screenshot-loop agents actually work under the hood — and how quickly open-source can close the gap with proprietary systems — this is where to look.
Fellou
Self-described as the "first agentic AI browser," Fellou's distinguishing feature is transparency. Before executing any workflow, it shows you the plan and lets you inspect and edit it — unlike other agentic browsers that operate as black boxes. It builds "agentic memory" from your browser history, integrates with local applications and file management, and supports task scheduling for recurring automated workflows. Cost transparency is built in — it gives upfront estimates of the credits needed for any complex action.
Fellou is a bet on the proposition that users want to see what the agent is doing before it does it — a planning-then-execution model rather than the fire-and-forget approach.
The Approaches Are Not Converging
It would be tempting to frame this as a horse race — ten implementations all climbing toward the same summit. That's not what's happening. There are at least five fundamentally different philosophies about what computer use should be, and they're diverging, not converging.
Philosophy 1: The universal screenshot loop. GPT-5.4, Anthropic's Computer Use API, UiPath Screen Agent, AGI Company's OSAgent. The AI literally sees your screen and clicks things. Maximum compatibility, minimum assumptions. Works on any application, any platform, any interface. The downside: it's slow (every action requires a full screenshot round trip), expensive (vision tokens add up), and operates at the abstraction level of pixel coordinates rather than semantic understanding.
Philosophy 2: Structured tool use. Claude Code, and to some extent Microsoft Copilot Studio. The AI doesn't pretend to be a human looking at a screen. It uses programmatic interfaces — reading files, running commands, calling APIs — that are faster, cheaper, and more precise than the screenshot loop. The downside: it only works where structured tools exist. You can't use Claude Code to fill out a web form in a GUI application.
Philosophy 3: Browser-native agents. Perplexity Comet, Google Mariner, Fellou. The AI operates within a browser context, which gives it access to the DOM, the page structure, and web APIs that pure screenshot agents can't use. Faster and more reliable than the screenshot loop for web tasks, but limited to the browser.
Philosophy 4: Hub-and-spoke platforms. OpenClaw. The AI doesn't use a computer in the visual sense at all — it operates through messaging channels, routing commands through adapters to skills that call APIs. No screenshots, no browser, no desktop. Just text in, action out, through whatever channel you prefer.
Philosophy 5: Multi-model orchestration. Perplexity Computer. The AI doesn't rely on one model doing everything — it routes subtasks to 19 specialized models, creates sub-agents, and connects to 400+ applications through authenticated integrations. The agent doesn't look at your screen or operate your browser. It talks to your tools directly, dispatching the right model for each job. The downside: it's cloud-dependent, expensive ($200/month), and works through integrations rather than universal access — if an app isn't in the 400+ list, the agent can't reach it.
These philosophies make different tradeoffs and serve different use cases. The screenshot loop will always be the most universal. Structured tool use will always be more precise for development workflows. Browser agents will always be faster for web tasks. Hub-and-spoke platforms will always be more accessible for non-technical users. Multi-model orchestration will always be more capable for complex multi-domain workflows — but at the cost of lock-in and integration dependency.
Notable by its absence: Apple. The largest tech company by market cap has no computer-use agent product. Their strategy is device-centric and privacy-heavy — Apple Research published Ferret-UI Lite, a 3-billion-parameter on-device GUI agent model, but they've shipped nothing comparable to any of the products on this list. Whether this is strategic patience or a genuine gap remains to be seen.
The real question isn't which one wins. It's which combinations emerge. Microsoft is already mixing models. Anthropic is already running both tracks. Perplexity is orchestrating across everyone else's models. The future probably looks like agents that use structured tools when they can, fall back to the screenshot loop when they must, and orchestrate across specialized models when the task demands it — precision where possible, universality where necessary, specialization where it matters.
The Security Story Nobody Is Talking About Enough
OpenClaw — which started as "Clawdbot," a weekend project by Peter Steinberger in November 2025, was renamed to Moltbot after Anthropic trademark complaints, then to OpenClaw days later — crossed roughly 250,000 GitHub stars in sixty days, surpassing React. Steinberger joined OpenAI in February 2026 and transitioned the codebase to an independent 501(c)(3) foundation. The creator left for OpenAI while the security crisis was unfolding. OpenClaw now runs on over 135,000 internet-exposed instances according to SecurityScorecard's STRIKE team.
It has no sandboxing by default. The agent runs with your user account permissions. Whatever you can do on your computer, the agent can do.
In February 2026, Koi Security published an audit of ClawHub — the community skill registry where users publish and share agent capabilities. They found 341 malicious skills out of 2,857 entries. The skills were traced to a coordinated operation called "ClawHavoc" — 335 traced to a single coordinated operation based on shared tactics and infrastructure. Updated scans put the number at over 824 malicious skills with 1,184 malicious packages across 12 publisher accounts — approximately 20% of the registry.
A skill, remember, is a Markdown file containing natural-language instructions. The agent reads it and follows the instructions. A malicious skill can instruct the agent to exfiltrate data, modify files, install backdoors, or connect to command-and-control servers. On macOS, payloads tied to the Atomic macOS Stealer collected browser credentials, keychains, SSH keys, and crypto wallets. The attack surface is not a code vulnerability. It's the agent doing exactly what it was designed to do — following instructions — with instructions written by an adversary.
On top of this, CVE-2026-25253 (CVSS 8.8, rated high) describes a one-click remote code execution vulnerability via WebSocket hijacking. A malicious webpage can connect to a locally running OpenClaw instance, steal the authentication token, and execute arbitrary commands — because OpenClaw incorrectly assumed that any connection from localhost could be implicitly trusted.
CrowdStrike, Microsoft, Cisco, Bitdefender, Palo Alto Networks, and Kaspersky have all published warnings. Meta has banned OpenClaw from corporate devices. The Dutch Data Protection Authority issued a formal warning. Belgium's Centre for Cybersecurity (CCB) issued a specific advisory on CVE-2026-25253. On March 10, 2026, China's CNCERT issued a security alert warning of "extremely weak default security configuration," followed by China's MIIT and SASAC ordering government agencies and state-owned enterprises — including the largest banks — to restrict or ban OpenClaw on office devices. Multiple governments, not just individual companies, are now responding. Microsoft's recommendation: run OpenClaw in "a fully isolated environment such as a dedicated virtual machine." This is the technology equivalent of telling someone their front door doesn't lock and they should build a second house to put inside the first one.
Meanwhile, AWS launched a managed OpenClaw service on Lightsail — a major cloud provider productizing the very tool that governments are warning about. The Lightsail deployment ships with sandboxing, HTTPS, and device pairing authentication, but AWS itself recommends never exposing the gateway publicly.
This is not an OpenClaw-specific problem. It's the defining security challenge of the computer-use era. Every agent that can click buttons and type text can also click the wrong buttons and type the wrong text. Every agent with filesystem access can read your SSH keys, your environment variables, your browser cookies. The more capable the agent, the larger the attack surface.
The enterprise platforms understand this. Microsoft's Copilot Studio ships with credential management, governance, and Purview monitoring. UiPath has AIUC-1 certification. Perplexity Computer runs in isolated cloud environments with audit trails. These aren't features — they're prerequisites. But the consumer and open-source space is moving at a speed that leaves security as an afterthought.
The 135,000 exposed OpenClaw instances are not a bug. They're the natural result of a tool that's trivially easy to install, immediately useful, and doesn't remind you to lock the door. A ClawHub skill that deploys your website to Vercel and a ClawHub skill that exfiltrates your API keys look exactly the same to the user: a Markdown file with a description that sounds helpful.
Twenty percent of the registry. One in five.
The regulatory trajectory is unmistakable. The Netherlands, Belgium, and China have all issued formal warnings within weeks of each other. Meta banned it from corporate devices. AWS productized it with security caveats. The pattern suggests regulatory frameworks specifically targeting computer-use agents are not far behind — the governments are responding faster than the technology is hardening.
To their credit, OpenClaw has responded. Version 2026.3.12 shipped over a dozen security patches in a single release cycle: the WebSocket origin vulnerability (CVE-2026-25253) was patched with strict origin checking, bootstrap tokens are now short-lived and scoped rather than persistent, implicit workspace plugin auto-loading was disabled by default, and ClawHub added mandatory code signing for published skills. The team also published a security hardening guide and introduced a --sandboxed flag for running agents in restricted environments. These are real fixes to real problems. But they're reactive fixes to a deployment base of 135,000+ exposed instances, many of which will never update — and the fundamental architecture (an agent that follows arbitrary natural-language instructions with your user permissions) means the attack surface is the feature, not a bug in it.
What It Actually Costs
Pricing in this space is confusing because you're often paying for model access separately from platform access, and token consumption varies wildly by task. Here's the clearest comparison possible as of March 2026:
GPT-5.4 agent mode: Included with ChatGPT Plus ($20/month) and Pro ($200/month). API pricing: $2.50/MTok input, $15/MTok output. Pro reasoning tier: $30/$180 per MTok. A moderately complex computer-use session — say, navigating a web app to complete a multi-step form — might consume 50,000-200,000 tokens depending on the number of screenshots and reasoning steps.
Claude Computer Use API / Cowork / Code: Opus 4.6 is $15/MTok input, $75/MTok output. Sonnet 4.6 is $3/MTok input, $15/MTok output. The 92% prefix caching rate dramatically reduces effective costs for repetitive workflows. Claude Code is available through a subscription ($20/month for the Pro tier with usage limits, or pay-as-you-go via API). Claude Cowork is included with the Pro tier ($20/month) and above — no additional cost for the desktop agent beyond your existing subscription.
Perplexity Computer: $200/month for Perplexity Max. 10,000 monthly compute credits included. No free tier. Personal Computer (the always-on Mac mini version) requires the same Max subscription plus your own hardware. Computer for Enterprise pricing is negotiated per-organization.
Google Mariner: $249.99/month for Google AI Ultra subscription. No per-token pricing for the extension itself, but that subscription cost is steep for what's still explicitly a research prototype.
Microsoft Copilot Studio + Cowork: Enterprise licensing through Microsoft 365. The upcoming E7 Frontier Suite (GA May 1) bundles Copilot Cowork with advanced agentic capabilities at $99/user/month — a 65% jump over E5 and the first new enterprise tier since E5 launched in 2015. Additional capacity charges apply for Cloud PC compute.
Perplexity Comet: Free browser with Pro features at $20/month and Max features (model selection, extended capabilities) at $40/month.
OpenClaw: Free. Bring your own API keys. Your costs are whatever the underlying model charges, which varies by provider and model. This is the cheapest option — and given the security landscape, you get what you pay for in terms of guardrails.
The practical upshot: for individual users who want to experiment with computer-use agents, GPT-5.4 through ChatGPT Plus ($20/month) or Claude Cowork through a Pro subscription ($20/month) are the most accessible entry points — both give you desktop automation without API complexity. For complex multi-domain workflows, Perplexity Computer's orchestration model is the most capable but requires the $200/month commitment. For enterprise deployment, Microsoft's E7 Frontier Suite and UiPath ScreenPlay are the options with the governance infrastructure that regulated industries require. For developers who want to understand the technology, Simular's Agent S3 and OpenClaw's codebase (run locally, behind a firewall, with the --sandboxed flag, and extreme caution about which skills you install) are the most instructive.
The Benchmark Problem
A note on the numbers that define this space. OSWorld-Verified is the standard benchmark, and crossing the 72.4% human expert baseline is a genuine milestone. But benchmarks measure specific tasks under specific conditions, and the gap between benchmark performance and real-world utility is wide.
A model that scores 75% on OSWorld can complete three out of four benchmark tasks successfully. In practice, that means one in four attempts at a computer-use task will fail. For a coding assistant, a 25% failure rate is annoying but manageable — you review the output and fix mistakes. For an unattended agent filling out government forms, scheduling medical appointments, or managing financial transactions, a 25% failure rate is catastrophic.
The 14.5-hour task completion horizon that Anthropic advertises for Claude Code is impressive, but it also means the agent is operating autonomously for 14.5 hours with a non-zero error rate on every decision. The error rate compounds. By hour ten, the probability that at least one significant mistake has occurred approaches certainty.
This is the central tension of the computer-use moment: the technology is genuinely superhuman on benchmarks and genuinely unreliable in production. Both of these are true simultaneously. The benchmark scores are not lies. The failure stories are not exceptions. The technology is exactly as capable and exactly as fragile as both sides claim.
What This Means for You, Concretely
If you use ChatGPT daily, here's what changed in March 2026:
You can now ask ChatGPT to use your computer. Agent mode in GPT-5.4 will take screenshots of your environment, reason about what it sees, and execute actions. This works today, for Pro and Plus subscribers, through the ChatGPT interface.
The best tool depends on your task. For general computer use — navigating applications, filling out forms, interacting with GUIs — GPT-5.4's screenshot loop is currently the most capable single model. For software development, Claude Code's structured approach produces better results for sustained terminal workflows. For desktop automation without touching a terminal, Claude Cowork packages the same capabilities into a consumer-friendly app. For web research and browsing automation, Perplexity Comet or Google Mariner are more focused tools. For multi-domain workflows that touch email, code, docs, and messaging simultaneously, Perplexity Computer is the most ambitious answer.
Nothing is ready for unsupervised high-stakes tasks. A 75% success rate means you should be watching. Use computer-use agents to accelerate workflows you're already doing, not to replace workflows you don't understand. The technology is a power tool, not autopilot.
If you install OpenClaw, treat it like you'd treat giving a stranger the keys to your house. Run it in a VM with the --sandboxed flag. Audit every skill before installing it. Assume that any community-published skill might be malicious until proven otherwise. Version 2026.3.12 shipped real security fixes — mandatory skill signing, patched WebSocket hijacking, scoped tokens — but 135,000+ exposed instances won't all update, and the fundamental architecture still gives the agent your user permissions. The framework is genuinely impressive. The security posture is improving but not yet where it needs to be.
The enterprise tools are boring and that's the point. Microsoft Copilot Cowork and UiPath ScreenPlay are not exciting consumer products. They're plumbing for corporate automation. If your company is evaluating computer-use agents, these are the options with the governance, security, and monitoring infrastructure that enterprise deployment requires — Copilot Cowork with Anthropic's reasoning running inside Microsoft 365's data layer, ScreenPlay with AIUC-1 certification for regulated industries. They're also the ones most likely to be invisible to end users — the agent that silently processes insurance claims or reconciles spreadsheets while you're not looking.
The Eighteen-Month Timeline
It's worth stepping back to appreciate what happened. In September 2024, the idea of an AI that could look at a screen and operate a computer was a research demo. Anthropic's initial Computer Use beta was slow, clumsy, and prone to clicking the wrong button. OSWorld scores were in the low 20s.
Eighteen months later:
- Multiple models exceed human expert performance on standardized benchmarks
- A consumer browser ships with computer use built in
- An enterprise RPA platform has replaced scripted automation with LLM-powered screen reading
- An open-source framework for personal AI agents has more GitHub stars than React
- An open-source research agent (Agent S3) reached human-level OSWorld scores, going from 20.6% to 72.6% in a single year
- Desktop agents (Claude Cowork, Copilot Cowork) ship computer use to non-developers
- Microsoft and Anthropic partner on long-running autonomous agents inside Microsoft 365
- An orchestration platform dispatches 19 specialized models to handle workflows that run for weeks
- The benchmark curve shows no sign of flattening
The speed of this is easy to underestimate because each increment felt small. A few points on OSWorld here, a new product launch there. But 47.3% to 75.0% in a single release cycle is not incremental. That's a phase transition. The technology crossed from "promising but unreliable" to "superhuman but still unreliable" in the space of one model update.
The "still unreliable" part is not a contradiction. Human experts score 72.4% on OSWorld. GPT-5.4 scores 75.0%. Neither humans nor AI complete these tasks reliably — the benchmark is hard. The difference is that AI costs pennies per attempt and can run thousands of attempts in parallel. Superhuman performance at superhuman scale with a meaningful error rate creates a fundamentally new category of capability.
What happens next is not a prediction. It's an observation about the curve. GPT-5.2 scored 47.3%. GPT-5.4 scored 75.0%. Whatever scores 85% or 90% is not five years away. It might be one release cycle away. The question is not whether computer-use agents will be reliable enough for unsupervised high-stakes work. The question is when — and whether the security and governance infrastructure will be ready when they are.
Given that 20% of the most popular skill registry is currently compromised by a coordinated malicious operation, the answer to the second question is: not yet. Not even close.
The benchmarks say superhuman. The security audits say the front door is open. The pricing says accessible. The enterprise platforms say governed. All of these are true at the same time. Welcome to computer use in March 2026.