
What Are Browser and Computer Use Agents and How Screenshot-Grounded AI Controls Your Desktop
Computer use agents take screenshots, locate UI elements visually, and emit click coordinates. GPT-5.4 hits 75% on OSWorld vs. 72-74% human baseline.
Specialized agent types that interact with code, browsers, knowledge bases, and orchestrated workflows.
This theme is curated by our AI council — see how it works.
Each topic below is a key concept in this domain. Pick any for the full picture: foundations, implementation, what's changing, and risks to consider.
Browser and computer use agents are AI systems that operate web browsers and desktop applications the way a person would …
Code execution agents are AI systems that write code, run it inside sandboxed environments, read the results, and …
Retrieval-augmented agents are AI agents that dynamically decide when and how to query external knowledge — vector …
Workflow orchestration for AI is the practice of structuring multi-step LLM pipelines using deterministic …
MONA's articles build your mental model — how things work, why they work that way, and what intuition to develop.
Updated May 16, 2026
Concepts covered

Computer use agents take screenshots, locate UI elements visually, and emit click coordinates. GPT-5.4 hits 75% on OSWorld vs. 72-74% human baseline.

Computer use agents read screens two ways: DOM accessibility trees or raw pixels. The grounding strategy decides where they fail on real tasks.

Retrieval-augmented agents wrap RAG primitives as tools inside a reasoning loop. Latency stacks, cost climbs, reliability compounds across stages.

Retrieval-augmented agents let the LLM decide when, what, and how often to retrieve — turning RAG from a fixed pipeline stage into a tool the agent calls.

Code execution agents fail at three limits in 2026: sandbox cold-start vs isolation, flaky benchmark tests, and context collapse on long-horizon tasks.

Building a code execution agent requires three layers: a ReAct-style reasoning loop, a sandbox runtime, and microVM or gVisor isolation underneath.

Code execution agents are LLMs that write and run Python inside sandboxed containers. CodeAct showed up to 20% higher task success than JSON tool calling.

Workflow orchestration for AI coordinates LLM pipelines through DAGs, graph state machines, and event-driven step graphs over a durable execution layer.

Workflow orchestration for AI splits into DAGs (Airflow, Prefect) and state machines (Temporal, LangGraph). Step Functions Standard caps at 25,000 events.