Download PDF

Agent Interaction Surfaces: Impact on Browser-Based Agent Performance

Manju Weerasinghe¹, Priya Chawla¹, Danny Prevoznik²
¹ TollBit ² KERNEL
Benchmarking setup developed in collaboration with KERNEL
May 2026

Abstract

This document evaluates how browser interaction surfaces affect agent performance while holding task, prompt, and model constant.

Across five commerce workflows and 100 runs per site per variant (1,000 runs total), Agent Sites for eCommerce reduced mean time to completion by 24-35% and matched or exceeded default-site task completion on every site, reaching 100% completion on all five. Default sites ranged from 91% to 100% completion, with observed failures tied to the agent exhausting its step budget while navigating or recovering from UI state changes.

The interaction surface is a primary driver of agent performance, independent of the model or prompt.

1Key Results

24-35% faster time to completion across all sites tested
Agent Site reached 100% task completion on every site; default ranged 91-100%
Fewer interaction steps and screenshots per run (used as a proxy for token usage and cost)
Failures on default sites frequently involve max-step loops and unstable navigation paths

Results aggregated across 100 executions per site per variant (1,000 runs total).

Site	Time to Completion Improvement	Completion Delta	Step Reduction
Cool3C	34.8% faster	+4-6 pts	37.9% fewer
Kopari Beauty	24.3% faster	No material change	18.6% fewer
MM Lafleur	29.9% faster	+6-10 pts	25.4% fewer
Pela	28.0% faster	+0-18 pts	29.5% fewer
Quip	26.1% faster	+0 pts	35.2% fewer

2Problem

Most websites today are built around human conversion flows. They assume a human who can visually scan a page, ignore distractions, recover from dead ends, and adapt to UI changes in real time.

When browser-based agents attempt to complete real tasks on these same sites, that assumption breaks down. Agents interact with UI patterns that are incidental for humans but costly for automation. In practice, this shows up as extra steps, higher variance across runs, and frequent failure modes that are not directly tied to the underlying task. Common issues we observe include:

Visual clutter such as ads, modals, and promotional overlays
Non-deterministic navigation paths and stateful UI
Dynamically injected DOM elements and shifting page structure
Repeated visual scanning and screenshot capture to recover context
Interruptions from chat widgets, consent prompts, and autoplay media

The result is brittle automation, not just slower execution. Identical prompts against the same site can succeed or fail depending on timing, layout shifts, or incidental UI changes. This increases compute cost, reduces reliability, and complicates debugging.

This evaluation isolates the effect of the interaction surface on agent execution by holding task, prompt, and model constant.

3What We Mean by Agent Site

Agent Sites for eCommerce are not new websites and not new content. They follow the same pattern as mobile-optimized sites: the underlying system remains the same, but the interaction surface adapts to a different mode of use.

They are an alternative interaction surface over an existing site, designed specifically for agents executing tasks through a browser. The underlying site stays the same:

Same products and content
Same business logic
Same pricing and checkout flows

For the purposes of this evaluation, we hold the task and prompt constant and change only the surface the agent interacts with:

The default, human-oriented site
A TollBit-managed Agent Site over that same site

The goal is not to simplify the task or bypass site logic. The goal is to remove patterns that are incidental for humans but create instability, excess steps, and failure modes for agents. In practice, agent sites focus on:

Stabilizing page structure across navigations
Removing UI elements that introduce non-determinism for agents
Making action paths explicit and repeatable
Preserving full actionability (not read-only extraction)

The underlying site stays the same. The surface shifts from being optimized for human attention to agent execution.

4Evaluation Philosophy

Most discussions around browser agents change multiple variables at once - model choice, prompting strategy, tooling, retries, and task design. This makes it difficult to isolate what actually drives success or failure.

This evaluation focuses on a single question:

Holding task, prompt, and model constant, how much does the interaction surface affect an agent’s ability to complete real workflows?

Prompts are fixed. Agent configuration is fixed. Success criteria are fixed.

The only variable that changes is the surface the agent interacts with at runtime.

5Evaluation Methodology

This evaluation isolates the impact of the interaction surface on agent performance by constraining all other variables.

Controlled Variables

For each comparison, the following are held constant:

Task definition
Each workflow has a fixed task description (e.g. constrained product selection and add-to-cart).
Agent prompt
The exact same prompt is used for both the default site and the agent site.
Starting point
The agent begins from the same logical entry point for each run.
Agent framework and model
Comparisons are made using the same agent framework and model configuration per run. We do not tune prompts or agent behavior between the default and Agent Site variants.
Success criteria
Task success and failure are defined identically across runs.

Differences in outcomes are not due to changes in prompting, model choice, or task design.

Independent Variable

The only variable that changes between paired runs is the site surface the agent interacts with:

The default, human-oriented website
A TollBit-managed Agent Site over the same underlying site

The underlying content, products, pricing, and business logic remain unchanged.

Metrics Collected

We evaluate agent performance using metrics that map directly to reliability, execution cost, and stability:

Task success rate
Whether the agent completes the workflow as defined.
Time to completion
Total elapsed time to reach task completion or failure.
Interaction steps
Number of discrete actions taken by the agent.
Visual observations (e.g. screenshots captured during execution)
Used as a proxy for vision-based token usage and agent effort.
Failure modes
Captured as a primary technical reason code (e.g., bot_detection, element_not_found, timeout) plus a qualitative surface bucket when observable (e.g., popup/overlay, consent prompt, navigation loop, checkout gating).
Variance across runs
Distribution of outcomes across repeated executions of the same task.

Repeated Runs

Each task is executed multiple times per surface to account for agent non-determinism. Results are aggregated across repeated runs. The goal is not to produce a single “best case” outcome, but to observe how surface design affects consistency and failure rates under repeated execution.

6Scope of Evaluation

This evaluation focuses on structured commerce workflows where agents are required to complete end-to-end tasks.

Representative workflows include:

Navigating to a product or listing
Applying filters or constraints
Selecting an item
Adding the item to cart

These workflows were chosen because task completion depends directly on navigation stability, actionability, and consistency across runs.

While agents interact with many other types of sites (e.g. forums, content platforms, authenticated flows), those are not included in this evaluation.

7Experimental Setup

This section describes the agent, environment, and run configuration used in the evaluation. The benchmarking setup was developed jointly with KERNEL.

Agent framework(s) used
- Browser runtime: KERNEL cloud browsers, launched in stealth mode with session replay recording enabled. Each run received a fresh, isolated browser session.
- Navigation layer: Playwright `page.goto` was issued inside the KERNEL session with a 10-second navigation timeout and up to two navigation attempts before the run was marked as a navigation failure.
- Agent loop: Anthropic Computer Use sampling loop. The agent received screenshots and a fixed set of computer-control tools (click, scroll, type, key, screenshot) and was required to declare its own outcome via a structured `report_task_result` tool call so that success and failure were both explicit rather than inferred.
- System prompt: A single fixed system prompt was used across every run and each variant for all sites. It instructs the agent to prefer keyboard-based scrolling, to take a verification screenshot after each action, and to call `report_task_result` exactly once at the end of the task with a success flag and, on failure, a structured failure mode and surface bucket.
Model
- Model: Claude Sonnet 4.6
- Tool version: Anthropic computer-use tool surface `computer_20251124`
- Reasoning: Adaptive extended-thinking enabled.
- Output budget: 16,384 tokens per assistant turn.
- The same model and configuration were used for both the default and Agent Site variants of every site.
Run configuration (retries, parallelism, staggering)
- Sample size: 100 invocations per (site x variant), across 5 sites and 2 variants, for 1,000 total runs.
- Step cap: Each run was capped at 40 agent steps. Runs that reached the cap without producing a successful `report_task_result` were classified as `max_steps_reached`.
- Invocation timeout: A 15-minute hard timeout was enforced at the runner level. Runs that exceeded this were classified as timeouts.
- Replay capture: Every run produced a video replay for post-hoc inspection of failure modes.
- Variant routing: The default variant navigated directly to the canonical site URL. The Agent Site variant navigated to the same underlying site through a TollBit-managed Agent Site. The underlying products, pricing, and business logic were identical between the two variants.
Workflows and Prompts
- Pela Case - "Add an iPhone 17 pro silly goose case with magsafe to cart."
- MM LaFleur - "Find a blue dress with pockets in a size small, and add it to cart."
- Kopari Beauty - "Search for 2 different creams that total less than $100. Add both to the cart."
- Quip - "Search for a single Rev Oscillating Toothbrush in the Black Night colour to add to cart."
- Cool3C - "Find the Kokomo vacuum cleaner product and add it to the cart."
Environment notes
- All runs were unauthenticated and single-session; no persistent cookies, accounts, or personalization carried between runs.
- Runs executed during a single April 2026 evaluation window so that site state was as consistent as possible across the dataset.
- The benchmark runner persists every step (action type, duration, screenshot flag, error flag, timestamp) for every run so that aggregate metrics in this report are computed from the recorded run data.

8Results & Observations

Overview

Each site was evaluated across 100 independent executions per variant (default vs Agent Site), for 1,000 total runs, using identical task definitions, prompts, and agent configuration.

Results are reported as distributions across runs rather than single aggregates to capture variability in agent behavior.

Time to Completion

Agent Site reduces mean time to completion across all sites, with observed improvements ranging from 24.3% to 35.1%.

Across runs, default sites exhibit wider spread in time to completion, including long-tail executions. Agent Site reduces long-tail execution cases.

Task Completion

Agent Site reached 100% task completion on all five sites. Default sites failed to reach full completion on three of five workflows.

Agent Site increases completion rates on MM LaFleur and Cool3C, and maintains comparable completion on Kopari, Pela, and Quip.

Failures on the default variant occurred intermittently across runs for identical tasks, indicating instability in execution rather than task complexity.

Interaction Steps

Step count differs between variants.

Default runs include longer execution paths, with repeated navigation and re-evaluation of page state. Agent Site reduced the total steps required to complete each workflow, which in turn reduced the number of screenshots captured per run (each step produces one screenshot under the agent's verification protocol). Step and screenshot counts serve as proxies for token usage and vision cost.

Variability

Run-to-run variability is observed across all sites.

Default sites show:

variation in time to completion
variation in step count
inconsistent completion outcomes on three of five sites

Agent Site shows more consistent execution paths across runs, with zero failed runs across the full 500-run Agent Site dataset.

Failure Modes

In this 1,000-run dataset, every recorded terminal failure (21 / 500 default runs, 0 / 500 Agent Site runs) ended as `max_steps_reached`, meaning the agent exhausted its 40-step budget without producing a successful report. Per-site terminal failure counts on the default variant were: Cool3C 4, MM LaFleur 8, Pela 9, Kopari 0, and Quip 0.

`max_steps_reached` is not a root-cause label. We therefore reviewed the replay for each of the 21 default-variant failures and assigned a dominant qualitative cause. That replay review suggests three recurring patterns:

Bot detection / challenge flow: 12/21 failures, including both explicit challenge flows and silent same-page request blocking that never surfaced to the agent as a distinct error state.
Agent decision loop: 8/21 failures, including bad-URL detours, backing out of the correct product, popup distraction, and repeated scrolling or hesitation near the correct add-to-cart action.
Page-load / stuck-page: 1/21 failures, where the page entered a prolonged loading state and never recovered within the step budget.

By site, the replay-based dominant-cause breakdown was: Cool3C 4 decision-loop failures; MM LaFleur 4 bot-detection failures and 4 decision-loop failures; Pela 8 bot-detection failures and 1 page-load failure. These failures occurred intermittently across otherwise identical runs.

On Agent Site, navigation-related failures were reduced. In the final dataset, no Agent Site runs terminated in failure.

9Limitations

This evaluation is based on controlled browser-based runs and reflects observed behavior under the specific setup used in this benchmark.

Bot detection and environment effects

Replay review shows that bot detection remained a meaningful source of failure on some default sites, particularly when add-to-cart or filtering actions were blocked without a clear error signal.

In these cases, failures were triggered by repeated API activity tied to normal site behavior rather than explicit agent errors. For example, some sites persist cart state by issuing multiple background requests on each page load, even when the cart has not changed. Similarly, product listing and filtering flows can generate additional fetch requests as results are updated.

These request patterns trigger bot detection during rapid, step-by-step navigation, causing actions to fail silently without a distinct challenge or error response. In these scenarios, the agent continues retrying within the same page context until it exhausts its step budget.

Notably, these patterns can also surface under fast human navigation, indicating that the behavior is tied to site-level request patterns rather than agent-specific actions.

Run instability on default sites
Default site failures were not deterministic. Identical tasks and prompts produced both successful and failed runs.

Across the 500 default-variant runs:

21 runs (4.2%) failed to complete the task within the 40-step cap
100% of those runs ended with the terminal label `max_steps_reached`, meaning the agent exhausted its 40-step budget without producing a successful report
Per-site failure counts: Cool3C 4, MM LaFleur 8, Pela 9, Kopari 0, Quip 0
Replay review assigned dominant underlying causes of 12 bot-detection failures, 8 agent decision-loop failures, and 1 page-load / stuck-page failure
The automated label set recorded 0 explicit `bot_detection` terminal failures, but the replay review shows that this is an undercount caused by silent same-page blocking rather than a true zero

This variability is tied to execution-path differences rather than task definition.

Failure classification granularity

Failure modes are derived from the agent’s structured `report_task_result` output and from runner-side classification of timeouts and infrastructure errors. These labels represent final recorded states, not definitive root-cause diagnoses.

Replay review of failed runs highlights a key limitation of automated labeling: some blocking behavior does not surface as an explicit error state and instead appears as repeated failed actions within the same page. In these cases, runs terminate as `max_steps_reached` rather than being attributed to a specific failure type.

To address this, we supplemented automated labels with manual replay review of all 21 failed default-variant runs and report those replay-based dominant-cause counts in the Results section. These should be interpreted as an additional layer of analysis rather than a replacement for the automated labels.

Prompt and workflow sensitivity
Prompts were specified tightly enough to make success criteria unambiguous. The prompts shown in Experimental Setup are the post-clarification versions and were held constant across both variants once finalized. Examples include:

MM LaFleur - color, pocket presence, and size were all specified to disambiguate which dress to add
Quip - exact product line ("Rev Oscillating Toothbrush") and color ("Black Night") were specified to avoid agents converging on different SKUs
Kopari - quantity ("2 different creams") and price ceiling ("less than $100") were specified so success criteria were unambiguous
Cool3C - exact product name ("Kokomo vacuum cleaner") was specified given the marketplace-style listing density

Prompts were held constant within each comparison after these adjustments, but small changes in task definition can affect execution outcomes.

Limited site set
The evaluation includes 5 commerce sites with different UI patterns and structures:

Pela – direct-to-consumer storefront with product variants and cart flows
MM LaFleur – structured catalog with filtering, sizing constraints, and multi-step selection
Kopari Beauty – search-driven product discovery with cart aggregation
Quip – product + variant selection with optional subscription paths
Cool3C – marketplace-style interface with non-English UI and less conventional navigation patterns

This evaluation is limited to unauthenticated, single-session commerce workflows that stop at add-to-cart. Results may differ for authenticated, personalized, or multi-session experiences.

Metric scope
The evaluation measures:

time to completion
task success
interaction steps

Direct measurement of token usage and compute cost is not included.

10Discussion

These results suggest that interaction surface design is a first-order factor in browser-based agent performance.

Improvements are not limited to speed, but extend to consistency and reduction in failure modes.

This has implications for how websites expose functionality to automated systems, particularly as agents move from retrieval to execution.