Evaluating Testpilot Runs

Overview

In addition to HTML and JUnit reports, Testpilot reports detailed metrics about test executions in a reports.json file that is included in the output directory of the test. This report can be uploaded to AI observability or evaluation platforms to understand how your tests are performing over time, and to identify bottlenecks and flaky steps in your tests. This guide explains how to extract and calculate key metrics from TestPilot JSON reports for evaluation and monitoring purposes:

Latency per step and per test - Measure execution time performance
Pass/fail per step and per test - Calculate success rates
Cache performance analysis - Analyze cache effectiveness and optimization opportunities

These metrics help identify performance regressions, flaky tests, and cache optimization opportunities across multiple test runs.

Data Sources

TestPilot generates a report.json file for each test in your test run in the testpilot-out/.internal/<report-id>/<test-id>/results.json directory. You can view a detailed reference of the report structure here.

1. Latency Metrics

Understanding Latency Fields

TestPilot provides timing data at multiple levels to help you understand where time is being spent during test execution:

Test-level timing: The duration field on each Test object represents the total wall-clock time from test start to completion
Step-level timing: Each Step has its own duration field showing how long that individual step took
Action-level timing: Individual Action objects within steps also have duration fields for granular analysis

The duration format uses strings like “2.5 s” or “90.2 s” - always in seconds with the “s” suffix. Parse these into numeric values to enable mathematical operations like averaging, percentile calculations, and trend analysis.

Per-Test Latency Analysis

Test-level latency provides the most important metric for overall user experience - the total time from test start to completion. This end-to-end measurement is crucial for:

SLA monitoring: Ensuring tests complete within acceptable time limits
Performance trending: Tracking whether test execution is getting faster or slower over time
Resource planning: Understanding typical execution times for capacity planning

Key fields for analyzing Per test latency include:

Test.id: Stable Identifier for analyzing your test across runs
Test.duration: Wall time for how long the test took to run
Test.startTime: Useful for organizing your tests chronologically
Test.profilingMetrics: Splits test time into LLM Duration, Tool Duration, and Attempts, which can help you understand what is driving the test’s overall duration.

Per-Step Latency Analysis

Step-level latency analysis helps identify bottlenecks in test execution by examining the duration field on each Step object. This is particularly useful for:

Bottleneck identification: Finding which types of operations (navigation, form filling, verification) take the longest
Optimization targeting: Prioritizing which steps to optimize for maximum performance impact
Regression detection: Monitoring whether specific step types are getting slower over time

Extract the duration from each step and track it alongside the step title and status to build comprehensive performance profiles. Key fields for analyzing Per-Step latency include:

Step.title: The actual wording of the step provided to the LLM. You can use this, or the step’s index for analysis across tests.
Step.duration: Wall time for how long the step took
Step.profilingMetrics: Shows the breakdown of duration by attempts, LLM Duration, and Tool Duration
Step.executionMode: Shows how the step was executed (cache, agent, script, etc.) - see Enum Reference
Step.actions: The actual actions Taken by Testpilot, which can give you a granular understanding of what occurred during the step.
Step.cacheStatus: Shows whether the step successfully used the Action cache, or had to fall back. See Enum Reference
Step.explanation: The LLM’s interpretation of how the step ended.

Some things to evaluate when analyzing steps include:

Do the profiling metrics indicate frequent retries of the step, or issues with any of the tool calls you are making?
Does cache status consistently suggest an issue with caching capability? What reasons are preventing the cache from hitting on the action.
Do you see a large number of waits, or repetitive actions in the step?

2. Pass/Fail Metrics

Understanding Status Fields

TestPilot uses numeric status codes to indicate the outcome of tests and steps. The status field appears on both Test and Step objects. See the Status enum table for complete values and meanings.

Per-Step Pass/Fail Analysis

Step-level analysis helps identify which specific operations are most prone to failure. This is valuable for:

Failure pattern identification: Understanding which types of steps fail most frequently
Root cause analysis: Distinguishing between action failures and verification failures
Test stability assessment: Identifying steps that contribute to overall test flakiness

Pay attention to the stepType field as verification steps failing might indicate different issues than action steps failing. See the StepType enum table for complete values. Key aggregation strategies:

Binary classification: Treat STATUS_SUCCEEDED as “pass” and all others as “fail” for most analyses
Step type segmentation: Analyze action steps vs verification steps separately
Failure categorization: Use the failureReasonCategory field to group similar failure types

Per-Test Pass/Fail Analysis

Test-level pass/fail rates are the primary metric for overall test suite health. The test status field represents the final outcome after all steps have been attempted. A test is considered successful only if it reaches STATUS_SUCCEEDED (2). The explanation field often contains valuable context about why a test failed, which can be categorized for root cause analysis and automated triage.

Pass/Fail Trend Analysis

Tracking pass/fail rates over time is essential for identifying flaky tests and overall test suite stability: Critical metrics to track:

Overall pass rate: Percentage of tests passing in recent runs
Test stability: Individual tests with inconsistent results across runs
Flaky test identification: Tests with pass rates between 20-80% indicating intermittent issues
Failure clustering: Whether failures concentrate around specific time periods or code changes

Analysis techniques:

Stability scoring: Calculate pass rates over rolling windows to identify consistently failing vs intermittent tests
Trend detection: Monitor whether test suite health is improving or degrading over time
Outlier identification: Flag tests that deviate significantly from expected pass rates

3. Cache Performance Analysis

Understanding Cache Fields

TestPilot’s caching system speeds up test execution by reusing previous step results when conditions are similar. Understanding cache performance helps optimize test execution time and identify optimization opportunities. Key fields for cache analysis:

Step.cacheStatus: Indicates what happened with cache for this step (hit, miss, or unused with reason)
Step.executionMode: Shows how the step was actually executed (cache, agent, script, etc.)
Test.cacheSourceId: When present, indicates this test used another test as a cache source

Cache Rate Calculation

The most effective cache rate calculation focuses on non-assertion steps since verification steps typically cannot be cached. The analysis should:

Count cache utilization: Steps executed with EXECUTION_MODE_CACHE (2) or EXECUTION_MODE_CACHE_FALLBACK_TO_CUA (3)
Exclude assertion steps: Filter out steps identified as assertions/verifications since they rarely can be cached
Calculate hit rate: Cached steps divided by total non-assertion steps

Assertion step identification:

Step titles beginning with “verify,” “assert,” or “expect” (case-insensitive).
Steps with stepType of STEP_TYPE_VERIFICATION (3).

Cache Error Rate Analysis

Cache error rate tracks situations where cache was attempted but failed, requiring fallback to live agent execution. This metric helps identify:

Cache reliability issues: How often cache attempts fail
Environmental factors: Whether cache failures correlate with specific conditions
Cache optimization opportunities: Which scenarios need better cache handling

Calculate as the percentage of steps with execution mode EXECUTION_MODE_CACHE_FALLBACK_TO_CUA (3) across all steps.

Cache Status Analysis

Beyond simple hit rates, analyze the specific reasons cache wasn’t used: See the CacheStatus enum table for complete cache status values and their meanings, including detailed unused reasons and optimization strategies.

Cache Performance Trends

Track cache performance over time to understand optimization effectiveness: Key trend indicators:

Declining hit rates: May indicate tests becoming more dynamic or environmental instability
Consistent unused reasons: Suggest systematic issues addressable through test design changes
Cache source stability: Tests frequently serving as cache sources should be prioritized for stability
Performance correlation: Relationship between cache hit rates and overall execution speed

Field Selection Best Practices

Essential Fields by Analysis Type

Latency Analysis:

Test.duration and Step.duration: Core timing metrics
Test.id: For grouping across runs
ProfilingMetrics.totalDuration, llmDuration, toolDuration: For detailed breakdowns
Report.startTime: For chronological ordering

Pass/Fail Analysis:

Test.status and Step.status: Core success metrics (focus on value 2 = SUCCESS)
Test.explanation and Step.explanation: Context for failures
Step.stepType: To differentiate verification vs action failures
Step.failureReasonCategory: For categorizing failure types

Cache Analysis:

Step.executionMode: How step was executed (primary metric)
Step.cacheStatus: Detailed cache behavior
Test.cacheSourceId: Cache dependency relationships
Step.title: For identifying assertion steps to exclude

Enum Reference

TestPilot uses numeric enum values throughout the JSON reports. This section provides complete reference tables for all enum types used in evaluation and analysis.

Status

The status field appears on both Test and Step objects to indicate execution outcomes:

Number	Name	Meaning
0	STATUS_UNSPECIFIED	Status is unknown or not set
1	STATUS_PENDING	Test or step is still in progress
2	STATUS_SUCCEEDED	Completed successfully (passed)
3	STATUS_FAILED	Completed with failure
4	STATUS_INCOMPLETE	Did not finish all steps (aborted or skipped)

ExecutionMode

The executionMode field indicates how each step was executed:

Number	Name	Meaning
0	EXECUTION_MODE_UNSPECIFIED	Mode not specified
1	EXECUTION_MODE_CUA	Executed by Computer Use Agent (LLM-driven)
2	EXECUTION_MODE_CACHE	Step executed from cache (no live LLM run)
3	EXECUTION_MODE_CACHE_FALLBACK_TO_CUA	Cache attempted but fell back to agent execution
4	EXECUTION_MODE_FALLBACK_TO_CUA	Non-CUA execution failed, fell back to agent
5	EXECUTION_MODE_NON_CUA	Non-CUA mode (deterministic/scripted)
6	EXECUTION_MODE_SCRIPT	Script mode execution
7	EXECUTION_MODE_FORM_FILLER	Form filler mode execution

CacheStatus

The cacheStatus field indicates what happened with cache for each step:

Number	Name	Meaning
0	CACHE_STATUS_UNSPECIFIED	Cache status not specified
1	CACHE_STATUS_HIT	Cache entry found and used - optimal performance
2	CACHE_STATUS_MISS	No suitable cache entry found - expected for first runs
3	CACHE_STATUS_UNUSED_IS_RETRY	Cache ignored because step needed retry - may indicate flaky steps
4	CACHE_STATUS_UNUSED_IS_ASSERTION	Cache ignored for assertion-only step - expected behavior
5	CACHE_STATUS_UNUSED_NON_CACHEABLE_EXECUTION_MODE	Cache ignored due to non-cacheable execution mode
6	CACHE_STATUS_UNUSED_CONTAINS_NON_CACHEABLE_ACTIONS	Cache ignored due to non-cacheable actions
7	CACHE_STATUS_UNUSED_ELEMENT_IS_MISSING	Cache ignored because required element missing - UI changes may have broken cache assumptions
8	CACHE_STATUS_UNUSED_HAS_TOOL_CALLS	Cache ignored because step involved tool calls - dynamic behavior prevented caching
9	CACHE_STATUS_UNUSED_IS_SCRIPT	Cache ignored for script-based step

StepType

The stepType field classifies the type of step being executed:

Number	Name	Meaning
0	STEP_TYPE_UNSPECIFIED	Step type not specified
1	STEP_TYPE_REGULAR_ACTION	Regular user actions (click, type, navigate)
2	STEP_TYPE_FORM_FILLING	Data entry or form completion
3	STEP_TYPE_VERIFICATION	Assertion or verification of expected state

Key Takeaways

Duration parsing: Always convert duration strings to numeric seconds for analysis
Test identification: Use Test.id for tracking across runs, not titles
Cache calculation: Exclude assertion steps for accurate cache hit rates
Status interpretation: Focus on STATUS_SUCCEEDED (2) as the only true success state
Trend analysis: Use rolling averages and percentile calculations for meaningful insights
Failure categorization: Leverage explanation fields for automated failure triage

Overview

Getting Started

Guides

Integrations

Advanced

Evaluation

Reference

Best Practices

Changelog

Evaluating Testpilot Runs

Overview

Data Sources

1. Latency Metrics

Understanding Latency Fields

Per-Test Latency Analysis

Per-Step Latency Analysis

2. Pass/Fail Metrics

Understanding Status Fields

Per-Step Pass/Fail Analysis

Per-Test Pass/Fail Analysis

Pass/Fail Trend Analysis

3. Cache Performance Analysis

Understanding Cache Fields

Cache Rate Calculation

Cache Error Rate Analysis

Cache Status Analysis

Cache Performance Trends

Field Selection Best Practices

Essential Fields by Analysis Type

Enum Reference

Status

ExecutionMode

CacheStatus

StepType

Key Takeaways

Overview

Getting Started

Guides

Integrations

Advanced

Evaluation

Reference

Best Practices

Changelog

​Overview

​Data Sources

​1. Latency Metrics

​Understanding Latency Fields

​Per-Test Latency Analysis

​Per-Step Latency Analysis

​2. Pass/Fail Metrics

​Understanding Status Fields

​Per-Step Pass/Fail Analysis

​Per-Test Pass/Fail Analysis

​Pass/Fail Trend Analysis

​3. Cache Performance Analysis

​Understanding Cache Fields

​Cache Rate Calculation

​Cache Error Rate Analysis

​Cache Status Analysis

​Cache Performance Trends

​Field Selection Best Practices

​Essential Fields by Analysis Type

​Enum Reference

​Status

​ExecutionMode

​CacheStatus

​StepType

​Key Takeaways

Overview

Data Sources

1. Latency Metrics

Understanding Latency Fields

Per-Test Latency Analysis

Per-Step Latency Analysis

2. Pass/Fail Metrics

Understanding Status Fields

Per-Step Pass/Fail Analysis

Per-Test Pass/Fail Analysis

Pass/Fail Trend Analysis

3. Cache Performance Analysis

Understanding Cache Fields

Cache Rate Calculation

Cache Error Rate Analysis

Cache Status Analysis

Cache Performance Trends

Field Selection Best Practices

Essential Fields by Analysis Type

Enum Reference

Status

ExecutionMode

CacheStatus

StepType

Key Takeaways