Skip to main content

Overview

In addition to HTML and JUnit reports, Testpilot reports detailed metrics about test executions in a reports.json file that is included in the output directory of the test. This report can be uploaded to AI observability or evaluation platforms to understand how your tests are performing over time, and to identify bottlenecks and flaky steps in your tests. This guide explains how to extract and calculate key metrics from TestPilot JSON reports for evaluation and monitoring purposes:
  1. Latency per step and per test - Measure execution time performance
  2. Pass/fail per step and per test - Calculate success rates
  3. Cache performance analysis - Analyze cache effectiveness and optimization opportunities
These metrics help identify performance regressions, flaky tests, and cache optimization opportunities across multiple test runs.

Data Sources

TestPilot generates a report.json file for each test in your test run in the testpilot-out/.internal/<report-id>/<test-id>/results.json directory. You can view a detailed reference of the report structure here.

1. Latency Metrics

Understanding Latency Fields

TestPilot provides timing data at multiple levels to help you understand where time is being spent during test execution:
  • Test-level timing: The duration field on each Test object represents the total wall-clock time from test start to completion
  • Step-level timing: Each Step has its own duration field showing how long that individual step took
  • Action-level timing: Individual Action objects within steps also have duration fields for granular analysis
The duration format uses strings like “2.5 s” or “90.2 s” - always in seconds with the “s” suffix. Parse these into numeric values to enable mathematical operations like averaging, percentile calculations, and trend analysis.

Per-Test Latency Analysis

Test-level latency provides the most important metric for overall user experience - the total time from test start to completion. This end-to-end measurement is crucial for:
  • SLA monitoring: Ensuring tests complete within acceptable time limits
  • Performance trending: Tracking whether test execution is getting faster or slower over time
  • Resource planning: Understanding typical execution times for capacity planning
Key fields for analyzing Per test latency include:
  • Test.id: Stable Identifier for analyzing your test across runs
  • Test.duration: Wall time for how long the test took to run
  • Test.startTime: Useful for organizing your tests chronologically
  • Test.profilingMetrics: Splits test time into LLM Duration, Tool Duration, and Attempts, which can help you understand what is driving the test’s overall duration.

Per-Step Latency Analysis

Step-level latency analysis helps identify bottlenecks in test execution by examining the duration field on each Step object. This is particularly useful for:
  • Bottleneck identification: Finding which types of operations (navigation, form filling, verification) take the longest
  • Optimization targeting: Prioritizing which steps to optimize for maximum performance impact
  • Regression detection: Monitoring whether specific step types are getting slower over time
Extract the duration from each step and track it alongside the step title and status to build comprehensive performance profiles. Key fields for analyzing Per-Step latency include:
  • Step.title: The actual wording of the step provided to the LLM. You can use this, or the step’s index for analysis across tests.
  • Step.duration: Wall time for how long the step took
  • Step.profilingMetrics: Shows the breakdown of duration by attempts, LLM Duration, and Tool Duration
  • Step.executionMode: Shows how the step was executed (cache, agent, script, etc.) - see Enum Reference
  • Step.actions: The actual actions Taken by Testpilot, which can give you a granular understanding of what occurred during the step.
  • Step.cacheStatus: Shows whether the step successfully used the Action cache, or had to fall back. See Enum Reference
  • Step.explanation: The LLM’s interpretation of how the step ended.
Some things to evaluate when analyzing steps include:
  • Do the profiling metrics indicate frequent retries of the step, or issues with any of the tool calls you are making?
  • Does cache status consistently suggest an issue with caching capability? What reasons are preventing the cache from hitting on the action.
  • Do you see a large number of waits, or repetitive actions in the step?

2. Pass/Fail Metrics

Understanding Status Fields

TestPilot uses numeric status codes to indicate the outcome of tests and steps. The status field appears on both Test and Step objects. See the Status enum table for complete values and meanings.

Per-Step Pass/Fail Analysis

Step-level analysis helps identify which specific operations are most prone to failure. This is valuable for:
  • Failure pattern identification: Understanding which types of steps fail most frequently
  • Root cause analysis: Distinguishing between action failures and verification failures
  • Test stability assessment: Identifying steps that contribute to overall test flakiness
Pay attention to the stepType field as verification steps failing might indicate different issues than action steps failing. See the StepType enum table for complete values. Key aggregation strategies:
  • Binary classification: Treat STATUS_SUCCEEDED as “pass” and all others as “fail” for most analyses
  • Step type segmentation: Analyze action steps vs verification steps separately
  • Failure categorization: Use the failureReasonCategory field to group similar failure types

Per-Test Pass/Fail Analysis

Test-level pass/fail rates are the primary metric for overall test suite health. The test status field represents the final outcome after all steps have been attempted. A test is considered successful only if it reaches STATUS_SUCCEEDED (2). The explanation field often contains valuable context about why a test failed, which can be categorized for root cause analysis and automated triage.

Pass/Fail Trend Analysis

Tracking pass/fail rates over time is essential for identifying flaky tests and overall test suite stability: Critical metrics to track:
  • Overall pass rate: Percentage of tests passing in recent runs
  • Test stability: Individual tests with inconsistent results across runs
  • Flaky test identification: Tests with pass rates between 20-80% indicating intermittent issues
  • Failure clustering: Whether failures concentrate around specific time periods or code changes
Analysis techniques:
  • Stability scoring: Calculate pass rates over rolling windows to identify consistently failing vs intermittent tests
  • Trend detection: Monitor whether test suite health is improving or degrading over time
  • Outlier identification: Flag tests that deviate significantly from expected pass rates

3. Cache Performance Analysis

Understanding Cache Fields

TestPilot’s caching system speeds up test execution by reusing previous step results when conditions are similar. Understanding cache performance helps optimize test execution time and identify optimization opportunities. Key fields for cache analysis:
  • Step.cacheStatus: Indicates what happened with cache for this step (hit, miss, or unused with reason)
  • Step.executionMode: Shows how the step was actually executed (cache, agent, script, etc.)
  • Test.cacheSourceId: When present, indicates this test used another test as a cache source

Cache Rate Calculation

The most effective cache rate calculation focuses on non-assertion steps since verification steps typically cannot be cached. The analysis should:
  • Count cache utilization: Steps executed with EXECUTION_MODE_CACHE (2) or EXECUTION_MODE_CACHE_FALLBACK_TO_CUA (3)
  • Exclude assertion steps: Filter out steps identified as assertions/verifications since they rarely can be cached
  • Calculate hit rate: Cached steps divided by total non-assertion steps
Assertion step identification:
  • Step titles beginning with “verify,” “assert,” or “expect” (case-insensitive).
  • Steps with stepType of STEP_TYPE_VERIFICATION (3).

Cache Error Rate Analysis

Cache error rate tracks situations where cache was attempted but failed, requiring fallback to live agent execution. This metric helps identify:
  • Cache reliability issues: How often cache attempts fail
  • Environmental factors: Whether cache failures correlate with specific conditions
  • Cache optimization opportunities: Which scenarios need better cache handling
Calculate as the percentage of steps with execution mode EXECUTION_MODE_CACHE_FALLBACK_TO_CUA (3) across all steps.

Cache Status Analysis

Beyond simple hit rates, analyze the specific reasons cache wasn’t used: See the CacheStatus enum table for complete cache status values and their meanings, including detailed unused reasons and optimization strategies. Track cache performance over time to understand optimization effectiveness: Key trend indicators:
  • Declining hit rates: May indicate tests becoming more dynamic or environmental instability
  • Consistent unused reasons: Suggest systematic issues addressable through test design changes
  • Cache source stability: Tests frequently serving as cache sources should be prioritized for stability
  • Performance correlation: Relationship between cache hit rates and overall execution speed

Field Selection Best Practices

Essential Fields by Analysis Type

Latency Analysis:
  • Test.duration and Step.duration: Core timing metrics
  • Test.id: For grouping across runs
  • ProfilingMetrics.totalDuration, llmDuration, toolDuration: For detailed breakdowns
  • Report.startTime: For chronological ordering
Pass/Fail Analysis:
  • Test.status and Step.status: Core success metrics (focus on value 2 = SUCCESS)
  • Test.explanation and Step.explanation: Context for failures
  • Step.stepType: To differentiate verification vs action failures
  • Step.failureReasonCategory: For categorizing failure types
Cache Analysis:
  • Step.executionMode: How step was executed (primary metric)
  • Step.cacheStatus: Detailed cache behavior
  • Test.cacheSourceId: Cache dependency relationships
  • Step.title: For identifying assertion steps to exclude

Enum Reference

TestPilot uses numeric enum values throughout the JSON reports. This section provides complete reference tables for all enum types used in evaluation and analysis.

Status

The status field appears on both Test and Step objects to indicate execution outcomes:
NumberNameMeaning
0STATUS_UNSPECIFIEDStatus is unknown or not set
1STATUS_PENDINGTest or step is still in progress
2STATUS_SUCCEEDEDCompleted successfully (passed)
3STATUS_FAILEDCompleted with failure
4STATUS_INCOMPLETEDid not finish all steps (aborted or skipped)

ExecutionMode

The executionMode field indicates how each step was executed:
NumberNameMeaning
0EXECUTION_MODE_UNSPECIFIEDMode not specified
1EXECUTION_MODE_CUAExecuted by Computer Use Agent (LLM-driven)
2EXECUTION_MODE_CACHEStep executed from cache (no live LLM run)
3EXECUTION_MODE_CACHE_FALLBACK_TO_CUACache attempted but fell back to agent execution
4EXECUTION_MODE_FALLBACK_TO_CUANon-CUA execution failed, fell back to agent
5EXECUTION_MODE_NON_CUANon-CUA mode (deterministic/scripted)
6EXECUTION_MODE_SCRIPTScript mode execution
7EXECUTION_MODE_FORM_FILLERForm filler mode execution

CacheStatus

The cacheStatus field indicates what happened with cache for each step:
NumberNameMeaning
0CACHE_STATUS_UNSPECIFIEDCache status not specified
1CACHE_STATUS_HITCache entry found and used - optimal performance
2CACHE_STATUS_MISSNo suitable cache entry found - expected for first runs
3CACHE_STATUS_UNUSED_IS_RETRYCache ignored because step needed retry - may indicate flaky steps
4CACHE_STATUS_UNUSED_IS_ASSERTIONCache ignored for assertion-only step - expected behavior
5CACHE_STATUS_UNUSED_NON_CACHEABLE_EXECUTION_MODECache ignored due to non-cacheable execution mode
6CACHE_STATUS_UNUSED_CONTAINS_NON_CACHEABLE_ACTIONSCache ignored due to non-cacheable actions
7CACHE_STATUS_UNUSED_ELEMENT_IS_MISSINGCache ignored because required element missing - UI changes may have broken cache assumptions
8CACHE_STATUS_UNUSED_HAS_TOOL_CALLSCache ignored because step involved tool calls - dynamic behavior prevented caching
9CACHE_STATUS_UNUSED_IS_SCRIPTCache ignored for script-based step

StepType

The stepType field classifies the type of step being executed:
NumberNameMeaning
0STEP_TYPE_UNSPECIFIEDStep type not specified
1STEP_TYPE_REGULAR_ACTIONRegular user actions (click, type, navigate)
2STEP_TYPE_FORM_FILLINGData entry or form completion
3STEP_TYPE_VERIFICATIONAssertion or verification of expected state

Key Takeaways

  1. Duration parsing: Always convert duration strings to numeric seconds for analysis
  2. Test identification: Use Test.id for tracking across runs, not titles
  3. Cache calculation: Exclude assertion steps for accurate cache hit rates
  4. Status interpretation: Focus on STATUS_SUCCEEDED (2) as the only true success state
  5. Trend analysis: Use rolling averages and percentile calculations for meaningful insights
  6. Failure categorization: Leverage explanation fields for automated failure triage