Overview
In addition to HTML and JUnit reports, Testpilot reports detailed metrics about test executions in areports.json file that is included in the output directory of the test. This report can be uploaded to AI observability or evaluation platforms to understand how your tests are performing over time, and to identify bottlenecks and flaky steps in your tests.
This guide explains how to extract and calculate key metrics from TestPilot JSON reports for evaluation and monitoring purposes:
- Latency per step and per test - Measure execution time performance
- Pass/fail per step and per test - Calculate success rates
- Cache performance analysis - Analyze cache effectiveness and optimization opportunities
Data Sources
TestPilot generates areport.json file for each test in your test run in the testpilot-out/.internal/<report-id>/<test-id>/results.json directory. You can view a detailed reference of the report structure here.
1. Latency Metrics
Understanding Latency Fields
TestPilot provides timing data at multiple levels to help you understand where time is being spent during test execution:- Test-level timing: The
durationfield on eachTestobject represents the total wall-clock time from test start to completion - Step-level timing: Each
Stephas its owndurationfield showing how long that individual step took - Action-level timing: Individual
Actionobjects within steps also havedurationfields for granular analysis
Per-Test Latency Analysis
Test-level latency provides the most important metric for overall user experience - the total time from test start to completion. This end-to-end measurement is crucial for:- SLA monitoring: Ensuring tests complete within acceptable time limits
- Performance trending: Tracking whether test execution is getting faster or slower over time
- Resource planning: Understanding typical execution times for capacity planning
- Test.id: Stable Identifier for analyzing your test across runs
- Test.duration: Wall time for how long the test took to run
- Test.startTime: Useful for organizing your tests chronologically
- Test.profilingMetrics: Splits test time into LLM Duration, Tool Duration, and Attempts, which can help you understand what is driving the test’s overall duration.
Per-Step Latency Analysis
Step-level latency analysis helps identify bottlenecks in test execution by examining theduration field on each Step object. This is particularly useful for:
- Bottleneck identification: Finding which types of operations (navigation, form filling, verification) take the longest
- Optimization targeting: Prioritizing which steps to optimize for maximum performance impact
- Regression detection: Monitoring whether specific step types are getting slower over time
- Step.title: The actual wording of the step provided to the LLM. You can use this, or the step’s index for analysis across tests.
- Step.duration: Wall time for how long the step took
- Step.profilingMetrics: Shows the breakdown of duration by attempts, LLM Duration, and Tool Duration
- Step.executionMode: Shows how the step was executed (cache, agent, script, etc.) - see Enum Reference
- Step.actions: The actual actions Taken by Testpilot, which can give you a granular understanding of what occurred during the step.
- Step.cacheStatus: Shows whether the step successfully used the Action cache, or had to fall back. See Enum Reference
- Step.explanation: The LLM’s interpretation of how the step ended.
- Do the profiling metrics indicate frequent retries of the step, or issues with any of the tool calls you are making?
- Does cache status consistently suggest an issue with caching capability? What reasons are preventing the cache from hitting on the action.
- Do you see a large number of waits, or repetitive actions in the step?
2. Pass/Fail Metrics
Understanding Status Fields
TestPilot uses numeric status codes to indicate the outcome of tests and steps. Thestatus field appears on both Test and Step objects. See the Status enum table for complete values and meanings.
Per-Step Pass/Fail Analysis
Step-level analysis helps identify which specific operations are most prone to failure. This is valuable for:- Failure pattern identification: Understanding which types of steps fail most frequently
- Root cause analysis: Distinguishing between action failures and verification failures
- Test stability assessment: Identifying steps that contribute to overall test flakiness
stepType field as verification steps failing might indicate different issues than action steps failing. See the StepType enum table for complete values.
Key aggregation strategies:
- Binary classification: Treat STATUS_SUCCEEDED as “pass” and all others as “fail” for most analyses
- Step type segmentation: Analyze action steps vs verification steps separately
- Failure categorization: Use the
failureReasonCategoryfield to group similar failure types
Per-Test Pass/Fail Analysis
Test-level pass/fail rates are the primary metric for overall test suite health. The teststatus field represents the final outcome after all steps have been attempted. A test is considered successful only if it reaches STATUS_SUCCEEDED (2).
The explanation field often contains valuable context about why a test failed, which can be categorized for root cause analysis and automated triage.
Pass/Fail Trend Analysis
Tracking pass/fail rates over time is essential for identifying flaky tests and overall test suite stability: Critical metrics to track:- Overall pass rate: Percentage of tests passing in recent runs
- Test stability: Individual tests with inconsistent results across runs
- Flaky test identification: Tests with pass rates between 20-80% indicating intermittent issues
- Failure clustering: Whether failures concentrate around specific time periods or code changes
- Stability scoring: Calculate pass rates over rolling windows to identify consistently failing vs intermittent tests
- Trend detection: Monitor whether test suite health is improving or degrading over time
- Outlier identification: Flag tests that deviate significantly from expected pass rates
3. Cache Performance Analysis
Understanding Cache Fields
TestPilot’s caching system speeds up test execution by reusing previous step results when conditions are similar. Understanding cache performance helps optimize test execution time and identify optimization opportunities. Key fields for cache analysis:- Step.cacheStatus: Indicates what happened with cache for this step (hit, miss, or unused with reason)
- Step.executionMode: Shows how the step was actually executed (cache, agent, script, etc.)
- Test.cacheSourceId: When present, indicates this test used another test as a cache source
Cache Rate Calculation
The most effective cache rate calculation focuses on non-assertion steps since verification steps typically cannot be cached. The analysis should:- Count cache utilization: Steps executed with
EXECUTION_MODE_CACHE(2) orEXECUTION_MODE_CACHE_FALLBACK_TO_CUA(3) - Exclude assertion steps: Filter out steps identified as assertions/verifications since they rarely can be cached
- Calculate hit rate: Cached steps divided by total non-assertion steps
- Step titles beginning with “verify,” “assert,” or “expect” (case-insensitive).
- Steps with
stepTypeofSTEP_TYPE_VERIFICATION(3).
Cache Error Rate Analysis
Cache error rate tracks situations where cache was attempted but failed, requiring fallback to live agent execution. This metric helps identify:- Cache reliability issues: How often cache attempts fail
- Environmental factors: Whether cache failures correlate with specific conditions
- Cache optimization opportunities: Which scenarios need better cache handling
EXECUTION_MODE_CACHE_FALLBACK_TO_CUA (3) across all steps.
Cache Status Analysis
Beyond simple hit rates, analyze the specific reasons cache wasn’t used: See the CacheStatus enum table for complete cache status values and their meanings, including detailed unused reasons and optimization strategies.Cache Performance Trends
Track cache performance over time to understand optimization effectiveness: Key trend indicators:- Declining hit rates: May indicate tests becoming more dynamic or environmental instability
- Consistent unused reasons: Suggest systematic issues addressable through test design changes
- Cache source stability: Tests frequently serving as cache sources should be prioritized for stability
- Performance correlation: Relationship between cache hit rates and overall execution speed
Field Selection Best Practices
Essential Fields by Analysis Type
Latency Analysis:Test.durationandStep.duration: Core timing metricsTest.id: For grouping across runsProfilingMetrics.totalDuration,llmDuration,toolDuration: For detailed breakdownsReport.startTime: For chronological ordering
Test.statusandStep.status: Core success metrics (focus on value 2 = SUCCESS)Test.explanationandStep.explanation: Context for failuresStep.stepType: To differentiate verification vs action failuresStep.failureReasonCategory: For categorizing failure types
Step.executionMode: How step was executed (primary metric)Step.cacheStatus: Detailed cache behaviorTest.cacheSourceId: Cache dependency relationshipsStep.title: For identifying assertion steps to exclude
Enum Reference
TestPilot uses numeric enum values throughout the JSON reports. This section provides complete reference tables for all enum types used in evaluation and analysis.Status
Thestatus field appears on both Test and Step objects to indicate execution outcomes:
| Number | Name | Meaning |
|---|---|---|
| 0 | STATUS_UNSPECIFIED | Status is unknown or not set |
| 1 | STATUS_PENDING | Test or step is still in progress |
| 2 | STATUS_SUCCEEDED | Completed successfully (passed) |
| 3 | STATUS_FAILED | Completed with failure |
| 4 | STATUS_INCOMPLETE | Did not finish all steps (aborted or skipped) |
ExecutionMode
TheexecutionMode field indicates how each step was executed:
| Number | Name | Meaning |
|---|---|---|
| 0 | EXECUTION_MODE_UNSPECIFIED | Mode not specified |
| 1 | EXECUTION_MODE_CUA | Executed by Computer Use Agent (LLM-driven) |
| 2 | EXECUTION_MODE_CACHE | Step executed from cache (no live LLM run) |
| 3 | EXECUTION_MODE_CACHE_FALLBACK_TO_CUA | Cache attempted but fell back to agent execution |
| 4 | EXECUTION_MODE_FALLBACK_TO_CUA | Non-CUA execution failed, fell back to agent |
| 5 | EXECUTION_MODE_NON_CUA | Non-CUA mode (deterministic/scripted) |
| 6 | EXECUTION_MODE_SCRIPT | Script mode execution |
| 7 | EXECUTION_MODE_FORM_FILLER | Form filler mode execution |
CacheStatus
ThecacheStatus field indicates what happened with cache for each step:
| Number | Name | Meaning |
|---|---|---|
| 0 | CACHE_STATUS_UNSPECIFIED | Cache status not specified |
| 1 | CACHE_STATUS_HIT | Cache entry found and used - optimal performance |
| 2 | CACHE_STATUS_MISS | No suitable cache entry found - expected for first runs |
| 3 | CACHE_STATUS_UNUSED_IS_RETRY | Cache ignored because step needed retry - may indicate flaky steps |
| 4 | CACHE_STATUS_UNUSED_IS_ASSERTION | Cache ignored for assertion-only step - expected behavior |
| 5 | CACHE_STATUS_UNUSED_NON_CACHEABLE_EXECUTION_MODE | Cache ignored due to non-cacheable execution mode |
| 6 | CACHE_STATUS_UNUSED_CONTAINS_NON_CACHEABLE_ACTIONS | Cache ignored due to non-cacheable actions |
| 7 | CACHE_STATUS_UNUSED_ELEMENT_IS_MISSING | Cache ignored because required element missing - UI changes may have broken cache assumptions |
| 8 | CACHE_STATUS_UNUSED_HAS_TOOL_CALLS | Cache ignored because step involved tool calls - dynamic behavior prevented caching |
| 9 | CACHE_STATUS_UNUSED_IS_SCRIPT | Cache ignored for script-based step |
StepType
ThestepType field classifies the type of step being executed:
| Number | Name | Meaning |
|---|---|---|
| 0 | STEP_TYPE_UNSPECIFIED | Step type not specified |
| 1 | STEP_TYPE_REGULAR_ACTION | Regular user actions (click, type, navigate) |
| 2 | STEP_TYPE_FORM_FILLING | Data entry or form completion |
| 3 | STEP_TYPE_VERIFICATION | Assertion or verification of expected state |
Key Takeaways
- Duration parsing: Always convert duration strings to numeric seconds for analysis
- Test identification: Use
Test.idfor tracking across runs, not titles - Cache calculation: Exclude assertion steps for accurate cache hit rates
- Status interpretation: Focus on STATUS_SUCCEEDED (2) as the only true success state
- Trend analysis: Use rolling averages and percentile calculations for meaningful insights
- Failure categorization: Leverage explanation fields for automated failure triage