Testing

Frontier’s testing strategy is designed to ensure the reliability and accuracy of its real-time AI sales coaching. This involves a multi-layered approach, from focused unit tests to comprehensive synthetic call simulations, aimed at catching regressions and continuously improving quality.

Unit and Integration Testing

Unit and integration tests, powered by Vitest, form the foundation of our testing. These are run locally via bun run test and include integration-specific tests for deeper component interactions. Coverage reporting is managed via Vitest’s v8 engine.

Code Coverage

We employ a code coverage ratchet that, for now, is advisory rather than blocking. Pull requests are expected to achieve at least 80% line coverage on changed files. This promotes a culture of continuous testing and aims to stabilize overall coverage to enable a future hard gate. Our current live overall code coverage is approximately 63% (as reported by SonarCloud).

End-to-End (E2E) Testing

End-to-end tests are conducted using Playwright, with six distinct configurations covering different parts of the system:

Root Smoke: Basic sanity checks across the application.
Dashboard: Our primary suite, including comprehensive tests for the Dashboard application, leveraging Clerk for authentication across various user roles, and running in parallel with two continuous integration (CI) workers. Onboarding flows run serially to ensure data integrity.
Desktop: Electron tests are currently smoke-only, focusing on the basic functionality of the Desktop client without deep DOM interaction.
HUD: The HUD (Heads-Up Display) overlay is tested in a browser-only mode and is currently run locally, not within our continuous integration pipeline.
Live: Smoke tests for the Live web call application are run in CI, with a full suite pending further development (FRO-784).
Call-App: Tests for the shared call-app library mirror those for the Live application.

Reporting and Visual Regression

Test results are reported via Allure for per-branch previews on Cloudflare Pages and historical tracking in Cloudflare R2. Visual regression testing is performed using Chromatic against our Storybook components. The packages/ui library has 54 stories covered, with other packages queued (FRO-495).

Synthetic Call Testing

A core strength of our testing strategy is synthetic call testing. The Simulation Studio is a powerful debug panel tool that enables one-click full end-to-end call simulations, as well as granular scenario building. This tool interfaces with an API for scenarios like starting a call, simulating a conversation, batching FAQ detections, and flowing through a script.

For more deterministic testing, a transcript replay corpus is used. This involves a command-line utility that posts simulated Recall.ai webhooks through our Call Server orchestration worker. A dedicated script replays a corpus of question episodes to rigorously test edge cases in question canonicalization, ensuring flow and deduplication correctness against a Cloudflare D1 backed QuestionStore.

Offline Model and Signal Evaluation

We employ offline evaluation tools to assess the quality and performance of our AI models and coaching signals:

model-probe: This is a standalone, developer-only Cloudflare Worker used for benchmarking model latency and validation scores across various LLM providers like Together AI, OpenAI, and Anthropic. It includes in-Worker validators for scoring signal correctness, such as intent validation, keyword detection, and quick-answer quality. This is for offline tuning and is not part of the live call-server production path.
eval-quick-answer.ts: A command-line interface (CLI) tool that directly exercises the CallChatAgent’s quick-answer path, covering both Phase 1 retrieval (FAQ/fast-model) and Phase 2 tool-calling. It runs against hardcoded evaluation cases to detect issues like wrong FAQ matches, hallucinations, and latency regressions.

Current Gaps and Priorities

While our synthetic call simulation is a significant strength, enabling deterministic end-to-end validation without manual setup, there are recognized areas for improvement:

E2E Breadth: End-to-end testing remains thin in critical areas. The HUD E2E suite is not yet integrated into continuous integration, and Desktop E2E tests are limited to smoke tests (without full DOM interaction). A full E2E suite for the Live web call app is pending.
Human-Labelled Signal Evaluation: There is currently no human-labelled ground-truth evaluation for the quality of our coaching signals. The existing model-probe and eval-quick-answer.ts tools rely on synthetic or heuristic scoring.
We currently lack a robust framework for human-labelled ground-truth evaluation of coaching signal quality. This is a prioritized roadmap item.
Critical 0%-Coverage Files: Several high-risk components, including the CallChatAgent, QuickAnswerAgent, WebSocketRelay, and script-completion-worker (within the Call Server), currently have 0% test coverage. Addressing these is a high priority.
Future Testing Paradigms: We plan to roll out mutation testing (e.g., Stryker) and property-based testing (e.g., fast-check) to further enhance test robustness.
Flakiness: Transient failures in E2E tests often stem from external factors, such as Vercel preview environments, rather than product flakiness. These are mitigated with resilient polling mechanisms.