Evaluation plan
Frontier’s evaluation plan outlines how we measure the performance and reliability of our real-time coaching system, identifying both current capabilities and future investments to ensure high-quality and resilient service. Our focus is on the integrity of the real-time processing loop, the accuracy of the coaching signals, and the overall robustness of the platform.
Measuring Coaching Quality
Section titled “Measuring Coaching Quality”While our systems currently validate the flow and basic accuracy of our real-time coaching, a comprehensive, human-labeled evaluation of coaching signal quality is a key area for future investment.
Today, we use a combination of tools and approaches to assess our coaching:
- Record/Replay for Flow Correctness: We have a record/replay system that processes historical data through our logic for detecting and canonicalizing questions. This system ensures that our pipeline correctly handles deduplication, episode matching, and persistence, confirming the integrity of the data flow. Similarly, a separate record/replay path validates the end-to-end integrity of transcript delivery and logging. These systems verify pipeline logic and flow, but not the subjective quality of the coaching output itself.
- Heuristic Model Evaluation: For specific coaching signals, such as quick answers, we employ an offline command-line tool. This tool evaluates the quick-answer path against a set of predefined cases, detecting issues like incorrect FAQ matches, hallucinations, and latency regressions. It scores runs based on expected and forbidden terms, and it checks if the system selects the correct FAQ or provides a safe ‘Bridge’ response for out-of-domain questions.
- Model Probing: A dedicated, developer-only service allows us to benchmark the latency and validation scores of various large language models (LLMs) from different providers. This service contains in-worker validators that score signal correctness (e.g., intent, keyword presence, quick-answer quality) for offline tuning and prompt experimentation.
System Reliability and Efficiency
Section titled “System Reliability and Efficiency”Maintaining a reliable and efficient real-time system is paramount. We continuously monitor and improve our platform’s resilience and performance.
- Observability: We use Sentry for error tracking and Axiom for comprehensive logging and tracing. All errors are logged to Axiom, while only high-severity errors are forwarded to Sentry to prevent alert fatigue. Each Sentry error event includes a deep-link to the full call log in Axiom for rapid debugging.
- Operational Resilience: Our system incorporates graceful degradation for key services. For example, live question detection will return no detection on an LLM timeout or error, ensuring the call continues uninterrupted rather than failing. Tunable timeouts for external API calls (e.g., to LLM providers) are externalized, with sensible defaults configured.
- Transcript Integrity: We monitor transcription reliability closely. A gap-analysis script classifies transcript gaps as silence or dropped speech (meaning speech was detected by Recall.ai as a partial but no final transcript segment was received), and a storage integrity harness verifies that all transcript data is successfully delivered to our database. This highlights a dependency on third-party transcription providers. We are actively working on a Direct Audio path to move rep-speech transcription from Recall.ai to direct Deepgram integration within the HUD window, which will improve speaker attribution and reduce reliance on an intermediary.
- Efficiency Investments: Our roadmap includes significant investment in LLM quality. This project focuses on optimizing model speed and accuracy for various tasks, including detection, answering, and FAQ extraction, to make the computationally expensive parts of our real-time pipeline more efficient.
- Provider Downtime Mitigation: We are actively planning to Mitigate Provider Downtime by implementing features like incident communication banners, a provider health dashboard, and enhanced failover strategies, including regional routing for services like Recall.ai.
Security and Compliance
Section titled “Security and Compliance”As we scale, ensuring strong security and data compliance is a core focus.
- SOC-2 Path: A significant part of our future investment with additional funding is achieving Compliance & Certification. This project explicitly includes pursuing SOC-2 certification, along with addressing GDPR, user access/deletion requests (DSAR), and data sovereignty/multi-region capabilities.