Skip to content

Reliability & operations

Frontier’s real-time AI sales coaching operates on a distributed, edge-native architecture that prioritizes resilience and responsiveness. This page outlines the operational principles and components that ensure the system remains available and performs reliably, even in the face of external provider issues or internal component failures, focusing on graceful degradation, observability, and disaster recovery for developers orienting before reading source code.

Frontier employs several strategies to ensure the system remains operational and provides a positive user experience even when dependencies experience issues. The core principle is to “fail open” or degrade gracefully rather than block the real-time coaching flow.

  • FAQ Relevance Gating: The system’s FAQ relevance gate is designed to “fail open.” On any model error, timeout, unparseable response, or missing transcript context, all relevant FAQs are surfaced to the sales representative rather than being suppressed. This ensures coaching suggestions are never silenced due to an outage.
  • LLM Timeouts: Live question detection implements an AbortController with a configurable OPENAI_CALL_TIMEOUT_MS (default 10 seconds). On timeout or error, the question-only path gracefully returns null, allowing the call to continue without detection and logging the failure.
  • Quick-Answer Flow: The two-phase quick-answer flow is built with graceful degradation. It first presents an initial card (FAQ match, transcript-grounded answer, or a safe “Bridge”). The full knowledge stream is always run afterward. If no FAQ, transcript, or script context is available, the decider is skipped, and a safe “Bridge” is shown, preventing dead ends for the user.
  • Externalized Operational Timeouts: Key operational timeouts and thresholds are externalized to environment variables, including OPENAI_CALL_TIMEOUT_MS (10s), WORKERS_AI_TIMEOUT_MS (8s), ALARM_PROCEED_DELAY_MS (10s), CHAT_THREAD_RETRY_DELAY_MS (300ms), and KEYWORD_SIMILARITY_THRESHOLD (0.9).
  • Embedding Calls: AI embedding calls for the Cloudflare Workers AI provider are wrapped in a timeout-and-fallback mechanism. If the primary provider times out, the system falls back to OpenRouter. The operation rejects only if both the primary and fallback providers fail.
  • OpenRouter Retries: OpenRouter embedding calls include explicit retry logic with backoff. Up to 2 retries (3 total attempts) are performed on HTTP 503 (unavailable), 429 (rate limit), and 529 (provider overloaded) responses, honoring the Retry-After header when present, or defaulting to a 2-second delay.
  • Knowledge Backend Resolution: The knowledge backend resolves at request time across multiple backends (e.g., Supermemory, Cloudflare AI Search, Pinecone). It can swallow configuration store (Cloudflare KV) failures, returning default settings as a fallback. If the Cloudflare R2 bucket binding for AI Search is absent, it defaults to Pinecone; otherwise, it defaults to AI Search.
  • Cloudflare AI Search Errors: Cloudflare AI Search returns empty results on error 1031, which prevents application crashes and allows for graceful degradation.
  • Transcription Drop: A known failure pattern involves Recall.ai sending partial transcript data but never a final transcript.data message. Since only final messages are persisted to the database, these words are lost. While detection and integrity tooling exist, the upstream drop is not preventable by the Call Server.
  • Recall Real-time WebSocket: The Recall.ai real-time WebSocket client aggressively retries on connection failure (30 times with 3-second backoff), and the real-time webhook POST path is a supported fallback.

Frontier relies on a multi-tiered observability stack to monitor system health, performance, and errors.

  • Centralized Error Reporting: The error-capture.ts utility is the central point for error reporting within the Call Server.
    • All errors are logged to Axiom (for debugging and analysis) regardless of severity.
    • Only errors categorized as ‘high’ or ‘critical’ severity are forwarded to Sentry (for alerting), to prevent alert fatigue.
    • Structured capture helpers (e.g., captureSystemError for critical, captureBusinessError for high) are provided and do not throw, allowing callers to decide error handling.
  • Axiom for Logging and Metrics: Axiom is the primary logging and metrics backend.
    • Deployed Cloudflare Workers ship structured JSON console logs to Axiom via Cloudflare Workers Observability.
    • Durable Object and WebSocket paths POST logs to a dedicated logging worker, which then ingests them into Axiom using the AXIOM_TOKEN.
    • Logs and traces are stored in separate Axiom datasets (e.g., frontier-call-server-logs-prd and frontier-call-server-traces-prd).
    • Each Sentry event includes an axiom context with a one-click deep-link to the full call log in Axiom Explorer.
  • Sentry for Alerting: Sentry is integrated into the Call Server’s Cloudflare Workers via the official @sentry/cloudflare SDK.
    • It uses a dedicated Sentry project, frontierx-call-server, configured with a DSN from the SENTRY_DSN secret.
    • Sentry reporting is non-throwing by design to prevent Durable Object WebSocket connections from closing on a failed report. The reportError helper wraps capture in a try/catch and fires a fire-and-forget Sentry.flush(2000) because withSentry only flushes on fetch() return (the WebSocket upgrade), before in-call errors occur.
    • Sentry issues are fingerprinted by stable identifiers (error name, worker name, error slug), not by call_id, to group similar errors effectively.
  • Alerting Transport: NTFY is provisioned as an alerting transport only at the configuration/secrets level (NTFY_TOKEN, NTFY_DEV_ALERTS_TOPIC). It is not yet wired into any source code for automated alerts.

Observability Blind Spots and Known Issues

Section titled “Observability Blind Spots and Known Issues”
  • Logging Worker SPOF: The dedicated logging worker is a silent-failure Single Point of Failure (SPOF). It returns HTTP 200 even when Axiom ingest fails. If the AXIOM_TOKEN expires, all Durable Object logs are silently dropped with no alert, leading to observability blindness during incidents. A proposed fix (failure counter + canary heartbeat) is not yet implemented.
  • Background Work Blind Spot: If promises within ctx.waitUntil reject and no code in that path explicitly calls a capture helper, the failure does not reach Sentry. This is particularly critical as thin webhook ingress deliberately pushes Recall.ai processing (including transcript persistence) into ctx.waitUntil. A silent ctx.waitUntil failure can lead to transcript loss without alerting.
  • Sentry Trace Propagation (Audit Finding): OpenTelemetry trace propagation is inverted (excluding *.workers.dev), leading to broken cross-worker trace chains.

Frontier’s backend consists of an orchestrated set of Cloudflare Workers deployed in a specific sequence.

  • Deployment Units: The Call Server is not a single Cloudflare Worker but an orchestrated set, including a main worker hosting Agents-SDK Durable Objects, plus companion workers for transcript orchestration, question/FAQ/script detection, and logging. These are wired together by service bindings.
  • Orchestrated Deployment: The Cloudflare Workers are deployed in a strict order: the logging worker second (as Durable Objects depend on it for immediate RPC logging), then the detection workers (script-completion, faq-detection, question-detection, transcript-orchestration), and finally the main CallAgent Worker (which depends on the others via service bindings). The deployment script (deploy.sh) enforces this order.
  • Environments: Cloudflare D1 databases are provisioned per environment (dev/stg/demo/prd). The Release Candidate (RC) environment shares the staging database (call-agent-db-stg).
  • Workers (blue/green): new versions deploy at 0% traffic, smoke-test, then promote; failed smoke tests never shift traffic. Cloudflare Workers version rollback is the documented recovery path (flip a wrangler variable + redeploy, ~2 min; or wrangler rollback). Teams record the known-good version SHA before each migration step.
  • Durable Object migrations — the one thing you can’t simply roll back: a deploy that changes Durable Object classes/storage can’t be cleanly reverted by a version rollback (the storage migration has already applied). This is the main deploy risk and is handled with extra care (record known-good SHA; stage DO changes deliberately).
  • AI layer: multi-provider via the Vercel AI SDK + the ‘frontier-ai’ gateway; embeddings fall back Cloudflare AI → OpenRouter.
  • Knowledge base: retrieval backend is runtime-selectable and falls back (Supermemory → Cloudflare AI Search); KV config failures resolve to defaults.
  • Data: backups / DR for Supabase and D1 are a GAP (see below).
  • Today (honest): there is no formal on-call rotation or status page; the team often learns about staging/production breakage from users and QA. Alerting transports (NTFY) are provisioned but not yet wired, and SLOs are spec-only.
  • Planned: introduce incident.io for incident declaration, on-call paging, and coordination; wire NTFY/Sentry alerts to it; publish a status page; and stand up the SLO monitors defined in the hardening epic. (Founder supplies the rollout timeline.)

Known Single Points of Failure (SPOFs) and Critical Dependencies

Section titled “Known Single Points of Failure (SPOFs) and Critical Dependencies”
  • Single Deployment Unit / Coupled Workers: The entire set of Call Server workers (main CallAgent Worker, Durable Objects, and companion workers) is deployed together as a single, interdependent unit with strict service-binding ordering. A failed or partial deploy, or an issue in a sub-worker, can break the entire chain.
  • Logging Worker Silent Failure: As noted above, the logging worker silently returns HTTP 200 even when Axiom ingest fails. This creates an observability blind spot where all Durable Object logs could be dropped without alerting.
  • RC Shares Staging D1: The Release Candidate environment shares its Cloudflare D1 database instance (call-agent-db-stg) with the staging environment. This means a corruption or migration incident in either environment could affect both. Production D1 is isolated.
  • WebSocket Relay Durable Object: The WebSocketRelayAgent Durable Object explicitly disables hibernation (hibernate: false) to keep the Durable Object awake and maintain the Recall.ai real-time WebSocket connection, preventing abnormal closure issues. While deliberate, this specific configuration for a core relay makes it a critical component.
  • Upstream Transcript Loss: The reliance on Recall.ai for initial meeting audio capture means that if Recall.ai sends partial transcripts but never a final one for a segment, those words are permanently lost and unrecoverable by the Call Server. This is an inherent reliability dependency on the upstream provider.

Frontier is actively investing in hardening its reliability and observability, with several key initiatives currently in progress or planned:

  • call-server-hardening Epic: Many reliability features such as generalized retry logic, externalized timeouts, race condition fixes, error-response consistency, structured logger unification, alerting, and Service Level Objectives (SLOs) are part of the call-server-hardening epic, currently in the ‘Planning’ status. While some specific features like timeout externalization have shipped, others are still design proposals.
  • Worker Callback Authentication: At the time of a recent audit, authentication for internal worker callbacks was temporarily disabled for debugging, and Recall.ai webhooks lacked idempotency checks. Re-enabling these security and reliability measures is tracked as FRO-532 and FRO-533.
  • Direct Audio Capture Migration: Frontier is migrating to direct microphone and speaker capture from the Desktop client for complete separation of rep and customer audio feeds. While the architecture and code for this USE_DIRECT_AUDIO path (streaming directly to Deepgram) are present, the production readiness and full deployment of this path are in progress. The DirectDeepgramSTT is described as a ‘single hop, stable fallback’ primarily used in development environments currently.