Risks we're managing

Frontier, our real-time AI sales-call coaching platform, operates in a complex environment with various technical and operational risks. This page outlines the key risks we actively manage and the strategies we employ to mitigate them, distinguishing between measures currently in place and those planned for future implementation.

Third-Party Vendor Risk on the Live Call

Frontier relies on several external services for critical live-call functionality, including meeting capture, speech-to-text (STT), and AI inference. Downtime or performance degradation from any of these providers could impact coaching quality or availability.

The table below rates each data-path dependency by how critical it is to a live call and what mitigates it. The honest headline: we are heavily concentrated on Cloudflare (it is the runtime, data layer, and part of the AI layer), while most other vendors are individually replaceable.

Vendor	Criticality	Mitigation / fallback
Cloudflare (Workers, Durable Objects, D1, R2, KV, AI, AI Search)	Critical — concentration risk	The whole live-call runtime + data layer. No fallback today; a regional Cloudflare outage degrades the live pipeline. Mitigation is largely Cloudflare’s own resilience (planned: more explicit multi-region posture).
Recall (Desktop SDK)	High	Sole meeting-capture path today. Planned: Recall multi-region routing/failover + the direct mic+speaker capture path (reduces reliance on Recall for transcription).
Deepgram (STT)	High	No verified production STT failover today (a dev-only direct path exists). Planned: a second STT provider behind the transcription abstraction.
LLM providers (Anthropic / OpenAI / Google / Together)	Medium	Mitigated — multi-provider via the Vercel AI SDK + the ‘frontier-ai’ AI Gateway; no single-vendor lock-in.
Supabase	Medium	Config/jobs/realtime (off the hot path). Backups/DR = GAP.
Supermemory (KB)	Low	Interim bridge only; falls back to Cloudflare AI Search.
Pinecone (KB, legacy)	Low	Legacy path; superseded by AI Search.
Inngest / Clerk / Axiom / Sentry / PostHog	Medium–Low	Standard managed dependencies; individually replaceable.

If a vendor goes down (failure modes)

A blunt walk-through of what actually happens, today, if each dependency fails mid-operation:

Vendor down	What breaks	In-progress call survives?	Failover today → planned
Cloudflare	The entire live pipeline (runtime, data, AI routing)	No	None today → explicit multi-region posture is a future investment
Clerk	New logins / token refresh	Yes — an in-progress call keeps running on its already-issued token (the live path checks the token locally); the rep can’t re-login until Clerk returns	None needed mid-call; startup gates on Clerk
Recall (Desktop SDK)	On-device audio capture	No (no capture = no coaching)	None → multi-region Recall + the direct mic/speaker path
Deepgram	Live transcription (coaching goes “blind”)	App stays up; coaching signal stops	None today → a second STT provider behind the abstraction
An LLM provider	Answer generation for that provider	Partial — coaching continues; answers fail until switched	Multi-provider exists, but switching is static (re-deploy) today → dynamic, performance-based failover is the plan
Supermemory / a KB backend	Knowledge retrieval (answers lose grounding)	Yes, degraded	Runtime-selectable backends; falls back to AI Search (operator-flipped today, not per-request)
Supabase	Login, config writes, post-call jobs	Yes — the hot path runs on Cloudflare; config reads are cached	Inngest retries post-call work; backups/DR = GAP
Inngest	Post-call summaries, KB indexing, bot/calendar jobs (async)	Yes	Events queue and retry; summaries delayed, not lost
Axiom / Sentry / PostHog	Observability only	Yes	No user impact (we lose visibility, not function)

The honest headline: Cloudflare is the one true single point of failure (no failover); most others degrade gracefully or are individually replaceable; and dynamic LLM/STT failover is a plan, not yet built. (One nuance — that a Clerk outage doesn’t drop a live call — stems from local token checking on the hot path; see Trust → Security for the validation-consistency caveat.)

Mitigations for LLM Providers (current):
- A Cloudflare AI Gateway routes AI model calls, logging cost and latency.
- OpenRouter serves as a fallback provider for generating embeddings. If the primary provider, Cloudflare AI, times out or fails, embedding calls are automatically routed through OpenRouter, with explicit retry logic for transient errors.
- Frontier employs a multi-provider strategy for live AI inference, leveraging the Vercel AI SDK to integrate models from Anthropic, Google, OpenAI, and Together AI. This allows for flexibility and reduces reliance on any single provider.
- The system is designed with graceful degradation, meaning non-critical AI failures (e.g., an AI Search error) will not crash the live call experience.
Mitigations for LLM Providers (planned):
- Performance-based model failover is planned to automatically switch to the next-fastest provider or model if a currently active one slows down. This capability is not yet built.
Mitigations for Speech-to-Text (current):
- Production live speech-to-text utilizes Deepgram’s Nova-3 model, served via Cloudflare Workers AI.
- A development-only fallback connects directly to Deepgram’s WebSocket for raw audio transcription, gated to development environments.
Mitigations for Meeting Capture (current):
- Recall.ai is integrated via its Desktop SDK, which runs on the sales representative’s machine to capture meeting audio.
- Recall.ai’s real-time WebSocket client includes aggressive retry mechanisms (up to 30 attempts with a 3-second backoff) on connection failures, and a real-time webhook POST path is supported as a fallback.
Mitigations for Meeting Capture (planned):
- Frontier plans to implement Recall.ai regional routing and failover (US/EU) to enhance resilience against regional outages.
- There is a migration underway to shift transcription from Recall.ai to a direct microphone and speaker capture path. The desktop application will capture system and microphone audio directly and stream it to Deepgram for transcription. This aims to improve speaker attribution and reduce reliance on Recall.ai for the transcription step.
Gaps:
- Frontier currently relies on Recall.ai as the sole meeting capture vendor, with no alternative bot provider integrated at the abstraction layer.
- Frontier does not yet have a verified production-level STT provider failover or a generalized LLM-router failover policy beyond what OpenRouter provides for embeddings.
- Specific Service Level Agreements (SLAs) or uptime expectations from Cloudflare, Supabase, Recall.ai, Deepgram, and the LLM providers are a GAP (founder supplies).

Availability and Single Points of Failure (SPOFs)

Frontier’s architecture includes multiple components, and careful management is required to ensure high availability and resilience against single points of failure.

Mitigations (current):
- The Cloudflare Durable Object responsible for coordinating a live call disables hibernation to maintain an active WebSocket connection, preventing intermittent connection closures.
- A core graceful degradation posture is adopted: for instance, the FAQ relevance gate “fails open” (surfacing all FAQs on error rather than suppressing them), Cloudflare AI Search returns empty results on errors, AI embedding calls fall back to OpenRouter, and knowledge base resolution tolerates configuration failures. This prioritizes user experience during partial outages.
- Cloudflare Workers version rollback is the documented path for rapid recovery from deployment issues, with teams recording known-good deployment identifiers.
Single Points of Failure and Risks:
- Cloudflare as a dominant single dependency: The entire real-time call-server, including STT (via Cloudflare Workers AI), retrieval (via Cloudflare Vectorize, AI Search, R2), state (via Cloudflare D1 and Durable Objects), and all compute, runs on Cloudflare. A regional Cloudflare outage would impact the entire live-call pipeline.
- Coupled Cloudflare Workers: The call-server is deployed as an orchestrated set of Cloudflare Workers with strict service-binding ordering. A failed or partial deployment, or a down sub-worker (e.g., the logging worker or a detection worker), can break the dependent chain, making the entire set a single deployment unit.
- Silent Logging Worker Failure: The Cloudflare Worker responsible for logging silently returns success (HTTP 200) even if Axiom ingest fails. If the Axiom token expires, all Durable Object logs are silently dropped without alerting, creating an observability blind spot during incidents.
- Recall.ai as sole meeting bot: As noted above, Recall.ai remains the sole meeting capture vendor, lacking an abstraction-level alternative.
- Release Candidate (RC) shares Staging D1: The RC environment shares its Cloudflare D1 database with the staging environment, meaning a corruption or migration incident in either could affect both.

For more details on operational mechanics and incident response, see Operations → Reliability & operations.

Data Loss and Disaster Recovery (DR)

The loss of critical data, such as call transcripts or organizational configuration, poses a significant risk.

Mitigations (current):
- Supabase Postgres stores application and account data, including configuration.
- Cloudflare D1 (SQLite) stores call-specific, ephemeral data.
- Tenant isolation is a fundamental principle, enforced across all data stores: Supabase uses Row-Level Security (RLS) based on the org_id from Clerk JWTs. Pinecone and Cloudflare AI Search apply org_id metadata filters for retrieval, and R2 object storage uses org_id prefixes for knowledge objects.
Gaps (founder supplies):
- Backups and disaster recovery plans for Supabase Postgres (e.g., Point-In-Time Recovery, scheduled exports, restore runbooks) are a GAP.
- Backups and disaster recovery plans for Cloudflare D1 (e.g., export/backup automation, restore procedures) are a GAP.
- Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for all data stores are a GAP.
- An explicit encryption-at-rest posture for Supabase Postgres, R2, KV, D1, and Cloudflare Durable Object storage relies on provider defaults and is a GAP.

Key-Person Risk

As a startup, there is an inherent risk of over-reliance on a single individual, particularly the solo founder, for critical knowledge, development, and operational functions.

Mitigations (current):
- Frontier’s agentic operating model is designed to maximize the throughput of a small, senior team through extensive automation across the software development lifecycle. This includes automated code review, quality gates, a self-maintaining knowledge graph, and broad CI automation, which aims to reduce individual bottlenecks and institutionalize knowledge.
- A strong emphasis is placed on comprehensive and self-healing documentation to capture and maintain organizational knowledge effectively.

For more information on our operating model, see How we operate.

Security and Compliance Risk

Ensuring the security of customer data and adherence to compliance standards is paramount.

Mitigations (current):
- Authentication is managed by Clerk, which enforces strong identity and access control for the Dashboard and validates Clerk JWTs for the call-server.
- Recall.ai uses its own OAuth flows (Google, Microsoft) for calendar access, distinct from the main application login. These scopes are read-only for calendar events.
- Tenant isolation is a core security control, implemented through Row-Level Security in Supabase, org_id filters in Pinecone and Cloudflare AI Search, and org_id prefixes for R2 object storage. Cloudflare AI Search’s metadata filter is explicitly the primary security boundary for knowledge bases.
- Secrets are managed securely via Doppler, injected into environments at runtime or during CI/CD processes.
- Frontier employs Static Application Security Testing (SAST) using CodeQL (in an advisory capacity during the merge queue and weekly scans), SonarQube for security and reliability gating, and integrates security review bots like Aikido and Cursor BugBot into the PR workflow. Dependabot provides weekly dependency vulnerability and version updates.
- A published responsible vulnerability disclosure policy exists, including a dedicated contact email (security@frontierx.ai), PGP availability, and a 90-day coordinated disclosure window.
- Call-server API responses are hardened with security headers such as HSTS, X-Frame-Options DENY, Content Security Policy (CSP), and Permissions-Policy.
Risks (not fully mitigated):
- Critical Risk: Disabled Internal Worker Authentication: Internal Cloudflare Worker-to-Worker callback authentication is currently disabled at three entry points within the call-server. This poses a critical risk, as it allows for the potential injection of fake detection results into active calls by anyone who knows the callback URLs. A fix for this has been documented but is not yet applied.
- Inconsistent JWT Validation Strength: While the WebSocket for transcription performs full RS256 signature verification on Clerk JWTs, other call-server entry points use a decode-only validation method that does not verify the signature. This means the org_id claim, used for tenant isolation, is trusted from an unverified token in these pathways unless gated by other security controls.
- Background Work Blind Spot: Failures in background tasks managed by ctx.waitUntil (e.g., in processing Recall.ai webhooks) may not be explicitly captured and reported to Sentry, leading to potential silent data loss (e.g., transcript fragments) or undetected issues.
Compliance Gaps (founder supplies):
- Compliance certifications (e.g., SOC 2, GDPR adherence) are a roadmap item and are not yet held.
- Specific details regarding Data Subject Access Request (DSAR) handling (user access/deletion), lawful basis for data processing, data retention policies, data residency, and signed Data Processing Agreements (DPAs) with sub-processors are a GAP.
- Third-party penetration test status (vendor, date, report) is a GAP.

For comprehensive details on security posture and data handling, please refer to the Trust centre, AI and safety, and Sub-processors sections.