Observability & cost

Frontier’s operational observability provides engineers with insight into the system’s behavior, performance, and costs. It encompasses structured logging, distributed tracing, metrics collection, error tracking, and AI-specific telemetry, aiming to give a clear picture of the system’s health and resource consumption.

Logs, Traces, and Metrics

All logs and traces within Frontier’s backend systems are primarily routed to Axiom, a logging and observability platform.

Logs and traces are segregated into separate Axiom datasets based on environment (development, staging, release candidate, production, demo) and type (logs vs. traces). For example, production logs go to frontier-call-server-logs-prd, and production traces go to frontier-call-server-traces-prd.

The Call Server, composed of multiple Cloudflare Workers, leverages Cloudflare Workers Observability to route deployed environment logs, with a 100% sampling rate. Traces from Cloudflare Workers Observability are head-sampled at 10% in production and 100% in other environments. Additionally, an app-level OpenTelemetry trace pipeline, using @microlabs/otel-cf-workers, exports to a different Axiom dataset (frontier-agent-{DOPPLER_CONFIG}), also sampled at 10% in production.

Durable Objects and other long-lived execution paths post their logs to a dedicated logging worker, which then ingests them into Axiom. The Dashboard, a Next.js application, sends its logs to frontier-dash-{env} datasets in Axiom, utilizing the @axiomhq/logging and @axiomhq/nextjs libraries. All Axiom log events adhere to a canonical AxiomLogEvent schema, including contextual fields like call_id, org_id, request_id, correlation_id, and user_id.

Error Tracking

Error tracking for the Call Server relies on Sentry, integrated via the @sentry/cloudflare SDK. The Call Server has its own dedicated Sentry project (frontierx-call-server) and a unique DSN (Data Source Name), distinct from the Dashboard’s Sentry configuration. Sentry’s trace sampling rate for the Call Server is set to 10% in production environments and 100% otherwise.

To prevent alert fatigue, Frontier implements severity-based routing for errors: only CRITICAL and HIGH severity errors are forwarded to Sentry for alerting, while LOW and MEDIUM severity errors are recorded in Axiom logs only. Sentry issues are fingerprinted using stable identifiers (error name, worker name, error slug) rather than call_id to group similar errors effectively. However, call_id, org_id, request_id, and correlation_id are attached as searchable tags and enable deep-linking directly to corresponding Axiom logs.

Error reporting within Durable Objects is designed to be non-blocking. The reportError function uses a fire-and-forget Sentry.flush(2000) mechanism to ensure that error events are dispatched even if Durable Object WebSocket connections are abruptly closed.

Product and LLM Telemetry

Product analytics for the Dashboard application (@frontierx/dash) is managed using PostHog. This PostHog instance is self-hosted and proxied at p.frontierxai.com. Frontier adopts a conservative approach to analytics: it is opt-out by default, consent-gated, disables session recording, and captures pageviews manually. User profiles (person_profiles) are limited to identified_only. Server-side LLM telemetry is not captured via PostHog.

LLM usage and cost accounting are crucial for Frontier. A Cloudflare AI Gateway named frontier-ai is integrated into the Call Server. All calls to OpenAI and Cloudflare Workers AI are routed through this gateway. The frontier-ai gateway is instrumental in logging prompt, response, latency, and COST for each request, along with cf-aig-metadata for correlation.

Currently, several LLM providers, including Together, OpenRouter, Perplexity, and Google Gemini, bypass the frontier-ai gateway. Routing all LLM providers through this gateway for unified cost accounting is a planned roadmap item (T2).

In the Call Server’s current implementation, OpenRouter is utilized only as an embedding fallback provider (baai/bge-large-en-v1.5), used when Cloudflare Workers AI times out or fails. It is not currently the primary LLM gateway or router for general LLM calls. The fallback includes retry logic for specific HTTP error codes (503, 429, 529).

For automated code review, Qodo, Frontier’s PR code-review bot, runs on Amazon Bedrock. The specific Bedrock model ID is configured via the QODO_BEDROCK_MODEL environment variable.

Alerting and Service Level Objectives (SLOs)

Environment variables for NTFY alerts (e.g., NTFY_TOKEN, NTFY_DEV_ALERTS_TOPIC) are provisioned. However, there is no in-application NTFY sender code currently wired into the codebase for programmatic alerting.

Defined (target, not measured) SLOs, as documented in specifications, include:

Transcript delivery p99 under 5 seconds (over 30 days).
Question-detection availability of 99.5% (over 7 days).
Quick-answer availability of 99% (over 7 days).
Sentry high/critical error budget below 50 per day (production).
Log delivery 99.9% within 2 minutes.
WebSocket uptime of 99.9% (over 7 days).

Critical alert conditions, as defined in the target specification, would route to paging and warnings to a Slack channel (#dev-alerts). Examples include: a logging worker experiencing more than 5 ingest failures per 5 minutes, Sentry reporting more than 10 critical severity errors per 5 minutes, WebSocket connection failures exceeding 20% per 5 minutes, or a sub-worker showing zero logs for over 10 minutes.