OpenTelemetry in Rust: From Zero to Production Observability
A ground-up guide to understanding observability signals, how OpenTelemetry unifies them, and how to wire logs, traces, and metrics into a real Rust service using the tracing ecosystem.
Why Observability Matters
You ship a service. It runs. Users start complaining that things are slow, or that something fails randomly, but only in production. You have no idea what is happening inside the process at runtime.
This is the observability problem. You need a window into your running system one that tells you what happened (logs), how your system performed over time (metrics), and why a specific request was slow or broken (traces).
OpenTelemetry is the industry-standard toolkit for collecting all three signals in a vendor-neutral way, so you can send them to Jaeger, Grafana, Datadog, or any OTLP-compatible backend without rewriting your instrumentation.
Part 1: The Three Signals
Before writing a single line of code, you need to understand what you are collecting and why.
Logs
A log is a timestamped, human-readable record of something that happened.
2026-03-12T10:00:01Z INFO user created user_id=42
2026-03-12T10:00:02Z ERROR database connection failed err="connection refused"Logs are the oldest form of observability. They are cheap to write and easy to understand, but they are hard to search at scale and impossible to correlate across services without extra plumbing.
In Rust, the standard way to write logs is the tracing crate:
use tracing::{info, warn, error};
info!("user created");
warn!(user_id = 42, "quota almost reached"); // structured key-value fields
error!(err = %e, "database connection failed");The % sigil tells tracing to format the value using its Display
implementation. ? uses Debug. Undecorated values must implement
tracing::Value.
Metrics
A metric is a numeric measurement sampled over time. It answers questions like:
- How many requests per second is the service handling?
- What is the p99 latency of the
/checkoutendpoint? - How many background jobs are in the queue right now?
There are three core metric types you will use:
| Type | When to use | Example |
|---|---|---|
| Counter | Things you count up; never go down | http.requests.total |
| Gauge | Values that go up and down | queue.depth, memory.used |
| Histogram | Distribution of a value (latency, size, duration) | request.duration_ms |
Unlike logs, metrics are aggregated. You do not keep every individual sample; you keep statistical summaries (counts, sums, bucket counts). This makes them cheap to store and fast to query even at high volume.
Traces (and Spans)
A trace represents the full lifecycle of a single request as it travels through your system across service boundaries, thread boundaries, and async task switches. A trace is made up of spans.
A span is one unit of work within a trace. It has:
- A name (e.g.
"process_payment") - A start time and end time
- A set of attributes (key-value metadata)
- A parent span ID (which links it into the trace tree)
- A status (OK or ERROR)
When a request enters your service, you open a root span. Every sub-operation opens a child span. When everything completes, the collected tree of spans forms the trace, which you can visualise in a tool like Jaeger as a waterfall diagram.
[ handle_request 100ms ]
[ auth_check 5ms ]
[ db_query 80ms ]
[ query_plan 2ms ]
[ row_scan 75ms ]
[ serialize 3ms ]From this diagram you can immediately see that db_query → row_scan is
taking 75 ms and is the bottleneck something a log or metric alone could not
show you.
How the Three Signals Complement Each Other
| Question | Best signal |
|---|---|
| Did anything go wrong? | Logs |
| Is my service degraded overall? | Metrics |
| Why was this specific request slow? | Traces |
| What was happening when this span ran? | Logs in span |
In a well-instrumented service all three are correlated: a metric spike leads you to a trace, a trace leads you to a log line that tells you exactly what broke. OpenTelemetry is the glue that makes this correlation automatic.
Part 2: OpenTelemetry Architecture
The Spec, the SDKs, and the Collector
OpenTelemetry is three things at once:
- A specification defines a standard data model and wire protocol (OTLP) for all three signals.
- SDKs language-specific libraries (
opentelemetry,opentelemetry_sdkin Rust) that implement the spec in your process. - The Collector an optional but recommended sidecar process that receives OTLP data from your app, processes it (batching, filtering, enrichment), and fans it out to one or more backends.
Your Rust app
┌────────────────────────────────────┐
│ tracing macros (info!, span!, …) │
│ │ │
│ tracing-subscriber registry │
│ ┌───────┴──────────────────────┐ │
│ │ EnvFilter │ │
│ │ fmt::layer (JSON → stdout) │ │
│ │ tracing-opentelemetry layer │─┼──→ OTLP gRPC (traces)
│ │ OTelTracingBridge layer │─┼──→ OTLP gRPC (logs)
│ └──────────────────────────────┘ │
│ │
│ global::meter() API │─→ OTLP gRPC (metrics)
└────────────────────────────────────┘
│ :4317
┌───────────▼──────────────┐
│ OTel Collector │
│ receivers: [otlp] │
│ processors: [batch] │
│ exporters: │
│ traces → Jaeger │
│ metrics → Prometheus │
│ logs → Loki │
└──────────────────────────┘Why use a Collector at all?
You could point your app directly at Jaeger or Prometheus. The Collector sits in the middle because it:
- Decouples your app from your backend. Change backends by editing one YAML file in the Collector; your app does not change.
- Buffers and batches. Your app does not block on network I/O to a remote backend.
- Enriches telemetry. Add Kubernetes pod labels, cloud region, etc. without touching your code.
- Fans out. Send the same trace to Jaeger and your cloud provider simultaneously.
OTLP the wire protocol
OTLP (OpenTelemetry Protocol) is the standard format OpenTelemetry SDKs use to
ship data to a Collector or backend. It runs over gRPC (port 4317) or HTTP/JSON
(port 4318). In this setup we use gRPC via the tonic transport because it is
lower overhead and supports streaming.
Part 3: Setting Up the Rust Project
Crates
Add these to Cargo.toml:
[dependencies]
# Core OTel API (trace + metrics)
opentelemetry = { version = "0.31.0", features = ["trace", "metrics"] }
# OTLP exporter: gRPC transport (tonic), all three signals
opentelemetry-otlp = { version = "0.31.0", features = ["grpc-tonic", "trace", "metrics", "logs"] }
# SDK implementations of the three providers
opentelemetry_sdk = { version = "0.31.0", features = ["rt-tokio", "trace", "metrics"] }
# Bridges tracing log events into OTel log records
opentelemetry-appender-tracing = "0.31.1"
# The tracing subscriber ecosystem
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter", "fmt", "json"] }
# Converts tracing spans into OTel spans
tracing-opentelemetry = "0.32.1"A few things worth noting:
opentelemetryis the API crate thin, no runtime cost if telemetry is disabled. Your library code should only ever depend on this.opentelemetry_sdkis the implementation crate the actual providers, exporters, and batch processors. Only your binary / top-level crate should pull this in.rt-tokiotells the SDK to use Tokio for its async background tasks (the batch export loop, the periodic metric reader). You must enable this if your app uses Tokio.opentelemetry-appender-tracingandtracing-opentelemetryare the bridge crates that connect thetracingworld to the OTel world.
The Config Structure
Before the telemetry initialisation code, you need to know what makes it
configurable. We use a TelemetryConfig struct that is loaded from a YAML
config file (or environment variables):
pub struct TelemetryConfig {
pub level: LogLevel, // fallback log level (overridden by RUST_LOG)
pub endpoint: String, // OTLP collector address, e.g. "http://localhost:4317"
pub tracer_name: String, // scopes spans from this service in the backend
pub app_name: String, // becomes service.name in all telemetry
}This is dependency injection: the telemetry module does not read env vars or
config files directly; it receives a plain struct. This makes it easy to test
and easy to change the config source without touching telemetry.rs.
Part 4: The init_telemetry Function, Line by Line
All the setup lives in src/utils/telemetry.rs. Let us walk through it
section by section.
Step 1: The Resource
fn resource(app_name: &str) -> Resource {
use opentelemetry::KeyValue;
Resource::builder()
.with_attribute(KeyValue::new("service.name", app_name.to_owned()))
.build()
}A Resource is metadata that describes the entity producing telemetry
"who is sending this data?" Every span, metric data point, and log record is
stamped with the resource. This is how your observability backend knows that
a span belongs to rustapp and not to some other service.
service.name is a required OTel semantic convention. Without it, Jaeger will
still accept your traces but will label them as an unknown service.
We build the resource once per signal rather than sharing a single clone because
Resource is immutable and cheap to construct; keeping three separate instances
avoids any ownership entanglement.
Step 2: Traces
let span_exporter = SpanExporter::builder()
.with_tonic() // gRPC transport via the tonic crate
.with_endpoint(endpoint) // "http://localhost:4317"
.build()?;
let tracer_provider = SdkTracerProvider::builder()
.with_resource(resource(&cfg.app_name))
.with_batch_exporter(span_exporter)
.build();
let tracer = tracer_provider.tracer(cfg.tracer_name.clone());with_batch_exporter is important. Without it you would call with_simple_exporter,
which makes a synchronous gRPC call for every single span an enormous
performance problem under any real load. The batch exporter accumulates spans
in memory and sends them in bulk, periodically or when the buffer fills.
The tracer handle (scoped by tracer_name) is what we hand to
tracing-opentelemetry so it knows which provider to route spans through.
Step 3: Metrics
let metric_exporter = MetricExporter::builder()
.with_tonic()
.with_endpoint(endpoint)
.build()?;
let meter_provider = SdkMeterProvider::builder()
.with_resource(resource(&cfg.app_name))
.with_reader(PeriodicReader::builder(metric_exporter).build())
.build();
opentelemetry::global::set_meter_provider(meter_provider.clone());PeriodicReader is the metrics equivalent of batch export: it polls all
registered instruments at a fixed interval (default 60 seconds) and sends the
current values to the exporter. You never push a metric manually; you just
record values and the reader handles the rest.
set_meter_provider registers the provider in a global registry. This means
any module in your codebase can call opentelemetry::global::meter("name")
without needing a direct reference to the provider analogous to how
tracing's global subscriber works.
Note we pass meter_provider.clone() to set_meter_provider and keep the
original for the Shutdown guard. The provider is reference-counted internally,
so the clone is cheap.
Step 4: Logs
let log_exporter = LogExporter::builder()
.with_tonic()
.with_endpoint(endpoint)
.build()?;
let logger_provider = SdkLoggerProvider::builder()
.with_resource(resource(&cfg.app_name))
.with_batch_exporter(log_exporter)
.build();The log provider is not registered globally because there is no
set_logger_provider equivalent in the OTel API. Instead, we pass it directly
to the OpenTelemetryTracingBridge layer in the next step. The bridge keeps
the reference alive.
Step 5: The tracing-subscriber Registry
This is where everything gets wired together:
let env_filter =
EnvFilter::try_from_default_env()
.unwrap_or_else(|_| EnvFilter::new(cfg.level.as_str()));
tracing_subscriber::registry()
.with(env_filter)
.with(tracing_subscriber::fmt::layer().json())
.with(tracing_opentelemetry::layer().with_tracer(tracer))
.with(
opentelemetry_appender_tracing::layer::OpenTelemetryTracingBridge::new(
&logger_provider,
),
)
.init();Think of the registry as a pipeline. Every info!() call, every span, passes
through each layer in order:
| Layer | What it does |
|---|---|
EnvFilter | Reads RUST_LOG at runtime (e.g. RUST_LOG=debug). If unset, falls back to cfg.level. Events that do not match are dropped here nothing below sees them. |
fmt::layer().json() | Serialises events to structured JSON and writes them to stdout. Good for production log aggregators (Fluentd, Loki) that parse JSON. |
tracing_opentelemetry::layer() | Converts tracing spans into OTel spans and ships them through the SdkTracerProvider we built above. |
OpenTelemetryTracingBridge | Converts tracing log events into OTel log records and ships them through SdkLoggerProvider. |
The key insight: you never call OTel APIs for logs or traces. You just write
info!() and #[instrument] as you normally would in any Rust code. The
bridge layers do the forwarding silently.
Step 6: The Shutdown Guard
pub struct Shutdown {
tracer_provider: SdkTracerProvider,
meter_provider: SdkMeterProvider,
logger_provider: SdkLoggerProvider,
}
impl Drop for Shutdown {
fn drop(&mut self) {
let _ = self.tracer_provider.shutdown();
let _ = self.meter_provider.shutdown();
let _ = self.logger_provider.shutdown();
}
}This is the RAII (Resource Acquisition Is Initialisation) pattern. When main
returns, the stack unwinds and Shutdown is dropped, which calls each
provider's shutdown(). That method flushes any spans, metrics, or log records
still sitting in the internal buffers before the process exits.
Without this, the last batch of telemetry is silently lost every time your process exits normally. In practice this means you will miss the final spans from a graceful shutdown exactly when you most want to see what happened.
In main.rs, the guard must be kept alive for the full program lifetime:
let _shutdown = init_telemetry(&config.telemetry)?;
// ^
// The leading underscore suppresses "unused variable" warnings, but
// crucially the binding is NOT dropped immediately (unlike `let _ = ...`
// which would drop it at the end of the statement).The distinction matters: let _ = init_telemetry(...) would drop the Shutdown
immediately after construction, calling shutdown before your app even starts.
let _shutdown = ... keeps it alive until the end of main.
Part 5: Using Telemetry in Application Code
Logs: just use tracing macros
use tracing::{info, warn, error, debug};
info!("application started");
warn!(user_id = 42, "quota almost reached");
error!(err = %e, "payment failed");
debug!(payload = ?body, "received request"); // ? = Debug formatThese flow through the subscriber pipeline automatically. The JSON layer writes them to stdout; the bridge ships them to OTLP. You write one macro call and get two destinations.
Traces: instrument whole functions (preferred)
use tracing::instrument;
#[instrument]
async fn process_job(job_id: u64) {
info!("processing started");
do_work().await;
info!("processing complete");
}#[instrument] creates a span named after the function (process_job) every
time it is called. The job_id argument is automatically captured as a span
attribute. Any log events emitted inside the function are nested under the span
in the trace backend.
You can customise what gets captured:
#[instrument(skip(password), fields(user = %user.id))]
async fn login(user: &User, password: &str) { … }skip prevents sensitive fields from appearing in the span. fields lets you
add custom attributes beyond the function arguments.
Traces: manual spans for finer control
use tracing::{info_span, Instrument};
let span = info_span!("send_email", recipient = %email, template = "welcome");
send_email_inner(payload).instrument(span).await;Use manual spans when you want to name the span something different from the function name, or when you are adding a span around a block of code that is not a whole function.
Metrics: the one place you call OTel directly
Metrics have no tracing bridge, so you use the OTel API directly. It is still
straightforward:
use opentelemetry::global;
// --- Counter (monotonically increasing, e.g. total requests processed) ---
let meter = global::meter("payment-service");
let counter = meter.u64_counter("payments.processed").build();
counter.add(1, &[]);
// --- Histogram (distribution of values, e.g. latency) ---
let histogram = meter.f64_histogram("payment.duration_ms").build();
histogram.record(latency_ms, &[]);
// --- With attributes (dimensions/labels for slicing in dashboards) ---
use opentelemetry::KeyValue;
counter.add(1, &[
KeyValue::new("payment.method", "card"),
KeyValue::new("payment.status", "success"),
]);Attributes are the metric equivalent of log fields they let you slice the
data by dimension (e.g. "show me payments.processed grouped by
payment.method").
Create the meter and instrument objects once (ideally at service startup or
in a lazy_static / OnceLock) and reuse them. Creating a new meter on every
request is wasteful; the underlying objects are meant to be long-lived.
Part 6 The Collector and Local Infrastructure
docker-compose.yaml
The local dev stack brings up four services:
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.123.0
ports:
- "4317:4317" # OTLP gRPC ← your app sends here
- "4318:4318" # OTLP HTTP
- "8888:8888" # Collector self-metrics
jaeger:
image: jaegertracing/jaeger:2.6.0
environment:
COLLECTOR_OTLP_ENABLED: true
ports:
- "16686:16686" # Jaeger UI → http://localhost:16686
postgres: …
redis: …Your app sends OTLP to localhost:4317. The Collector receives it, processes
it, and forwards traces to Jaeger. Open http://localhost:16686 in your browser
to see the trace waterfall.
configs/otel-collector.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
otlp/jaeger:
endpoint: jaeger:4317 # internal Docker network name
tls:
insecure: true
debug:
verbosity: normal
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/jaeger, debug]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [debug]
logs:
receivers: [otlp]
processors: [batch]
exporters: [debug]The Collector config is straightforward: receive on gRPC, batch everything, and
export traces to Jaeger. In production you would replace debug with a real
backend Prometheus remote write for metrics, Loki or Elasticsearch for logs.
Running locally
make up # starts docker-compose (Collector, Jaeger, Postgres, Redis)
make dev # cargo run with CONFIG_PATH=configs/config.yamlThen open http://localhost:16686 → Search → select rustapp → hit Find
Traces. Every request your app handles will appear as a trace.
Part 7: Things to Know in Production
Version alignment is critical
The opentelemetry, opentelemetry_sdk, opentelemetry-otlp,
opentelemetry-appender-tracing, and tracing-opentelemetry crates must all
be on compatible versions. The OTel Rust ecosystem moves fast and semver minor
bumps often break the API. Pin all OTel crates to the same minor version and
update them together.
The batch processor has knobs
The default batch exporter settings (max queue size 2048 spans, max export batch size 512, scheduled delay 5 seconds, export timeout 30 seconds) are tuned for moderate throughput. Under high load, you may need to adjust them:
use opentelemetry_sdk::trace::BatchConfigBuilder;
let batch_config = BatchConfigBuilder::default()
.with_max_queue_size(8192)
.with_max_export_batch_size(1024)
.build();
SdkTracerProvider::builder()
.with_batch_exporter(span_exporter)
// apply batch_config via the builder...
.build();If the queue fills up, spans are dropped silently. Monitor the Collector's
self-metrics (http://localhost:8888/metrics) to spot export failures.
RUST_LOG overrides everything
At runtime, set RUST_LOG=debug to get verbose output. RUST_LOG=warn to
silence everything below warnings. You can target specific modules:
RUST_LOG=rustapp=debug,sqlx=warn,tokio=error
The EnvFilter is checked first; if a log event does not match, it never
reaches the JSON layer or the OTel bridge. This makes it safe to leave
debug!() calls in the code they are zero-cost in production unless you
explicitly enable them.
let _shutdown vs let _ = ...
This deserves repeating because it is a common footgun:
let _shutdown = init_telemetry(&config.telemetry)?; // CORRECT lives to end of main
let _ = init_telemetry(&config.telemetry)?; // WRONG dropped immediatelyRust's _ pattern drops the value on the spot. _name creates a binding that
lives to the end of its scope. For RAII guards, always use the latter.
Sampling
Collecting every single span in a high-traffic service is expensive. OTel supports trace sampling:
- Always-on (default): every trace is collected. Fine for dev; costly in prod.
- TraceIdRatioBased: sample a percentage (e.g. 1%) of traces randomly.
- ParentBased: respect the sampling decision of the parent span (important for distributed tracing across services).
Configure sampling on the SdkTracerProvider before you go to production.
Quick Reference
Which macro for which job?
info!("message") // plain log
info!(key = value, "message") // structured log with fields
#[instrument] // auto-span the whole function
#[instrument(skip(secret))] // auto-span, hide a field
info_span!("name", key = value) // manual span
.instrument(span) // attach span to a futureMetric types at a glance
meter.u64_counter("name").build() // count up (requests, jobs)
meter.f64_histogram("name").build() // distribution (latency, size)
meter.i64_up_down_counter("name").build() // goes up and down (queue depth)
meter.f64_gauge("name").build() // current value (temperature, ratio)Signal summary
| Signal | How you emit it | Where it goes |
|---|---|---|
| Log | info!() / warn!() / error!() | stdout + OTel log backend |
| Trace | #[instrument] or info_span!() | OTel trace backend (Jaeger) |
| Metric | global::meter().u64_counter() etc. | OTel metric backend |
Part 8: Complete Usage Examples (Import to Run)
This section gives you copy-paste-ready, realistic code for every signal. Each example shows the full import block so you do not have to hunt for crate names.
Logs: complete example
Logs are the simplest signal. No setup beyond what init_telemetry already
does. Just import and call.
// src/services/user_service.rs
use tracing::{debug, error, info, instrument, warn};
use uuid::Uuid;
pub struct UserService { /* … db pool, etc. */ }
impl UserService {
/// Creates a new user.
/// The #[instrument] macro wraps the whole function in a span named
/// "create_user" and records `email` as a span attribute automatically.
/// Every log inside is nested under that span in Jaeger.
#[instrument(skip(self), fields(service = "user"))]
pub async fn create_user(&self, email: &str) -> Result<Uuid, anyhow::Error> {
// Plain info log shows up in stdout JSON and in the OTel log backend.
info!(email, "creating user");
// Structured fields come before the message string.
// `%` means Display format, `?` means Debug format.
debug!(email, attempt = 1, "querying db");
let user_id = Uuid::now_v7();
match self.db_insert(user_id, email).await {
Ok(_) => {
// Attach key-value context to the log line.
info!(user_id = %user_id, email, "user created successfully");
Ok(user_id)
}
Err(e) => {
// `%e` formats the error with its Display impl.
error!(err = %e, email, "failed to insert user");
Err(e)
}
}
}
async fn db_insert(&self, _id: Uuid, _email: &str) -> Result<(), anyhow::Error> {
Ok(()) // placeholder
}
}What you get:
stdout: a JSON log line perinfo!/error!call- Jaeger: the
create_userspan with all fields as attributes, and the log events attached to the span as span events
Traces: complete example
Option A: #[instrument] (use this by default)
// src/services/payment_service.rs
use tracing::{error, info, instrument, warn};
pub struct PaymentService;
impl PaymentService {
/// `skip(self, card_number)`: hides sensitive fields from the span.
/// `fields(…)`: adds custom attributes that are not function arguments.
#[instrument(
skip(self, card_number),
fields(
payment.method = "card",
payment.currency = %currency,
)
)]
pub async fn charge(
&self,
user_id: u64,
amount_cents: u64,
currency: &str,
card_number: &str, // skipped never appears in telemetry
) -> Result<String, anyhow::Error> {
info!(user_id, amount_cents, "initiating charge");
// Simulate calling a downstream payment provider.
let result = self.call_provider(amount_cents).await;
match result {
Ok(ref txn_id) => {
// Attach the transaction ID to the current span's fields.
tracing::Span::current().record("payment.txn_id", txn_id.as_str());
info!(txn_id, "charge successful");
result
}
Err(ref e) => {
warn!(err = %e, user_id, "charge failed");
result
}
}
}
async fn call_provider(&self, _amount: u64) -> Result<String, anyhow::Error> {
Ok("txn_abc123".to_string()) // placeholder
}
}Option B: manual spans for fine-grained control
Use this when you need a span around a specific block of code (not a whole function), or when you want to name the span differently from the function.
// src/jobs/email_job.rs
use tracing::{info, info_span, Instrument};
pub async fn send_welcome_emails(user_ids: Vec<u64>) {
for user_id in user_ids {
// Build a span with custom attributes, then attach it to the future.
let span = info_span!(
"send_welcome_email",
user.id = user_id,
email.template = "welcome_v2",
);
async move {
info!(user_id, "sending email");
// … actual send logic …
info!(user_id, "email sent");
}
.instrument(span)
.await;
}
}Option C: error status on a span
When an operation fails, mark the span as an error so Jaeger highlights it:
// src/handlers/checkout.rs
use opentelemetry::trace::Status;
use tracing::{error, instrument};
use tracing_opentelemetry::OpenTelemetrySpanExt; // needed for set_status
#[instrument]
pub async fn checkout_handler(order_id: u64) -> Result<(), anyhow::Error> {
let result = process_order(order_id).await;
if let Err(ref e) = result {
// Mark the current span as ERROR so Jaeger colors it red.
tracing::Span::current()
.set_status(Status::error(e.to_string()));
error!(err = %e, order_id, "checkout failed");
}
result
}
async fn process_order(_id: u64) -> Result<(), anyhow::Error> {
Ok(())
}Metrics: complete example
Metrics are the one place you call the OTel API directly. Create your
instruments once at startup and reuse them throughout the service lifetime.
The idiomatic pattern in Rust is OnceLock or a struct that holds the meters.
Recommended pattern instrument struct
// src/metrics.rs
use opentelemetry::{
global,
metrics::{Counter, Histogram, Meter, UpDownCounter},
KeyValue,
};
use std::sync::OnceLock;
/// All metrics for the HTTP layer.
pub struct HttpMetrics {
/// Total number of HTTP requests received.
pub requests_total: Counter<u64>,
/// Number of requests currently in-flight.
pub requests_in_flight: UpDownCounter<i64>,
/// Request duration in milliseconds (histogram = p50/p95/p99).
pub request_duration_ms: Histogram<f64>,
/// Response body size in bytes.
pub response_bytes: Histogram<u64>,
}
impl HttpMetrics {
pub fn new() -> Self {
// "http-server" is the instrumentation scope visible in your backend
// to distinguish metrics from different components.
let meter: Meter = global::meter("http-server");
Self {
requests_total: meter
.u64_counter("http.requests.total")
.with_description("Total HTTP requests received")
.build(),
requests_in_flight: meter
.i64_up_down_counter("http.requests.in_flight")
.with_description("Requests currently being processed")
.build(),
request_duration_ms: meter
.f64_histogram("http.request.duration_ms")
.with_description("Request latency in milliseconds")
// Optional: explicit histogram bucket boundaries.
// Default buckets work for most cases.
.build(),
response_bytes: meter
.u64_histogram("http.response.size_bytes")
.with_description("Response body size in bytes")
.build(),
}
}
}
// Global singleton initialised once after `init_telemetry()`.
static HTTP_METRICS: OnceLock<HttpMetrics> = OnceLock::new();
pub fn http_metrics() -> &'static HttpMetrics {
HTTP_METRICS.get_or_init(HttpMetrics::new)
}Using the metrics in a handler
// src/handlers/orders.rs
use std::time::Instant;
use opentelemetry::KeyValue;
use crate::metrics::http_metrics;
pub async fn list_orders_handler(/* … */) {
let m = http_metrics();
// Dimensions (attributes) let you slice metrics in dashboards.
// e.g. "show me requests grouped by method and route"
let attrs = &[
KeyValue::new("http.method", "GET"),
KeyValue::new("http.route", "/orders"),
];
// Track in-flight requests (goes up when request starts, down when done).
m.requests_in_flight.add(1, attrs);
let start = Instant::now();
let result = fetch_orders_from_db().await;
let elapsed_ms = start.elapsed().as_secs_f64() * 1000.0;
// Record latency in the histogram this feeds p50/p95/p99 in Grafana.
m.request_duration_ms.record(elapsed_ms, attrs);
// Increment total counter with a status dimension.
let status = if result.is_ok() { "200" } else { "500" };
m.requests_total.add(1, &[
KeyValue::new("http.method", "GET"),
KeyValue::new("http.route", "/orders"),
KeyValue::new("http.status_code", status),
]);
m.requests_in_flight.add(-1, attrs);
}
async fn fetch_orders_from_db() -> Result<Vec<()>, anyhow::Error> {
Ok(vec![])
}Counter vs Histogram vs UpDownCounter vs Gauge when to use each
| Instrument | Rust type | When to use |
|---|---|---|
u64_counter | Counter<u64> | Things that only go up: requests, jobs, errors |
f64_counter | Counter<f64> | Same, but fractional: bytes transferred |
i64_up_down_counter | UpDownCounter<i64> | Things that go up and down: queue depth, active sessions |
f64_histogram | Histogram<f64> | Distribution: latency, request size, memory allocated |
u64_histogram | Histogram<u64> | Same but integer: file sizes, row counts |
f64_gauge | Gauge<f64> | Current snapshot value: CPU%, memory%, temperature |
Counter: never decreases. If you need "requests per second" in Grafana,
you derive it from a counter with rate(http_requests_total[1m]). Do not use
a gauge for things that are logically monotonic.
Histogram: records individual observations and lets the backend compute percentiles (p50, p95, p99). Always use a histogram for latency measurements, never a gauge showing "current request time".
UpDownCounter: like a counter but can go negative. Use for in-flight requests, queue depth, connection pool usage.
Gauge: a snapshot of the current value at scrape time. Use for things that are already a ratio or absolute current reading: memory utilisation, CPU percentage, cache hit ratio.
Putting it all together in main.rs
// src/main.rs
mod handlers;
mod metrics;
mod services;
mod utils;
use std::env;
use anyhow::Result;
use tracing::info;
use crate::utils::config::load_config;
use crate::utils::telemetry::init_telemetry;
#[tokio::main]
async fn main() -> Result<()> {
let config_path =
env::var("CONFIG_PATH").unwrap_or_else(|_| "configs/config.yaml".to_string());
let config = load_config(&config_path)?;
// IMPORTANT: `_shutdown` must stay alive until the end of main.
// When it is dropped, all three providers flush their buffers.
// `let _ = ...` would drop it immediately do NOT do that.
let _shutdown = init_telemetry(&config.telemetry)?;
// Warm up the metrics instruments now that the global meter is registered.
// This ensures the OnceLock is populated before any request comes in.
let _ = metrics::http_metrics();
info!(config_path, "application started");
// … start your axum server, spawn background tasks, etc. …
Ok(())
// _shutdown is dropped here → providers flush → process exits cleanly
}Closing Thoughts
The setup described here is about 170 lines of code most of it boilerplate that you write once and never touch again. What you get in return is:
- Structured JSON logs on stdout, parseable by any log aggregator.
- Distributed traces in Jaeger that show exactly what each request did and how long each step took.
- Metrics available for any dashboard or alerting system.
- One endpoint to change if you swap backends.
- No OTel API calls in business logic just normal
tracingmacros.
The tracing crate was always the right way to instrument Rust code. OTel
extends it from local stdout logs to a production-grade observability pipeline
without changing how you write application code. That is the design win worth
understanding.