What industries do you work with?

We work across a wide range of industries including finance, healthcare, e-commerce, logistics, and telecommunications. Our solutions are tailored to each client’s specific domain requirements and regulatory environment.

How long does a typical engagement take?

It depends on the scope. A focused observability deployment or automation workflow can be delivered in 4-6 weeks. Larger initiatives like full-scale LLM integration or platform builds typically run 2-4 months. We always start with a discovery phase to align on timelines.

Do you offer ongoing support after project delivery?

Yes. We offer flexible support and maintenance plans to ensure your systems stay healthy, updated, and optimized. We can also embed with your team on a part-time basis for continuous improvement.

Can you work with our existing tech stack?

Absolutely. We integrate with your current infrastructure and tools rather than forcing a rip-and-replace. Whether you’re on AWS, GCP, Azure, or on-prem, we adapt our approach to what works best for your environment.

What is your pricing model?

We offer both fixed-price project engagements and time-and-materials contracts depending on the nature of the work. Reach out through our contact form and we’ll provide a tailored estimate within 24 hours.

How do you handle data security and compliance?

Security is built into every engagement. We follow industry best practices for data handling, support GDPR and SOC 2 compliance requirements, and can work within your existing security policies and access controls.

OpenTelemetry in Practice — Unified Traces, Metrics, and Logs

The Three Pillars Problem

Most engineering teams have observability — they just have it in three disconnected silos. Traces live in Jaeger, metrics in Prometheus, logs in Elasticsearch. When an incident hits, engineers tab-switch between tools, mentally correlating timestamps and request IDs. The data is there; the connections are not.

OpenTelemetry (OTel) solves this by providing a single, vendor-neutral instrumentation layer for all three signals. Instrument your code once, and export traces, metrics, and logs to any backend — Grafana Tempo, Jaeger, Prometheus, Datadog, or all of them simultaneously. No vendor lock-in, no re-instrumentation when you switch backends.

This article walks through practical OTel adoption: how the Collector works, how to instrument services in multiple languages, how to correlate signals, and how to avoid the pitfalls that trip up most teams in production. For a deep-dive into OpenTelemetry for data pipelines — covering Airflow, Spark, and dbt instrumentation — see our dedicated guide.

OpenTelemetry Architecture

OTel consists of three layers: the API (interfaces for instrumentation), the SDK (implementation with processors and exporters), and the Collector (a standalone service that receives, processes, and exports telemetry). Understanding the boundaries between these layers is key to a clean deployment.

API Layer

The API defines how you create spans, record metrics, and emit log records. Libraries and frameworks instrument against the API only — they never depend on the SDK. This means library authors can add OTel instrumentation without forcing users into a specific backend or even requiring the SDK to be present. If no SDK is configured, the API is a no-op with zero overhead.

SDK Layer

The SDK provides the concrete implementation: span processors, metric readers, log record processors, samplers, and exporters. Your application configures the SDK at startup to wire instrumentation to backends. The SDK also handles batching, retry, and resource attribution — attaching metadata like service name, version, and environment to every signal.

Collector

The OTel Collectoris a vendor-agnostic proxy that sits between your applications and your backends. It receives telemetry over OTLP (or other protocols), processes it (filtering, sampling, enrichment), and exports it to one or more destinations. Running a Collector decouples your applications from backend specifics — switching from Jaeger to Tempo is a Collector config change, not a code change.

The Collector Pipeline

The Collector processes telemetry through a pipeline of receivers, processors, and exporters. Each pipeline handles one signal type (traces, metrics, or logs), and you can define multiple pipelines per signal.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 8192
    timeout: 200ms

  memory_limiter:
    check_interval: 1s
    limit_mib: 1024
    spike_limit_mib: 256

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
    resource_to_telemetry_conversion:
      enabled: true

  otlp/loki:
    endpoint: loki:3100
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/loki]

Note

Always place memory_limiter first in the processor chain. If the Collector runs out of memory, it crashes and you lose all in-flight telemetry. The memory limiter applies backpressure before that happens, refusing new data and giving exporters time to flush.

Agent vs. Gateway Deployment

In the agent pattern, a Collector runs as a sidecar or DaemonSet alongside each application. It handles local buffering and forwards to a central gateway. In the gateway pattern, a Collector cluster receives telemetry from all services and handles processing centrally. Most production deployments use both: agents for resilience and local enrichment, gateways for tail-based sampling and cross-service processing.

Distributed Tracing Done Right

A trace is a tree of spans that represents a single request flowing through your system. Each span records an operation — an HTTP handler, a database query, a message publish. The power of tracing is seeing the full chain: which services a request touched, how long each step took, and where errors occurred.

Auto-Instrumentation

OTel provides auto-instrumentation libraries for most popular frameworks and clients. These hook into HTTP servers, database drivers, gRPC clients, and message brokers to create spans automatically — no code changes required. Start with auto-instrumentation to get baseline visibility, then add custom spans for business-critical paths.

# Python — auto-instrumentation with zero code changes
# Install the packages
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# Run your app with auto-instrumentation
opentelemetry-instrument \
  --service_name order-service \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --logs_exporter otlp \
  --exporter_otlp_endpoint http://otel-collector:4317 \
  python app.py

# This automatically instruments:
# - Flask/Django/FastAPI HTTP handlers
# - requests/httpx/aiohttp outbound calls
# - psycopg2/asyncpg/SQLAlchemy queries
# - redis, celery, kafka-python, and 40+ more libraries

Custom Spans for Business Logic

Auto-instrumentation captures infrastructure operations. Custom spans capture what your business cares about: order validation, payment processing, inventory reservation. These are the spans that show up in incident reviews and SLO dashboards.

// Node.js — custom spans with semantic attributes
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service', '1.0.0');

async function processOrder(order: Order): Promise<OrderResult> {
  return tracer.startActiveSpan('order.process', async (span) => {
    try {
      // Add business context as span attributes
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.total_cents', order.totalCents);
      span.setAttribute('order.item_count', order.items.length);
      span.setAttribute('customer.id', order.customerId);
      span.setAttribute('customer.tier', order.customerTier);

      // Validate — creates a child span automatically
      const validation = await validateOrder(order);
      if (!validation.valid) {
        span.setStatus({ code: SpanStatusCode.ERROR, message: validation.reason });
        span.setAttribute('order.rejection_reason', validation.reason);
        return { status: 'rejected', reason: validation.reason };
      }

      // Reserve inventory — another child span
      const reservation = await tracer.startActiveSpan('inventory.reserve', async (invSpan) => {
        invSpan.setAttribute('warehouse.id', order.warehouseId);
        const result = await inventoryClient.reserve(order.items);
        invSpan.setAttribute('inventory.reserved_count', result.reservedCount);
        invSpan.end();
        return result;
      });

      // Process payment
      const payment = await tracer.startActiveSpan('payment.charge', async (paySpan) => {
        paySpan.setAttribute('payment.method', order.paymentMethod);
        paySpan.setAttribute('payment.currency', order.currency);
        const result = await paymentClient.charge(order);
        paySpan.setAttribute('payment.transaction_id', result.transactionId);
        paySpan.end();
        return result;
      });

      span.setAttribute('order.status', 'confirmed');
      span.addEvent('order.confirmed', { 'payment.transaction_id': payment.transactionId });
      return { status: 'confirmed', transactionId: payment.transactionId };
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
      throw error;
    } finally {
      span.end();
    }
  });
}

Note

Use semantic conventions for attribute names wherever possible — http.request.method, db.system, rpc.service. Backends build dashboards and alerts around these conventions. Custom attribute names lose that integration.

Metrics — Beyond Prometheus Scraping

OTel metrics support both push and pull models. You can export directly via OTLP (push) or expose a Prometheus-compatible scrape endpoint (pull). The OTel metrics API provides three core instruments that cover virtually all use cases:

Instrument	Example	Aggregation	When to Use
Counter	requests_total	Sum (monotonic)	Events that only go up
Histogram	request_duration_ms	Distribution	Latency, sizes, distributions
UpDownCounter	active_connections	Sum (non-monotonic)	Gauges that go up and down

// Go — OTel metrics with OTLP export
package main

import (
    "context"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/metric"
    sdkmetric "go.opentelemetry.io/otel/sdk/metric"
    "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
)

func initMetrics(ctx context.Context) (*sdkmetric.MeterProvider, error) {
    exporter, err := otlpmetricgrpc.New(ctx,
        otlpmetricgrpc.WithEndpoint("otel-collector:4317"),
        otlpmetricgrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    provider := sdkmetric.NewMeterProvider(
        sdkmetric.WithReader(
            sdkmetric.NewPeriodicReader(exporter, sdkmetric.WithInterval(15*time.Second)),
        ),
    )
    otel.SetMeterProvider(provider)
    return provider, nil
}

// Application metrics
var (
    meter           = otel.Meter("order-service")
    ordersProcessed metric.Int64Counter
    orderLatency    metric.Float64Histogram
    activeOrders    metric.Int64UpDownCounter
)

func init() {
    var err error
    ordersProcessed, err = meter.Int64Counter("orders.processed",
        metric.WithDescription("Total number of orders processed"),
        metric.WithUnit("{order}"),
    )
    if err != nil { panic(err) }

    orderLatency, err = meter.Float64Histogram("orders.processing_duration",
        metric.WithDescription("Order processing latency in milliseconds"),
        metric.WithUnit("ms"),
        metric.WithExplicitBucketBoundaries(5, 10, 25, 50, 100, 250, 500, 1000, 2500),
    )
    if err != nil { panic(err) }

    activeOrders, err = meter.Int64UpDownCounter("orders.active",
        metric.WithDescription("Number of orders currently being processed"),
        metric.WithUnit("{order}"),
    )
    if err != nil { panic(err) }
}

func handleOrder(ctx context.Context, order Order) error {
    start := time.Now()
    activeOrders.Add(ctx, 1)
    defer func() {
        activeOrders.Add(ctx, -1)
        orderLatency.Record(ctx, float64(time.Since(start).Milliseconds()))
    }()

    err := processOrder(ctx, order)
    ordersProcessed.Add(ctx, 1,
        metric.WithAttributes(
            attribute.String("status", statusFromErr(err)),
            attribute.String("payment_method", order.PaymentMethod),
        ),
    )
    return err
}

Note

Keep histogram bucket boundaries aligned with your SLOs. If your p99 target is 500ms, you need buckets around that value to measure accurately. The default OTel buckets are rarely right for your specific service — always customize them.

Structured Logs with Trace Correlation

OTel's log signal bridges your existing logging with traces and metrics. Instead of replacing your logger, OTel's log bridge API injects trace context (trace ID, span ID) into every log record. This means you can click from a trace to the exact log lines that were emitted during that span — the correlation that makes incidents solvable in minutes instead of hours.

# Python — structured logging with automatic trace correlation
import logging
import structlog
from opentelemetry.instrumentation.logging import LoggingInstrumentor

# Enable OTel log correlation — injects trace_id and span_id into log records
LoggingInstrumentor().instrument(set_logging_format=True)

# Configure structlog with OTel context
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        # trace_id and span_id are injected automatically by OTel
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.make_filtering_bound_logger(logging.INFO),
)

logger = structlog.get_logger()

async def process_payment(order_id: str, amount_cents: int):
    # These logs automatically include trace_id and span_id
    logger.info("payment.started", order_id=order_id, amount_cents=amount_cents)

    try:
        result = await payment_gateway.charge(amount_cents)
        logger.info(
            "payment.completed",
            order_id=order_id,
            transaction_id=result.transaction_id,
            latency_ms=result.latency_ms,
        )
        return result
    except PaymentDeclined as e:
        logger.warning(
            "payment.declined",
            order_id=order_id,
            reason=e.reason,
            decline_code=e.code,
        )
        raise

# Output (every line includes trace context):
# {"event": "payment.started", "order_id": "ord-789",
#  "amount_cents": 4999, "level": "info",
#  "timestamp": "2026-04-15T10:23:45.123Z",
#  "otelTraceID": "a1b2c3d4e5f6...", "otelSpanID": "f6e5d4c3..."}

The Correlation Pattern in Practice

With trace-correlated logs, your incident workflow becomes: (1) alert fires on a metric anomaly — say, p99 latency exceeding the SLO. (2) You query for slow traces within that time window in Tempo or Jaeger. (3) You find a trace showing a 3-second database query. (4) You click through to the logs for that span and see the exact SQL query and the lock_wait event that caused the delay. Three signals, one workflow, no tab-switching.

Sampling Strategies That Don't Lose Signal

At scale, collecting 100% of traces is neither affordable nor useful. A service handling 10,000 requests per second generates millions of spans per hour. Sampling reduces volume while preserving the traces that matter. OTel supports three sampling approaches:

Head-Based Sampling

The decision is made at the start of the trace (the “head”). A TraceIdRatioBasedsampler keeps a fixed percentage — say 10%. It's simple and efficient, but it can't know if a trace will be interesting until it's complete. You might discard the one trace that shows a rare failure.

Tail-Based Sampling

The decision is made after the trace is complete (the “tail”). The Collector buffers complete traces and applies policies: keep all errors, keep traces slower than 2 seconds, keep 5% of everything else. This captures the interesting traces while dropping routine ones. The trade-off is memory: the Collector must buffer all traces until the decision point.

Hybrid Sampling

Combine both: head-based sampling at 100% (collect everything locally), tail-based sampling at the gateway Collector. This gives you complete local traces for debugging while the gateway decides what reaches long-term storage. The agent-gateway Collector topology makes this natural.

# Tail-based sampling in the Collector
processors:
  tail_sampling:
    decision_wait: 10s           # wait for trace completion
    num_traces: 100000           # max traces in memory
    expected_new_traces_per_sec: 1000
    policies:
      # Always keep error traces
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Always keep slow traces (> 2s)
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 2000

      # Keep all traces from critical services
      - name: critical-services
        type: string_attribute
        string_attribute:
          key: service.name
          values: [payment-service, auth-service]

      # Sample 5% of everything else
      - name: baseline
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp/tempo]

Note

Tail-based sampling requires that all spans for a trace reach the same Collector instance. Use the loadbalancing exporter in your agent Collectors to route spans by trace ID to the appropriate gateway instance. Without this, you get partial traces and incorrect sampling decisions.

Context Propagation Across Boundaries

Distributed tracing only works if trace context propagates across service boundaries. OTel uses W3C Trace Context headers by default: traceparent carries the trace ID, span ID, and sampling decision. Every HTTP client and server in your stack must propagate these headers.

// Java — context propagation with Spring Boot
// Auto-instrumentation handles this automatically, but here's what happens:

// Incoming request: extract context from headers
// traceparent: 00-a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4-f6e5d4c3b2a1f6e5-01

// The span created for this request becomes a child of the remote span.
// When making outbound calls, context is injected automatically:

@Service
public class OrderService {

    private final RestClient restClient;

    // OTel auto-instrumentation wraps RestClient to inject traceparent
    public PaymentResult processPayment(Order order) {
        return restClient.post()
            .uri("http://payment-service/api/v1/charge")
            // traceparent header is injected automatically
            .body(new ChargeRequest(order.getTotalCents(), order.getCurrency()))
            .retrieve()
            .body(PaymentResult.class);
    }
}

// For message queues, context goes into message headers:
@Component
public class OrderEventPublisher {

    private final KafkaTemplate<String, OrderEvent> kafkaTemplate;

    public void publishOrderPlaced(OrderPlaced event) {
        // OTel Kafka instrumentation injects traceparent into Kafka headers
        kafkaTemplate.send("orders.placed", event.getOrderId(), event);
        // Consumer on the other end extracts it, creating a linked span
    }
}

For services using different propagation formats (B3 from Zipkin, X-Ray from AWS), configure the OTel SDK with composite propagators. The SDK can extract from multiple formats and inject in the standard W3C format, bridging legacy and modern instrumentation.

Resource Attributes — Know Where Telemetry Comes From

Every span, metric, and log record carries resource attributes that identify its source. At minimum, set service.name, service.version, and deployment.environment. These attributes power service maps, version comparisons, and environment filtering in every observability backend.

# Environment variables — works with any language SDK
export OTEL_SERVICE_NAME=order-service
export OTEL_RESOURCE_ATTRIBUTES="service.version=1.4.2,deployment.environment=production,service.namespace=ecommerce"
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1

# Kubernetes — set resource attributes from pod metadata
env:
  - name: OTEL_SERVICE_NAME
    value: order-service
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: >-
      service.version=$(IMAGE_TAG),
      deployment.environment=production,
      k8s.namespace.name=$(K8S_NAMESPACE),
      k8s.pod.name=$(K8S_POD_NAME),
      k8s.node.name=$(K8S_NODE_NAME)
  - name: K8S_NAMESPACE
    valueFrom:
      fieldRef:
        fieldPath: metadata.namespace
  - name: K8S_POD_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.name
  - name: K8S_NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName

Common Pitfalls and How to Avoid Them

Over-instrumenting hot paths. Adding a custom span to every loop iteration or every cache lookup creates noise and overhead. Instrument at the operation level (HTTP handler, queue consumer, batch job), not the line-of-code level.
High-cardinality metric attributes. Using user IDs, request IDs, or URLs as metric labels explodes your time series count and kills your metrics backend. Stick to bounded values: HTTP methods, status code classes, service names, endpoint groups.
Skipping the Collector. Exporting directly from applications to backends works in development. In production, it couples your code to specific vendors, loses the ability to transform telemetry centrally, and makes backend migration painful.
Ignoring context propagation gaps. One service that doesn't propagate traceparent breaks the trace chain for every downstream service. Audit propagation across your entire call graph, including queues and async workers.
Not setting resource attributes. Without service.name, every span shows up as “unknown_service” in your trace viewer. This is the most common setup mistake and the easiest to prevent.

A Practical Adoption Strategy

Adopting OTel across an organization is a journey, not a switch flip. The most successful rollouts follow a phased approach:

Phase 1 — Collector and auto-instrumentation. Deploy the Collector, enable auto-instrumentation on 2-3 services, and export to your existing backends. Prove the pipeline works without changing application code.
Phase 2 — Custom spans and metrics. Add business-relevant spans and custom metrics to your most critical services. Build dashboards and alerts around these signals. This is where OTel starts paying back the investment.
Phase 3 — Log correlation. Bridge your existing logging to OTel, inject trace context, and configure your log backend to link logs to traces. This closes the three-pillar loop.
Phase 4 — Tail-based sampling and optimization. Once all services emit telemetry, deploy gateway Collectors with tail-based sampling. Optimize costs by filtering noise and keeping only the signals that matter.

Note

Don't try to instrument everything at once. Start with the services involved in your most frequent incidents. The goal is to reduce mean time to resolution (MTTR) for real problems, not to achieve 100% instrumentation coverage on a Gantt chart.

One Standard, All Signals

OpenTelemetry is the first instrumentation standard that credibly unifies traces, metrics, and logs under one API, one SDK, and one collection pipeline. The value proposition is straightforward:

Instrument once, export to any backend — no vendor lock-in
Correlate traces with logs and metrics for faster incident resolution
Auto-instrumentation gives you baseline visibility with zero code changes
The Collector decouples your applications from backend specifics
Tail-based sampling captures interesting traces while controlling costs

The ecosystem is mature: the specificationis stable for traces and metrics, logs are GA as of late 2024, and every major observability vendor supports OTLP natively. If you're starting a new service or re-evaluating your observability stack, OTel is the foundation to build on.

For applying OpenTelemetry specifically to data pipelines — instrumenting Spark jobs, Airflow DAGs, and dbt runs — see OpenTelemetry for Data Pipelines. For distributed tracing deep-dives with Jaeger, sampling strategies, and service dependency maps, see Distributed Tracing with Jaeger. For LLM-specific observability — latency, token cost, hallucination rate, and prompt regression tracking — see LLM Observability and Monitoring.

Building or scaling an LLM-powered application in production?

We help engineering teams design reliable, auditable LLM systems — from structured output schemas and tool use patterns to guardrail middleware, evaluation pipelines, and prompt versioning. Let’s talk.

Send a Message

Prompt Engineering for Enterprise — Structured Outputs, Tool Use, and Guardrails

The Three Pillars Problem

OpenTelemetry Architecture

API Layer

SDK Layer

Collector

The Collector Pipeline

Agent vs. Gateway Deployment

Distributed Tracing Done Right

Auto-Instrumentation

Custom Spans for Business Logic

Metrics — Beyond Prometheus Scraping

Structured Logs with Trace Correlation

The Correlation Pattern in Practice

Sampling Strategies That Don't Lose Signal

Head-Based Sampling

Tail-Based Sampling

Hybrid Sampling

Context Propagation Across Boundaries

Resource Attributes — Know Where Telemetry Comes From

Common Pitfalls and How to Avoid Them

A Practical Adoption Strategy

One Standard, All Signals

Building or scaling an LLM-powered application in production?

Need help implementing this in production?

Prompt Engineering for Enterprise — Structured Outputs, Tool Use, and Guardrails

The Three Pillars Problem

OpenTelemetry Architecture

API Layer

SDK Layer

Collector

The Collector Pipeline

Agent vs. Gateway Deployment

Distributed Tracing Done Right

Auto-Instrumentation

Custom Spans for Business Logic

Metrics — Beyond Prometheus Scraping

Structured Logs with Trace Correlation

The Correlation Pattern in Practice

Sampling Strategies That Don't Lose Signal

Head-Based Sampling

Tail-Based Sampling

Hybrid Sampling

Context Propagation Across Boundaries

Resource Attributes — Know Where Telemetry Comes From

Common Pitfalls and How to Avoid Them

A Practical Adoption Strategy

One Standard, All Signals

Building or scaling an LLM-powered application in production?

Related Articles

Need help implementing this in production?