Go 可观测性:OpenTelemetry 完全指南
你有没有经历过这样的夜晚:凌晨两点被报警电话叫醒,服务出了故障,但你盯着满屏的日志,完全搞不清楚请求到底经过了哪些服务、在哪个环节出了问题、为什么突然变慢?如果你有微服务架构的经验,这个场景一定不陌生。
传统的日志监控在单体应用中还算够用,但到了分布式系统里,光靠日志就像拿着手电筒找黑猫——你根本不知道猫在哪里。我们需要的是可观测性(Observability),一种让你无需改变系统行为就能理解其内部状态的能力。
今天,我们就来深入探讨 OpenTelemetry——这个由 CNCF 维护的可观测性标准框架,看看如何在 Go 项目中实现全链路追踪、指标收集和日志关联。
可观测性的三大支柱
在动手之前,先理清概念。可观测性有三大支柱:
- Traces(链路追踪):一个请求从进入系统到返回响应的完整路径,包括经过的每个服务、每次数据库查询、每次外部调用
- Metrics(指标):系统行为的数值度量,比如 QPS、延迟、错误率、CPU 使用率
- Logs(日志):离散的事件记录,比如"用户 A 在 10:00:01 登录失败"
OpenTelemetry 的厉害之处在于:它将这三者统一到一个框架中,并且提供了跨语言、跨平台的标准化实现。你不再需要分别为 Zipkin、Prometheus、ELK 各自接入不同的 SDK。
核心概念速览
在 OpenTelemetry 的世界里,有几个关键概念你必须了解:
- Trace:一次完整的请求链路,由一个或多个 Span 组成
- Span:Trace 中的一个工作单元,有名称、起止时间、属性和事件
- Context Propagation:上下文传播机制,确保 Span 之间的关系能跨服务传递
- Exporter:将遥测数据发送到后端(Jaeger、Tempo、Prometheus 等)
- Collector:可选的中间代理,负责接收、处理和转发遥测数据
- Instrumentation:埋点,分为自动埋点(auto-instrumentation)和手动埋点(manual instrumentation)
项目准备
让我们创建一个示例项目来演示 OpenTelemetry 的集成:
mkdir otel-demo
cd otel-demo
go mod init github.com/yourusername/otel-demo
安装 OpenTelemetry Go SDK 及其依赖:
# 核心 SDK
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/sdk
# Trace 相关
go get go.opentelemetry.io/otel/trace
go get go.opentelemetry.io/otel/sdk/trace
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp
# Metrics 相关
go get go.opentelemetry.io/otel/metric
go get go.opentelemetry.io/otel/sdk/metric
go get go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc
go get go.opentelemetry.io/otel/exporters/prometheus
# 上下文传播
go get go.opentelemetry.io/otel/propagation
# Jaeger 导出器(用于本地开发)
go get go.opentelemetry.io/otel/exporters/jaeger
# 自动埋点(HTTP、gRPC 等)
go get go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
go get go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc
go get go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux
go get go.opentelemetry.io/contrib/instrumentation/runtime
初始化 TracerProvider
一切的起点是创建 TracerProvider。它是 Trace 数据的管理中心,负责创建 Span 并将它们导出到后端。
// internal/telemetry/tracer.go
package telemetry
import (
"context"
"fmt"
"log"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
)
// InitTracer 初始化 TracerProvider
func InitTracer(ctx context.Context, serviceName, collectorEndpoint string) (func(context.Context) error, error) {
// 创建 OTLP gRPC 导出器
conn, err := grpc.NewClient(
collectorEndpoint,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
if err != nil {
return nil, fmt.Errorf("failed to create gRPC connection: %w", err)
}
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithGRPCConn(conn),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, fmt.Errorf("failed to create trace exporter: %w", err)
}
// 创建资源描述(标识服务身份)
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceNameKey.String(serviceName),
semconv.ServiceVersionKey.String("1.0.0"),
semconv.DeploymentEnvironmentKey.String("development"),
),
resource.WithHost(),
resource.WithProcess(),
resource.WithTelemetrySDK(),
)
if err != nil {
return nil, fmt.Errorf("failed to create resource: %w", err)
}
// 创建 TracerProvider
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter,
sdktrace.WithBatchTimeout(5*time.Second),
sdktrace.WithMaxExportBatchSize(512),
),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.AlwaysSample()), // 生产环境建议用 TraceIDRatioBased
)
// 设置全局 TracerProvider
otel.SetTracerProvider(tp)
// 设置全局上下文传播器(W3C TraceContext + Baggage)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
log.Printf("TracerProvider initialized for service: %s", serviceName)
// 返回 cleanup 函数
return func(ctx context.Context) error {
log.Println("Shutting down TracerProvider...")
return tp.Shutdown(ctx)
}, nil
}
创建自定义 Span
有了 TracerProvider,我们就可以创建 Span 了。Span 是追踪的基本单元,它记录了一个操作的开始和结束。
// internal/handler/order_handler.go
package handler
import (
"context"
"encoding/json"
"net/http"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("order-service")
// OrderHandler 订单处理器
type OrderHandler struct {
orderService OrderService
paymentClient PaymentClient
inventoryRepo InventoryRepo
}
func (h *OrderHandler) CreateOrder(w http.ResponseWriter, r *http.Request) {
// 从请求的 context 中提取父 Span(如果有)
ctx := r.Context()
// 创建一个新 Span,自动关联到父 Span
ctx, span := tracer.Start(ctx, "CreateOrder",
trace.WithSpanKind(trace.SpanKindServer),
trace.WithAttributes(
attribute.String("http.method", r.Method),
attribute.String("http.url", r.URL.String()),
),
)
defer span.End()
// 解析请求
var req CreateOrderRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "invalid request body")
http.Error(w, err.Error(), http.StatusBadRequest)
return
}
// 为 Span 添加业务属性
span.SetAttributes(
attribute.String("order.user_id", req.UserID),
attribute.Int("order.item_count", len(req.Items)),
)
// 添加 Span 事件
span.AddEvent("parsing request completed",
trace.WithAttributes(
attribute.String("order.id", req.OrderID),
),
)
// 调用下游服务(自动创建子 Span)
order, err := h.processOrder(ctx, &req)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "failed to process order")
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
span.SetStatus(codes.Ok, "order created")
span.SetAttributes(attribute.String("order.result_id", order.ID))
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(order)
}
// processOrder 处理订单(内部方法,创建子 Span)
func (h *OrderHandler) processOrder(ctx context.Context, req *CreateOrderRequest) (*Order, error) {
ctx, span := tracer.Start(ctx, "processOrder",
trace.WithAttributes(
attribute.String("order.id", req.OrderID),
),
)
defer span.End()
// 第一步:验证库存
if err := h.checkInventory(ctx, req.Items); err != nil {
span.RecordError(err)
return nil, fmt.Errorf("inventory check failed: %w", err)
}
// 第二步:扣减库存
if err := h.deductInventory(ctx, req.Items); err != nil {
span.RecordError(err)
return nil, fmt.Errorf("inventory deduction failed: %w", err)
}
// 第三步:处理支付
paymentID, err := h.processPayment(ctx, req.UserID, req.TotalAmount)
if err != nil {
span.RecordError(err)
return nil, fmt.Errorf("payment failed: %w", err)
}
span.SetAttributes(attribute.String("payment.id", paymentID))
// 第四步:创建订单记录
order := &Order{
ID: req.OrderID,
UserID: req.UserID,
Items: req.Items,
TotalAmount: req.TotalAmount,
PaymentID: paymentID,
Status: "confirmed",
CreatedAt: time.Now(),
}
span.AddEvent("order created successfully")
return order, nil
}
// checkInventory 检查库存(创建更深层的子 Span)
func (h *OrderHandler) checkInventory(ctx context.Context, items []OrderItem) error {
ctx, span := tracer.Start(ctx, "checkInventory")
defer span.End()
for _, item := range items {
span.AddEvent("checking item",
trace.WithAttributes(
attribute.String("item.sku", item.SKU),
attribute.Int("item.quantity", item.Quantity),
),
)
available, err := h.inventoryRepo.GetStock(ctx, item.SKU)
if err != nil {
span.RecordError(err,
trace.WithAttributes(attribute.String("item.sku", item.SKU)),
)
return err
}
if available < item.Quantity {
err := fmt.Errorf("insufficient stock for %s: need %d, have %d",
item.SKU, item.Quantity, available)
span.RecordError(err)
return err
}
}
return nil
}
// processPayment 处理支付
func (h *OrderHandler) processPayment(ctx context.Context, userID string, amount float64) (string, error) {
ctx, span := tracer.Start(ctx, "processPayment",
trace.WithAttributes(
attribute.String("payment.user_id", userID),
attribute.Float64("payment.amount", amount),
),
)
defer span.End()
// 模拟调用外部支付服务
paymentID, err := h.paymentClient.Charge(ctx, userID, amount)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "payment gateway error")
return "", err
}
span.SetAttributes(attribute.String("payment.id", paymentID))
span.SetStatus(codes.Ok, "payment successful")
return paymentID, nil
}
上下文传播(Context Propagation)
在微服务架构中,一个请求可能经过多个服务。上下文传播确保所有服务的 Span 能串联成一条完整的 Trace。
// internal/client/payment_client.go
package client
import (
"context"
"fmt"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("payment-client")
// PaymentClient 支付客户端
type PaymentClient struct {
httpClient *http.Client
baseURL string
}
func NewPaymentClient(baseURL string) *PaymentClient {
return &PaymentClient{
httpClient: &http.Client{},
baseURL: baseURL,
}
}
// Charge 调用支付网关扣款
func (c *PaymentClient) Charge(ctx context.Context, userID string, amount float64) (string, error) {
ctx, span := tracer.Start(ctx, "PaymentClient.Charge",
trace.WithSpanKind(trace.SpanKindClient),
trace.WithAttributes(
attribute.String("payment.user_id", userID),
attribute.Float64("payment.amount", amount),
attribute.String("payment.currency", "USD"),
),
)
defer span.End()
// 构建 HTTP 请求
req, err := http.NewRequestWithContext(ctx, "POST",
fmt.Sprintf("%s/api/v1/charge", c.baseURL), nil)
if err != nil {
span.RecordError(err)
return "", err
}
// 关键步骤:注入上下文到 HTTP Header
// 这会将 TraceID 和 SpanID 写入请求头,使得下游服务能关联到同一个 Trace
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
// 发送请求
resp, err := c.httpClient.Do(req)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "HTTP request failed")
return "", err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
err := fmt.Errorf("payment service returned status %d", resp.StatusCode)
span.RecordError(err)
span.SetStatus(codes.Error, resp.Status)
return "", err
}
// 解析响应...
paymentID := "pay_abc123" // 模拟
span.SetAttributes(attribute.String("payment.id", paymentID))
return paymentID, nil
}
下游服务(支付服务)接收时提取上下文:
// 下游支付服务的处理器
func (h *PaymentHandler) Charge(w http.ResponseWriter, r *http.Request) {
// 从 HTTP Header 中提取传播的上下文
ctx := otel.GetTextMapPropagator().Extract(
r.Context(),
propagation.HeaderCarrier(r.Header),
)
// 创建 Span 时会自动关联到上游 Span
ctx, span := tracer.Start(ctx, "PaymentHandler.Charge",
trace.WithSpanKind(trace.SpanKindServer),
)
defer span.End()
// 处理支付逻辑...
span.AddEvent("payment processed")
}
如果你使用的是标准 HTTP 客户端和服务器,OpenTelemetry 提供了自动注入/提取的包装器:
import (
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
// 自动传播上下文的 HTTP 客户端
client := &http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}
// 自动传播上下文的 HTTP 服务器
handler := otelhttp.NewHandler(yourHandler, "my-server")
http.ListenAndServe(":8080", handler)
Metrics 指标收集
除了 Tracing,OpenTelemetry 也提供了完善的 Metrics 支持。让我们看看如何在 Go 应用中收集指标。
// internal/telemetry/meter.go
package telemetry
import (
"context"
"log"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
"go.opentelemetry.io/otel/sdk/metric"
"go.opentelemetry.io/otel/sdk/resource"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)
// InitMeter 初始化 MeterProvider
func InitMeter(ctx context.Context, serviceName, collectorEndpoint string) (func(context.Context) error, error) {
// 创建 OTLP gRPC 导出器
exporter, err := otlpmetricgrpc.New(ctx,
otlpmetricgrpc.WithEndpoint(collectorEndpoint),
otlpmetricgrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
// 创建资源
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceNameKey.String(serviceName),
),
)
if err != nil {
return nil, err
}
// 创建 MeterProvider
mp := metric.NewMeterProvider(
metric.WithResource(res),
metric.WithReader(metric.NewPeriodicReader(exporter,
metric.WithInterval(10*time.Second),
)),
)
// 设置全局 MeterProvider
otel.SetMeterProvider(mp)
log.Printf("MeterProvider initialized for service: %s", serviceName)
return func(ctx context.Context) error {
return mp.Shutdown(ctx)
}, nil
}
自定义 Metrics
OpenTelemetry 提供了多种 Metric 类型:Counter、UpDownCounter、Histogram、Gauge。
// internal/metrics/order_metrics.go
package metrics
import (
"context"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/metric"
)
// OrderMetrics 订单相关指标
type OrderMetrics struct {
orderCounter metric.Int64Counter
orderAmount metric.Float64Histogram
activeOrders metric.Int64UpDownCounter
orderProcessingMs metric.Int64Histogram
cancelCounter metric.Int64Counter
}
// NewOrderMetrics 创建订单指标收集器
func NewOrderMetrics() (*OrderMetrics, error) {
meter := otel.Meter("order-service")
orderCounter, err := meter.Int64Counter("orders.total",
metric.WithDescription("Total number of orders created"),
metric.WithUnit("{order}"),
)
if err != nil {
return nil, err
}
orderAmount, err := meter.Float64Histogram("orders.amount",
metric.WithDescription("Distribution of order amounts"),
metric.WithUnit("USD"),
metric.WithExplicitBucketBoundaries(10, 50, 100, 200, 500, 1000, 5000),
)
if err != nil {
return nil, err
}
activeOrders, err := meter.Int64UpDownCounter("orders.active",
metric.WithDescription("Number of currently active orders"),
metric.WithUnit("{order}"),
)
if err != nil {
return nil, err
}
orderProcessingMs, err := meter.Int64Histogram("orders.processing_duration",
metric.WithDescription("Time taken to process an order"),
metric.WithUnit("ms"),
metric.WithExplicitBucketBoundaries(10, 50, 100, 250, 500, 1000, 2500, 5000),
)
if err != nil {
return nil, err
}
cancelCounter, err := meter.Int64Counter("orders.cancelled",
metric.WithDescription("Total number of cancelled orders"),
metric.WithUnit("{order}"),
)
if err != nil {
return nil, err
}
return &OrderMetrics{
orderCounter: orderCounter,
orderAmount: orderAmount,
activeOrders: activeOrders,
orderProcessingMs: orderProcessingMs,
cancelCounter: cancelCounter,
}, nil
}
// RecordOrderCreated 记录订单创建
func (m *OrderMetrics) RecordOrderCreated(ctx context.Context, userID string, amount float64, itemCount int) {
attrs := metric.WithAttributes(
attribute.String("user.id", userID),
attribute.Int("order.item_count", itemCount),
)
m.orderCounter.Add(ctx, 1, attrs)
m.orderAmount.Record(ctx, amount, attrs)
m.activeOrders.Add(ctx, 1, attrs)
}
// RecordOrderCompleted 记录订单完成
func (m *OrderMetrics) RecordOrderCompleted(ctx context.Context, userID string, duration time.Duration) {
attrs := metric.WithAttributes(
attribute.String("user.id", userID),
attribute.String("order.status", "completed"),
)
m.activeOrders.Add(ctx, -1, attrs)
m.orderProcessingMs.Record(ctx, duration.Milliseconds(), attrs)
}
// RecordOrderCancelled 记录订单取消
func (m *OrderMetrics) RecordOrderCancelled(ctx context.Context, reason string) {
attrs := metric.WithAttributes(
attribute.String("order.cancel_reason", reason),
)
m.cancelCounter.Add(ctx, 1, attrs)
m.activeOrders.Add(ctx, -1, attrs)
}
在业务代码中使用 Metrics
// internal/service/order_service.go
package service
import (
"context"
"fmt"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/trace"
"github.com/yourusername/otel-demo/internal/metrics"
)
var tracer = otel.Tracer("order-service")
type OrderService struct {
repo OrderRepo
metrics *metrics.OrderMetrics
}
func (s *OrderService) PlaceOrder(ctx context.Context, req *CreateOrderRequest) (*Order, error) {
ctx, span := tracer.Start(ctx, "OrderService.PlaceOrder",
trace.WithAttributes(
attribute.String("order.user_id", req.UserID),
),
)
defer span.End()
start := time.Now()
// 记录指标:订单创建
s.metrics.RecordOrderCreated(ctx, req.UserID, req.TotalAmount, len(req.Items))
// 业务逻辑...
order, err := s.repo.Save(ctx, req)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "failed to save order")
s.metrics.RecordOrderCancelled(ctx, "save_failed")
return nil, err
}
duration := time.Since(start)
s.metrics.RecordOrderCompleted(ctx, req.UserID, duration)
span.SetStatus(codes.Ok, "order placed")
return order, nil
}
Prometheus 集成
如果你的监控系统基于 Prometheus,可以直接使用 Prometheus Exporter:
// internal/telemetry/prometheus.go
package telemetry
import (
"context"
"log"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/collectors"
prom "go.opentelemetry.io/otel/exporters/prometheus"
"go.opentelemetry.io/otel/sdk/metric"
"go.opentelemetry.io/otel/sdk/resource"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
"go.opentelemetry.io/otel"
)
// InitPrometheusMeter 初始化 Prometheus 导出器
func InitPrometheusMeter(ctx context.Context, serviceName string) (*prom.Exporter, error) {
// 创建资源
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceNameKey.String(serviceName),
),
)
if err != nil {
return nil, err
}
// 创建 Prometheus 导出器
// 它同时实现了 metric.Reader 和 prometheus.Collector
exporter, err := prom.New(
prom.WithRegisterer(prometheus.DefaultRegisterer),
)
if err != nil {
return nil, err
}
// 注册 Go 运行时指标
prometheus.MustRegister(collectors.NewGoCollector())
prometheus.MustRegister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))
// 创建 MeterProvider
mp := metric.NewMeterProvider(
metric.WithResource(res),
metric.WithReader(exporter),
)
otel.SetMeterProvider(mp)
log.Printf("Prometheus MeterProvider initialized for service: %s", serviceName)
return exporter, nil
}
在 HTTP 服务器中暴露 /metrics 端点:
// main.go
package main
import (
"context"
"log"
"net/http"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/yourusername/otel-demo/internal/telemetry"
)
func main() {
ctx := context.Background()
// 初始化 Prometheus 指标
_, err := telemetry.InitPrometheusMeter(ctx, "order-service")
if err != nil {
log.Fatalf("Failed to init Prometheus meter: %v", err)
}
mux := http.NewServeMux()
// Prometheus 指标端点
mux.Handle("/metrics", promhttp.Handler())
// 业务端点
mux.HandleFunc("/api/orders", orderHandler.CreateOrder)
// 健康检查
mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status":"ok"}`))
})
log.Println("Server starting on :8080")
log.Fatal(http.ListenAndServe(":8080", mux))
}
日志关联(Log Correlation)
日志需要和 Trace 关联起来,这样你在看日志时就能快速跳转到对应的 Trace。
// internal/logging/logger.go
package logging
import (
"context"
"log/slog"
"os"
"go.opentelemetry.io/otel/trace"
)
// NewLogger 创建带 Trace 上下文的 slog logger
func NewLogger(serviceName string) *slog.Logger {
handler := &traceLogHandler{
handler: slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo,
}),
}
return slog.New(handler).With(
slog.String("service", serviceName),
)
}
// traceLogHandler 自定义 slog Handler,自动注入 TraceID 和 SpanID
type traceLogHandler struct {
handler slog.Handler
}
func (h *traceLogHandler) Enabled(ctx context.Context, level slog.Level) bool {
return h.handler.Enabled(ctx, level)
}
func (h *traceLogHandler) Handle(ctx context.Context, record slog.Record) error {
// 从 context 中提取 Span 信息
span := trace.SpanFromContext(ctx)
if span.SpanContext().IsValid() {
record.AddAttrs(
slog.String("trace_id", span.SpanContext().TraceID().String()),
slog.String("span_id", span.SpanContext().SpanID().String()),
slog.Bool("trace_flags_sampled", span.SpanContext().IsSampled()),
)
}
return h.handler.Handle(ctx, record)
}
func (h *traceLogHandler) WithAttrs(attrs []slog.Attr) slog.Handler {
return &traceLogHandler{handler: h.handler.WithAttrs(attrs)}
}
func (h *traceLogHandler) WithGroup(name string) slog.Handler {
return &traceLogHandler{handler: h.handler.WithGroup(name)}
}
// FromContext 从上下文获取 logger(带 Trace 信息)
func FromContext(ctx context.Context, logger *slog.Logger) *slog.Logger {
span := trace.SpanFromContext(ctx)
if !span.SpanContext().IsValid() {
return logger
}
return logger.With(
slog.String("trace_id", span.SpanContext().TraceID().String()),
slog.String("span_id", span.SpanContext().SpanID().String()),
)
}
在业务代码中使用关联日志:
func (s *OrderService) PlaceOrder(ctx context.Context, req *CreateOrderRequest) (*Order, error) {
logger := logging.FromContext(ctx, s.logger)
logger.InfoContext(ctx, "placing order",
slog.String("user_id", req.UserID),
slog.Float64("amount", req.TotalAmount),
)
// ... 业务逻辑 ...
if err != nil {
logger.ErrorContext(ctx, "failed to place order",
slog.String("error", err.Error()),
)
return nil, err
}
logger.InfoContext(ctx, "order placed successfully",
slog.String("order_id", order.ID),
)
return order, nil
}
输出示例(JSON 格式):
{
"time": "2025-02-15T16:10:00+08:00",
"level": "INFO",
"service": "order-service",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"msg": "placing order",
"user_id": "user_123",
"amount": 299.99
}
有了 trace_id,你可以在 Jaeger 或 Grafana Tempo 中直接搜索这条日志对应的完整 Trace。
自动埋点(Auto-Instrumentation)
手动给每个函数加 Span 太累?OpenTelemetry 提供了常见库的自动埋点插件。
HTTP 服务器自动埋点
import (
"net/http"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func main() {
mux := http.NewServeMux()
mux.HandleFunc("/api/orders", handleOrders)
mux.HandleFunc("/api/users", handleUsers)
// 用 otelhttp 包装,自动为每个请求创建 Span
handler := otelhttp.NewHandler(mux, "my-http-server",
otelhttp.WithFilter(func(r *http.Request) bool {
// 过滤掉健康检查和 metrics 端点
return r.URL.Path != "/health" && r.URL.Path != "/metrics"
}),
)
http.ListenAndServe(":8080", handler)
}
gRPC 自动埋点
import (
"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
"google.golang.org/grpc"
)
// gRPC 服务器
server := grpc.NewServer(
grpc.StatsHandler(otelgrpc.NewServerHandler()),
)
// gRPC 客户端
conn, err := grpc.NewClient("localhost:50051",
grpc.WithStatsHandler(otelgrpc.NewClientHandler()),
)
数据库自动埋点
import (
"go.opentelemetry.io/contrib/instrumentation/github.com/lib/pq/otelpq"
"database/sql"
)
// 注册 OpenTelemetry 包装的 PostgreSQL 驱动
sql.Register("otelpostgres", otelpq.Wrap(&pq.Driver{}))
// 使用注册后的驱动名
db, err := sql.Open("otelpostgres", "postgres://user:pass@localhost/db?sslmode=disable")
Gin/Echo/Chi 等框架
// Chi 中间件
import "go.opentelemetry.io/contrib/instrumentation/github.com/go-chi/chi/v5/otelchi"
r := chi.NewRouter()
r.Use(otelchi.Middleware("my-server", otelchi.WithChiRoutes(r)))
// Gorilla Mux
import "go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux"
r := mux.NewRouter()
r.Use(otelmux.Middleware("my-server"))
采样策略
在生产环境中,不可能采样 100% 的请求——那会产生海量数据。OpenTelemetry 提供了多种采样策略:
// internal/telemetry/sampler.go
package telemetry
import (
sdktrace "go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/trace"
)
// NewProductionSampler 生产环境的采样策略
func NewProductionSampler() sdktrace.Sampler {
// 组合多种采样策略
return sdktrace.ParentBased(
// 根 Span 使用基于比率的采样
sdktrace.TraceIDRatioBased(0.1), // 采样 10%
// 有父 Span 时跟随父 Span 的采样决策
sdktrace.WithRemoteParentSampled(sdktrace.AlwaysSample()),
sdktrace.WithRemoteParentNotSampled(sdktrace.NeverSample()),
sdktrace.WithLocalParentSampled(sdktrace.AlwaysSample()),
sdktrace.WithLocalParentNotSampled(sdktrace.NeverSample()),
)
}
// NewErrorAwareSampler 错误请求 100% 采样的策略
func NewErrorAwareSampler() sdktrace.Sampler {
return &errorAwareSampler{
base: sdktrace.TraceIDRatioBased(0.05), // 正常请求采样 5%
}
}
type errorAwareSampler struct {
base sdktrace.Sampler
}
func (s *errorAwareSampler) ShouldSample(p sdktrace.SamplingParameters) sdktrace.SamplingResult {
// 先使用基础采样策略
result := s.base.ShouldSample(p)
// 如果有错误属性,强制采样
for _, attr := range p.Attributes {
if attr.Key == "error" && attr.Value.AsBool() {
return sdktrace.SamplingResult{
Decision: sdktrace.RecordAndSample,
Tracestate: trace.SpanContextFromContext(p.ParentContext).TraceState(),
}
}
}
return result
}
func (s *errorAwareSampler) Description() string {
return "ErrorAwareSampler(sampling errors at 100%, normal at 5%)"
}
导出到 Jaeger
Jaeger 是最流行的分布式追踪系统之一。本地开发时可以用 Docker 快速启动:
# 启动 Jaeger All-in-One
docker run -d --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
配置导出到 Jaeger:
// 使用 OTLP gRPC 导出到 Jaeger
cleanup, err := telemetry.InitTracer(ctx, "order-service", "localhost:4317")
if err != nil {
log.Fatalf("Failed to init tracer: %v", err)
}
defer cleanup(context.Background())
// Jaeger UI: http://localhost:16686
导出到 Grafana Tempo
Grafana Tempo 是 Grafana 生态中的分布式追踪后端,与 Grafana 仪表盘深度集成:
# docker-compose.yml(Grafana + Tempo + Prometheus)
version: '3.8'
services:
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
ports:
- "4317:4317" # OTLP gRPC
- "3200:3200" # Tempo API
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
ports:
- "3000:3000"
volumes:
- ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
Tempo 配置文件:
# tempo.yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
storage:
trace:
backend: local
local:
path: /tmp/tempo/blocks
Grafana 数据源配置:
# grafana-datasources.yml
apiVersion: 1
datasources:
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
isDefault: true
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
完整的 main.go 组装
将所有组件组装在一起:
// main.go
package main
import (
"context"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"time"
"github.com/prometheus/client_golang/prometheus/promhttp"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"github.com/yourusername/otel-demo/internal/handler"
"github.com/yourusername/otel-demo/internal/logging"
"github.com/yourusername/otel-demo/internal/metrics"
"github.com/yourusername/otel-demo/internal/telemetry"
)
func main() {
ctx := context.Background()
logger := logging.NewLogger("order-service")
// 初始化 TracerProvider
traceCleanup, err := telemetry.InitTracer(ctx,
"order-service",
getEnv("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317"),
)
if err != nil {
log.Fatalf("Failed to init tracer: %v", err)
}
defer traceCleanup(ctx)
// 初始化 Prometheus 指标
_, err = telemetry.InitPrometheusMeter(ctx, "order-service")
if err != nil {
log.Fatalf("Failed to init meter: %v", err)
}
// 创建业务指标
orderMetrics, err := metrics.NewOrderMetrics()
if err != nil {
log.Fatalf("Failed to create metrics: %v", err)
}
// 创建 Handler
orderHandler := handler.NewOrderHandler(orderMetrics, logger)
// 路由配置
mux := http.NewServeMux()
mux.HandleFunc("/api/orders", orderHandler.CreateOrder)
mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
w.Write([]byte(`{"status":"ok"}`))
})
mux.Handle("/metrics", promhttp.Handler())
// 用 otelhttp 包装,自动创建 Span
wrappedMux := otelhttp.NewHandler(mux, "order-service",
otelhttp.WithFilter(func(r *http.Request) bool {
return r.URL.Path != "/health" && r.URL.Path != "/metrics"
}),
)
srv := &http.Server{
Addr: ":8080",
Handler: wrappedMux,
ReadTimeout: 10 * time.Second,
WriteTimeout: 30 * time.Second,
IdleTimeout: 60 * time.Second,
}
// 优雅关闭
go func() {
logger.Info("Server starting on :8080")
if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
logger.Error("Server error", "error", err)
}
}()
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
<-quit
logger.Info("Shutting down server...")
shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := srv.Shutdown(shutdownCtx); err != nil {
logger.Error("Server forced to shutdown", "error", err)
}
logger.Info("Server exited")
}
func getEnv(key, fallback string) string {
if val := os.Getenv(key); val != "" {
return val
}
return fallback
}
Grafana 仪表盘配置
在 Grafana 中,你可以创建强大的仪表盘来可视化 Trace 和 Metrics。以下是几个关键面板的建议:
| 面板 | 数据源 | PromQL / 查询 |
|---|---|---|
| 请求 QPS | Prometheus | rate(http_requests_total[5m]) |
| P50/P95/P99 延迟 | Prometheus | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
| 错误率 | Prometheus | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) |
| 订单创建速率 | Prometheus | rate(orders_total[5m]) |
| 活跃订单数 | Prometheus | orders_active |
| GC Pause | Prometheus | rate(go_gc_duration_seconds_sum[5m]) |
你还可以利用 Grafana 的 Trace to Logs 功能,在 Trace 视图中直接关联到 Loki 的日志——这就是可观测性三大支柱互相关联的魅力所在。
生产环境部署建议
将 OpenTelemetry 部署到生产环境时,建议遵循以下最佳实践:
1. 使用 OpenTelemetry Collector
不要让你的服务直接导出数据到 Jaeger/Prometheus,而是经过 Collector 中转:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
filter/drop-health:
error_mode: ignore
traces:
span:
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/metrics"'
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, filter/drop-health, batch]
exporters: [otlp/jaeger, otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
2. Kubernetes 部署 Collector
# k8s/otel-collector.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
replicas: 2
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/etc/otel/collector.yaml"]
ports:
- containerPort: 4317
- containerPort: 4318
- containerPort: 8889
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: config
mountPath: /etc/otel
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
- name: otlp-http
port: 4318
- name: prometheus
port: 8889
3. 你的 Go 服务只需要导出到本地 Collector
// 在 Kubernetes 中,每个 Pod 运行一个 Collector Sidecar
// Go 服务只需要导出到 localhost
cleanup, err := telemetry.InitTracer(ctx,
"order-service",
"localhost:4317", // 指向 sidecar collector
)
4. 采样策略配置
// 根据环境动态调整采样率
func getSampler(env string) sdktrace.Sampler {
switch env {
case "development":
return sdktrace.AlwaysSample()
case "staging":
return sdktrace.TraceIDRatioBased(0.5)
case "production":
return sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.1),
)
default:
return sdktrace.AlwaysSample()
}
}
实战:完整的分布式追踪示例
让我们构建一个包含多个服务的完整分布式系统,看看 OpenTelemetry 如何在真实场景中工作。
服务 1:API Gateway
// services/gateway/main.go
package main
import (
"context"
"log"
"net/http"
"os"
"time"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/trace"
"github.com/yourusername/otel-demo/internal/telemetry"
)
var tracer = otel.Tracer("gateway-service")
func main() {
ctx := context.Background()
// 初始化追踪
cleanup, err := telemetry.InitTracer(ctx, "gateway-service", "localhost:4317")
if err != nil {
log.Fatalf("Failed to init tracer: %v", err)
}
defer cleanup(ctx)
// 创建 HTTP 客户端(自动传播上下文)
httpClient := &http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
Timeout: 10 * time.Second,
}
// API Gateway 处理器
handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
span := trace.SpanFromContext(ctx)
span.SetAttributes(
attribute.String("gateway.path", r.URL.Path),
attribute.String("gateway.method", r.Method),
)
// 根据路径路由到不同服务
switch {
case r.URL.Path == "/api/orders":
span.AddEvent("routing to order-service")
resp, err := httpClient.Get("http://order-service:8080/orders")
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "order-service unavailable")
http.Error(w, "Service unavailable", http.StatusServiceUnavailable)
return
}
defer resp.Body.Close()
span.SetAttributes(attribute.Int("order_service.status", resp.StatusCode))
case r.URL.Path == "/api/users":
span.AddEvent("routing to user-service")
resp, err := httpClient.Get("http://user-service:8080/users")
if err != nil {
span.RecordError(err)
http.Error(w, "Service unavailable", http.StatusServiceUnavailable)
return
}
defer resp.Body.Close()
default:
span.SetStatus(codes.Error, "unknown route")
http.NotFound(w, r)
return
}
w.WriteHeader(http.StatusOK)
w.Write([]byte(`{"status":"ok"}`))
})
// 用 otelhttp 包装(自动创建 Span)
wrappedHandler := otelhttp.NewHandler(handler, "gateway",
otelhttp.WithPropagators(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
)),
)
log.Println("Gateway starting on :8080")
log.Fatal(http.ListenAndServe(":8080", wrappedHandler))
}
服务 2:Order Service
// services/order-service/main.go
package main
import (
"context"
"encoding/json"
"log"
"net/http"
"time"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/trace"
"github.com/yourusername/otel-demo/internal/telemetry"
)
var tracer = otel.Tracer("order-service")
type Order struct {
ID string `json:"id"`
UserID string `json:"user_id"`
Amount float64 `json:"amount"`
Status string `json:"status"`
CreatedAt time.Time `json:"created_at"`
}
func main() {
ctx := context.Background()
cleanup, err := telemetry.InitTracer(ctx, "order-service", "localhost:4317")
if err != nil {
log.Fatalf("Failed to init tracer: %v", err)
}
defer cleanup(ctx)
// HTTP 客户端(自动传播)
httpClient := &http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}
mux := http.NewServeMux()
mux.HandleFunc("/orders", func(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
// 创建子 Span
ctx, span := tracer.Start(ctx, "ProcessOrderRequest",
trace.WithAttributes(
attribute.String("order.handler", "list"),
),
)
defer span.End()
// 模拟数据库查询
orders, err := fetchOrdersFromDB(ctx)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "database error")
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
span.SetAttributes(attribute.Int("orders.count", len(orders)))
// 调用用户服务获取用户信息
for i, order := range orders {
userInfo, err := fetchUserInfo(ctx, httpClient, order.UserID)
if err != nil {
span.AddEvent("failed to fetch user info",
trace.WithAttributes(
attribute.String("user.id", order.UserID),
attribute.String("error", err.Error()),
),
)
continue
}
span.AddEvent("enriched order with user info",
trace.WithAttributes(
attribute.String("order.id", order.ID),
attribute.String("user.name", userInfo.Name),
),
)
_ = i // 使用 userInfo 更新 order
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(orders)
})
// 包装处理器
handler := otelhttp.NewHandler(mux, "order-service")
log.Println("Order service starting on :8081")
log.Fatal(http.ListenAndServe(":8081", handler))
}
func fetchOrdersFromDB(ctx context.Context) ([]Order, error) {
ctx, span := tracer.Start(ctx, "DatabaseQuery",
trace.WithAttributes(
attribute.String("db.system", "postgresql"),
attribute.String("db.statement", "SELECT * FROM orders"),
),
)
defer span.End()
// 模拟数据库延迟
time.Sleep(50 * time.Millisecond)
span.AddEvent("query completed",
trace.WithAttributes(attribute.Int("db.rows_affected", 3)),
)
return []Order{
{ID: "ord_1", UserID: "user_123", Amount: 99.99, Status: "paid"},
{ID: "ord_2", UserID: "user_456", Amount: 149.50, Status: "pending"},
{ID: "ord_3", UserID: "user_789", Amount: 299.00, Status: "shipped"},
}, nil
}
func fetchUserInfo(ctx context.Context, client *http.Client, userID string) (*struct{ Name string }, error) {
ctx, span := tracer.Start(ctx, "FetchUserInfo",
trace.WithSpanKind(trace.SpanKindClient),
trace.WithAttributes(
attribute.String("user.id", userID),
attribute.String("peer.service", "user-service"),
),
)
defer span.End()
resp, err := client.Get("http://user-service:8082/users/" + userID)
if err != nil {
span.RecordError(err)
return nil, err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
err := fmt.Errorf("user service returned %d", resp.StatusCode)
span.SetStatus(codes.Error, resp.Status)
return nil, err
}
var user struct{ Name string }
if err := json.NewDecoder(resp.Body).Decode(&user); err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(attribute.String("user.name", user.Name))
return &user, nil
}
服务 3:User Service
// services/user-service/main.go
package main
import (
"context"
"encoding/json"
"log"
"net/http"
"strings"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/trace"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"github.com/yourusername/otel-demo/internal/telemetry"
)
var tracer = otel.Tracer("user-service")
type User struct {
ID string `json:"id"`
Name string `json:"name"`
Email string `json:"email"`
}
func main() {
ctx := context.Background()
cleanup, err := telemetry.InitTracer(ctx, "user-service", "localhost:4317")
if err != nil {
log.Fatalf("Failed to init tracer: %v", err)
}
defer cleanup(ctx)
mux := http.NewServeMux()
mux.HandleFunc("/users/", func(w http.ResponseWriter, r *http.Request) {
// 从 Header 提取传播的上下文
ctx := propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
).Extract(r.Context(), propagation.HeaderCarrier(r.Header))
// 创建 Span(会自动关联到上游)
ctx, span := tracer.Start(ctx, "GetUser",
trace.WithSpanKind(trace.SpanKindServer),
)
defer span.End()
// 从 URL 提取用户 ID
userID := strings.TrimPrefix(r.URL.Path, "/users/")
span.SetAttributes(attribute.String("user.id", userID))
// 查询用户
user, err := getUserFromDB(ctx, userID)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "user not found")
http.NotFound(w, r)
return
}
span.SetAttributes(
attribute.String("user.name", user.Name),
attribute.String("user.email", user.Email),
)
span.SetStatus(codes.Ok, "user found")
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(user)
})
handler := otelhttp.NewHandler(mux, "user-service")
log.Println("User service starting on :8082")
log.Fatal(http.ListenAndServe(":8082", handler))
}
func getUserFromDB(ctx context.Context, userID string) (*User, error) {
_, span := tracer.Start(ctx, "DatabaseQuery",
trace.WithAttributes(
attribute.String("db.system", "postgresql"),
attribute.String("db.statement", "SELECT * FROM users WHERE id = $1"),
),
)
defer span.End()
// 模拟数据库查询
time.Sleep(30 * time.Millisecond)
// 模拟用户数据
users := map[string]*User{
"user_123": {ID: "user_123", Name: "Alice", Email: "alice@example.com"},
"user_456": {ID: "user_456", Name: "Bob", Email: "bob@example.com"},
"user_789": {ID: "user_789", Name: "Charlie", Email: "charlie@example.com"},
}
user, ok := users[userID]
if !ok {
return nil, fmt.Errorf("user %s not found", userID)
}
return user, nil
}
使用 Docker Compose 启动完整环境
# docker-compose.yml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:latest
environment:
- COLLECTOR_OTLP_ENABLED=true
ports:
- "16686:16686"
- "4317:4317"
- "4318:4318"
gateway:
build:
context: ./services/gateway
ports:
- "8080:8080"
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=jaeger:4317
- OTEL_SERVICE_NAME=gateway-service
depends_on:
- jaeger
- order-service
- user-service
order-service:
build:
context: ./services/order-service
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=jaeger:4317
- OTEL_SERVICE_NAME=order-service
depends_on:
- jaeger
- user-service
user-service:
build:
context: ./services/user-service
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=jaeger:4317
- OTEL_SERVICE_NAME=user-service
depends_on:
- jaeger
启动后访问 http://localhost:8080/api/orders,然后在 Jaeger UI (http://localhost:16686) 查看完整的分布式追踪链路。你会看到:
- Gateway 接收请求
- Gateway 路由到 Order Service
- Order Service 查询数据库
- Order Service 调用 User Service
- User Service 查询数据库
- 所有 Span 串联成一条完整的 Trace
高级追踪模式
Baggage:跨服务传递业务上下文
有时候你需要在整条链路中传递业务信息(比如用户 ID、租户 ID),而不仅仅是追踪信息。OpenTelemetry 的 Baggage 机制就是为此设计的:
import (
"go.opentelemetry.io/otel/baggage"
"go.opentelemetry.io/otel/attribute"
)
func handleRequest(ctx context.Context, r *http.Request) {
// 添加 baggage(会自动传播到下游服务)
ctx = baggage.ContextWithBaggage(ctx,
baggage.New(
baggage.Member{Key: "user.id", Value: "user_123"},
baggage.Member{Key: "tenant.id", Value: "tenant_456"},
baggage.Member{Key: "request.priority", Value: "high"},
),
)
// 下游服务可以读取这些 baggage
span := trace.SpanFromContext(ctx)
span.SetAttributes(
attribute.String("user.id", baggage.FromContext(ctx).Member("user.id").Value()),
)
}
Span Links:关联相关的 Trace
有时候两个 Trace 在逻辑上是相关的(比如批量处理中的单个任务),但又不属于同一个 Trace。这时可以使用 Span Links:
func processBatchItem(ctx context.Context, batchTraceID string, item Item) {
// 创建关联到批次 Trace 的 Span
_, span := tracer.Start(ctx, "ProcessItem",
trace.WithLinks(
trace.Link{
SpanContext: trace.NewSpanContext(trace.SpanContextConfig{
TraceID: parseTraceID(batchTraceID),
}),
Attributes: []attribute.KeyValue{
attribute.String("link.type", "batch_parent"),
},
},
),
)
defer span.End()
// 处理逻辑...
}
常见陷阱与最佳实践
1. 不要在循环中创建过多 Span
// ❌ 错误:循环中创建大量 Span
for i := 0; i < 10000; i++ {
ctx, span := tracer.Start(ctx, "ProcessItem")
processItem(i)
span.End()
}
// ✅ 正确:为整个批次创建一个 Span
ctx, span := tracer.Start(ctx, "ProcessBatch",
trace.WithAttributes(attribute.Int("batch.size", 10000)),
)
defer span.End()
for i := 0; i < 10000; i++ {
// 使用事件而不是子 Span
span.AddEvent("processing item",
trace.WithAttributes(attribute.Int("item.index", i)),
)
processItem(i)
}
2. 及时调用 span.End()
忘记调用 span.End() 会导致 Span 永远不会结束,造成内存泄漏和追踪数据不完整:
// ❌ 错误:忘记 End
func bad() {
ctx, span := tracer.Start(context.Background(), "MyOperation")
// ... 逻辑 ...
// 忘记调用 span.End()
}
// ✅ 正确:使用 defer
func good() {
ctx, span := tracer.Start(context.Background(), "MyOperation")
defer span.End()
// ... 逻辑 ...
}
3. 合理设置 Span 属性
不要过度使用属性,只记录对调试有价值的数据:
// ❌ 错误:记录敏感信息
span.SetAttributes(
attribute.String("user.password", user.Password),
attribute.String("credit_card.number", card.Number),
)
// ✅ 正确:只记录必要信息
span.SetAttributes(
attribute.String("user.id", user.ID),
attribute.String("order.id", order.ID),
attribute.Bool("order.is_premium", order.IsPremium),
)
故障排查指南
问题 1:Jaeger 中看不到 Trace
检查清单:
- 确认 Collector 是否正常运行
- 确认服务是否能连接到 Collector(检查网络和端口)
- 确认采样率设置(
AlwaysSamplevsTraceIDRatioBased) - 查看服务日志是否有导出错误
# 检查 Collector 状态
kubectl get pods -l app=otel-collector
# 查看 Collector 日志
kubectl logs -l app=otel-collector
# 测试 Collector 端口连通性
nc -zv localhost 4317
问题 2:Trace 链路不完整
通常是上下文传播出了问题:
// 检查是否正确注入/提取上下文
// 在 HTTP 客户端
req, _ := http.NewRequest("GET", url, nil)
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
// 打印 Header 确认是否有 traceparent
log.Printf("Headers: %v", req.Header)
// 在 HTTP 服务端
ctx = otel.GetTextMapPropagator().Extract(ctx, propagation.HeaderCarrier(r.Header))
// 检查是否有有效的 Span Context
if span := trace.SpanFromContext(ctx); span.SpanContext().IsValid() {
log.Println("Valid parent span found")
} else {
log.Println("No valid parent span")
}
问题 3:Prometheus 指标不更新
检查指标注册是否正确:
// 确保使用同一个 MeterProvider
meter := otel.Meter("my-service")
// 检查指标名称是否符合 Prometheus 规范
// Prometheus 要求指标名只能包含字母、数字、下划线
counter, _ := meter.Int64Counter("http_requests_total") // ✅
counter, _ := meter.Int64Counter("http-requests-total") // ❌
性能优化建议
在生产环境中使用 OpenTelemetry 时,需要注意性能影响:
1. 异步导出
使用异步批量导出,避免阻塞业务逻辑:
// 配置批量导出器
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter,
sdktrace.WithBatchTimeout(5*time.Second), // 最多等待 5 秒
sdktrace.WithMaxExportBatchSize(512), // 每批最多 512 个 Span
sdktrace.WithMaxQueueSize(2048), // 队列最大容量
sdktrace.WithExportTimeout(30*time.Second), // 导出超时
),
)
2. 减少字符串分配
Span 属性和事件名称应该复用,避免每次创建新字符串:
// 定义常量属性键
var (
userIDKey = attribute.Key("user.id")
orderIDKey = attribute.Key("order.id")
statusKey = attribute.Key("http.status")
)
// 使用常量键
span.SetAttributes(
userIDKey.String("user_123"),
orderIDKey.String("order_456"),
statusKey.Int(200),
)
3. 条件性记录详细信息
只在需要时记录详细信息,避免不必要的开销:
func handleRequest(ctx context.Context, req *Request) error {
ctx, span := tracer.Start(ctx, "HandleRequest")
defer span.End()
// 只在调试模式下记录详细请求体
if os.Getenv("OTEL_DEBUG") == "true" {
body, _ := json.Marshal(req)
span.SetAttributes(attribute.String("request.body", string(body)))
}
// 只在错误时记录详细信息
if err := processRequest(req); err != nil {
span.RecordError(err)
span.SetAttributes(
attribute.String("error.details", err.Error()),
attribute.String("error.stack", string(debug.Stack())),
)
return err
}
return nil
}
4. 使用 Span 过滤器
在 Collector 层面过滤不需要的 Span,减少存储成本:
# otel-collector-config.yaml
processors:
filter/drop-health:
error_mode: ignore
traces:
span:
# 丢弃健康检查相关的 Span
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/metrics"'
- 'attributes["http.target"] == "/ready"'
tail_sampling:
policies:
# 只采样有错误的 Trace
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
# 只采样慢请求(延迟 > 1s)
- name: latency-policy
type: latency
latency: {threshold_ms: 1000}
# 采样特定服务的 Trace
- name: service-policy
type: string_attribute
string_attribute:
key: service.name
values: [order-service, payment-service]
总结
恭喜你完成了 OpenTelemetry 的学习之旅!让我们来回顾一下核心要点:
- Tracing:使用 Span 记录请求的完整链路,支持跨服务上下文传播
- Metrics:Counter、Histogram、Gauge 等类型覆盖各种度量场景
- Logging:将 TraceID 和 SpanID 注入日志,实现日志与链路追踪的关联
- Auto-Instrumentation:HTTP、gRPC、数据库驱动等常见组件都有自动埋点插件
- 采样策略:生产环境使用 ParentBased + TraceIDRatio 控制数据量
- 导出器:支持 Jaeger、Tempo、Prometheus 等主流后端
- Collector:作为中间代理统一管理遥测数据的接收和转发
可观测性不是锦上添花,而是分布式系统的必需品。下次当你的服务在凌晨两点出故障时,有了 OpenTelemetry,你就能在 Jaeger 里一眼看到问题的根源,而不是在日志的海洋里苦苦搜索。
希望这篇文章能帮助你掌握 Go 项目中 OpenTelemetry 的集成方法。下一篇,我们将深入 Kubernetes Operator 的世界!
继续阅读
探索更多技术文章
浏览归档,发现更多关于系统设计、工具链和工程实践的内容。