elixir

Who Watches the Watcher? Debugging a Silent Langfuse Integration in Production

I deployed Langfuse for observability, but the dashboard stayed empty. Instead of redeploying with more logs, I used BEAM's :dbg to trace live function calls and hot-loaded fixes into a running Elixir system. This is the story of debugging the watcher itself.

Manu Ajith

14 Nov 2025 • 10 min read

I spent a recent afternoon staring at an empty Langfuse dashboard, wondering if I'd lost my mind. I clicked over to the Langfuse UI: nothing. No traces, no generations, no sign that my Elixir app had ever spoken to it. That was not the way I left it the previous day.

It's a particular kind of irony, isn't it? You deploy an observability platform to monitor your system, and then the very thing you set up for observability lacks it in return. Who watches the watcher? Turns out, I do. With :dbg and a stubborn refusal to redeploy every time I form a new hypothesis.

The setup seemed straightforward enough. I was designing an AI email parser that processes emails daily, extracting structured data from messy HTML using self-hosted LLM models with VLLM. I needed to monitor latency, token usage, prompt effectiveness, and, importantly, which types of emails consumed the most GPU time.

Langfuse made sense. A self-hosted option for data privacy, a proper OpenAPI spec, and they allow me to define custom costs for VLLM (since I'm paying for GPU hours, not tokens). I wired up Data.Telemetry.LangfuseReporter, deployed, everything was working fine, until the next day... and... silence.

Reading the Code Like a Detective

The first rule of production debugging is to understand what the code should do. I opened Data.Telemetry.LangfuseReporter:

defmodule Data.Telemetry.LangfuseReporter do
  @moduledoc """
  Enhanced Langfuse reporter for VLLM observability with synchronous calls.
  """

  @vllm_events [
    [:data, :email, :llm_parse],
    [:data, :email, :vertex_parse],
    [:data, :vllm, :generation],
    [:data, :vllm, :tool_call]
  ]

  def attach_handlers do
    if enabled?() do
      Enum.each(@vllm_events, fn event ->
        :telemetry.attach(
          "langfuse-#{Enum.join(event, "-")}",
          event,
          &handle_event/4,
          %{}
        )
      end)
      Logger.info("[Langfuse] Enhanced handlers attached for VLLM tracing")
    end
    :ok
  end

  defp handle_event(_event_name, _measurements, _metadata, _config) do
    :ok
  end
end

First false red flag: The handle_event/4 callback was a stub. Just :ok and nothing else. That explained why telemetry events weren't auto-capturing, but I knew the integration was also making direct calls from the LLM parser via LangfuseReporter.create_trace/1 and create_generation/1.

The create_trace function looked solid. It built a payload, checked for config, fired off a Req.post. Classic HTTP client works. But was it actually running in production? I could've added logs and waited 15 minutes for a redeploy. Instead, I did what any BEAM-powered masochist would do: I traced the live system.

The Configuration Check That Wasn't Enough

A quick kubectl exec got me into the pod. Environment variables looked fine, Langfuse endpoint, public key, secret key all present, using Kubernetes DNS to hit the service in-cluster. Good, I didn't mess up any env variable this time.

kubectl exec -it data-574c77b48b-kzw6z -- /bin/bash
env | grep LANGFUSE
# LANGFUSE_ENDPOINT=http://langfuse-web.langfuse.svc.cluster.local:3000
# LANGFUSE_PUBLIC_KEY=pk-lf-[redacted]
# LANGFUSE_SECRET_KEY=sk-lf-[redacted]

But I don't trust environment variables. I trust what the VM is actually doing. That's where Erlang's :dbg module becomes your best friend. You can watch every call, every argument, every return value, in real-time, without touching a line of code.

So, I fired up a remote shell and started tracing:

kubectl exec -it data-574c77b48b-kzw6z -- /app/bin/data remote

Inside the remote IEx session:

# Start the tracer
:dbg.start()

# Set up a custom tracer process that prints to stdout
:dbg.tracer(:process, {fn msg, _n ->
  IO.puts(inspect(msg, limit: :infinity, pretty: true))
  0
end, 0})

# Trace all processes for call events
:dbg.p(:all, :c)

# Trace specific functions with return values
:dbg.tpl(Data.Telemetry.LangfuseReporter, :send_to_langfuse, [{:_, [], [{:return_trace}]}])
:dbg.tpl(Data.Telemetry.LangfuseReporter, :create_trace, [{:_, [], [{:return_trace}]}])
:dbg.tpl(Req, :post, [{:_, [], [{:return_trace}]}])

The pattern [{:_, [], [{:return_trace}]}] means: match any arguments, no guards, and include the return value. It's the trace pattern equivalent of "show me everything."

Then I triggered a test event:

Data.Telemetry.LangfuseReporter.create_trace(%{
  trace_id: UUID.uuid4(),
  user_id: 99999,
  email_id: 123,
  subject: "Test trace",
  sync_id: 456,
  gmail_auth_id: 789
})

The trace output was revealing:

{:trace, #PID<0.1234.0>, :call,
 {Data.Telemetry.LangfuseReporter, :create_trace,
  [%{trace_id: "550e8400-e29b-41d4-a716-446655440000", ...}]}}

{:trace, #PID<0.1234.0>, :call,
 {Data.Telemetry.LangfuseReporter, :send_to_langfuse,
  [%{id: "...", timestamp: "2025-11-14T10:30:00Z", type: "trace-create", ...}]}}

{:trace, #PID<0.1234.0>, :return_from,
 {Data.Telemetry.LangfuseReporter, :send_to_langfuse, 1}, :ok}

Critical finding: send_to_langfuse was being called and returning :ok, but there was NO trace event for Req.post/2. The HTTP request wasn't happening.

I added more granular tracing, this time for Req's internals:

:dbg.tpl(Req.Request, :run_request, [{:_, [], [{:return_trace}]}])
:dbg.tpl(Finch, :request, [{:_, [], [{:return_trace}]}])

Triggered another test, and there it was:

{:trace, #PID<0.1234.0>, :exception_from,
 {Req.Request, :run_request, 1},
 {ArgumentError, "unknown registry: Req.Finch"}}

The Missing Finch Pool and the Hot-Load Heroics

Ah. Req was trying to use a Finch connection pool called Req.Finch, but my app only had DataFinch The request failed silently somewhere in the middleware stack, got converted to an error tuple, and my code treated it as missing config and swallowed it.

The fix was trivial: tell Req to use the correct Finch pool

case Req.post(url,
      json: batch_payload,
      headers: headers,
      receive_timeout: 10_000,
      finch: DataFinch) do
  {:ok, %{status: status}} when status in 200..299 -> :ok
  {:ok, %{status: status, body: _body}} -> {:error, {:http_error, status}}
  {:error, reason} -> {:error, reason}
end

But I wasn't about to wait 15 minutes to test it. The BEAM has a better way.

Hot code loading is one of those BEAM features that feels like cheating. It was built for telecom systems that couldn't go down, but it's equally valuable for debugging. You can replace running code without stopping the system, test a fix immediately, and iterate in seconds.

Here's the process I used:

Modified the code locally with finch: DataFinch
Compiled just that module: mix compile --force
Located the compiled .beam file: _build/dev/lib/data/ebin/Elixir.Data.Telemetry.LangfuseReporter.beam
Copied it to the pod:

kubectl cp _build/dev/lib/data/ebin/Elixir.Data.Telemetry.LangfuseReporter.beam \
  data-574c77b48b-kzw6z:/tmp/

Hot-loaded in the remote IEx session:

code_binary = File.read!("/tmp/Elixir.Data.Telemetry.LangfuseReporter.beam")
:code.load_binary(Data.Telemetry.LangfuseReporter,
                     'Elixir.Data.Telemetry.LangfuseReporter.beam',
                     code_binary)

Remote shell returned {:module, Data.Telemetry.LangfuseReporter}. I triggered another trace and saw it: Req.post/2 was now being called, and returning {:ok, %Req.Response{status: 207, ...}}.

Status 207 means "Multi-Status" - batch processed with mixed results. But the dashboard was still empty.

Shot Into the Ether: Why the Server Wasn't Listening, or Was it?

Time to check the server side. Well, turns out that Langfuse started receiving the events, but Langfuse's logs were full of validation errors:

{
  "level": "error",
  "message": "Error processing events",
  "timestamp": "2025-11-14T10:45:23.123Z",
  "errors": [
    {
      "code": "invalid_type",
      "expected": "string",
      "received": "undefined",
      "path": ["batch", 0, "id"],
      "message": "Required"
    },
    {
      "code": "invalid_type",
      "expected": "string",
      "received": "undefined",
      "path": ["batch", 0, "timestamp"],
      "message": "Required"
    }
  ]
}

The payload structure was wrong. Langfuse was receiving requests but rejecting them due to schema validation errors. This is where many integrations fail: assuming you know the API structure without reading the specification.

I fetched Langfuse's official OpenAPI spec:

http GET https://cloud.langfuse.com/generated/api/openapi.yml \
  > langfuse-openapi.yml

Searching for the ingestion endpoint's schema:

BaseEvent:
  type: object
  required: [id, timestamp, type, body]
  properties:
    id:
      type: string
      format: uuid
      description: Unique event identifier
    timestamp:
      type: string
      format: date-time
    type:
      type: string
      enum: [trace-create, generation-create, span-create, ...]
    body:
      type: object
      description: Event-specific payload

There we go - I was conflating the event ID with the trace ID. The outer payload needs a unique event ID, timestamp, and type, while the body contains the actual trace data.

My payload:

%{
  type: "trace-create",
  body: %{
    id: trace_id,        # WRONG - trace ID inside body
    timestamp: "...",    # WRONG - missing at top level
  }
}

Correct payload:

%{
  id: UUID.uuid4(),      # Unique EVENT ID
  timestamp: "...",
  type: "trace-create",
  body: %{
    id: trace_id,        # TRACE ID inside body
    # ...
  }
}

This is a common API design pattern: the envelope vs. the payload. I fixed create_trace/1:

def create_trace(params) do
  payload = %{
    id: UUID.uuid4(),                    # EVENT ID
    timestamp: format_datetime(params[:start_time] || DateTime.utc_now()),
    type: "trace-create",
    body: %{
      id: params.trace_id,               # TRACE ID
      sessionId: params[:session_id],
      userId: to_string(params[:user_id]),
      # ...
    }
  }
  send_to_langfuse(payload)
end

Generations, Cost Tracking, and the Missing Context

While in the spec, I found more issues with the generation payload.

Token usage structure was wrong:

# My code
body: %{
  promptTokens: params[:prompt_tokens],
  completionTokens: params[:completion_tokens],
  totalTokens: params[:total_tokens],
}

# Spec requirement
CreateGenerationBody:
  properties:
    usage:
      type: object
      properties:
        promptTokens: integer
        completionTokens: integer
        totalTokens: integer

Fixed:

body: %{
  usage: %{
    promptTokens: params[:prompt_tokens],
    completionTokens: params[:completion_tokens],
    totalTokens: params[:total_tokens]
  },
  # ...
}

Output field naming was wrong. I used completion (OpenAI style), but Langfuse uses output:

# Wrong
body: %{completion: "The generated text..."}

# Correct
body: %{output: "The generated text..."}

Cost tracking structure needed work. For VLLM, costs are GPU-based:

defp calculate_vllm_cost(tokens, latency_ms) when is_number(tokens) and is_number(latency_ms) do
  # $3.50/hour for H200 GPU
  hours = latency_ms / (1000 * 60 * 60)
  hours * @hourly_cost |> Float.round(6)
end

body: %{
  costDetails: %{
    total: calculate_vllm_cost(params[:total_tokens], params[:latency_ms]),
    input: calculate_vllm_cost(params[:prompt_tokens], params[:latency_ms] / 2),
    output: calculate_vllm_cost(params[:completion_tokens], params[:latency_ms] / 2)
  }
}

This apportions GPU cost between input/output based on token counts, a reasonable proxy for computational work.

After hot-loading all the schema fixes, events finally appeared! But gmail_id was consistently null. This field tracks which email was synced.

The data lived in the gmails table, accessible via an sync association. My Emails.get/1 wasn't preloading it:

# Before
def get(id) do
  Repo.get(Email, id)
end

# After
def get(id) do
  Email
  |> Repo.get(id)
  |> Repo.preload(:sync)
end

Then extract the ID:

gmail_id = email.sync && email.sync.gmail_id

LangfuseReporter.create_trace(%{
  gmail_id: gmail_id,
  # ...
})

After this final fix, full context flowed into Langfuse.

Debugging Philosophy: What This Session Taught Me (Again)

This wasn't my first rodeo with production debugging, but it reinforced some truths I've come to live by after a decade of building systems:

Observability beats speculation. The urge is always "add logs and redeploy." But logs are limited. You must anticipate what to log, they add overhead and require redeployment to change. Instead, use VM-level tracing to see exactly what executes. :dbg Let me observe function calls, arguments, return values, and exceptions in real-time without modifying code.

When to use :dbg vs logging:

Use :dbg when: Debugging unexpected behavior, tracing control flow, measuring precise timing
Use logging when: Recording business events, tracking user actions, aggregating metrics

Hot code loading is for debugging, not just deployments. Being able to test a fix immediately, iterate in seconds instead of minutes, and validate on one pod before cluster-wide deploy changes how you debug. Just remember the limitations:

Can't change module attributes or app config
Requires restarting stateful processes (GenServers) to pick up state changes
Safe for pure functional code, risky for stateful code
Always follow up with proper deployment

Read the spec, not the docs. API documentation is often incomplete. OpenAPI specs are machine-readable contracts that define required fields, exact data types, nested structures, and valid enum values. When integrating any external API:

Fetch the OpenAPI spec
Generate types/schemas from it (or validate against it)
Write tests that validate your payload against the spec

For Elixir, consider using ExJsonSchema to validate payloads against OpenAPI schemas before sending them.

Configuration is code. The missing Finch adapter came from an implicit assumption that Req.post would "just work." But configuration has dependencies, defaults, and failure modes. Best practices:

Make configuration explicit in function calls
Don't rely on global defaults when alternatives exist
Validate configuration at startup, not first use
Use typespecs to document configuration requirements

207 status codes hide problems. Multi-Status responses mean "some succeeded, some failed." Don't treat them as blanket success. Always check the response body:

case Req.post(url, json: batch_payload, headers: headers, finch: DataFinch) do
  {:ok, %{status: 207, body: %{"successes" => s, "errors" => e}}} when length(e) > 0 ->
    Logger.warning("Langfuse partial failure", successes: length(s), errors: inspect(e))
    {:error, {:partial_failure, e}}
  {:ok, %{status: status}} when status in 200..299 ->
    :ok
  {:error, reason} ->
    {:error, reason}
end

Distributed systems require multi-layer debugging. The bug spanned four layers: application code, HTTP client, API schema, and data layer. Each was "correct" in isolation. The failure only emerged from their interaction. Debugging approach:

Start at boundaries (API requests/responses)
Work inward (application code, data layer)
Verify assumptions at each layer
Check both sides of network calls

Performance Considerations and Production Safety

Cost of :dbg tracing: While powerful, :dbg has overhead:

Each traced call generates a message to the tracer process
High-frequency functions (called millions of times/second) can overwhelm the tracer
Tracing :all processes captures everything, including BEAM internals

Production safety guidelines:

Trace specific modules/functions, not :all modules
Use :dbg.ctp/1 to clear trace patterns when done
Limit to short debugging sessions (minutes, not hours)
Monitor tracer process mailbox: :erlang.process_info(pid, :message_queue_len)

For my use case (tracing HTTP requests that happen a few times per second), overhead was negligible—sub-millisecond per call.

Req vs Finch performance: Specifying the Finch pool (finch: DataFinch) matters for performance:

Without a pool: new connection per request (TCP handshake, TLS negotiation)
With a pool: connection reuse (HTTP/1.1 keep-alive or HTTP/2 multiplexing)

For my Langfuse integration (intra-cluster HTTP calls), this changed the latency from ~50ms to ~5ms per request.

Kubernetes Debugging Techniques

Copy compiled .beam files for hot patches:

# Copy to pod
kubectl cp local/file.beam namespace/pod:/tmp/

# In remote shell
code_binary = File.read!("/tmp/file.beam")
:code.load_binary(Module.Name, 'file.beam', code_binary)

Check logs on both sides:

# Client (your app)
kubectl logs -f data-574c77b48b-kzw6z

# Server (Langfuse)
kubectl logs -f -n langfuse langfuse-web-68cd7fb787-dhmd2

# Filter for errors
kubectl logs -f data-574c77b48b-kzw6z | grep -i error

Port-forward for direct testing:

# Forward Langfuse port to localhost
kubectl port-forward -n langfuse svc/langfuse-web 3000:3000

# Test from local machine
curl -X POST http://localhost:3000/api/public/ingestion \
  -H "Authorization: Basic $(echo -n 'pk:sk' | base64)" \
  -H "Content-Type: application/json" \
  -d '{"batch": [...]}'

Verify service discovery:

kubectl exec -it data-574c77b48b-kzw6z -- nslookup langfuse-web.langfuse.svc.cluster.local

Architectural Improvements for Next Time

This session revealed several areas for improvement:

Schema validation at compile time:

defmodule Data.Telemetry.LangfuseSchema do
  @external_resource "priv/langfuse_openapi.yml"
  @openapi_spec YamlElixir.read_from_file!("priv/langfuse_openapi.yml")

  def validate_trace(payload) do
    schema = get_in(@openapi_spec, ["components", "schemas", "CreateTraceEvent"])
    ExJsonSchema.Validator.validate(schema, payload)
  end
end

# In create_trace/1
payload = build_trace_payload(params)

if Mix.env() in [:dev, :test] do
  case LangfuseSchema.validate_trace(payload) do
    {:error, errors} -> raise "Invalid payload: #{inspect(errors)}"
    :ok -> :ok
  end
end

send_to_langfuse(payload)

Circuit breaker for external APIs:

defmodule Data.Telemetry.LangfuseCircuitBreaker do
  use GenServer
  @failure_threshold 10
  @reset_timeout :timer.minutes(5)

  def record_failure do
    GenServer.call(__MODULE__, :record_failure)
  end

  def handle_call(:record_failure, _from, %{failures: f} = state) when f >= @failure_threshold do
    Logger.error("Langfuse circuit breaker OPEN")
    # Alert to Sentry/PagerDuty
    {:reply, :circuit_open, state}
  end
  # ... implementation
end

Integration tests against real schemas:

defmodule Data.Telemetry.LangfuseReporterTest do
  use ExUnit.Case
  @langfuse_schema File.read!("priv/langfuse_openapi.yml") |> YamlElixir.read_from_string!()

  test "create_trace builds valid payload" do
    params = %{trace_id: UUID.uuid4(), user_id: 123, ...}
    payload = LangfuseReporter.build_trace_payload(params)
    schema = get_in(@langfuse_schema, [...])
    assert :ok = ExJsonSchema.Validator.validate(schema, payload)
  end
end

What I Learned (Again)

This debugging journey, from silent failures to full observability, shows why I keep choosing Elixir for production systems.

The BEAM gives you superpowers, sure, production introspection without redeployment, hot code loading for rapid hypothesis testing, process-level isolation that makes tracing safe, and built-in distribution that extends debugging across nodes.

I've used all of them in this session, and they've saved me hours of redeploy cycles. But here's the thing: tools like :dbg and hot code loading are only effective when you combine them with solid engineering practices.

You need to read the OpenAPI spec, not just the documentation. You need to validate schemas at compile time when you can. You need to make configuration explicit instead of relying on magic defaults. You need to check both sides of network calls - client logs and server logs. You need to test with real data against real schemas, not just mocked unit tests.

And when you find yourself staring at a silent dashboard at 3 AM, remember this: resist the urge to add logging and redeploy. Instead, observe the running system with tracing. Form hypotheses based on what you actually see, not what you think should be happening. Test your fixes via hot code loading on a single pod. Verify on both client and server. Only deploy the validated fix cluster-wide.

And if you're deploying an observability tool? Make sure you have a way to observe it. Because nobody else will.