llm - Today I learned

Today I learned

Sign in Subscribe

llm

A collection of 4 posts

Speed Racer Gone Wrong: When CUDA Graph Optimization Killed My Inference Server

Why did my stable vLLM server keep crashing with CUDA OOM errors? The culprit: CUDA Graphs. For diverse workloads like structured JSON, graph capture + PyTorch's cache = a slow memory leak. I fixed it with one flag: --enforce-eager. Here's why I traded performance for stability.

Ghostbusters: Who You Gonna Call When KV Cache Eats Your GPU?

My model was 60GB, and my GPU had 141GB. I should have had 81GB free, but I kept hitting OOM errors. The culprit? KV cache - an unseen memory hog that consumed 68GB without showing up in any config file. This article will explore how the context window and batch size are a zero-sum game.

Fast & Furious Tensor Parallelism: GPU Heist Gone Wrong

Splitting a model across 4 H200 GPUs was expected to 4x throughput, but instead resulted in 2.8x worse latency and 35% lower throughput. Without NVLink, tensor parallelism causes more communication overhead than speedup, so sometimes 1 GPU outperforms 4

Honey, I Shrunk the Model: When Quantizing 70B Parameters Broke Everything

I tried to shrink a 70B model from FP16 to FP8 to fit in my 141GB of VRAM. Spoiler: it broke everything. After testing 6 models and 3 quantization formats, I discovered that a 30B model in full precision outperformed every quantized 70B. Turns out precision matters more than parameter count.

web
analytics