WIP: From Tokens to Tensors - Inference Engineering
From Tokens to Tensors: How a 744B Model Processes Five Words
Tracing "The capital of France is" through GLM-5's architecture — from tokenization to next-token prediction — then deploying a 120B MoE model on real hardware with all the topology gotchas nobody warns you about.
Rohan Singh · March 2026 · 22 min read
There's a gap between understanding a model architecture diagram and knowing what actually happens when tokens hit silicon. This post bridges that gap by tracing a single forward pass through GLM-5 (744B), then showing what it takes to deploy a model like this on real GPU hardware — including the topology gotchas that no diagram warns you about, and the KV cache mechanics that tie everything together.
Table of Contents
- The GLM-5 architecture at a glance
- Tracing the forward pass
- Matrix dimensions at every stage
- How GPUs actually run this
- Real deployment: 120B MoE on 4× A100 40GB
- When topology bites back
- The KV cache: the missing piece
- Prompt caching: from GPU memory to cloud APIs
- Lessons learned
1. The GLM-5 architecture at a glance
GLM-5 is a 744B-parameter Mixture-of-Experts transformer. The headline numbers: 78 transformer blocks, embedding dimension of 6,144, vocabulary of 155k tokens, and 256 experts per MoE layer — of which only 8 (+1 shared) are active per token. That sparsity is the key trick: the model has 744B total parameters but only ~40B are active per inference step.
A few architectural choices make it interesting:
Multi-head Latent Attention (MLA) compresses the KV cache through a low-rank bottleneck, dramatically reducing the memory footprint for long-context inference. Instead of storing full K/V tensors per head per layer, MLA projects to a compact latent and reconstructs on the fly.
DeepSeek Sparse Attention (DSA) lets some attention heads use sparse patterns instead of full quadratic attention, making long-context tractable without the O(n²) memory cost.
SwiGLU feedforward — the gated linear unit variant that multiplies a SiLU-activated path with a linear path — appears in both the dense FFN (blocks 1–3) and the MoE experts (blocks 4–78). The dense FFN uses a hidden size of 12,288; each expert uses a much smaller 2,048.
2. Tracing the forward pass
Let's trace exactly what happens when "The capital of France is" enters the network.
Tokenization
The BPE tokenizer splits the string into roughly 5 token IDs — something like [1042, 5831, 203, 8927, 412], each an integer in [0, 155,000).
Embedding
Each token ID indexes into an embedding matrix of shape (155k × 6,144). Our 5 tokens become a matrix of shape (5 × 6,144) — five dense vectors, one per position. At this point, the representations are context-free: "France" has the same vector regardless of surrounding words.
The 78-layer transformer stack
Each block follows the same pattern: RMSNorm → Attention (with residual) → RMSNorm → Feedforward (with residual).
The attention mechanism is where context gets built. At position 4 ("is"), the attention scores tell the model how much to attend to each prior token. It attends heavily to "France" and "capital" to build a representation useful for predicting the next word. RoPE (Rotary Position Embeddings) encodes positional information via rotation matrices applied to queries and keys.
Dense FFN vs MoE
The first 3 blocks use a dense SwiGLU FFN with hidden size 12,288 — every parameter participates for every token. The rationale: early layers learn fundamental, universally-needed representations.
Blocks 4–78 use MoE. For each token independently, a router selects 8 of 256 experts plus 1 shared expert. Each expert is a smaller SwiGLU with intermediate size 2,048. The outputs are weighted by their router probabilities and summed.
Why MoE matters: For our token "France" at block 50, the router might select experts specializing in named entities, geography, and European concepts. Meanwhile "is" routes to a different set tuned for syntax and copular verbs. The model learns specialized subnetworks without the compute cost of running all 744B parameters.
Output
After 78 blocks and a final RMSNorm, the hidden state at the last position is projected by a linear layer of shape (6,144 × 155,000) to produce logits over the vocabulary. Softmax yields the probability distribution — and "Paris" comes out on top.
3. Matrix dimensions at every stage
The tensor shape stays (5, 6,144) through almost the entire network. The interesting deviations tell you where the architecture does something clever:
| Stage | Shape | Notes |
|---|---|---|
| Raw text | string |
"The capital of France is" |
| Token IDs | (5,) |
BPE tokenizer → integer IDs |
| Token embeddings | (5, 6144) |
Lookup in 155k × 6144 table |
| RMSNorm 1 | (5, 6144) |
Normalize each row by its RMS |
| MLA queries | (5, H, d_h) |
Per-head queries with RoPE |
| KV latent | (5, d_c) |
Compressed bottleneck — K, V reconstructed from this |
| Attention weights | (H, 5, 5) |
Row 4 ("is") attends to "France" and "capital" |
| Attention output + residual | (5, 6144) |
x = x + Attn(RMSNorm(x)) |
| RMSNorm 2 | (5, 6144) |
Normalize before feedforward |
| Dense gate/up (blocks 1–3) | (5, 12288) ×2 |
Two parallel linear maps |
| Dense FFN output + residual | (5, 6144) |
SiLU(gate) ⊙ up → down project |
| Router logits (blocks 4–78) | (5, 256) |
Score each of 256 experts |
| Expert FFN (each of 8+1) | (5, 2048) |
6144→2048→6144 SwiGLU (smaller) |
| MoE output + residual | (5, 6144) |
Weighted sum of expert outputs |
| Final RMSNorm | (5, 6144) |
After all 78 layers |
| Logits | (5, 155000) |
Linear projection to vocabulary |
| Next-token probs | (155000,) |
P("Paris") ≈ 0.92 |
The bolded rows are where the shape changes — each one marks a key architectural decision: KV compression for memory efficiency, the router's expert selection, the smaller expert hidden dimension, and the final explosion to vocabulary size.
4. How GPUs actually run this
GLM-5 has 744B parameters. At FP16, that's ~1.5 TB just for the weights. A single H100 has 80 GB of HBM3. You need at minimum ~19 H100s just to store the model — before any activations, KV cache, or optimizer state. In practice you'd use 48–256 GPUs.
The single GPU
An H100 has 132 streaming multiprocessors, each with 64 FP32 cores and 4 tensor cores specialized for matrix multiply. The memory hierarchy is the key constraint:
| Memory | Size | Bandwidth |
|---|---|---|
| HBM3 | 80 GB | 3.35 TB/s |
| L2 cache | 50 MB | ~12 TB/s |
| Registers + SRAM | ~26 MB | ~33 TB/s |
A single matmul like the output head (5, 6144) × (6144, 155k) is tiled into small blocks that fit in SRAM, computed by tensor cores, and written back. The entire forward pass is a carefully choreographed dance to keep data close to compute.
Tensor parallelism
For the dense FFN, the weight matrix W(6144 × 12288) is split column-wise across 8 GPUs. Each GPU holds a (6144 × 1536) shard and computes its portion of the matmul independently. The partial results are combined via AllReduce over NVLink (900 GB/s bidirectional). The matmul runs 8× faster, and no single GPU holds the full weight.
Expert parallelism
For MoE layers, the 256 experts are spread across 32 GPUs — 8 experts per GPU. The router runs on every GPU and decides which experts each token needs. An all-to-all communication dispatches tokens to the GPUs hosting their chosen experts, those GPUs compute the small SwiGLU FFN, and a second all-to-all returns results.
Pipeline parallelism
The 78 layers are divided across 6 pipeline stages, each on a separate 8-GPU node. Data flows stage-to-stage over the network. To hide latency, the input batch is split into micro-batches that overlap — while stage 1 processes micro-batch 2, stage 2 is already working on micro-batch 1.
The full picture
6 nodes × 8 H100s = 48 GPUs. Pipeline parallelism (PP=6) across nodes, tensor parallelism (TP=8) within each node, expert parallelism (EP=32) for the MoE layers. Total memory: 3.84 TB — enough for 1.5 TB weights plus KV cache, activations, and overhead.
The real bottleneck: memory bandwidth
An H100 can do 989 TFLOPS at FP16, but its memory bandwidth is "only" 3.35 TB/s. For the output head matmul, the weight matrix is ~1.9 GB. Reading it from HBM takes ~0.57 ms, while the arithmetic is trivially fast on tensor cores. This operation is memory-bound — the GPU spends most of its time waiting for data.
Three things in GLM-5 are specifically GPU-friendly:
- MLA shrinks the KV cache reads by 4–8× per attention step
- MoE sparsity means only ~453 MB of weight reads per token per layer instead of the full dense matrix
- FlashAttention fuses the entire attention computation into a single kernel that tiles in SRAM, turning an O(n²) memory operation into O(n)
5. Real deployment: 120B MoE on 4× A100 40GB
Theory is nice. Let's talk about deploying a model on real hardware — specifically GPT-OSS 120B on 4× A100 40GB GPUs with 160 GB total memory.
GPT-OSS 120B is architecturally similar to GLM-5 but smaller: 36 layers, embedding dimension of 2,880, 128 experts per layer with 4 active per token, and only 5.1B parameters active per inference step. It uses Grouped Query Attention (GQA) with 64 heads instead of MLA.
Memory budget
| Precision | Weight Memory | Total w/ KV + overhead | Fits? |
|---|---|---|---|
| FP16 (2B/param) | 240 GB | ~264 GB | ❌ Won't fit |
| INT8 (1B/param) | 120 GB | ~144 GB | ⚠️ Tight for long context |
| MXFP4 (0.5B/param) | 60 GB | ~84 GB | ✅ Comfortable, 76 GB headroom |
The model ships with native MXFP4 (microscaling FP4) quantization baked into the checkpoint — no post-hoc quantization needed. At 60 GB, that splits cleanly across 4 GPUs at 15 GB each, leaving ~25 GB per GPU for KV cache and activations.
The vLLM configuration
vllm serve openai/gpt-oss-120b \
--tensor-parallel-size 4 \
--quantization mxfp4 \
--dtype bfloat16 \
--max-model-len 16384 \
--gpu-memory-utilization 0.90
The 64 GQA heads divide cleanly across 4 GPUs (16 each). The 128 MoE experts also divide evenly (32 each). No awkward remainders.
Performance numbers
From our benchmarks on this hardware:
- Prefill throughput: up to 12,206 tok/s (peak, single request)
- Generation throughput: ~148 tok/s (single request, steady state)
- Batched generation: ~563 tok/s aggregate (8 concurrent requests)
- Prefix cache hit rate: 96.7%
6. When topology bites back
Running nvidia-smi topo -m on our 4× A100 box reveals a critical detail:
GPU0 GPU1 GPU2 GPU3
GPU0 X NV12 SYS SYS
GPU1 NV12 X SYS SYS
GPU2 SYS SYS X NV12
GPU3 SYS SYS NV12 X
This is a 2+2 split topology:
- GPU 0 ↔ GPU 1: NV12 (600 GB/s) — 12 NVLinks, blazing fast
- GPU 2 ↔ GPU 3: NV12 (600 GB/s) — same
- Cross-pair (0↔2, 0↔3, 1↔2, 1↔3): SYS (~25 GB/s) — PCIe + QPI, 20× slower
With --tensor-parallel-size 4, every layer does an AllReduce across all 4 GPUs. NCCL builds a ring where two of four hops are SYS links. That's 144 cross-NUMA transfers per forward pass (36 layers × 2 AllReduces per layer × 2 slow hops).
The hypothesis: PP=2 × TP=2
The idea: split layers into two pipeline stages, one per NVLink pair. Stage 1 (layers 1–18) runs on GPUs 0–1 with TP=2 over NV12. Stage 2 (layers 19–36) runs on GPUs 2–3 with TP=2 over NV12. The only cross-NUMA traffic is the activation tensor handoff between stages — 1 × 1 × 2880 × 2 bytes = 5.6 KB per decode step, essentially free.
vllm serve openai/gpt-oss-120b \
--tensor-parallel-size 2 \
--pipeline-parallel-size 2 \
--quantization mxfp4 \
--dtype bfloat16 \
--max-model-len 16384 \
--gpu-memory-utilization 0.90
| TP=4 | PP=2 × TP=2 | |
|---|---|---|
| Cross-NUMA transfers per forward pass | 144 | 1 |
| AllReduce link speed | Mix of NV12 + SYS | NV12 only |
The result: theory was wrong
After building a benchmark harness, cleaning up stale CUDA contexts (more on that below), and running both configs:
| Metric | TP=4 | PP=2 × TP=2 | Winner |
|---|---|---|---|
| Generation tok/s (single request) | ~148 | ~124 | TP=4 |
| Prompt tok/s (peak) | ~12,206 | ~10,159 | TP=4 |
| Batched generation (8 concurrent) | ~563 | ~696 (ramping) | Mixed |
TP=4 wins by ~16% on single-request decode. My prediction was wrong for three reasons:
-
MXFP4 shrinks AllReduce payloads. The tensors being communicated are 4-bit, not FP16. A small payload over a slow link is still fast in absolute terms.
-
The pipeline bubble is real. With PP=2, stage 2 idles while stage 1 processes, and vice versa. For single-request decode, this serialization penalty is a ~2× utilization hit that NVLink-only communication can't overcome.
-
NCCL is smart. It likely uses shared memory or socket-based transfer for cross-NUMA hops, which is slower than NVLink but not catastrophic for small payloads.
Lesson: benchmark, don't theorize. The topology analysis was correct in principle — cross-NUMA is 20× slower. But the second-order effects (quantized payload sizes, pipeline bubbles, NCCL transport selection) dominated the first-order effect.
The process cleanup gotcha
When switching between configs, we hit NCCL initialization errors — not because PP=2×TP=2 was incompatible, but because the previous server's GPU workers didn't release their CUDA contexts.
vLLM with TP=4 spawns a process tree:
APIServer (parent)
└─ EngineCore
├─ WorkerProc rank=0 ← holds GPU 0 CUDA context
├─ WorkerProc rank=1 ← holds GPU 1 CUDA context
├─ WorkerProc rank=2 ← holds GPU 2 CUDA context
└─ WorkerProc rank=3 ← holds GPU 3 CUDA context
Killing the parent with SIGTERM doesn't guarantee the GPU workers release memory. The fix:
# Kill the entire process group, not just the parent
kill -- -$(ps -o pgid= -p $PID | tr -d ' ')
# Verify GPUs are clean
fuser -k /dev/nvidia*
nvidia-smi --query-gpu=index,memory.used --format=csv,noheader
# Should show <1 GB per GPU
7. The KV cache: the missing piece
Everything above — the parallelism strategies, the memory budgets, the bandwidth bottlenecks — connects through one mechanism that often gets hand-waved: the KV cache.
The problem
When generating "Paris" after "The capital of France is", the model doesn't just process "is" — attention requires every token to look at every previous token. Without caching:
- To generate token 6, compute attention over tokens 1–5
- To generate token 7, compute attention over tokens 1–6
- To generate token 8, compute attention over tokens 1–7
Each step recomputes K and V projections for the entire sequence. For a 2,000-token clinical note generating a 500-token summary, that's:
1 + 2 + 3 + ... + 500 = 125,000 KV computations
Most of those are redundant — token 1's K and V never change. Total cost: O(n²).
The solution
The KV cache stores the K and V projections from every previous token at every layer. Each new token computes only its own Q, K, V, then looks up everything else from the cache:
Without cache: O(n²) compute, O(1) memory
With cache: O(n) compute, O(n) memory
The tradeoff is compute for memory. You store growing tensors in GPU HBM to avoid recomputing them.
What gets stored
For GPT-OSS 120B with GQA (8 KV heads, head dim 45):
Per token, per layer: 2 × (8 heads × 45 dims) × 2 bytes = 1,440 bytes
Per token, all layers: 36 × 1,440 = 51,840 bytes ≈ 50 KB
Full cache (8k context): 8,192 × 50 KB ≈ 403 MB per sequence
At 16k context, that's ~806 MB per sequence. With ~25 GB of free memory per GPU after weights, you can cache roughly 30 concurrent 16k-length sequences — or 250+ shorter ones.
How GQA and MLA shrink the cache
Standard multi-head attention with 64 heads stores 64 separate K and V tensors per layer — that's the full-size cache. Two techniques compress this dramatically:
| Method | KV Heads Stored | Cache Size (relative) | Used By |
|---|---|---|---|
| Standard MHA | 64 K + 64 V | 1× (baseline) | Older models |
| GQA | 8 K + 8 V | ~0.125× (8× smaller) | GPT-OSS 120B |
| MLA | Compressed latent only | ~0.03× (~32× smaller) | GLM-5 |
GQA (Grouped Query Attention) shares K/V heads across groups of query heads. With 64 query heads and 8 KV heads, each KV head serves 8 query heads. The cache shrinks by 8×.
MLA (Multi-head Latent Attention) goes further: it stores only a compressed latent vector per token and reconstructs K and V on the fly during attention. This is how GLM-5 (744B) can serve 128k context — its KV cache is roughly the same size as GPT-OSS's despite the model being 6× larger.
Memory over the lifecycle of a request
Load Prefill Decode Done
GPU Memory ┃ ┃ ┃ ┃
┃ ┃ ┃ ┃
40 GB ─────────┃───────┃────────────────────┃────────────────────┃──
┃ ┃ ┃ ╱╱ KV cache ┃
30 GB ─────────┃───────┃────────────────────┃───╱╱──(released)───┃──
┃ ┃ KV cache ┃╱╱ ┃
20 GB ─────────┃───────┃──────────growing──╱┃────────────────────┃──
┃ Framework + activations ┃ ┃
15 GB ─────────┃━━━━━━━━━━━━━━━━━━━━━━━━━━━┃━━━━━━━━━━━━━━━━━━━┃──
┃ ┃ ┃
┃ Model weights (MXFP4) ┃ (constant) ┃
0 GB ─────────┃━━━━━━━━━━━━━━━━━━━━━━━━━━━┃━━━━━━━━━━━━━━━━━━━┃──
Model weights are constant — loaded once at startup. The KV cache starts empty and grows linearly with each generated token. During prefill, the entire prompt's KV is computed at once. During decode, one row is appended per token per layer. When generation finishes, the cache is released.
vLLM's PagedAttention manages this memory in fixed-size pages — like a virtual memory allocator for attention state. Pages can be allocated and freed without fragmentation, which is what enables efficient concurrent serving.
How KV cache explains the benchmark results
This actually explains why TP=4 beat PP=2×TP=2. With TP=4, each GPU holds 1/4 of the KV cache (16 of 64 GQA heads). During decode, each GPU reads only its own KV shard — a purely local memory read, no cross-GPU communication needed.
With PP=2×TP=2, the pipeline boundary means stage 1's KV cache (layers 1–18) lives entirely on GPUs 0–1, and stage 2's cache (layers 19–36) lives on GPUs 2–3. Each decode step is serialized: stage_1_time + transfer + stage_2_time. The KV cache access pattern (local reads vs pipeline serial) is what makes TP=4 win on per-token latency.
8. Prompt caching: from GPU memory to cloud APIs
The same KV cache mechanism that runs inside vLLM also powers prompt caching in cloud APIs like Amazon Bedrock. Understanding the connection reveals how a low-level GPU optimization becomes a product feature.
vLLM prefix caching (what our deployment uses)
vLLM automatically detects when two requests share the same token prefix and reuses the cached KV entries. Our benchmark logs showed:
Prefix cache hit rate: 96.7%
No API changes needed. The cache lives in GPU memory and persists as long as the server runs. When multiple clinicians send the same system prompt + protocol context, the KV entries from the first request get reused by all subsequent ones.
Bedrock prompt caching (the API-level equivalent)
Bedrock's prompt caching does the same thing, exposed as an API feature. When you mark a portion of your prompt with a cachePoint, Bedrock captures the model's internal neural state at that position — which is the KV cache tensors at every layer for those tokens — and stores it in an ephemeral cache.
The key practical differences:
| vLLM Prefix Cache | Bedrock Prompt Cache | |
|---|---|---|
| Activation | Automatic | Explicit (cachePoint markers) |
| TTL | Until evicted by memory pressure | 5 min default, 1 hour optional |
| Persistence | GPU memory (server lifetime) | AWS-managed infrastructure |
| Pricing | Self-hosted (your GPU costs) | Cache reads ~10% of input token cost |
| Cache writes | Free (already computed) | May cost more than standard input |
Clinical AI implications
For agentic pipelines on Bedrock, the high-value caching targets:
System prompt + tool definitions. These are identical across every user request. On a 2k-token system prompt hitting 100 queries in 5 minutes, you pay for 1 cache write + 99 cache reads instead of 100 full prefills — roughly 90% cost reduction on input tokens for that portion.
Document context in RAG. When a clinician uploads a 10k-token clinical note and asks 5 questions about it, the note gets cached on the first question. Questions 2–5 skip prefill for those 10k tokens entirely, making time-to-first-token dramatically faster.
Multi-turn conversations. With Bedrock's conversation history caching, the growing context of a clinical discussion doesn't need to be reprocessed from scratch on each turn. The KV state up to the previous turn is cached, and only the new user message triggers fresh computation.
For self-hosted vLLM deployments (like our GPT-OSS 120B setup), prefix caching gives you the same benefits automatically — no code changes required, just the nature of serving multiple requests with shared context through the same inference engine.
9. Lessons learned
Architecture determines hardware requirements, not just parameter count. A 744B MoE model with 40B active parameters has a fundamentally different deployment profile than a 40B dense model — less compute per token, but the weight storage and expert routing add complexity.
GPU topology matters more than total FLOPS. Our 4× A100 box has impressive aggregate compute, but the 2+2 NVLink split means naive TP=4 could waste cycles on slow cross-NUMA AllReduces. Understanding nvidia-smi topo -m before choosing a parallelism strategy is essential — even though in our case the simpler config won.
Benchmark, don't theorize. My topology analysis was correct in principle (cross-NUMA is 20× slower) but wrong in prediction (TP=4 still won). Second-order effects — quantized payload sizes, pipeline bubbles, NCCL transport optimization — dominated the first-order bandwidth difference. Always measure.
MoE sparsity is your best friend on limited hardware. A 120B model that only activates 5.1B parameters per token fits comfortably on 4× A100 40GB at INT4, with room for generous KV cache and concurrent serving. The "effective model size" for memory bandwidth purposes is much smaller than the headline number.
The KV cache connects everything. Memory budgets, parallelism choices, bandwidth bottlenecks, and even cloud API pricing all flow through KV cache mechanics. Understanding the compute-vs-memory tradeoff of caching attention state is the single most important mental model for LLM deployment.
Process cleanup on GPUs is harder than you think. CUDA contexts don't release on SIGTERM the way file handles do. Always kill the full process group and verify memory is free before launching a new model. fuser /dev/nvidia* is your best friend.
Start with what works, then optimize. Our TP=4 config ran stable in production for weeks. The topology optimization was an interesting experiment, but a working deployment matters more than a theoretically optimal one that crashes during NCCL initialization.
Written from the trenches of clinical AI infrastructure at an academic medical center. All benchmarks on real hardware, not marketing slides.