WIP: From Tokens to Tensors - Inference Engineering

16th March 2026 — Updated: 16th March 2026

From Tokens to Tensors: How a 744B Model Processes Five Words

Tracing "The capital of France is" through GLM-5's architecture — from tokenization to next-token prediction — then deploying a 120B MoE model on real hardware with all the topology gotchas nobody warns you about.

Rohan Singh · March 2026 · 22 min read

There's a gap between understanding a model architecture diagram and knowing what actually happens when tokens hit silicon. This post bridges that gap by tracing a single forward pass through GLM-5 (744B), then showing what it takes to deploy a model like this on real GPU hardware — including the topology gotchas that no diagram warns you about, and the KV cache mechanics that tie everything together.

The GLM-5 architecture at a glance
Tracing the forward pass
Matrix dimensions at every stage
How GPUs actually run this
Real deployment: 120B MoE on 4× A100 40GB
When topology bites back
The KV cache: the missing piece
Prompt caching: from GPU memory to cloud APIs
Lessons learned

1. The GLM-5 architecture at a glance

GLM-5 is a 744B-parameter Mixture-of-Experts transformer. The headline numbers: 78 transformer blocks, embedding dimension of 6,144, vocabulary of 155k tokens, and 256 experts per MoE layer — of which only 8 (+1 shared) are active per token. That sparsity is the key trick: the model has 744B total parameters but only ~40B are active per inference step.

A few architectural choices make it interesting:

Multi-head Latent Attention (MLA) compresses the KV cache through a low-rank bottleneck, dramatically reducing the memory footprint for long-context inference. Instead of storing full K/V tensors per head per layer, MLA projects to a compact latent and reconstructs on the fly.

DeepSeek Sparse Attention (DSA) lets some attention heads use sparse patterns instead of full quadratic attention, making long-context tractable without the O(n²) memory cost.

SwiGLU feedforward — the gated linear unit variant that multiplies a SiLU-activated path with a linear path — appears in both the dense FFN (blocks 1–3) and the MoE experts (blocks 4–78). The dense FFN uses a hidden size of 12,288; each expert uses a much smaller 2,048.

2. Tracing the forward pass

Let's trace exactly what happens when "The capital of France is" enters the network.

Tokenization

The BPE tokenizer splits the string into roughly 5 token IDs — something like [1042, 5831, 203, 8927, 412], each an integer in [0, 155,000).

Embedding

Each token ID indexes into an embedding matrix of shape (155k × 6,144). Our 5 tokens become a matrix of shape (5 × 6,144) — five dense vectors, one per position. At this point, the representations are context-free: "France" has the same vector regardless of surrounding words.

The 78-layer transformer stack

Each block follows the same pattern: RMSNorm → Attention (with residual) → RMSNorm → Feedforward (with residual).

The attention mechanism is where context gets built. At position 4 ("is"), the attention scores tell the model how much to attend to each prior token. It attends heavily to "France" and "capital" to build a representation useful for predicting the next word. RoPE (Rotary Position Embeddings) encodes positional information via rotation matrices applied to queries and keys.

Dense FFN vs MoE

The first 3 blocks use a dense SwiGLU FFN with hidden size 12,288 — every parameter participates for every token. The rationale: early layers learn fundamental, universally-needed representations.

Blocks 4–78 use MoE. For each token independently, a router selects 8 of 256 experts plus 1 shared expert. Each expert is a smaller SwiGLU with intermediate size 2,048. The outputs are weighted by their router probabilities and summed.

Why MoE matters: For our token "France" at block 50, the router might select experts specializing in named entities, geography, and European concepts. Meanwhile "is" routes to a different set tuned for syntax and copular verbs. The model learns specialized subnetworks without the compute cost of running all 744B parameters.

Output

After 78 blocks and a final RMSNorm, the hidden state at the last position is projected by a linear layer of shape (6,144 × 155,000) to produce logits over the vocabulary. Softmax yields the probability distribution — and "Paris" comes out on top.

3. Matrix dimensions at every stage

The tensor shape stays (5, 6,144) through almost the entire network. The interesting deviations tell you where the architecture does something clever:

Stage	Shape	Notes
Raw text	`string`	"The capital of France is"
Token IDs	`(5,)`	BPE tokenizer → integer IDs
Token embeddings	`(5, 6144)`	Lookup in 155k × 6144 table
RMSNorm 1	`(5, 6144)`	Normalize each row by its RMS
MLA queries	`(5, H, d_h)`	Per-head queries with RoPE
KV latent	`(5, d_c)`	Compressed bottleneck — K, V reconstructed from this
Attention weights	`(H, 5, 5)`	Row 4 ("is") attends to "France" and "capital"
Attention output + residual	`(5, 6144)`	x = x + Attn(RMSNorm(x))
RMSNorm 2	`(5, 6144)`	Normalize before feedforward
Dense gate/up (blocks 1–3)	`(5, 12288) ×2`	Two parallel linear maps
Dense FFN output + residual	`(5, 6144)`	SiLU(gate) ⊙ up → down project
Router logits (blocks 4–78)	`(5, 256)`	Score each of 256 experts
Expert FFN (each of 8+1)	`(5, 2048)`	6144→2048→6144 SwiGLU (smaller)
MoE output + residual	`(5, 6144)`	Weighted sum of expert outputs
Final RMSNorm	`(5, 6144)`	After all 78 layers
Logits	`(5, 155000)`	Linear projection to vocabulary
Next-token probs	`(155000,)`	P("Paris") ≈ 0.92

The bolded rows are where the shape changes — each one marks a key architectural decision: KV compression for memory efficiency, the router's expert selection, the smaller expert hidden dimension, and the final explosion to vocabulary size.

4. How GPUs actually run this

GLM-5 has 744B parameters. At FP16, that's ~1.5 TB just for the weights. A single H100 has 80 GB of HBM3. You need at minimum ~19 H100s just to store the model — before any activations, KV cache, or optimizer state. In practice you'd use 48–256 GPUs.

The single GPU

An H100 has 132 streaming multiprocessors, each with 64 FP32 cores and 4 tensor cores specialized for matrix multiply. The memory hierarchy is the key constraint:

Memory	Size	Bandwidth
HBM3	80 GB	3.35 TB/s
L2 cache	50 MB	~12 TB/s
Registers + SRAM	~26 MB	~33 TB/s

A single matmul like the output head (5, 6144) × (6144, 155k) is tiled into small blocks that fit in SRAM, computed by tensor cores, and written back. The entire forward pass is a carefully choreographed dance to keep data close to compute.

Tensor parallelism

For the dense FFN, the weight matrix W(6144 × 12288) is split column-wise across 8 GPUs. Each GPU holds a (6144 × 1536) shard and computes its portion of the matmul independently. The partial results are combined via AllReduce over NVLink (900 GB/s bidirectional). The matmul runs 8× faster, and no single GPU holds the full weight.

Expert parallelism

For MoE layers, the 256 experts are spread across 32 GPUs — 8 experts per GPU. The router runs on every GPU and decides which experts each token needs. An all-to-all communication dispatches tokens to the GPUs hosting their chosen experts, those GPUs compute the small SwiGLU FFN, and a second all-to-all returns results.

Pipeline parallelism

The 78 layers are divided across 6 pipeline stages, each on a separate 8-GPU node. Data flows stage-to-stage over the network. To hide latency, the input batch is split into micro-batches that overlap — while stage 1 processes micro-batch 2, stage 2 is already working on micro-batch 1.

The full picture

6 nodes × 8 H100s = 48 GPUs. Pipeline parallelism (PP=6) across nodes, tensor parallelism (TP=8) within each node, expert parallelism (EP=32) for the MoE layers. Total memory: 3.84 TB — enough for 1.5 TB weights plus KV cache, activations, and overhead.

The real bottleneck: memory bandwidth

An H100 can do 989 TFLOPS at FP16, but its memory bandwidth is "only" 3.35 TB/s. For the output head matmul, the weight matrix is ~1.9 GB. Reading it from HBM takes ~0.57 ms, while the arithmetic is trivially fast on tensor cores. This operation is memory-bound — the GPU spends most of its time waiting for data.

Three things in GLM-5 are specifically GPU-friendly:

MLA shrinks the KV cache reads by 4–8× per attention step
MoE sparsity means only ~453 MB of weight reads per token per layer instead of the full dense matrix
FlashAttention fuses the entire attention computation into a single kernel that tiles in SRAM, turning an O(n²) memory operation into O(n)

5. Real deployment: 120B MoE on 4× A100 40GB

Theory is nice. Let's talk about deploying a model on real hardware — specifically GPT-OSS 120B on 4× A100 40GB GPUs with 160 GB total memory.

GPT-OSS 120B is architecturally similar to GLM-5 but smaller: 36 layers, embedding dimension of 2,880, 128 experts per layer with 4 active per token, and only 5.1B parameters active per inference step. It uses Grouped Query Attention (GQA) with 64 heads instead of MLA.

Memory budget

Precision	Weight Memory	Total w/ KV + overhead	Fits?
FP16 (2B/param)	240 GB	~264 GB	❌ Won't fit
INT8 (1B/param)	120 GB	~144 GB	⚠️ Tight for long context
MXFP4 (0.5B/param)	60 GB	~84 GB	✅ Comfortable, 76 GB headroom

The model ships with native MXFP4 (microscaling FP4) quantization baked into the checkpoint — no post-hoc quantization needed. At 60 GB, that splits cleanly across 4 GPUs at 15 GB each, leaving ~25 GB per GPU for KV cache and activations.

The vLLM configuration

vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 4 \
  --quantization mxfp4 \
  --dtype bfloat16 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90

The 64 GQA heads divide cleanly across 4 GPUs (16 each). The 128 MoE experts also divide evenly (32 each). No awkward remainders.

Performance numbers

From our benchmarks on this hardware:

Prefill throughput: up to 12,206 tok/s (peak, single request)
Generation throughput: ~148 tok/s (single request, steady state)
Batched generation: ~563 tok/s aggregate (8 concurrent requests)
Prefix cache hit rate: 96.7%

6. When topology bites back

Running nvidia-smi topo -m on our 4× A100 box reveals a critical detail:

        GPU0    GPU1    GPU2    GPU3
GPU0     X      NV12    SYS     SYS
GPU1    NV12     X      SYS     SYS
GPU2    SYS     SYS      X      NV12
GPU3    SYS     SYS     NV12     X

This is a 2+2 split topology:

GPU 0 ↔ GPU 1: NV12 (600 GB/s) — 12 NVLinks, blazing fast
GPU 2 ↔ GPU 3: NV12 (600 GB/s) — same
Cross-pair (0↔2, 0↔3, 1↔2, 1↔3): SYS (~25 GB/s) — PCIe + QPI, 20× slower

With --tensor-parallel-size 4, every layer does an AllReduce across all 4 GPUs. NCCL builds a ring where two of four hops are SYS links. That's 144 cross-NUMA transfers per forward pass (36 layers × 2 AllReduces per layer × 2 slow hops).

The hypothesis: PP=2 × TP=2

The idea: split layers into two pipeline stages, one per NVLink pair. Stage 1 (layers 1–18) runs on GPUs 0–1 with TP=2 over NV12. Stage 2 (layers 19–36) runs on GPUs 2–3 with TP=2 over NV12. The only cross-NUMA traffic is the activation tensor handoff between stages — 1 × 1 × 2880 × 2 bytes = 5.6 KB per decode step, essentially free.

vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 2 \
  --pipeline-parallel-size 2 \
  --quantization mxfp4 \
  --dtype bfloat16 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90

	TP=4	PP=2 × TP=2
Cross-NUMA transfers per forward pass	144	1
AllReduce link speed	Mix of NV12 + SYS	NV12 only

The result: theory was wrong

After building a benchmark harness, cleaning up stale CUDA contexts (more on that below), and running both configs:

Metric	TP=4	PP=2 × TP=2	Winner
Generation tok/s (single request)	~148	~124	TP=4
Prompt tok/s (peak)	~12,206	~10,159	TP=4
Batched generation (8 concurrent)	~563	~696 (ramping)	Mixed

TP=4 wins by ~16% on single-request decode. My prediction was wrong for three reasons:

MXFP4 shrinks AllReduce payloads. The tensors being communicated are 4-bit, not FP16. A small payload over a slow link is still fast in absolute terms.
The pipeline bubble is real. With PP=2, stage 2 idles while stage 1 processes, and vice versa. For single-request decode, this serialization penalty is a ~2× utilization hit that NVLink-only communication can't overcome.
NCCL is smart. It likely uses shared memory or socket-based transfer for cross-NUMA hops, which is slower than NVLink but not catastrophic for small payloads.

Lesson: benchmark, don't theorize. The topology analysis was correct in principle — cross-NUMA is 20× slower. But the second-order effects (quantized payload sizes, pipeline bubbles, NCCL transport selection) dominated the first-order effect.

The process cleanup gotcha

When switching between configs, we hit NCCL initialization errors — not because PP=2×TP=2 was incompatible, but because the previous server's GPU workers didn't release their CUDA contexts.

vLLM with TP=4 spawns a process tree:

APIServer (parent)
  └─ EngineCore
       ├─ WorkerProc rank=0  ← holds GPU 0 CUDA context
       ├─ WorkerProc rank=1  ← holds GPU 1 CUDA context
       ├─ WorkerProc rank=2  ← holds GPU 2 CUDA context
       └─ WorkerProc rank=3  ← holds GPU 3 CUDA context

Killing the parent with SIGTERM doesn't guarantee the GPU workers release memory. The fix:

# Kill the entire process group, not just the parent
kill -- -$(ps -o pgid= -p $PID | tr -d ' ')

# Verify GPUs are clean
fuser -k /dev/nvidia*
nvidia-smi --query-gpu=index,memory.used --format=csv,noheader
# Should show <1 GB per GPU

7. The KV cache: the missing piece

Everything above — the parallelism strategies, the memory budgets, the bandwidth bottlenecks — connects through one mechanism that often gets hand-waved: the KV cache.

The problem

When generating "Paris" after "The capital of France is", the model doesn't just process "is" — attention requires every token to look at every previous token. Without caching:

To generate token 6, compute attention over tokens 1–5
To generate token 7, compute attention over tokens 1–6
To generate token 8, compute attention over tokens 1–7

Each step recomputes K and V projections for the entire sequence. For a 2,000-token clinical note generating a 500-token summary, that's:

1 + 2 + 3 + ... + 500 = 125,000 KV computations

Most of those are redundant — token 1's K and V never change. Total cost: O(n²).

The solution

The KV cache stores the K and V projections from every previous token at every layer. Each new token computes only its own Q, K, V, then looks up everything else from the cache:

Without cache: O(n²) compute, O(1) memory
With cache:    O(n) compute,  O(n) memory

The tradeoff is compute for memory. You store growing tensors in GPU HBM to avoid recomputing them.

What gets stored

For GPT-OSS 120B with GQA (8 KV heads, head dim 45):

Per token, per layer: 2 × (8 heads × 45 dims) × 2 bytes = 1,440 bytes
Per token, all layers: 36 × 1,440 = 51,840 bytes ≈ 50 KB
Full cache (8k context): 8,192 × 50 KB ≈ 403 MB per sequence

At 16k context, that's ~806 MB per sequence. With ~25 GB of free memory per GPU after weights, you can cache roughly 30 concurrent 16k-length sequences — or 250+ shorter ones.

How GQA and MLA shrink the cache

Standard multi-head attention with 64 heads stores 64 separate K and V tensors per layer — that's the full-size cache. Two techniques compress this dramatically:

Method	KV Heads Stored	Cache Size (relative)	Used By
Standard MHA	64 K + 64 V	1× (baseline)	Older models
GQA	8 K + 8 V	~0.125× (8× smaller)	GPT-OSS 120B
MLA	Compressed latent only	~0.03× (~32× smaller)	GLM-5

GQA (Grouped Query Attention) shares K/V heads across groups of query heads. With 64 query heads and 8 KV heads, each KV head serves 8 query heads. The cache shrinks by 8×.

MLA (Multi-head Latent Attention) goes further: it stores only a compressed latent vector per token and reconstructs K and V on the fly during attention. This is how GLM-5 (744B) can serve 128k context — its KV cache is roughly the same size as GPT-OSS's despite the model being 6× larger.

Memory over the lifecycle of a request

                 Load    Prefill              Decode               Done
GPU Memory       ┃       ┃                    ┃                    ┃
                 ┃       ┃                    ┃                    ┃
  40 GB ─────────┃───────┃────────────────────┃────────────────────┃──
                 ┃       ┃                    ┃        ╱╱ KV cache ┃
  30 GB ─────────┃───────┃────────────────────┃───╱╱──(released)───┃──
                 ┃       ┃           KV cache ┃╱╱                  ┃
  20 GB ─────────┃───────┃──────────growing──╱┃────────────────────┃──
                 ┃ Framework + activations    ┃                    ┃
  15 GB ─────────┃━━━━━━━━━━━━━━━━━━━━━━━━━━━┃━━━━━━━━━━━━━━━━━━━┃──
                 ┃                            ┃                    ┃
                 ┃   Model weights (MXFP4)    ┃  (constant)        ┃
   0 GB ─────────┃━━━━━━━━━━━━━━━━━━━━━━━━━━━┃━━━━━━━━━━━━━━━━━━━┃──

Model weights are constant — loaded once at startup. The KV cache starts empty and grows linearly with each generated token. During prefill, the entire prompt's KV is computed at once. During decode, one row is appended per token per layer. When generation finishes, the cache is released.

vLLM's PagedAttention manages this memory in fixed-size pages — like a virtual memory allocator for attention state. Pages can be allocated and freed without fragmentation, which is what enables efficient concurrent serving.

How KV cache explains the benchmark results

This actually explains why TP=4 beat PP=2×TP=2. With TP=4, each GPU holds 1/4 of the KV cache (16 of 64 GQA heads). During decode, each GPU reads only its own KV shard — a purely local memory read, no cross-GPU communication needed.

With PP=2×TP=2, the pipeline boundary means stage 1's KV cache (layers 1–18) lives entirely on GPUs 0–1, and stage 2's cache (layers 19–36) lives on GPUs 2–3. Each decode step is serialized: stage_1_time + transfer + stage_2_time. The KV cache access pattern (local reads vs pipeline serial) is what makes TP=4 win on per-token latency.

8. Prompt caching: from GPU memory to cloud APIs

The same KV cache mechanism that runs inside vLLM also powers prompt caching in cloud APIs like Amazon Bedrock. Understanding the connection reveals how a low-level GPU optimization becomes a product feature.

vLLM prefix caching (what our deployment uses)

vLLM automatically detects when two requests share the same token prefix and reuses the cached KV entries. Our benchmark logs showed:

Prefix cache hit rate: 96.7%

No API changes needed. The cache lives in GPU memory and persists as long as the server runs. When multiple clinicians send the same system prompt + protocol context, the KV entries from the first request get reused by all subsequent ones.

Bedrock prompt caching (the API-level equivalent)

Bedrock's prompt caching does the same thing, exposed as an API feature. When you mark a portion of your prompt with a cachePoint, Bedrock captures the model's internal neural state at that position — which is the KV cache tensors at every layer for those tokens — and stores it in an ephemeral cache.

The key practical differences:

	vLLM Prefix Cache	Bedrock Prompt Cache
Activation	Automatic	Explicit (`cachePoint` markers)
TTL	Until evicted by memory pressure	5 min default, 1 hour optional
Persistence	GPU memory (server lifetime)	AWS-managed infrastructure
Pricing	Self-hosted (your GPU costs)	Cache reads ~10% of input token cost
Cache writes	Free (already computed)	May cost more than standard input

Clinical AI implications

For agentic pipelines on Bedrock, the high-value caching targets:

System prompt + tool definitions. These are identical across every user request. On a 2k-token system prompt hitting 100 queries in 5 minutes, you pay for 1 cache write + 99 cache reads instead of 100 full prefills — roughly 90% cost reduction on input tokens for that portion.

Document context in RAG. When a clinician uploads a 10k-token clinical note and asks 5 questions about it, the note gets cached on the first question. Questions 2–5 skip prefill for those 10k tokens entirely, making time-to-first-token dramatically faster.

Multi-turn conversations. With Bedrock's conversation history caching, the growing context of a clinical discussion doesn't need to be reprocessed from scratch on each turn. The KV state up to the previous turn is cached, and only the new user message triggers fresh computation.

For self-hosted vLLM deployments (like our GPT-OSS 120B setup), prefix caching gives you the same benefits automatically — no code changes required, just the nature of serving multiple requests with shared context through the same inference engine.

9. Lessons learned

Architecture determines hardware requirements, not just parameter count. A 744B MoE model with 40B active parameters has a fundamentally different deployment profile than a 40B dense model — less compute per token, but the weight storage and expert routing add complexity.

GPU topology matters more than total FLOPS. Our 4× A100 box has impressive aggregate compute, but the 2+2 NVLink split means naive TP=4 could waste cycles on slow cross-NUMA AllReduces. Understanding nvidia-smi topo -m before choosing a parallelism strategy is essential — even though in our case the simpler config won.

Benchmark, don't theorize. My topology analysis was correct in principle (cross-NUMA is 20× slower) but wrong in prediction (TP=4 still won). Second-order effects — quantized payload sizes, pipeline bubbles, NCCL transport optimization — dominated the first-order bandwidth difference. Always measure.

MoE sparsity is your best friend on limited hardware. A 120B model that only activates 5.1B parameters per token fits comfortably on 4× A100 40GB at INT4, with room for generous KV cache and concurrent serving. The "effective model size" for memory bandwidth purposes is much smaller than the headline number.

The KV cache connects everything. Memory budgets, parallelism choices, bandwidth bottlenecks, and even cloud API pricing all flow through KV cache mechanics. Understanding the compute-vs-memory tradeoff of caching attention state is the single most important mental model for LLM deployment.

Process cleanup on GPUs is harder than you think. CUDA contexts don't release on SIGTERM the way file handles do. Always kill the full process group and verify memory is free before launching a new model. fuser /dev/nvidia* is your best friend.

Start with what works, then optimize. Our TP=4 config ran stable in production for weeks. The topology optimization was an interesting experiment, but a working deployment matters more than a theoretically optimal one that crashes during NCCL initialization.

Written from the trenches of clinical AI infrastructure at an academic medical center. All benchmarks on real hardware, not marketing slides.

Rohan Singh's Weblog