H/R Harshit Rathod

#gpu-architecture

2 posts tagged with #gpu-architecture.

AI
Two Workloads in a Trench Coat: Why Prefill and Decode Change Everything

LLM inference is two opposite workloads — compute-bound prefill and memory-bandwidth-bound decode. Here's how continuous batching, PagedAttention, and disaggregation evolved to deal with it.

#llm-inference #prefill-decode #vllm #pageattention #continuous-batching #gpu-architecture #nvidia-dynamo #kv-cache #ai-engineering
Jun 12, 2026
AI
Serving the Machine: How LLM Inference Runs at Planetary Scale

From PagedAttention to GB200 racks, from token economics to MCP — a systems engineer's tour of how frontier models serve millions of users simultaneously.

#llm-inference #gpu-architecture #vllm #pageattention #continuous-batching #mcp #api-design #quantization #speculative-decoding #ai-engineering
Jun 11, 2026