Demystifying LLM Architecture 🔗
“Oh, there is a brain all right. It’s just that the brain is made out of
meatmath!”“So… what does the thinking?”
“You’re not understanding, are you? The brain does the thinking. The
meatmath.”“Thinking
meatmath! You’re asking me to believe in thinkingmeatmath!”
Aka: The documentation I wanted but couldn’t find in one place anywhere, and will want as a reference for myself ongoing also.
What if you just want to write code using the tools, and not become a PhD in theoretical math? Or you want to write programmer-style code, not data-scientist style code (which are about as different as Egyptian hieroglyphs are to Latin, frankly, despite both being “language” in the broad sense)?
So I’ve written a sort of math-nerd to tech-nerd translation cheat-sheet for decoding the PhD-speak into something hopefully usable, and with some comprehension and not just copypasta.
Eg: a “continuous n-dimensional vector space” is really just a 2d array is just a numeric array (usually floats) where values smoothly vary rather than jumping discretely (like integers or categories) - once you strip off the domain language and semantic constructs and rephrase it in more direct application/programmer-friendly terms.
Also here’s a video/visual guide by Grant Sanderson aka 3B1B, to go along with my ramblings.
What is Machine Learning? 🔗
Programming and software all boils down to some form of A->B, or input -> (magic) -> output. Sometimes the data goes into a db, or to a frontend/website, or is passed on via API, etc., but this fundamental pattern underlies it all. AI is essentially giving the computer some space (matrices, tensors) and math functions, and saying “you figure it out, by brute-forcing it till you get it right” essentially, then saving the version that gets it rightest to re-use ongoing.
ML is fundamentally mapping inputs to outputs by tuning a system of numerical weights, by formal definition. Inputs (like words or images) are first converted into numeric encodings, usually dense vectors called embeddings.
These vectors flow through multiple layers, each performing matrix multiplications and nonlinear functions (often ReLU or GELU). Each matrix multiplies the input by a set of learnable weights, generating intermediate representations. For text, these matrices typically have shapes like [vocabulary_size × embedding_dim] for embeddings, or [embedding_dim × embedding_dim] for hidden layers.
During training, outputs are compared to known targets (labels) via a loss function (like cross-entropy), and the gradients (differences between actual and target outputs) flow backward (backpropagation), slightly adjusting each matrix’s values in the direction opposite to the gradient, gently nudging towards better accuracy without overshooting. Besides dot products, math operations include addition, scaling, normalization, and nonlinear activation functions.
Getting high quality, accurately labeled data can be one of the biggest challenges for most applications of ML. As the old saying goes “garbage in, garbage out”, and it’s definitely apt here.
It kind of boils down to algebra, abstractly. You keep throwing input * x = output at it until it learns to solve for x, then you let it produce outputs on its own, and x is basically math-soup that we can’t really discern wtf it’s doing, other than how well it works or not. If it seems like opaque, black magic, that’s because it really is, you’re not missing something - that’s just how it works.
How are transformer models (LLMs) different? 🔗
Transformers are a neural network architecture built around attention mechanisms instead of sequential or convolutional patterns. The core innovation, “Attention is All You Need”, lets the model dynamically weigh input tokens differently based on context.
The embedding matrix is shaped [vocab_size × embedding_dim], where each token in the vocabulary corresponds to a row, and each column encodes a different semantic aspect (dimension of meaning), and the training process magically assigns these dimensions based on various word use in the training text.
These vectors represent points in an abstract “meaning-space,” where the relative positions and directions between vectors encode relationships and semantic nuances. This is how the model can calculate similarity (cosine similarity) based on the angles between vectors, indicating semantic closeness. It’s how we get fun like “King - Man + Woman = Queen” of word2vec fame.
The model takes each token of the query (aka the context window), does the magical math-soup over however many layers for each of those tokens, and produces a list of most likely probable next tokens. One of those is then selected (based on temperature etc.), and then the process repeats - for every token in the response.
Attention is computed as a softmax-normalized dot product between query and key matrices (Q × Káµ€), giving weights that determine how much each token’s value influences each output position (weights × Values). Transformers stack multiple attention heads in parallel, followed by dense feed-forward layers ([embedding_dim × embedding_dim] matrices) to produce rich contextual encodings.
Backpropagation and differentiation work as usual—computing gradients from the loss - but because attention is parallelized and not sequential, it’s computationally efficient and can utilize GPUs well.
Technically, each input is encoded into three matrices called Queries, Keys, and Values (all usually [sequence_length × embedding_dim] - or one full vocabulary row/vector per token in the sequence).
Technically, each token in the input sequence is first converted into a vector via the embedding matrix [vocab_size × embedding_dim], resulting in a [sequence_length × embedding_dim] input. Then, for every attention head, the model applies three learned linear projections - $W_q, W_k, and W_v$ - to generate Queries, Keys, and Values, each shaped [sequence_length × head_dim]. Multiply that by the number of heads (often 8, 12, or more), and you start to see how each token gets exploded into multiple representations for parallel comparison.
This is where most of the math-soup lives: lots of matrix multiplies, softmax weighting, and dot products - just to figure out which other tokens each word should “pay attention” to… and why it eats gigs of ram to churn out a few sentences.
Sequential Patterns vs. Attention: Clarifying the Mechanism 🔗
In earlier models (like RNNs), sequential context meant words were processed one-by-one, carrying state forward. N-grams (like bigrams or trigrams) were historically discrete language models without embeddings. Modern neural language models (like transformers) differ significantly: all tokens are simultaneously embedded into numeric vectors, then attention computes weighted relevance scores dynamically.
It’s not limited to fixed-length sequences (e.g., trigrams); instead, it dynamically integrates context across entire input sequences, using learned weights.
There is no big secret here either, they’re just more matrices and the training feedback nudges them to become useful in the desired ways, or presumes they do so since the outputs improve.
Attention matrices ($Q, K, V$) are computed from the same original input embeddings, but each undergoes its own linear transformation ($W_q, W_k, W_v$). These learned transformations specialize each matrix differently:
Queries ($Q$) represent what each token is “looking for.”
Keys ($K$) represent what information each token “offers.”
Values ($V$) represent the content actually used in output.
Thus, the attention weights ($softmax(Q × Kᵀ)$) dynamically determine how tokens interact. These matrices typically have shapes like [sequence_length × embedding_dim], transformed from inputs by learned [embedding_dim × embedding_dim] matrices.
Each transformer layer typically includes a multi-headed self-attention step (sequence_length × embedding_dim), followed by a feed-forward step (embedding_dim × 4 × embedding_dim, then back to embedding_dim).
Notable Differences Between Transformers and Traditional Neural Networks 🔗
Beyond attention, transformers notably differ in several ways:
-
Parallel vs Sequential Processing: Transformers process entire sequences simultaneously via attention, unlike recurrent (RNN/LSTM) models, eliminating serial bottlenecks.
-
Positional Encoding: Transformers lack built-in sequence-awareness, so positional embeddings explicitly encode token order.
-
Feed-Forward Layer Design: Transformers contain large intermediate feed-forward layers ([embedding_dim × 4×embedding_dim]), providing substantial nonlinearity and expressive power.
-
Layer Normalization: Transformers extensively use layer normalization to stabilize activations, compared to batch normalization common in convolutional networks.
These innovations collectively distinguish transformers from older architectures, providing better scalability, richer context modeling, and more stable training dynamics.
Ok, so what does all of this really mean, in practice? 🔗
Many programmers run into the same wtf moment when reading up on DIY AI Ã la PyTorch or similar libraries and frameworks. There seems to be 2 types of documentation. One is stacks of research papers on arXiv, loaded with dense math and lots of greek, and the other is “just call these functions, easy!” and explain nothing.
The typical coder type is left scratching their head, knowing it’s some combination of math and matrices, but what do you do with that? We’re used to having to spell stuff out, use logic, tell the machine what the heck to do, so where’s that part? As I said earlier, that’s the black magic part - you don’t.
It feels intuitively like there must be more to it, but there really isn’t, other than organizing and arranging the spaces (matrices) and math (functions) to taste, but mostly they’re all happily abstracted away into functions for the most common uses, and that’s what we see in the docs, the “call this, this, and this” type.
It’s basically saying “make a matrix, call some functions (one function = one layer), assign a backprop method (how strong do you want your gradient?) and call train() on the lot of it. That’s it, other than the frills and dressing of finessing inputs and outputs or whatnot in a more end-user scenario. The rest is the machine arduously brute forcing its way to solving for x well enough to be useful.
The thing you’re more likely thinking of, where you actually tell it things in some manner? That’s called symbolic AI, and lots of people have tried this. It’s basically what was behind the whole “semantic web” thing, among other efforts.
How Matrix Shapes and Embedding Dimensions Are Chosen 🔗
Embedding dimensions and matrix shapes aren’t completely arbitrary, but they’re largely pragmatic: bigger usually means better (more expressive) representations—but at higher compute and memory costs. Typically, embedding sizes (often 256 to 4096+) are selected based on GPU memory, budget, and dataset size.
Larger embeddings can encode richer nuances of language but require more training data and hardware. The trade-off: diminishing returns kick in past certain sizes, so experiments balance accuracy against cost.
How Math Functions and Operations Are Chosen per Layer 🔗
Most layer choices - like activation functions and normalization methods - come from empirical success rather than deep theoretical mandates. Researchers often start with known-good defaults (like multi-headed self-attention layers followed by feed-forward networks) and tweak hyperparameters (manual config tuning variables) experimentally.
Activation functions like ReLU ($max(0, x)$) or GELU (Gaussian Error Linear Unit, smoothly approximating ReLU while preserving gradient flow around zero) became standard because they reliably avoid issues like vanishing gradients and facilitate stable, deep network training. Layer normalization stabilizes activations and improves convergence. Essentially, the field iteratively tests and adopts techniques proven stable, efficient, and effective.
Which means, yes, they are doing precisely what they’re training the models to do: iteratively guess and correct via feedback, like a smart Monte Carlo system. One wonders why they don’t make an ML model designed for optimizing models. :)
How Did Transformers Solve NLP’s Combinatorial Explosion Problem? 🔗
Natural language processing is combinatorially explosive because each word’s meaning depends heavily on context—leading to immense complexity in older models like recurrent neural networks (RNNs). RNNs maintained state through sequential steps, causing vanishing/exploding gradient issues and poor long-range dependencies.
Transformers fixed this by dropping sequential processing entirely, using attention mechanisms instead. Attention directly compares each token with all others simultaneously (O(n²) complexity, but efficient via GPUs), effectively solving context resolution cryptographically: each token can “see” every other, dynamically adjusting weights to decode meaning based on global relationships rather than linear order. Thus, transformers efficiently capture complex contexts, eliminating the sequential bottleneck.
Here’s another video with some complimentary historical context and info. It explains, for instance, how bigger LLM models are better, despite conventional wisdom on the subject, among other tidbits.
How didn’t they solve it? 🔗
There’s talk of reactivating nuclear power plants to support these behemoths. They’re all brute-force, and it ain’t pretty. Every token of input (full context window) is repeatedly smashed against every token of attention, and of the full vocabulary of the model (which is more where the “parameters” number comes into play - not embeddings dimensions) - for every token generation. One. At. A. Time.
So, while this is cool and it does work, it still has a lot of room to work well, or at least more elegantly, maybe to eventually have actually good models on higher end desktop PCs? That’d be cool.
The big TL;DR Takeaway 🔗
Machine learning isn’t nearly as hard as it looks. It’s a lot more luck combined with trial and error (and recognizing viable tests to push them forward more) than some sage wisdom of the math nerd collective. You create some 2d arrays, just like numpy does, stack them horizontally (attention heads) and vertically (layers) using pytorch/tensorflow functions, tell it what math-soup to make with some other functions, and hit “run.” The docs aren’t wrong, we just expect something else, apparently. We’re used to, you know, writing code, not making math semi-sentient by sheer brute force. The answer you’re looking for is that there is none.
Look up some open source stuff, like deepseek-v3 which kind of sucks without the R1 CoT tacked on, but that repo is mysteriously empty save for some stats and a paper. Still, CoT is demonstrated elsewhere, it’s findable. The salient point here is that, if you look through the V3 code, you might wonder where the code is. Yes, there’s something there, a tiny handful of pytorch stuff that looks more concerned with quantization than creating the model - but that’s it. That’s what I’m saying. The “code” you’re looking for is more realistically the training data itself, which is massive. This isn’t quite LISP’s “code is data, data is code”, but it’s not far off either, or perhaps a miniKanren on steroids. Either way, you don’t really write models, you more “breed” them with shapes and combinations of matrices and math goo, and that’s about it.
Enjoy.