Ai on splatlabs

The History of NLP

Mon, 05 May 2025 00:00:00 +0000

How AI Learned to Talk: From Zork to Autocomplete. 🔗

That’s Natural Language Processing, not the other NLP. :)

Z-Games 🔗

If you’re very old or indulge in ancient novelties, you might’ve played Adventure, the direct inspiration for the most famous text-based game of the modern computing era: Dungeon, or Zork. It ported to the Z-Machine format and ZIL (an early DSL) by Infocom, who released a number of similar games, including the Hitchhiker’s Guide to the Galaxy.

Around this time, computers were starting to be office desk sized - personal - and finding their way into homes. Games took off in a couple directions, one being the graphics + text style Sierra Games (like the Leisure Suit Larry series), and the other stayed text-based, and became myriad MUDs, MUCKs, and other MUs games on the simultaneously-blossoming internet. A tick later and we had online graphical games like Ultima Online to compliment the text-based, and from there it just kept going.

Why does this matter? I thought this was the history of NLP, not history of PC gaming? Yeah well, turns out they’re not entirely distinct, though not identical either.

The two areas of technology, gaming and AI (with and without NLP) have been oddly intertwined since the beginning, and still are, in various ways. One reason is much of it is US based, and “capitalism”, basically. Academic funding exists, but if you want to actually do something, not just ~~FAFO~~ run an experiment, write a paper on it, then move on - but continually iterate in the wild, you need a funding model. That model, for decades, was games.

Gaming has driven GPU development - from late-80s i386 math-coprocessors to current CUDA cores. CAD and scientific computing went along for the ride, but gaming was the market mover with the power of the masses (and their wallets). Nvidia leaned in, supporting not just gaming, but deep learning and, more recently, robotics.

So yes, gaming - not academic research - is probably the most influential factor in why we have LLMs today (at least the hardware to run them), and has been a running (profitable) testbed for AI the whole way. Porn could’ve driven it too, maybe, being one of or the most profitable businesses online from the early days - but gaming had more demanding math problems and higher frame rate demands, so it worked.

From early syntax-limited parser grammars (“kill troll (with the sword)”) to NPCs with autonomous pathfinding to LLM-powered characters with dynamic personas, AI has been evolving inside games since the start.

Expert Systems and Infobots 🔗

Meanwhile, another text-based medium was also taking off: IRC. Though IRC didn’t use NLP, it quickly became a playground for bots - perhaps the first NPCs of the online world.

Many of us toyed with various ways to make these bots more personable, clever, and useful, but most boiled down to what became formalized as Infobots, aka Expert Systems - or the occasional Mega-HAL because why not - it was the early 90s and we were making it up as we went along, having fun.

They learned by passively and/or actively by harvesting factoids via simple conversational patterns like “x is y,” and storing these in flat files or DBs. Ask “what is x?” and they’d reply, parroting what they’d seen. Think: ELIZA meets Prolog-lite.

This was an organic echo of symbolic reasoning, ala basic Prolog facts and queries - like what_is(Keyword, Answer) - and we could expand them to “where is x” or whatever else we could somewhat reliably match on with regular expressions and the like. Some of us unknowingly essentially reinvented logic programming.

These bots laid the foundation for (or at least foreshadowed) the IVR hell systems of “press 3 for billing” fame, but they stayed mostly in hobbyist and gaming circles. They required tons of human curation and tended to rot without upkeep.

Still, the groundwork was laid, and they kept slowly but surely trying different strategies to further evolve, and often through games, even to this day.

Cryptography, One-Way Hashes, and Context 🔗

Language is inherently cryptographic, or rather steganographic, where meaning is hidden in context - sometimes layers deep, each with useful information - which is one crucial reason NLP is so complex.

Another challenge is the combinatorial explosion issue, and yet another is that people rarely speak using very correct grammar (especially these days), but the cryptographic aspect is arguably one of the harder parts to adapt to comprehension via computation.

To illustrate, if I used the word “computer” a century or two ago, it’d have been assumed I was speaking of a person skilled in mental math computation. If I attempted to explain no, I mean a machine, with a processor, and a screen and… it’d just make things worse, and lead to more confusion.

Words make sense of us because they’re based on direct experience, starting at birth with our caregivers speaking while pointing, doing, or otherwise indicating the correlation to the utterances they used.

From these direct, first-order observations, we then learn to generalize and abstract, to reference things we can’t point to in our immediate proximity, or limit references to specific instances of things.

We learn the can of Pepsi on our desk belongs to the abstract class “Pepsi” and can refer to any other can of Pepsi, which belongs to the category “soda”, then “drink”, “consumable”, and on up until we hit “object” (if we go by WordNet’s hyponym hierarchy).

Soda can also mean sodium bicarbonate, which is another issue, and we discern which is referenced by the context its referenced within (sentence, conversation, environment, are all context cues).

So we have both symbol overload (word re-use) and the one-way hash or cryptographic element - you can’t derive original context from a word without some approximate shared experience. Something that gives us context by which to recognize what the symbol is referencing, and an existing schema to process it through. Thus experience, and the context it creates in our minds, is the “key” to the cryptographic encoding of language.

Context, it turns out, really is everything. At least when it comes to meaning. Even dictionaries are useless without it. Even the word “meaning” is hilariously ambiguous.

To someone from before our technological era, trying to explain “computer” wouldn’t be helped by a dictionary or anything else, even a photograph might be suggestive, but still confusing, since it’d lack context to what the image was even showing.

Blind people can’t instantly see, or at least not know what they’re seeing, when their blindness is cured medically (and not in infancy). There are stories of natives who would see boats but had no concept of ships, so would instead render whales, large waves, or nothing at all. The story is of dubious validity, but the point is still highly relevant in fundamental human psychology likely deeply anchored in neurology and a notable factor in human cognition.

Our brains physically adapt, “neurons that fire together, wire together”, and through ongoing feedback loops come to recognize images, sounds, sensations, then symbols, words, and ideas, all building up piece by piece, creating contextual structures and schemas along the way, through which we process the world’s information. So how do we teach computers to do this in an automated, algorithmic way?

That’s a damn good question, and the answer which would lead us, presumably, to AGI, or systems that could (the “consciousness” debate aside) think, comprehend, and actually understand on some level.

Ontologies 🔗

There are words for this system of correspondences and abstractions we build up, like “reality model”, or more formally “ontology”, which may include aspects such as taxonomy etc also.

So the next big step was realizing the internet wasn’t just running a service we called the web, but could be a web, an ontological Semantic Network that could store a distributed system with the scaffolding needed for computational understanding of language. Some people built it, over decades, and it’s actually damn impressive (if largely invisible until you stumble onto it), if a bit dusty now.

The WikiMedia Foundation is notable here (look at the bottom of the page, click around, you’ll see many links with relationship data like is-a, has-a, has-parts, relates-to, parts of speech, etc), though Wikipedia is by far the most well-known out of this broader cluster of effort. Projects like ConceptNet and Cyc were laboriously built up, only to be eventually largely abandoned outside of some very niche domains and uses.

The semantic web adopted formats that attempted to formalize and encapsulate these knowledge databases in highly structured ways, such as OWL and RDF, which still exist, but tend to hide in niches like LOD and medical data (also impressive once you notice it, just very dusty) and some more obscure NLP related areas.

A few larger, actually successful applications were made, in medical diagnostic systems for instance, but were eventually considered too brittle, too hard to create and upkeep, and unable to be easily adapted to other domains. The momentum drained and dried up.

It was dead end after dead end, but efforts remained strong for decades. Until about 3+yrs ago, when the majority of related github repos and projects seemed to stop being updated, got archived, or became abandonware. This, of course, coincides directly with the monumental breakthrough of Attention is All You Need, which led directly to LLMs as we know them now.

Suddenly we didn’t need to define meaning. We could approximate it statistically.

Some clever math nerds had taken NN style math-soup, and figured out how to get it to talk. A bit later, it become coherent, then actually useful, in a few quick leaps. It was a convergence of the massive hardware necessary coming into existence along with decades of trial and error in the ontology and NLP field, finally hitting something that worked, mostly, if costly.

Companies caught wind of this, and invested insanely in it - finally, the AI breakthrough we’d been promised and were waiting for all these years. Ok, so maybe we have to fire back up or build a few new nuclear power plants, maybe math was never meant to think, but heck, it works, so throw everything else out and run with it!

And the insanity began, while decades of slow, deliberate building of vast networks of knowledge were largely abandoned (sans some niches of ongoing exploration) almost overnight. It’s still there though - most of it anyway - in case someone realizes pure math-soup was maybe not the best strategy, or structured knowledge might have some use yet.

What Now? 🔗

The renewed momentum generated by the success of transformer models (and let’s be realistic, cash infusion stemming from that) has fresh eyes looking into new ways to solve this decades-old problem. As we know too well, LLMs “hallucinate”, and all the data on the internet isn’t enough to teach them everything, somehow.

Data is messy, people say stupid things a lot, and it takes a ton of data to train a transformer to a coherent level. Add to this the pervasive commercialization, hyperbole, spam, SEO manipulation, misinformation, disinformation, RP and propaganda out there, and it makes a bit of a hostile environment for learning, to say the least. There are many reasons why LLMs are flawed, but this is a big one.

But with this momentum, we’ve got investigation and innovations around holography, photonic computing (potentially replacing GPUs for the big math), and other more analog-style approaches. Going “back” to escape the current local maxima - and move forward again in a better direction - seems like a solid strategy at this point.

What did we leave behind that would be useful now that we’ve solved this other problem, what didn’t work, but would now with our hardware and resources? The solutions we need may already exist in some dusty, archived github repos, you never know.

Transformers’ underlying structure, being essentially “quadrillions of dot products on craploads of matrices of floating point numbers” isn’t exactly biased towards accuracy. It’s fuzzy, on purpose, since teaching computers to actually think and know things proves a lot harder than it sounds once you dig into it.

It works, but it’s kind of a fluke, honestly. Hofstadter was right: enough structure and recursion breeds emergent behavior. But actual understanding? Even if our brains are also NNs, it’s not quite that simple, apparently.

The reality is, LLMs are a lot more like our “subconscious”, or “System 1” in the Thinking Fast and Slow sense. What it lacks, is the “System 2” to compliment it, what we often call the “conscious” or “rational” mind, the part that reasons through things slowly, as we often attempt to instruct LLMs to do - or attempt to force through CoT systems tacked on that give it a chance to build some faux critical thinking by generating purposeful outputs into its context before forming the final response.

When we see a model “thinking” it’s creating a semi-structured response, most of which include “the user is looking for…” and “wait… it could also be…” type sentences. Since it does most likely next token, these would lead it to generate varied branches of completions, which are then appended to the existing context window, forcing those into its full contextual math-soup as well, a bit of “priming” as it were.

It does not, even in these advanced scenarios, “think” in any meaningful use of the term, or engage any real “System 2” type of mechanism (though I would contend this is not much different than many humans, but that again is another article). :)

That said, the big cash infusion? Pretty much all to big LLMs and commercial ventures. Whomever finds the solution to these big problems may be riding the excitement wave, but is likely going to be some random individual leafing through dusty archives, driven purely by insatiable curiosity, without any expectation of reward. That’s how these things tend to work, after all.

How Llms Work

Sun, 04 May 2025 00:00:00 +0000

Demystifying LLM Architecture 🔗

“Oh, there is a brain all right. It’s just that the brain is made out of ~~meat~~ math!”

“So… what does the thinking?”

“You’re not understanding, are you? The brain does the thinking. The ~~meat~~ math.”

“Thinking ~~meat~~ math! You’re asking me to believe in thinking ~~meat~~ math!”

– MIT Thinking Meat

Aka: The documentation I wanted but couldn’t find in one place anywhere, and will want as a reference for myself ongoing also.

What if you just want to write code using the tools, and not become a PhD in theoretical math? Or you want to write programmer-style code, not data-scientist style code (which are about as different as Egyptian hieroglyphs are to Latin, frankly, despite both being “language” in the broad sense)?

So I’ve written a sort of math-nerd to tech-nerd translation cheat-sheet for decoding the PhD-speak into something hopefully usable, and with some comprehension and not just copypasta.

Eg: a “continuous n-dimensional vector space” is really just a 2d array is just a numeric array (usually floats) where values smoothly vary rather than jumping discretely (like integers or categories) - once you strip off the domain language and semantic constructs and rephrase it in more direct application/programmer-friendly terms.

Also here’s a video/visual guide by Grant Sanderson aka 3B1B, to go along with my ramblings.

What is Machine Learning? 🔗

Programming and software all boils down to some form of A->B, or input -> (magic) -> output. Sometimes the data goes into a db, or to a frontend/website, or is passed on via API, etc., but this fundamental pattern underlies it all. AI is essentially giving the computer some space (matrices, tensors) and math functions, and saying “you figure it out, by brute-forcing it till you get it right” essentially, then saving the version that gets it rightest to re-use ongoing.

ML is fundamentally mapping inputs to outputs by tuning a system of numerical weights, by formal definition. Inputs (like words or images) are first converted into numeric encodings, usually dense vectors called embeddings.

These vectors flow through multiple layers, each performing matrix multiplications and nonlinear functions (often ReLU or GELU). Each matrix multiplies the input by a set of learnable weights, generating intermediate representations. For text, these matrices typically have shapes like [vocabulary_size × embedding_dim] for embeddings, or [embedding_dim × embedding_dim] for hidden layers.

During training, outputs are compared to known targets (labels) via a loss function (like cross-entropy), and the gradients (differences between actual and target outputs) flow backward (backpropagation), slightly adjusting each matrix’s values in the direction opposite to the gradient, gently nudging towards better accuracy without overshooting. Besides dot products, math operations include addition, scaling, normalization, and nonlinear activation functions.

Getting high quality, accurately labeled data can be one of the biggest challenges for most applications of ML. As the old saying goes “garbage in, garbage out”, and it’s definitely apt here.

It kind of boils down to algebra, abstractly. You keep throwing input * x = output at it until it learns to solve for x, then you let it produce outputs on its own, and x is basically math-soup that we can’t really discern wtf it’s doing, other than how well it works or not. If it seems like opaque, black magic, that’s because it really is, you’re not missing something - that’s just how it works.

How are transformer models (LLMs) different? 🔗

Transformers are a neural network architecture built around attention mechanisms instead of sequential or convolutional patterns. The core innovation, “Attention is All You Need”, lets the model dynamically weigh input tokens differently based on context.

The embedding matrix is shaped [vocab_size × embedding_dim], where each token in the vocabulary corresponds to a row, and each column encodes a different semantic aspect (dimension of meaning), and the training process magically assigns these dimensions based on various word use in the training text.

These vectors represent points in an abstract “meaning-space,” where the relative positions and directions between vectors encode relationships and semantic nuances. This is how the model can calculate similarity (cosine similarity) based on the angles between vectors, indicating semantic closeness. It’s how we get fun like “King - Man + Woman = Queen” of word2vec fame.

The model takes each token of the query (aka the context window), does the magical math-soup over however many layers for each of those tokens, and produces a list of most likely probable next tokens. One of those is then selected (based on temperature etc.), and then the process repeats - for every token in the response.

Attention is computed as a softmax-normalized dot product between query and key matrices (Q × Kᵀ), giving weights that determine how much each token’s value influences each output position (weights × Values). Transformers stack multiple attention heads in parallel, followed by dense feed-forward layers ([embedding_dim × embedding_dim] matrices) to produce rich contextual encodings.

Backpropagation and differentiation work as usual—computing gradients from the loss - but because attention is parallelized and not sequential, it’s computationally efficient and can utilize GPUs well.

Technically, each input is encoded into three matrices called Queries, Keys, and Values (all usually [sequence_length × embedding_dim] - or one full vocabulary row/vector per token in the sequence).

Technically, each token in the input sequence is first converted into a vector via the embedding matrix [vocab_size × embedding_dim], resulting in a [sequence_length × embedding_dim] input. Then, for every attention head, the model applies three learned linear projections - $W_q, W_k, and W_v$ - to generate Queries, Keys, and Values, each shaped [sequence_length × head_dim]. Multiply that by the number of heads (often 8, 12, or more), and you start to see how each token gets exploded into multiple representations for parallel comparison.

This is where most of the math-soup lives: lots of matrix multiplies, softmax weighting, and dot products - just to figure out which other tokens each word should “pay attention” to… and why it eats gigs of ram to churn out a few sentences.

Sequential Patterns vs. Attention: Clarifying the Mechanism 🔗

In earlier models (like RNNs), sequential context meant words were processed one-by-one, carrying state forward. N-grams (like bigrams or trigrams) were historically discrete language models without embeddings. Modern neural language models (like transformers) differ significantly: all tokens are simultaneously embedded into numeric vectors, then attention computes weighted relevance scores dynamically.

It’s not limited to fixed-length sequences (e.g., trigrams); instead, it dynamically integrates context across entire input sequences, using learned weights.

There is no big secret here either, they’re just more matrices and the training feedback nudges them to become useful in the desired ways, or presumes they do so since the outputs improve.

Attention matrices ($Q, K, V$) are computed from the same original input embeddings, but each undergoes its own linear transformation ($W_q, W_k, W_v$). These learned transformations specialize each matrix differently:

Queries ($Q$) represent what each token is “looking for.”

Keys ($K$) represent what information each token “offers.”

Values ($V$) represent the content actually used in output.

Thus, the attention weights ($softmax(Q × Kᵀ)$) dynamically determine how tokens interact. These matrices typically have shapes like [sequence_length × embedding_dim], transformed from inputs by learned [embedding_dim × embedding_dim] matrices.

Each transformer layer typically includes a multi-headed self-attention step (sequence_length × embedding_dim), followed by a feed-forward step (embedding_dim × 4 × embedding_dim, then back to embedding_dim).

Notable Differences Between Transformers and Traditional Neural Networks 🔗

Beyond attention, transformers notably differ in several ways:

Parallel vs Sequential Processing: Transformers process entire sequences simultaneously via attention, unlike recurrent (RNN/LSTM) models, eliminating serial bottlenecks.
Positional Encoding: Transformers lack built-in sequence-awareness, so positional embeddings explicitly encode token order.
Feed-Forward Layer Design: Transformers contain large intermediate feed-forward layers ([embedding_dim × 4×embedding_dim]), providing substantial nonlinearity and expressive power.
Layer Normalization: Transformers extensively use layer normalization to stabilize activations, compared to batch normalization common in convolutional networks.

These innovations collectively distinguish transformers from older architectures, providing better scalability, richer context modeling, and more stable training dynamics.

Ok, so what does all of this really mean, in practice? 🔗

Many programmers run into the same wtf moment when reading up on DIY AI à la PyTorch or similar libraries and frameworks. There seems to be 2 types of documentation. One is stacks of research papers on arXiv, loaded with dense math and lots of greek, and the other is “just call these functions, easy!” and explain nothing.

The typical coder type is left scratching their head, knowing it’s some combination of math and matrices, but what do you do with that? We’re used to having to spell stuff out, use logic, tell the machine what the heck to do, so where’s that part? As I said earlier, that’s the black magic part - you don’t.

It feels intuitively like there must be more to it, but there really isn’t, other than organizing and arranging the spaces (matrices) and math (functions) to taste, but mostly they’re all happily abstracted away into functions for the most common uses, and that’s what we see in the docs, the “call this, this, and this” type.

It’s basically saying “make a matrix, call some functions (one function = one layer), assign a backprop method (how strong do you want your gradient?) and call train() on the lot of it. That’s it, other than the frills and dressing of finessing inputs and outputs or whatnot in a more end-user scenario. The rest is the machine arduously brute forcing its way to solving for x well enough to be useful.

The thing you’re more likely thinking of, where you actually tell it things in some manner? That’s called symbolic AI, and lots of people have tried this. It’s basically what was behind the whole “semantic web” thing, among other efforts.

How Matrix Shapes and Embedding Dimensions Are Chosen 🔗

Embedding dimensions and matrix shapes aren’t completely arbitrary, but they’re largely pragmatic: bigger usually means better (more expressive) representations—but at higher compute and memory costs. Typically, embedding sizes (often 256 to 4096+) are selected based on GPU memory, budget, and dataset size.

Larger embeddings can encode richer nuances of language but require more training data and hardware. The trade-off: diminishing returns kick in past certain sizes, so experiments balance accuracy against cost.

How Math Functions and Operations Are Chosen per Layer 🔗

Most layer choices - like activation functions and normalization methods - come from empirical success rather than deep theoretical mandates. Researchers often start with known-good defaults (like multi-headed self-attention layers followed by feed-forward networks) and tweak hyperparameters (manual config tuning variables) experimentally.

Activation functions like ReLU ($max(0, x)$) or GELU (Gaussian Error Linear Unit, smoothly approximating ReLU while preserving gradient flow around zero) became standard because they reliably avoid issues like vanishing gradients and facilitate stable, deep network training. Layer normalization stabilizes activations and improves convergence. Essentially, the field iteratively tests and adopts techniques proven stable, efficient, and effective.

Which means, yes, they are doing precisely what they’re training the models to do: iteratively guess and correct via feedback, like a smart Monte Carlo system. One wonders why they don’t make an ML model designed for optimizing models. :)

How Did Transformers Solve NLP’s Combinatorial Explosion Problem? 🔗

Natural language processing is combinatorially explosive because each word’s meaning depends heavily on context—leading to immense complexity in older models like recurrent neural networks (RNNs). RNNs maintained state through sequential steps, causing vanishing/exploding gradient issues and poor long-range dependencies.

Transformers fixed this by dropping sequential processing entirely, using attention mechanisms instead. Attention directly compares each token with all others simultaneously (O(n²) complexity, but efficient via GPUs), effectively solving context resolution cryptographically: each token can “see” every other, dynamically adjusting weights to decode meaning based on global relationships rather than linear order. Thus, transformers efficiently capture complex contexts, eliminating the sequential bottleneck.

Here’s another video with some complimentary historical context and info. It explains, for instance, how bigger LLM models are better, despite conventional wisdom on the subject, among other tidbits.

How didn’t they solve it? 🔗

There’s talk of reactivating nuclear power plants to support these behemoths. They’re all brute-force, and it ain’t pretty. Every token of input (full context window) is repeatedly smashed against every token of attention, and of the full vocabulary of the model (which is more where the “parameters” number comes into play - not embeddings dimensions) - for every token generation. One. At. A. Time.

So, while this is cool and it does work, it still has a lot of room to work well, or at least more elegantly, maybe to eventually have actually good models on higher end desktop PCs? That’d be cool.

The big TL;DR Takeaway 🔗

Machine learning isn’t nearly as hard as it looks. It’s a lot more luck combined with trial and error (and recognizing viable tests to push them forward more) than some sage wisdom of the math nerd collective. You create some 2d arrays, just like numpy does, stack them horizontally (attention heads) and vertically (layers) using pytorch/tensorflow functions, tell it what math-soup to make with some other functions, and hit “run.” The docs aren’t wrong, we just expect something else, apparently. We’re used to, you know, writing code, not making math semi-sentient by sheer brute force. The answer you’re looking for is that there is none.

Look up some open source stuff, like deepseek-v3 which kind of sucks without the R1 CoT tacked on, but that repo is mysteriously empty save for some stats and a paper. Still, CoT is demonstrated elsewhere, it’s findable. The salient point here is that, if you look through the V3 code, you might wonder where the code is. Yes, there’s something there, a tiny handful of pytorch stuff that looks more concerned with quantization than creating the model - but that’s it. That’s what I’m saying. The “code” you’re looking for is more realistically the training data itself, which is massive. This isn’t quite LISP’s “code is data, data is code”, but it’s not far off either, or perhaps a miniKanren on steroids. Either way, you don’t really write models, you more “breed” them with shapes and combinations of matrices and math goo, and that’s about it.

Enjoy.

How many FLOPs

Sun, 04 May 2025 00:00:00 +0000

Why are LLMs so Power Hungry? 🔗

I knew every token got np.dot()’d over every vocab (token) array of embeddings-length floats etc, but I was curious just how many calculations an LLM does per token, on average. I asked GPT4o, then Grok. Their answers were essentially the same, but Grok’s is a bit more detailed, so I’m pasting it in full here just because it’s a pretty intense read and I’m sure I’m not the only one that’s wondered - so I’ll save you the trouble of futzing around and comparing notes to find it. :)

Feel free to Skip to the Answer if you don’t care how we got there.

Also of note I found a blog that has a similarly freaky breakdown of calculating memory and training requirements for transformers here - will hopefully digest and make more use of this info some point, but there’s a little estimation tool based on this work too!

GPT4o, with some prodding, produced this:

$FLOPs ≈ E⋅V + L⋅(8⋅E²⋅S + 4⋅E⋅H⋅S + H⋅S²)$

What It Means: 🔗

E = embedding dimension (e.g. 4096)
V = vocabulary size (e.g. 200,000)
S = context size (seq_len, e.g. 2M tokens for 2MB)
H = number of attention heads
L = number of layers (e.g. 100+ in GPT-4 class)

What Each Term Represents: 🔗

E⋅V → one giant matrix multiply at the final output projection (dot against every token in vocab)
L⋅(...) → repeated for every layer:
- 8⋅E²⋅S → the two-layer MLPs (most of the FLOPs!)
- 4⋅E⋅H⋅S → Q/K/V dot products
- H⋅S² → self-attention dot(Q,K) → softmax

And Yeah, This Is Why: 🔗

Large context windows are devastating (S² term)
Big embedding sizes blow up E²⋅S
Massive vocab sizes make E⋅V crazy at inference

I ran this through a repl, filling in the vars with what seemed reasonable and got:

>>> 4096 * 200000 + 100 * (8 * (4096*4096) *
128000 + 4 * 4096 * 128000 + 8 * (128000*128000))

1731304652800000

which, in words is: one quadrillion seven hundred thirty-one trillion three hundred four billion six hundred fifty-two million eight hundred thousand. per token.

Ok, I know they measure these calculations in MFLOPs, GFLOPs, TFLOPs(mega, giga, tera, just like bits/bytes etc) for reasons, but seriously? And for every token? That’s insane, can’t be right. So I asked Grok to see if GPT4o was just confused. I briefly considered doing the math on approximately how many tokens its response was and how many FLOPs it’d have taken, but I’m not quite that masochistic.

Grok’s take 🔗

Estimating the FLOPs (floating-point operations) per token generation for a large language model (LLM) like GPT-4o with a 128k context window involves several factors, including the model’s architecture, parameter count, and the computational complexity of the transformer-based operations.

Since exact details about GPT-4o’s architecture (e.g., parameter count) are not publicly disclosed by OpenAI, we’ll rely on general principles from transformer models, scaling laws, and available literature to provide an informed estimate. I’ll also use insights from the provided web results where relevant.

Key Assumptions and Methodology 🔗

Model Architecture: 🔗

GPT-4o is a multimodal transformer-based model, likely a decoder-only architecture similar to previous GPT models. For FLOPs estimation, we focus on the text generation component, as this is the primary task for token generation.

Parameter Count: 🔗

OpenAI does not disclose GPT-4o’s parameter count, but estimates for GPT-4 suggest it could be in the range of 1.8 trillion parameters, possibly using a Mixture of Experts (MoE) architecture with 16 experts, each around 111 billion parameters.

For simplicity, we’ll assume GPT-4o is a dense model with a parameter count on the order of 100 billion to 1 trillion parameters, as MoE models reduce effective FLOPs per token due to sparse computation.

FLOPs Formula: 🔗

According to OpenAI’s scaling laws, the forward pass of a decoder-only transformer involves approximately ( 2N ) FLOPs per token, where ( N ) is the number of non-embedding parameters.

For a sequence length ( S ) (context window), the FLOPs per token generation scales with ( S ), as the attention mechanism computes interactions across the context.

The formula is roughly:

FLOPs per token: $\approx 2 \times N \times S$ where ( S ) is the context length (128,000 tokens in this case).

Context Length: 🔗

GPT-4o supports a 128k token context window, meaning $S = 128,000$.

Generation Phase: During generation, the model processes the entire context for each new token, as it attends to all previous tokens in the sequence. This makes the FLOPs per token dependent on the full context length.

Step-by-Step Estimation 🔗

Estimate ( N ): 🔗

Let’s assume GPT-4o has $\N = 100\ \text{billion} = 10^{11}$ parameters as a conservative estimate for a dense model.

If it’s an MoE model, the effective parameters per forward pass are lower (e.g., 2 experts at 111 billion each, or ~222 billion total, but only a fraction are used per token).

For a high-end estimate, we’ll also consider $N = 1 \ \text{trillion} = 10^{12}$ parameters, closer to GPT-4’s rumored scale.

Context Length ( S ): $S = 128,000 = 1.28 \times 10^5$

FLOPs Calculation: 🔗

Using the formula: FLOPs per token: $\approx 2 \times N \times S$

For: $N = 10^{11}$

FLOPs $\approx 2 \times 10^{11} \times 1.28 \times 10^5 = 2.56 \times 10^{16}$

So, approximately 25.6 quadrillion FLOPs per token or: $2.56 \times 10^{16}$

For: $N = 10^{12}$

$\text{FLOPs} \approx 2 \times 10^{12} \times 1.28 \times 10^5 = 2.56 \times 10^{17}$

So, approximately 256 quadrillion FLOPs per token or: $\ 2.56 \times 10^{17}$

Adjustments for Real-World Efficiency: 🔗

The $2N \times S$ formula assumes all operations are matrix multiplications, but transformers also include attention, layer normalization, and other operations that add overhead. These are typically less computationally intensive but require memory access, reducing effective throughput.

For MoE models, only a subset of parameters (e.g., 2 out of 16 experts) is used per token, reducing FLOPs. If GPT-4o is MoE with 1.8 trillion total parameters, the effective parameters per token might be ~222 billion, yielding FLOPs closer to the $N = 10^{11}$ case.

Real-world GPU utilization is often 30–50% of peak performance due to memory bottlenecks and distributed training inefficiencies. This doesn’t directly affect the theoretical FLOPs but impacts practical inference costs.

Final Estimate 🔗

Conservative Estimate: 🔗

Assuming a dense model with 100 billion parameters or an MoE model with effective parameters in that range, GPT-4o requires approximately 25.6 quadrillion FLOPs per token, or $2.56 \times 10^{16}$ for a 128k context.

High-End Estimate: 🔗

For a dense model with 1 trillion parameters, it’s around 256 quadrillion FLOPs per token: $2.56 \times 10^{17}$.

Likely Range: 🔗

Given GPT-4o’s multimodal nature and potential MoE architecture, the FLOPs per token is likely closer to the lower end, around 20–50 quadrillion FLOPs per token, but could approach 100 quadrillion if the model is denser or less sparse.

Validation and Context 🔗

Comparison with Literature: 🔗

The Hugging Face forum post confirms the $2N \times S$ formula for decoder-only transformers, aligning with our calculation. For smaller models like GPT-3 (175 billion parameters) with a 2k context, FLOPs per token is on the order of $7 \times 10^{11}$, scaling linearly with context and parameters.

Inference Cost: 🔗

GPT-4o’s API pricing ($2.50 per million input tokens, $10 per million output tokens) suggests high computational costs, consistent with large-scale models requiring significant FLOPs.

Limitations: 🔗

The exact FLOPs depend on implementation details (e.g., quantization, optimized attention mechanisms like multi-query attention). Additionally, the 128k context is the maximum, and real-world usage often involves shorter contexts, reducing FLOPs proportionally.

Answer 🔗

For an LLM like GPT-4o with a 128k context window, the FLOPs per token generation is approximately 20–50 quadrillion FLOPs ($2 \times 10^{16}$ to $5 \times 10^{16}$), assuming an effective parameter count of 100–200 billion (dense or MoE-adjusted).

If GPT-4o is a denser model with 1 trillion parameters, it could reach 256 quadrillion FLOPs per token ($2.56 \times 10^{17}$). These estimates are based on transformer scaling laws and assume full context utilization.

ouch.