splatlabs

How many FLOPs

Β· 1315 words Β· 7 minutes to read
Categories: ai
Tags: llm math-soup

Why are LLMs so Power Hungry? πŸ”—

I knew every token got np.dot()’d over every vocab (token) array of embeddings-length floats etc, but I was curious just how many calculations an LLM does per token, on average. I asked GPT4o, then Grok. Their answers were essentially the same, but Grok’s is a bit more detailed, so I’m pasting it in full here just because it’s a pretty intense read and I’m sure I’m not the only one that’s wondered - so I’ll save you the trouble of futzing around and comparing notes to find it. :)

Feel free to Skip to the Answer if you don’t care how we got there.

Also of note I found a blog that has a similarly freaky breakdown of calculating memory and training requirements for transformers here - will hopefully digest and make more use of this info some point, but there’s a little estimation tool based on this work too!

GPT4o, with some prodding, produced this:

$FLOPs β‰ˆ Eβ‹…V + Lβ‹…(8β‹…EΒ²β‹…S + 4β‹…Eβ‹…Hβ‹…S + Hβ‹…SΒ²)$

What It Means: πŸ”—

  • E = embedding dimension (e.g. 4096)
  • V = vocabulary size (e.g. 200,000)
  • S = context size (seq_len, e.g. 2M tokens for 2MB)
  • H = number of attention heads
  • L = number of layers (e.g. 100+ in GPT-4 class)

What Each Term Represents: πŸ”—

  • Eβ‹…V β†’ one giant matrix multiply at the final output projection (dot against every token in vocab)

  • Lβ‹…(...) β†’ repeated for every layer:

    • 8β‹…EΒ²β‹…S β†’ the two-layer MLPs (most of the FLOPs!)
    • 4β‹…Eβ‹…Hβ‹…S β†’ Q/K/V dot products
    • Hβ‹…SΒ² β†’ self-attention dot(Q,K) β†’ softmax

And Yeah, This Is Why: πŸ”—

  • Large context windows are devastating (SΒ² term)
  • Big embedding sizes blow up EΒ²β‹…S
  • Massive vocab sizes make Eβ‹…V crazy at inference

I ran this through a repl, filling in the vars with what seemed reasonable and got:

>>> 4096 * 200000 + 100 * (8 * (4096*4096) *
128000 + 4 * 4096 * 128000 + 8 * (128000*128000))

1731304652800000

which, in words is: one quadrillion seven hundred thirty-one trillion three hundred four billion six hundred fifty-two million eight hundred thousand. per token.

Ok, I know they measure these calculations in MFLOPs, GFLOPs, TFLOPs(mega, giga, tera, just like bits/bytes etc) for reasons, but seriously? And for every token? That’s insane, can’t be right. So I asked Grok to see if GPT4o was just confused. I briefly considered doing the math on approximately how many tokens its response was and how many FLOPs it’d have taken, but I’m not quite that masochistic.


Grok’s take πŸ”—

Estimating the FLOPs (floating-point operations) per token generation for a large language model (LLM) like GPT-4o with a 128k context window involves several factors, including the model’s architecture, parameter count, and the computational complexity of the transformer-based operations.

Since exact details about GPT-4o’s architecture (e.g., parameter count) are not publicly disclosed by OpenAI, we’ll rely on general principles from transformer models, scaling laws, and available literature to provide an informed estimate. I’ll also use insights from the provided web results where relevant.

Key Assumptions and Methodology πŸ”—


Model Architecture: πŸ”—

GPT-4o is a multimodal transformer-based model, likely a decoder-only architecture similar to previous GPT models. For FLOPs estimation, we focus on the text generation component, as this is the primary task for token generation.

Parameter Count: πŸ”—

OpenAI does not disclose GPT-4o’s parameter count, but estimates for GPT-4 suggest it could be in the range of 1.8 trillion parameters, possibly using a Mixture of Experts (MoE) architecture with 16 experts, each around 111 billion parameters.

For simplicity, we’ll assume GPT-4o is a dense model with a parameter count on the order of 100 billion to 1 trillion parameters, as MoE models reduce effective FLOPs per token due to sparse computation.

FLOPs Formula: πŸ”—

According to OpenAI’s scaling laws, the forward pass of a decoder-only transformer involves approximately ( 2N ) FLOPs per token, where ( N ) is the number of non-embedding parameters.

For a sequence length ( S ) (context window), the FLOPs per token generation scales with ( S ), as the attention mechanism computes interactions across the context.

The formula is roughly:

FLOPs per token: $\approx 2 \times N \times S$ where ( S ) is the context length (128,000 tokens in this case).

Context Length: πŸ”—

GPT-4o supports a 128k token context window, meaning $S = 128,000$.

Generation Phase: During generation, the model processes the entire context for each new token, as it attends to all previous tokens in the sequence. This makes the FLOPs per token dependent on the full context length.

Step-by-Step Estimation πŸ”—


Estimate ( N ): πŸ”—

Let’s assume GPT-4o has $\N = 100\ \text{billion} = 10^{11}$ parameters as a conservative estimate for a dense model.

If it’s an MoE model, the effective parameters per forward pass are lower (e.g., 2 experts at 111 billion each, or ~222 billion total, but only a fraction are used per token).

For a high-end estimate, we’ll also consider $N = 1 \ \text{trillion} = 10^{12}$ parameters, closer to GPT-4’s rumored scale.

Context Length ( S ): $S = 128,000 = 1.28 \times 10^5$

FLOPs Calculation: πŸ”—

Using the formula: FLOPs per token: $\approx 2 \times N \times S$

For: $N = 10^{11}$

  • FLOPs $\approx 2 \times 10^{11} \times 1.28 \times 10^5 = 2.56 \times 10^{16}$

So, approximately 25.6 quadrillion FLOPs per token or: $2.56 \times 10^{16}$

For: $N = 10^{12}$

  • $\text{FLOPs} \approx 2 \times 10^{12} \times 1.28 \times 10^5 = 2.56 \times 10^{17}$

So, approximately 256 quadrillion FLOPs per token or: $\ 2.56 \times 10^{17}$

Adjustments for Real-World Efficiency: πŸ”—

The $2N \times S$ formula assumes all operations are matrix multiplications, but transformers also include attention, layer normalization, and other operations that add overhead. These are typically less computationally intensive but require memory access, reducing effective throughput.

For MoE models, only a subset of parameters (e.g., 2 out of 16 experts) is used per token, reducing FLOPs. If GPT-4o is MoE with 1.8 trillion total parameters, the effective parameters per token might be ~222 billion, yielding FLOPs closer to the $N = 10^{11}$ case.

Real-world GPU utilization is often 30–50% of peak performance due to memory bottlenecks and distributed training inefficiencies. This doesn’t directly affect the theoretical FLOPs but impacts practical inference costs.


Final Estimate πŸ”—

Conservative Estimate: πŸ”—

Assuming a dense model with 100 billion parameters or an MoE model with effective parameters in that range, GPT-4o requires approximately 25.6 quadrillion FLOPs per token, or $2.56 \times 10^{16}$ for a 128k context.

High-End Estimate: πŸ”—

For a dense model with 1 trillion parameters, it’s around 256 quadrillion FLOPs per token: $2.56 \times 10^{17}$.

Likely Range: πŸ”—

Given GPT-4o’s multimodal nature and potential MoE architecture, the FLOPs per token is likely closer to the lower end, around 20–50 quadrillion FLOPs per token, but could approach 100 quadrillion if the model is denser or less sparse.


Validation and Context πŸ”—

Comparison with Literature: πŸ”—

The Hugging Face forum post confirms the $2N \times S$ formula for decoder-only transformers, aligning with our calculation. For smaller models like GPT-3 (175 billion parameters) with a 2k context, FLOPs per token is on the order of $7 \times 10^{11}$, scaling linearly with context and parameters.

Inference Cost: πŸ”—

GPT-4o’s API pricing ($2.50 per million input tokens, $10 per million output tokens) suggests high computational costs, consistent with large-scale models requiring significant FLOPs.

Limitations: πŸ”—

The exact FLOPs depend on implementation details (e.g., quantization, optimized attention mechanisms like multi-query attention). Additionally, the 128k context is the maximum, and real-world usage often involves shorter contexts, reducing FLOPs proportionally.


Answer πŸ”—

For an LLM like GPT-4o with a 128k context window, the FLOPs per token generation is approximately 20–50 quadrillion FLOPs ($2 \times 10^{16}$ to $5 \times 10^{16}$), assuming an effective parameter count of 100–200 billion (dense or MoE-adjusted).

If GPT-4o is a denser model with 1 trillion parameters, it could reach 256 quadrillion FLOPs per token ($2.56 \times 10^{17}$). These estimates are based on transformer scaling laws and assume full context utilization.


ouch.

Categories