The History of NLP

2025/05/05 · 2324 words · 11 minutes to read

Categories: ai

Tags: theory

How AI Learned to Talk: From Zork to Autocomplete. 🔗

That’s Natural Language Processing, not the other NLP. :)

Z-Games 🔗

If you’re very old or indulge in ancient novelties, you might’ve played Adventure, the direct inspiration for the most famous text-based game of the modern computing era: Dungeon, or Zork. It ported to the Z-Machine format and ZIL (an early DSL) by Infocom, who released a number of similar games, including the Hitchhiker’s Guide to the Galaxy.

Around this time, computers were starting to be office desk sized - personal - and finding their way into homes. Games took off in a couple directions, one being the graphics + text style Sierra Games (like the Leisure Suit Larry series), and the other stayed text-based, and became myriad MUDs, MUCKs, and other MUs games on the simultaneously-blossoming internet. A tick later and we had online graphical games like Ultima Online to compliment the text-based, and from there it just kept going.

Why does this matter? I thought this was the history of NLP, not history of PC gaming? Yeah well, turns out they’re not entirely distinct, though not identical either.

The two areas of technology, gaming and AI (with and without NLP) have been oddly intertwined since the beginning, and still are, in various ways. One reason is much of it is US based, and “capitalism”, basically. Academic funding exists, but if you want to actually do something, not just ~~FAFO~~ run an experiment, write a paper on it, then move on - but continually iterate in the wild, you need a funding model. That model, for decades, was games.

Gaming has driven GPU development - from late-80s i386 math-coprocessors to current CUDA cores. CAD and scientific computing went along for the ride, but gaming was the market mover with the power of the masses (and their wallets). Nvidia leaned in, supporting not just gaming, but deep learning and, more recently, robotics.

So yes, gaming - not academic research - is probably the most influential factor in why we have LLMs today (at least the hardware to run them), and has been a running (profitable) testbed for AI the whole way. Porn could’ve driven it too, maybe, being one of or the most profitable businesses online from the early days - but gaming had more demanding math problems and higher frame rate demands, so it worked.

From early syntax-limited parser grammars (“kill troll (with the sword)”) to NPCs with autonomous pathfinding to LLM-powered characters with dynamic personas, AI has been evolving inside games since the start.

Expert Systems and Infobots 🔗

Meanwhile, another text-based medium was also taking off: IRC. Though IRC didn’t use NLP, it quickly became a playground for bots - perhaps the first NPCs of the online world.

Many of us toyed with various ways to make these bots more personable, clever, and useful, but most boiled down to what became formalized as Infobots, aka Expert Systems - or the occasional Mega-HAL because why not - it was the early 90s and we were making it up as we went along, having fun.

They learned by passively and/or actively by harvesting factoids via simple conversational patterns like “x is y,” and storing these in flat files or DBs. Ask “what is x?” and they’d reply, parroting what they’d seen. Think: ELIZA meets Prolog-lite.

This was an organic echo of symbolic reasoning, ala basic Prolog facts and queries - like what_is(Keyword, Answer) - and we could expand them to “where is x” or whatever else we could somewhat reliably match on with regular expressions and the like. Some of us unknowingly essentially reinvented logic programming.

These bots laid the foundation for (or at least foreshadowed) the IVR hell systems of “press 3 for billing” fame, but they stayed mostly in hobbyist and gaming circles. They required tons of human curation and tended to rot without upkeep.

Still, the groundwork was laid, and they kept slowly but surely trying different strategies to further evolve, and often through games, even to this day.

Cryptography, One-Way Hashes, and Context 🔗

Language is inherently cryptographic, or rather steganographic, where meaning is hidden in context - sometimes layers deep, each with useful information - which is one crucial reason NLP is so complex.

Another challenge is the combinatorial explosion issue, and yet another is that people rarely speak using very correct grammar (especially these days), but the cryptographic aspect is arguably one of the harder parts to adapt to comprehension via computation.

To illustrate, if I used the word “computer” a century or two ago, it’d have been assumed I was speaking of a person skilled in mental math computation. If I attempted to explain no, I mean a machine, with a processor, and a screen and… it’d just make things worse, and lead to more confusion.

Words make sense of us because they’re based on direct experience, starting at birth with our caregivers speaking while pointing, doing, or otherwise indicating the correlation to the utterances they used.

From these direct, first-order observations, we then learn to generalize and abstract, to reference things we can’t point to in our immediate proximity, or limit references to specific instances of things.

We learn the can of Pepsi on our desk belongs to the abstract class “Pepsi” and can refer to any other can of Pepsi, which belongs to the category “soda”, then “drink”, “consumable”, and on up until we hit “object” (if we go by WordNet’s hyponym hierarchy).

Soda can also mean sodium bicarbonate, which is another issue, and we discern which is referenced by the context its referenced within (sentence, conversation, environment, are all context cues).

So we have both symbol overload (word re-use) and the one-way hash or cryptographic element - you can’t derive original context from a word without some approximate shared experience. Something that gives us context by which to recognize what the symbol is referencing, and an existing schema to process it through. Thus experience, and the context it creates in our minds, is the “key” to the cryptographic encoding of language.

Context, it turns out, really is everything. At least when it comes to meaning. Even dictionaries are useless without it. Even the word “meaning” is hilariously ambiguous.

To someone from before our technological era, trying to explain “computer” wouldn’t be helped by a dictionary or anything else, even a photograph might be suggestive, but still confusing, since it’d lack context to what the image was even showing.

Blind people can’t instantly see, or at least not know what they’re seeing, when their blindness is cured medically (and not in infancy). There are stories of natives who would see boats but had no concept of ships, so would instead render whales, large waves, or nothing at all. The story is of dubious validity, but the point is still highly relevant in fundamental human psychology likely deeply anchored in neurology and a notable factor in human cognition.

Our brains physically adapt, “neurons that fire together, wire together”, and through ongoing feedback loops come to recognize images, sounds, sensations, then symbols, words, and ideas, all building up piece by piece, creating contextual structures and schemas along the way, through which we process the world’s information. So how do we teach computers to do this in an automated, algorithmic way?

That’s a damn good question, and the answer which would lead us, presumably, to AGI, or systems that could (the “consciousness” debate aside) think, comprehend, and actually understand on some level.

Ontologies 🔗

There are words for this system of correspondences and abstractions we build up, like “reality model”, or more formally “ontology”, which may include aspects such as taxonomy etc also.

So the next big step was realizing the internet wasn’t just running a service we called the web, but could be a web, an ontological Semantic Network that could store a distributed system with the scaffolding needed for computational understanding of language. Some people built it, over decades, and it’s actually damn impressive (if largely invisible until you stumble onto it), if a bit dusty now.

The WikiMedia Foundation is notable here (look at the bottom of the page, click around, you’ll see many links with relationship data like is-a, has-a, has-parts, relates-to, parts of speech, etc), though Wikipedia is by far the most well-known out of this broader cluster of effort. Projects like ConceptNet and Cyc were laboriously built up, only to be eventually largely abandoned outside of some very niche domains and uses.

The semantic web adopted formats that attempted to formalize and encapsulate these knowledge databases in highly structured ways, such as OWL and RDF, which still exist, but tend to hide in niches like LOD and medical data (also impressive once you notice it, just very dusty) and some more obscure NLP related areas.

A few larger, actually successful applications were made, in medical diagnostic systems for instance, but were eventually considered too brittle, too hard to create and upkeep, and unable to be easily adapted to other domains. The momentum drained and dried up.

It was dead end after dead end, but efforts remained strong for decades. Until about 3+yrs ago, when the majority of related github repos and projects seemed to stop being updated, got archived, or became abandonware. This, of course, coincides directly with the monumental breakthrough of Attention is All You Need, which led directly to LLMs as we know them now.

Suddenly we didn’t need to define meaning. We could approximate it statistically.

Some clever math nerds had taken NN style math-soup, and figured out how to get it to talk. A bit later, it become coherent, then actually useful, in a few quick leaps. It was a convergence of the massive hardware necessary coming into existence along with decades of trial and error in the ontology and NLP field, finally hitting something that worked, mostly, if costly.

Companies caught wind of this, and invested insanely in it - finally, the AI breakthrough we’d been promised and were waiting for all these years. Ok, so maybe we have to fire back up or build a few new nuclear power plants, maybe math was never meant to think, but heck, it works, so throw everything else out and run with it!

And the insanity began, while decades of slow, deliberate building of vast networks of knowledge were largely abandoned (sans some niches of ongoing exploration) almost overnight. It’s still there though - most of it anyway - in case someone realizes pure math-soup was maybe not the best strategy, or structured knowledge might have some use yet.

What Now? 🔗

The renewed momentum generated by the success of transformer models (and let’s be realistic, cash infusion stemming from that) has fresh eyes looking into new ways to solve this decades-old problem. As we know too well, LLMs “hallucinate”, and all the data on the internet isn’t enough to teach them everything, somehow.

Data is messy, people say stupid things a lot, and it takes a ton of data to train a transformer to a coherent level. Add to this the pervasive commercialization, hyperbole, spam, SEO manipulation, misinformation, disinformation, RP and propaganda out there, and it makes a bit of a hostile environment for learning, to say the least. There are many reasons why LLMs are flawed, but this is a big one.

But with this momentum, we’ve got investigation and innovations around holography, photonic computing (potentially replacing GPUs for the big math), and other more analog-style approaches. Going “back” to escape the current local maxima - and move forward again in a better direction - seems like a solid strategy at this point.

What did we leave behind that would be useful now that we’ve solved this other problem, what didn’t work, but would now with our hardware and resources? The solutions we need may already exist in some dusty, archived github repos, you never know.

Transformers’ underlying structure, being essentially “quadrillions of dot products on craploads of matrices of floating point numbers” isn’t exactly biased towards accuracy. It’s fuzzy, on purpose, since teaching computers to actually think and know things proves a lot harder than it sounds once you dig into it.

It works, but it’s kind of a fluke, honestly. Hofstadter was right: enough structure and recursion breeds emergent behavior. But actual understanding? Even if our brains are also NNs, it’s not quite that simple, apparently.

The reality is, LLMs are a lot more like our “subconscious”, or “System 1” in the Thinking Fast and Slow sense. What it lacks, is the “System 2” to compliment it, what we often call the “conscious” or “rational” mind, the part that reasons through things slowly, as we often attempt to instruct LLMs to do - or attempt to force through CoT systems tacked on that give it a chance to build some faux critical thinking by generating purposeful outputs into its context before forming the final response.

When we see a model “thinking” it’s creating a semi-structured response, most of which include “the user is looking for…” and “wait… it could also be…” type sentences. Since it does most likely next token, these would lead it to generate varied branches of completions, which are then appended to the existing context window, forcing those into its full contextual math-soup as well, a bit of “priming” as it were.

It does not, even in these advanced scenarios, “think” in any meaningful use of the term, or engage any real “System 2” type of mechanism (though I would contend this is not much different than many humans, but that again is another article). :)

That said, the big cash infusion? Pretty much all to big LLMs and commercial ventures. Whomever finds the solution to these big problems may be riding the excitement wave, but is likely going to be some random individual leafing through dusty archives, driven purely by insatiable curiosity, without any expectation of reward. That’s how these things tend to work, after all.

The History of NLP

How AI Learned to Talk: From Zork to Autocomplete. 🔗

Z-Games 🔗

Expert Systems and Infobots 🔗

Cryptography, One-Way Hashes, and Context 🔗

Ontologies 🔗

What Now? 🔗

Recent Posts

Categories

Tags