In November 2022, OpenAI launched ChatGPT. In five days, one million users. In two months, a hundred million. The mainstream press discovers AI as if it had just been born. It was 70 years old.

Understanding this genealogy is not a history exercise for purists. It’s the only way to understand why an AI model needs huge amounts of data to “learn”, why it’s not magic but massive background work, and why current performance is no accident.

Before AI: the algorithm, the real pillar

Before talking about machine learning or deep learning, it’s worth mentioning what 90% of computer science is still based on: the deterministic algorithm.

An algorithm is a finite sequence of instructions. You give it an input, it follows the rules, it produces an output. Always the same, for the same input. It’s predictable, testable, explainable. Sorting a list, calculating an itinerary, validating an IBAN: algos.

AI has not replaced the algorithm. It coexists with it. And in many cases (we’ll come back to this in a dedicated article), a simple algorithm does the job better, cheaper and more explainably than an AI model. But let’s continue the genealogy.

1943-1980: the first bricks, the first disappointments

In 1943, McCulloch and Pitts published the first mathematical model of an artificial neuron. In 1950, Alan Turing posed the seminal question: “Can machines think?” and proposed the test that bears his name.

In 1956, John McCarthy coined the term “artificial intelligence” at the Dartmouth Conference. The optimism was total. McCarthy predicted that a machine equivalent to human intelligence would be built within a generation. It won’t happen.

The 1960s-70s saw the first “AI winter”: promises outstripped capacities, funding dried up. Then came a second winter in the 1980s and 90s. The history of AI is that of a repeated hype curve, long before Gartner formalized it.

1986-2012: machine learning comes out of the lab

The breakthrough came with gradient backpropagation, formalized by Rumelhart, Hinton and Williams in 1986. The principle: adjust the weights of a neural network by measuring the error on examples, layer by layer, backwards. This training technique is still used today in all modern models.

But the machine learning of the 1990-2000s is still limited by two constraints: data and computation. Datasets are small. Computers are too slow to train deep networks.

What changed in the 2000s: the Internet generated data on a scale never seen before. And GPUs - originally designed for video games - proved perfectly suited to the matrix calculations of machine learning. Two conditions for take-off.

2012: the AlexNet moment

In September 2012, a deep neural network called AlexNet won the ImageNet challenge by a spectacular margin. It classified images with an error rate of 15.3%, compared with 26.1% for the runner-up. For the first time, a deep neural network beat all other approaches by a wide margin on a real, large-scale task.

AlexNet uses two NVIDIA GTX 580 GPUs with 3 GB of memory to train a network of 60 million parameters on 1.2 million images. Training time: five to six days.

AlexNet - Krizhevsky, Sutskever, Hinton (2012)' sourceUrl='https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf' date='2012-01-01

This moment is often cited as the beginning of modern deep learning. It established a key principle: “as data and computation increase, performance improves”. This principle will structure everything that follows, right up to today’s LLMs.

2017: Transformer, the missing piece

In 2017, Google researchers published “Attention Is All You Need”. They proposed a new architecture: the Transformer. Instead of processing text sequentially (word by word), it processes the whole sequence in parallel, with an attention mechanism that weights the relationship between each token and all the others.

Two decisive advantages: it trains much faster on the GPU, and it better captures long-distance dependencies in the text. This is the architecture of all today’s major language models: GPT, Claude, Llama, Mistral.

Why a model needs a base to learn from

This is the point that the prevailing discourse systematically ignores.

A model doesn’t learn in a vacuum. It learns from data. Lots and lots of data. Today’s major language models have been trained on hundreds of billions of tokens: books, articles, code, web pages. This training phase represents millions of GPU hours and tens, sometimes hundreds of millions of dollars.

The model encodes the statistical patterns of this data in its parameters. It knows how to conjugate because it has seen millions of examples of correct conjugation. It knows how to summarize because it has seen millions of text/summary pairs. He can code because GitHub has contributed a massive fraction of the training data.

What the prevailing discourse calls “intelligence” or “learning ability” is really the ability to generalize from this training data to new, similar situations. This is useful. But it’s fundamentally different from continuous, adaptive learning like that of a human.

2022: why ChatGPT changed perception, not technology

ChatGPT is not a technological breakthrough. GPT-3, the underlying model, has been around since 2020. What changes in 2022 is the interface: a natural conversation, accessible to all, without having to write code or documentation. And a mass-market launch policy.

The effect was massive: for the first time, hundreds of millions of people interacted directly with an LLM. Perception is shifting. AI is no longer the preserve of data scientists and researchers. It’s in everyone’s hands.

But technically, what ChatGPT is doing in November 2022, GPT-3 was already doing in 2020, with less finesse. The break is one of distribution and interface, not of technical paradigm.

What a difference knowing makes

Understanding genealogy changes three concrete things:

  1. **Today’s performance comes at a cost ** It didn’t come out of a hat. Billions of parameters, billions of tokens, months of computation. This cost is reflected in your API prices and your carbon footprint.

  2. **Hallucinations, temporal cut-off, dependence on training data: these are not teething problems. They are properties of today’s architecture.

  3. **The next hype cycle exists ** The history of AI is cyclical. Today’s foundations are solid, but promises regularly outstrip achievements. That’s no reason not to invest. It’s a reason not to buy at the peak of the hype.