Back to trillion param models ↔️

Five years ago, Google published the first trillion param model called Switch Transformer. Since then, many models have been published, the majority of which smaller yet better. We are now back at the trillion range, what took us so long?

It turns out, those models needed much more data to justify the size. How much more? About 100x more. That generation of models like GPT3 and T5 was trained with billions of tokens instead of trillions. So what happened was that we spent the last five years playing catchup on the data size while models were shrinking instead of getting larger. This is no longer the case, last week Alibaba released Qwen 3 Max, a trillion model trained on an astonishing 36 trillion tokens.

This is not the first trillion model since Switch Transformer. GPT4/5, Claude, Gemini and Grok are also most likely at that range. It is the first time though, a provider openly shares this information at that scale. Does this mean models will start getting bigger again?

They are but at a much slower pace for a two main reasons

1️⃣ We have almost run out of data. Web, Videos, Books, Audio sum up to approx 100T tokens, exact amount depending on the source.

2️⃣ Instead of scaling model size and data, we are now into scaling thinking time (inference)

As a side note, and a potential third reason, the scale we operate now is not that different than neither the size of the human brain with its 100T “params” nor the data it receives yearly which is on the 1000T scale. We are only 1 or 2 orders of magnitude away.

To conclude, I think the trillion param size is a nice sweet spot at the moment that will grow much slower so in the next few years we might see hardware catching up and being able to run these models into our laptops, mobiles and edge devices 🚀