Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There were rumors floating around that GPT-4 was going to be a 100 trillion parameter model. Those rumors seemed ridiculous in hindsight, but this announcement makes me rethink how ridiculous it really was. 100 Terabytes of GPU memory is exactly what you need to train that class of model.

However, I’m not even sure enough text data exists in the world to saturate 100T parameters. Maybe if you generated massive quantities of text with GPT-4 and used that dataset as your pre-training data. Training on the entirety of the internet then becomes just another fine tuning step. The bulk of the training could be on some 400TB dataset of generated text.



Rule of thumb is that you need ~20 tokens per parameter. The average token size is ~4 characters, probably more for larger models where you want larger dictionary, but for simplicity I'll say it's 5 bytes to make numbers round. So you need 100 bytes of text data per parameter, or 10 PB for 100T model. Now, recent research says that you can reuse the same data like 4 times before it becomes hindering performance but it doesn't help much in our case.

But in this case what is really ridiculous is the compute requirement. The required compute for optimal model growth roughly quadratically (both your model and your data grow linearly). So for 100T model you need 1e30 FLOPs. This machine gives you 1e18 FLOPs per second. It will take 30k years to train this model on one of these (or 30k of these to train it in a year, but then utilization will start kicking in).


"The best time to start training a 100T param model was 30k years ago, the second best time is now."


Probably the best time to start train 100T model is never


What if you could train an AI with a desired outcome to their answers?

I.E. ; "answer this question where the outcome is the most beneficial to quality of life"


I'll take your question further: what if we have unlimited data (say some crazy rich RL environment or way to produce high quality and diverse synthetic data)? You still have to get these 1e30 FLOPs. Lets say you can connect 100 of these bad boys together with 40% utilization, with total 4e19 FLOPs/s. Assume also Moore's law keeps working indefinitely. When should we start training 100T model on it to get is as early as possible? We wait x years and the start training on machine with 4e19*2^(x/2) FLOPs/s. Turns out the answers is ~16 years, after which we'll have 1e22 FLOPs/s and 1e30 FLOPs will take another 3 years.


> life

A strange game. The only winning move is not to play. unplugs self


A properly designed AI agent would do exactly that.


That's obviously false under the assumption computing power will increase as it has in the past.


For the uninitiated, it's a tree planting quote.


I am not going to be embarrassed for the following Q ;

Please ELI5 where I can have a glossary of AI/ML terms - where do I get fluency in speaking about Tokens, Models, Training, Parameters, etc...

Please dont be Snarky - This is info that everyone younger than I am needs as well.

Is there a Canon? Where is it?


At the risk of sounding snarky, https://chat.openai.com would be a good introduction, followed by books, which GPT could recommend.


I have no idea tbh. I learned these a while ago (~7 years ago), and the materials I used then are heavily outdated and also I won't be able to remember what they were. I guess any intro course to deep learning should talk about these. Stanford ones used to be good. Maybe someone else can be more useful about it.


I'd recommend starting here:

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

It's pretty lengthy but doesn't require a PhD to understand. If you can get to the end of it you'll have a much better understanding of what's going on.


To be honest, I'd start with some introduction to Transformer YouTube videos. They'll cover a lot of these terms and you'll then have a better understanding to find additional resources.


> Rule of thumb is that you need ~20 tokens per parameter.

That rule of thumb is wrong. The chinchilla paper has it anywhere between 1 and 100 tokens per parameter.


> Those rumors seemed ridiculous in hindsight

No, those rumors seemed ridiculous even then. Many AI influencers were posting some of the most absurd material, often makes basic mistakes (like confusing training tokens with parameters), but anyone in the field could have easily told you that 100T parameters sounded ridiculous.

On that note, "100 Terabytes of GPU memory is exactly what you need to train that class of model." is also likely false. That's how much you'd need to fit such a model into memory at 1 byte per param. Not train it.


https://huggingface.co/docs/transformers/perf_train_gpu_one#...

You can't train a 100T model with "only" 100TB of VRAM, you need for each parameters 4 bytes + 4 bytes (gradient) + 8 bytes (AdamW optimizer) + forward activations that depends on the batch size, sequence length etc, maybe more if you use mixed precision and also you need to distribute the weights.


The general rule of thumb that I'm familiar with is that you need about 80 bytes of VRAM per parameter when you are doing training. Inference is different and a lot more efficient, and LoRA is also different and more efficient, but training a base model requires a LOT of memory.

A machine like this would top out below 2 trillion parameters using the training algorithms that I'm familiar with.


I suppose it would be 12 bytes? 4 bytes for base model, 4 bytes for optimizer momentum and 4 bytes for optimizer second moment EWA.


I don't know what the breakdown is, but I know there was code for training the llama models on a DGX (640 GB of VRAM, repo is now gone), and it could only train the 7b model without using deepspeed 3 (offloading).

The ML engineers in my group chat say "1 DGX to train 7b at full speed, 2 for 13b, 4 for 30b, 8 for 65b"


Why 80? It's matrix operations on 4 byte numbers for single precision.


Because you need a lot more information to perform back-propagation.


It's not "a lot more" information, it's holding derivative (single number) per parameter, right?


For automatic differentiation (backpropagation) you need to store the intermediate results per layer of the forward pass. With checkpointing you can only store every nth layer and recompute the rest accordingly to reduce memory requirements in favor of more compute.


What intermediate results you need to store?

For backpropagation you take the diff between actual and expected output and you go backwards to calculate derivate and apply it with optimiser - that's 8 extra bytes for single precision floats per trainable parameter.

Why do you need 80?


You also need the optimizer (e.g. Adam)'s state, which is usually double the parameter's size. So if using fp16, one parameter takes up 6 bytes in memory.


Yes, if you use ADAM - but it doesn't add up to 80, does it?

Even for fp64 it adds only 16 bytes.

RMSPRop, Adagrad have half of this overhead.

SGD has no optimizer overhead of course.


It's not per parameter, you also need to hold activations for back prop to work.


You need activations for inference as well.

But all of that (trainable parameters, activations, optimizer state) is like 12 bytes per trainable parameter, not 80.


Not the GP, but I believe that they are talking about the size of the training data set in relation to the model size.


You don't need to and can't really load all training data.

For LLMs you need to load single row of context size, that's vector of ie. 8k numbers, which is 32kB for single precision floats.


For the numerically challenged like me: 100TB is 100 trillion bytes, giving you 1 byte per parameter at 100T params.

LLaMA can apparently run quantized to 4 bits per param (not sure if worth it though), which would allow you to run a 200TB model on one of these cards if I'm understanding right.


You can’t quantize it for training due to numerical instability. For inference you don’t usually use such a big cluster.


I think people talking about a 100T GPT didn't mean a dense transformer but some sort of extreme Mixture-of-Experts which is much more amenable to low-resource setups and complicates this discussion.

In any case, it's almost certainly not bigger than 1T, even if it's not a dense transformer (PaLM-2 is and makes do with 340B, but it isn't exactly on par).


> LLaMA can apparently run quantized to 4 bits per param (not sure if worth it though)

From the GPTQ paper https://arxiv.org/abs/2210.17323:

"... with negligible accuracy degradation relative to the uncompressed baseline"


That would work for inference, but for efficient training you’d also want you training set to fit in memory.


There is much more out there than text. Audio, visual, touch, smell. Text isn't something humans directly train on, but representations of text from our senses.

GPT-4 was trained on image data. Besides gaining understanding of image content it also showed improved language abilities over a GPT-4 trained with only text. Facebook is working on a smaller model with text, image, video, audio, lidar depth, infrared heat, and 6-axis motion data. If a GPT-4 was trained with data like that, what capabilities would it have? Rumor says we will know in a few months.


My understanding is that the image data used a decoder-only stage, i.e. mapping images to tokens, basically taking the image textual description instead of the actual pixels so it can't "see" but can understand the "narration of an image"


John Conner, is that you?


> However, I’m not even sure enough text data exists in the world

I hope these models move significantly beyond text at some point. For backend programmers it's ok, but for the rest of the technical world (circuits, mechanical engineering, front end, sound, etc), it's fairly limited.


My understanding is that this is already the case, see PaLM-E as one such example of a multimodal model.


> Maybe if you generated massive quantities of text with GPT-4 and used that dataset as your pre-training data

Hello spurious regression


These 100T rumors were ridiculous from the start, not just in hindsight.


I think we’re going to start seeing learning based on all the video out there. Text is just computationally easier, but video contains a lot of information that people rarely write about, because it’s completely obvious to humans who grew up in the real world.

Also, I think training in simulated realities will be big, especially for learning how to interact with complex systems, for developing strategic planning heuristics.


There may not be enough text content on the internet, but there’s plenty of audio and video content, and there has already been some research about connecting that as an input to an LLM. So far we’ve seen that the more diverse the training data the more versatile the model, so I suspect multi-modal input training is inevitably where LLM’s are going.


As far as I can tell, the "100 trillion" number comes from an interview with the CEO of Cerebras when he was doing press for the WSE-2 release in 2021: https://www.wired.com/story/cerebras-chip-cluster-neural-net...


You don’t really need to fit fully in memory. Memory requirement to train is

~6DP * precision

Where D is number of tokens*mini batch size and P is number of parameters.

So if you want to fit fully into memory with a mini batch of 1, context window 32k, and 16 bit precision, that’s 144e12/6/32e3/2 = 375M param.

If you apply one token at a time then

144e12/6/2 = 12 T param

Ofc, in reality you have model parallelism as well…


I have to wonder how much improvement you would get with a 100 trillion parameter model. There seems to be diminishing returns in model size. That effort could almost certainly be better spent.


Let's record every conversation on Android to collect training data! Anyone can do the math?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: