More

gpjt · 2025-12-12T22:13:16 1765577596

100%, I think there were weeks when I aged a year...

gpjt · 2025-12-10T13:34:52 1765373692

OP here -- with a 112M model you should be able to get something worth playing with using 2.24B tokens. The Chinchilla heuristic is tokens = 20 x parameters. Obviously you cam get a better result by grinding through more tokens, but it will be very slow progress. It's worth noting that Andrej Karpathy is using the 20x thing for his nanochat project.

I try to explain the Chinchilla paper in the post, but your favourite AI should be able to explain it well, and has the benefit that you can ask follow-up questions.

gpjt · 2025-12-09T18:13:26 1765304006

OP here -- agreed! I tried to summarise (at least to my current level of knowledge) those 12-18 hours here: https://www.gilesthomas.com/2025/09/maths-for-llms

gpjt · 2025-12-09T15:34:14 1765294454

OP here: one thing that surprised me in this experiment was that the model trained on the more curated FineWeb-Edu dataset was worse than the one trained on FineWeb. That is very counterintuitive to me.

gpjt · 2025-12-09T15:30:18 1765294218

OP here -- thanks! I'm in the process of doing some trains using the same code plus DDP on big Lambda Labs machines, and (within the bounds of what I can afford) will hopefully have some interesting results about all of those shortly.

gpjt · 2025-12-09T19:37:54 1765309074

OK, early indicators support both you and Gemini quite strongly re: batch size. On my (somewhat ad-hoc) test dataset, I get losses like this:

  * OpenAI medium weights: 3.231
  * OpenAI small weights: 3.500
  * My locally trained model, FineWeb Chinchilla, batch size 6: 3.944
  * My locally trained model, FineWeb-Edu Chinchilla, batch size 6: 4.167
  * My locally trained model, FineWeb-Edu double Chinchilla, batch size 6: 4.135
  * My cloud trained model, FineWeb Chinchilla, batch size 13 \* 8 = 104: 3.674

That last one was trained on an 8x A100 machine with 40 GiB per GPU, with the same code as before, just converted to DDP. It certainly looks like the much larger batch size has improved the model significantly.

I'll be trying on larger machines. No gradient accumulation yet, but it's certainly looking like a valuable lever to pull for local training runs (and, I suspect, might also be useful on "small" cloud machines like the one I used -- will have to see what things look like with the bigger mini-batches I can squeeze onto 80 GiB and 160 GiB GPUs).

spi · 2025-12-10T09:31:28 1765359088

Thanks, very nice to see these results! Certainly using GPUs with more RAM makes things simpler to scale. Gradient accumulation is as easy as adding a counter for number of steps and an "if counter % gradient_accumulation_steps:` around `optimizer.step()`, so that can also be tried simply on a single GPU / cheaper GPUs. But if you can just use 8xA100 and your pipeline parallizes well, you also get results (almost) 8 times faster, which is certainly nicer to experiment of course!

gpjt · 2025-12-10T13:41:22 1765374082

Exactly! If I can get it down to an hour or two (seems very plausible on an 8x H200 with 160 GiB VRAM per GPU, though those are almost never available on Lambda Labs), I'll do the experiments with dropout and the other possible causes of issues, then see if I can bake that all into a new train on the RTX 3090 and confirm it repros there. Looks like I'll definitely need gradient accumulation there.

I assume the zero_grad would need to go in the same if block?

gpjt · 2025-12-11T01:57:42 1765418262

Hmm, interesting. With a batch size of 512 (8x B200s with 160 GiB each) I get worse results! Maybe there's a sweet spot somewhere in between.

spi · 2025-12-12T13:43:00 1765546980

Sorry came a bit late to this reply. Interesting, well, nobody says it's a monotonic function :-) in the limit of _very_ large batches you of course are worse off, because you take a very large amount of computation before taking a single step, so if you stop after a fixed amount of time your model just didn't have the time to learn properly. So certainly there is a sweet spot somewhere.

I suppose, the real "function" is a bit more complicated because (1) If you put 2x more data through the same GPU with large enough memory, it will take less than 2x the time to compute (but certainly not 1x). (2) At some point, empirically, increasing batch size makes it _worse_ even if you ignore the additional runtime cost (i.e. stop after n gradient update steps, and not x seconds). To my knowledge, the accepted reason for that fact is that a bit of noise helps in regularizing learning, because overly smooth learning curves end up stagnating in local loss minima more easily. In truth, I think nobody exactly understand how deep learning models work :-)

And to your other question - sorry again for the late answer. Yes, `optimizer.zero_grad()` should always be called directly after `optimizer.step()`, therefore with gradient accumulation once every `n` steps (otherwise, you'd be zeroing out the gradients, so just throwing away all the compute you did in previous steps).

gpjt · 2025-12-12T14:51:31 1765551091

Thanks re: gradient accumulation, I'm glad to hear my intuition was right!

As part of the upcoming post I'm running the DDP train on A100s with 40 GiB and 80 GiB, H100s with 80 GiB, and B200s with 160 GiB, so I'll have at least three loss vs. batch size points to plot. So that might be interesting.

I guess a full test would be to train at various batch sizes on the 160 GiB machine and plot the resulting loss. That would be very expensive as a hobby project (the bs=64 train cost a bit more than $40 excluding overhead) so I won't do it.

But perhaps a shorter train would still be of value? That is, train for 300M tokens for a tenth of the cost and see where the loss landed? The problem with that would be if the impact of batch sizes varied with the length of the train, eg. if batch size 64 was better than 512 for short trains but weaker at longer ones.

spi · 2025-12-13T20:40:30 1765658430

Yes exactly, I fear that shortening the training time would skew the results. In the very short term, smaller batch size is typically better just because you need a certain amount of gradient updates to move away from the original random, hence pretty terrible, weight distribution. Larger batch size gives a steadier, but slower, convergence, so it's hard to say for sure what is better for a given compute budget.

I'm definitely _not_ encouraging you on spending more money on a side topic just for the sake of optimizing this one parameter, there will always be another parameter after that that you'll feel an urge to optimize :-) I'd say it's already a pretty neat result to have come to a very close score to the original GPT2 training starting from scratch!

P.S. If you want to push it a bit further, rather than optimizing parameters for this model, last week at EurIPS I heard that a current "very good" modern repo to start from in order to train a good LLM is this: https://github.com/Niccolo-Ajroldi/plainLM. I haven't investigated this exactly (I'm not working on LLM), but it might be interesting to you for a sample run. The (N)EurIPS paper that was discussed at the conference claimed that the only important change to do was to modify the hyperparameters of the Adam optimizer, setting beta1=beta2=0.95 for example (the default values are beta1=0.9 and beta2=0.999 which are apparently outdated).

gpjt · 2025-12-02T13:13:54 1764681234

How much of that low survival rate is due to the condition they received the transplant, though? Conceivably a patient with "just" HIV might do better than one with eg. leukemia and HIV.

That said, IIUC the whole stem cell transplant procedure is unpleasant enough that it still might not be worth it.

stickfigure · 2025-12-02T15:48:03 1764690483

About half?

"The major cause of death is relapse, which accounts for approximately 40% of all deaths, followed by infections at 25% and graft-versus-host disease (GVHD) at 20%."

https://www.sciencedirect.com/science/article/pii/S266663672...

A good friend of mine died from a C. Diff infection in the hospital after a bone marrow transplant. It is very risky, especially with an imperfect match.

That said, you can help make it less risky! This used to be called "Be The Match", not sure why they renamed it but you could save someone's life by registering to be a donor:

https://www.nmdp.org/

fragrom · 2025-12-03T01:46:51 1764726411

I donated bone marrow through Be the Match (before they changed their name). It was painful, but I highly recommend the experience to folks whenever it comes up.

You get to save the life of a stranger AND they give you a t-shirt. Win win!

gpjt · 2025-11-23T05:19:16 1763875156

Thanks for the reminder of a brilliant IT crowd moment!

gpjt · 2025-10-19T13:58:02 1760882282

To be fair to the OpenAI team, if read in context the situation is at worst ambiguous.

The deleted tweet that the article is about said "GPT-5 just found solutions to 10 (!) previously unsolved Erdös problems, and made progress on 11 others. These have all been open for decades." If it had been posted stand-alone then I would certainly agree that it was misleading, but it was not.

It was a quote-tweet of this: https://x.com/MarkSellke/status/1979226538059931886?t=OigN6t..., where the author is saying he's "pushing further on this".

The "this" in question is what this second tweet is in turn quote-tweeting: https://x.com/SebastienBubeck/status/1977181716457701775?t=T... -- where the author says "gpt5-pro is superhuman at literature search: [...] it just solved Erdos Problem #339 (listed as open in the official database erdosproblems.com/forum/thread/3…) by realizing that it had actually been solved 20 years ago"

So, reading the thread in order, you get

  * SebastienBubeck: "GPT-5 is really good at literature search, it 'solved' an apparently-open problem by finding an existing solution"
  * MarkSellke: "Now it's done ten more"
  * kevinweil: "Look at this cool stuff we've done!"

I think the problem here is the way quote-tweets work -- you only see the quoted post and not anything that it in turn is quoting. Kevin Weil had the two previous quotes in his context when he did his post and didn't consider the fact that readers would only see the first level, so wouldn't have Sebastien Bubek's post in mind when they read his.

That seems like an easy mistake to entirely honestly make, and I think the pile-on is a little unfair.

moefh · 2025-10-19T14:21:36 1760883696

> Kevin Weil had the two previous quotes in his context when he did his post and didn't consider the fact that readers would only see the first level, so wouldn't have Sebastien Bubek's post in mind when they read his.

No, Weil said he himself misunderstood Sellke's post[1].

Note Weil's wording (10 previously unsolved Erdos problems) vs. Sellke's wording (10 Erdos problems that were listed as open).

[1] https://x.com/kevinweil/status/1979270343941591525

GodelNumbering · 2025-10-19T19:30:57 1760902257

Also, previous comment omitted the part that now-deleted tweet from Bubeck begins with "Science revolution via AI has officially begun...".

OtherShrezzing · 2025-10-19T14:52:58 1760885578

Am I correct in thinking this is the 2nd such fumble by a major lab? DeepMind released their “matrix multiplication better than SOTA” paper a few months back, which suggested Gemini had uncovered a new way to optimally multiply two matrices in fewer steps than previously known. Then immediately after their announcement, mathematicians pointed out that their newly discovered SOTA had been in the literature for 30-40 years, and was almost certainly in Gemini’s training set.

ogogmad · 2025-10-19T17:05:53 1760893553

No, your claim about matrix multiplication is false. Google's new algorithm can be applied recursively to 4x4 block matrices (over the field of complex numbers). This results in an asymptotically faster algorithm for nxn matrix multiplication than Strassen's. Earlier results on 4x4 matrices by Winograd and others did not extend to block matrices..

Google's result has more recently been generalised: https://arxiv.org/abs/2506.13242

jsnell · 2025-10-19T17:32:37 1760895157

That doesn't match my recollection of the AlphaEvolve release.

Some people just read the "48 multiplications for a 4x4 matrix multiplications" part, and thought they found prior art at that performance or better. But they missed that the supposed prior art had tighter requirements on the contents of the matrix, which meant those algorithms were not usable for implementing a recursive divide and conquer algorithm for much larger matrix multiplications.

Here is a HN poster claiming to be one of the authors rebutting the claim of prior art: https://news.ycombinator.com/item?id=43997136

ummonk · 2025-10-19T16:50:28 1760892628

We also had the GPT-5 presentation which featured both incorrect bar charts (likely AI generated) and an incorrect explanation of lift.

card_zero · 2025-10-19T15:15:25 1760886925

Well, it is important that we have some technology to prevent us from going round in circles by reinventing things, such as search.

glenstein · 2025-10-19T15:11:24 1760886684

It's an interesting type of fumble too, because it's easy to (mistakenly!) read it as "LLM tries and fails to solve problem but thinks it solved it" when really it's being credited with originality for discovering or reiterating solutions already out there in the literature.

It sounds like the content of the solutions themselves are perfectly fine, so it's unfortunate that the headline will leave the impression that these are just more hallucinations. They're not hallucinations, they're not wrong, they're just wrongly assigned credit for existing work. Which, you know, where have we heard that one before? It's like the stylistic "borrowing" from artists, but in research form.

whimsicalism · 2025-10-20T15:11:39 1760973099

no, you are incorrect

card_zero · 2025-10-19T14:20:12 1760883612

So the first guy said "solved [...] by realizing that it had actually been solved 20 years ago", and the second guy said "found solutions to 10 (!) previously unsolved Erdös problems".

Previously unsolved. The context doesn't make that true, does it?

glenstein · 2025-10-19T15:06:39 1760886399

Right, and I would even go a step further and say the context from SebastienBubeck is stretching "solved" past its breaking point by equating literature research with self-bootsrapped problem solving. When it's later characterized as "previously unsolved" it's doubling down on the same equivocation.

Don't get me wrong, effectively surfacing unappreciated research is great and extremely valuable. So there's a real thing here but with the wrong headline attached to it.

watwut · 2025-10-20T06:54:43 1760943283

> Don't get me wrong, effectively surfacing unappreciated research is great and extremely valuable. So there's a real thing here but with the wrong headline attached to it.

If I said that I solved a problem, but actually I took a solution for an old book, people would call me a liar. If I was prominent person, it would be academic fraud incident. No one would be saying that "I did extremely valuable thing" or "there was a real thing here".

3form · 2025-10-20T12:21:37 1760962897

If you said you "solved", yes - if you said "found a solution" however, there's ambiguity to it, which is part of the confusion here.

glenstein · 2025-10-20T11:11:35 1760958695

Some of the most important advancements in the history of science came from reviewing underappreciated discoveries that already existed in the literature. Mendel's work on genetics went under appreciated for decades before being effectively rediscovered, and proved to be integral to the modern synthesis, which provided a genetic basis for evolution, and is the most important development in the history of our understanding of evolution since Darwin and Wallace's original formulation.

Henrietta Leavitt's work on the relation between a stars period of pulsation and brightness was tucked away in a Harvard Journal, which had revolutionary potential not appreciated until Hubbel recalled and applied her work years later to demonstrate galactic redshift in Andromeda, understanding that it was an entirely separate galaxy, that it was receding away from us and contributing to the bedrock of modern cosmology.

The pathogenic basis for ulcers was proposed in the 1940s, which later became instrumental to explaining data in the 1980s and led to a Nobel prize in 2005.

It is and has always been fundamental to the progress of human knowledge to not just propose new ideas but to pull pertinent ones from the literature and apply them in new contexts, and depending on the field, the research landscape can be inconceivably vast, so efficiencies in combing through it can create the scaffolding for major advancements in understanding.

So there's more going on here than "lying".

Frieren · 2025-10-19T14:32:59 1760884379

> "GPT-5 is really good at literature search, it 'solved' an apparently-open problem by finding an existing solution"

Survivor bias.

I can assure you that GPT-5 fucks up even relatively easy searches. I need to have a very good idea how the results looks like and the ability to test it to be able to use any result from GPT-5.

If I throw the dice 1000 times and post about it each time that I got a double six. Am I the best dice thrower that there is?

wasabi991011 · 2025-10-20T01:52:37 1760925157

I'm not really sure what you mean. Literature search is about casting a wide net to make a reading list that is relevant to your research.

It is pretty hard to fuck that up, since you aren't expected to find everything anyway. The idea of "testing" and "using any result from GPT" is just, like, reading the papers and seeing if they are tangentially related.

If I may speak to my own experience, literature search has been the most productive application I've personally used, more than coding, and I've found many interesting papers and research directions with it.

saghm · 2025-10-20T11:32:17 1760959937

One time when I was a kid my dad and I were playing Yahtzee, and he rolled five 5s on his first roll of the turn. He was absolutely stunned, and at the time I was young enough that I didn't understand just how unlikely it was. If I only I knew that I was playing against the best dice thrower!

zacmps · 2025-10-19T14:50:31 1760885431

For literature search that might be ok. It doesn't need to replace any other tools, and if 1/10 it surfaces something you wouldn't have found otherwise it could be worth the time on the dud attempts.

camillomiller · 2025-10-19T16:13:29 1760890409

I have some more mirrors for you to try and climb, if you need them.

jibal · 2025-10-19T19:46:44 1760903204

That's being disingenuous, not fair.

gpjt · 2025-10-06T14:54:01 1759762441

This is a great post on many levels, but what struck me as particularly clever was the use of lm_head to decode the outputs of earlier layers. That linear layer is only trained to decode the output of the last layer, so intuitively it might only be able to do that -- the embedding spaces used between earlier layers might be different and "incompatible". It's really interesting that that is not the case.

gpjt · 2025-09-06T23:12:49 1757200369

Post author here. I agree 100%! The post is the basic maths for people digging in to how LLMs work under the hood -- I wrote a separate one for non-techies who just want to know what they are, at https://www.gilesthomas.com/2025/08/what-ai-chatbots-are-doi...