One of the tricks OpenAI uses when fine-tuning models is to use the unadjusted foundation model as a coherency model alongside the data or reward model to be fine-tuned on. Loss is calculated as the sum[0] of both the fine-tuning and coherency loss so that the training process is anchored to something.
The reason why unanchored training fails is fairly simple. "Training" is a misnomer, we're really copying and compressing[1]. When you train a model on itself, you're making a lossy copy of the original, which isn't a very good truth anchor.
There's probably other ways to anchor a self-training process, though. ChatGPT and other text-to-text transformer models are operated as autoregressive processes, where the model spits out a probability distribution, which you then sample to get a token to add to the input, and then repeat until the model says stop. You'll notice that if you squint a little, this looks like the policy function of AlphaGo, but being run stochastically instead of being min-maxed. Which begs the question: why can't we train GPT like we train chess AI, with self-play followed by fine-tuning on the result, as scored by some kind of reward model?
Granted, you'd have to specify a reward model, as well as what behavior you're trying to 'reward'. One other idea that's been bouncing around my head for self-training is training a model to remember details of prior conversations that have since fell off the end of the context window. The biological analogy being "long-term memory", in contrast to the "short-term memory" of the context window. So perhaps your reward model is the model plus the current context window, and your loss is calculated on the same model but without the parts of the context window you want to free up.
No clue if this has already been done, but if it has please reply with the name of the thing I'm not aware of.
[0] Or difference, I forget. If you get the signs wrong you get a hilariously horny version of ChatGPT.
[1] And, thanks to induction heads, compressing the knowledge of how and what to copy.
How does this actually work? At a high level, the autoregressive token generation process makes sense to me, but how does it know when a sensible time to stop is rather than just going on forever or abruptly stopping after n tokens? Is it trained on text that has special 'stop' tokens inserted at the end of paragraphs, etc. and when it chucks out one of these the model halts?
One of the tokens represents stopping. If you sample stop from the probability distribution instead of a normal text token, then you stop autoregressive sampling.
> One of the tricks OpenAI uses when fine-tuning models is to use the unadjusted foundation model as a coherency model alongside the data or reward model to be fine-tuned on. Loss is calculated as the sum[0] of both the fine-tuning and coherency loss so that the training process is anchored to something.
Long Term and Short Term "memory" is a well known concept on NLP, the predecessor to Transforms were Long-Short-Term Mememory networks aka LSTMs. And LSTMs are just RNNs that attempt to solve the vanishing gradient problem. The whole strength of Transformers is that they are capable of processing the entire text corpus in parallel, hence they can "hold" long range dependencies better.
> Specifically, we explore the potential of self-training models on their own outputs, akin to how humans learn and build on their previous thoughts and actions.
I'd argue this misrepresents even how humans learn. We learn from our previous thoughts and actions along with the reaction of the environment to them. That's more like reinforcement learning.
There are situations where someone can purely learn from their own thoughts - i.e. an author gaining more insight into their own characters as they imagine the story, or a mathematician building a proof in their head. But even those need real-world inputs from time to time: The author will be influenced by other stories they know and/or experiences they had in the past, the mathematician will write things down at some point and may gain new insight by actually looking at the formula/graphs/etc instead of just imagining them.
So I'm very sceptical that even humans could basically create infinite new knowledge by continuously "learning from their own thoughts".
I would argue that for a mathematician, the primary anchor is other mathematicians reading their proofs and complaining that it doesn't seem quite right. What anchors the mathematical community as a whole is applications - physics, cryptography, signal processing. The demand for applications is what keeps it from degenerating into some number-mysticism in the long term.
Maybe 'new information' can be generated by taking existing data and finding new relationships between that data. If that were the case, the original embedding for that data would change (new relationships = new associations).
I agree that infinite knowledge would be difficult to generate this way, but maybe some new knowledge could be. And this process of finding new relationships between existing data might be something we can't automate yet.
> We explore the potential of self-training models on their own outputs, akin to how humans learn and build on their previous thoughts and actions.
I feel like humans for the most part don't just blindly update their beliefs based on whatever thought or action they just did. Humans will often take some sort of action based on their thoughts and then observe how successful their action was. Then they can update based on the feedback signal.
When I'm preparing for a difficult conversation -- for example, confronting my boss about a major problem -- I do simulate the entire conversation in my head a few times. I think about how my boss will likely respond to my message, and where misunderstandings are likely to occur, and then adjust my delivery pre-emptively. It's part of communication with empathy, but very similar to self-play.
People don’t learn based on the action they just did, they learn based on the universe’s response to that action. It seems like an expected result that if the model is just learning by inserting it’s outputs into the training data somehow, that it will fail, in the same way that a person who never receives feedback will go in really weird directions.
But the paper is nice to have in case anyone is suggesting such a scheme. Is anyone? Hopefully not…
That, and we're constantly exposed to new information and situations.
The way to mimic it would be to self-train and add new crap onto the training continuously to simulate world experience, otherwise things get a bit too deterministic and you end up running an expensive minimization problem.
I strongly suspect that humans ingesting only output of other humans in a vacuum, e.g. social media without any other feedback mechanisms, suffer exactly the same degradation.
Outside of social media like bubbles, human learning in fact has a sort of oracle providing correct examples: the world around us. In fact, if you provide a world for an AI to interact with, such a game like Go, then it can in fact learn simply interacting with versions of itself and the world.
Put a little more simply, only talking will not lead to new knowledge if there is no testing of ideas or conclusions involved. If anything, extended talking without feedback often leads to damaging rumination and/or possible misinformation.
If this notion is correct, actively limiting and directing the sorts of inputs we give ourselves could help us improve our knowledge and abilities.
Right, it requires intimate past knowledge to be the foundation of the future knowledge and weighted biases. There isn't enough memory or disk today to accomplish this. It would be no more helpful than training a goldfish to react to a feeding time. It is why we instead substitute memory for an existing compendium of knowledge to query.
It seems obvious to me that training on the totality of the output of a model with any degree of error would result in a greater error due to the error in the source compounding with error in the training. That is simply the photocopy of a photocopy problem.
If there is a selection mechanism then you are reducing the data set to the best of it's output which by definition has the potential to be better than the original training data because divergences from correct prediction of the training data (errors) have a probability to be better than the original training data.
This is analogous to evolution where most errors are bad and some are beneficial, natural selection provides a mechanism to pick the good bits.
> So is there any selection criteria going on here?
I think that's Mistral's secret sauce.
> They appear particularly adept at marshalling training data—the second ingredient of AI success. Mr Mensch will not say how exactly Mistral curates its training sets;
This is useful as a "tiny paper" showing that one naive kind of self-training, in a smaller/older model (GPT-2), causes degradation.
But it's likely those on a motivated search for evidence of LLM limitations ('cope') are going to tout & over-rely on this result – and miss the extensive evidence in larger models, with slightly-more selective generations, that self-training often elicits improved performance on domains of interest.
For example, both the 'UnnInstruct' and 'Self Instruct' papers (of December 2022) showed that if – rather than self-training on arbitrary generations – you ask a model to generate good examples of a certain kind of prompt and helpfulness, then train on those generations, the resulting model tends to get better on many similar challenges.
It's almost like, as with human practice/exercises, there's a spillover effect on related competencies - eliciting from the model a potential that was too fuzzy to exploit at first, but gets honed by effortful practice (even without additional authoritative-instructor corrections).
To me, it's eerily similar to human self-help routines - "give yourself a pep talk, visualize desired results, pick tiny positive steps doable once & keep doing repeatedly, imagine success, affirm all progress".
Or, say, the "Inner Game of Tennis" style of gently reinforcing some key skill into subconscious comfort, with broader effects:
What are you trying to do, focusing on one parenthetical word, and not the bulk of the content? You could be a much better replier, without superficial word allergies.
Don't worry about the copers. Just be ready for people who
1. Think this research is pointless because of course stochastic parrots would exhibit knowledge collapse
2. Use this research to argue more advanced models will always have knowledge collapse, even if grounded in empirical data
These are not bad signs. They are screening signals that let you tell when someone has no genuine interest in uncovering the truth of the matter. That is very useful to know, since otherwise you will waste time trying to get useful contributions out of them, which is beyond frustrating.
Anyway, it seems like a pretty ineffective cope, right? Even if there is a limit to the degree to which the llm’s output can be fed back to it (and I think there must be, just like a person who’s only feedback is self-help mantras will eventually have limited improvement), humans could still be reduced to just thumbs-up or thumbs-downing which llm outputs get fed-back, which is not really a fate any of us want.
Lots of AI-progress-denialist cope is very ineffective! But the people demanding it want just-so stories that let them look away. Actual robust truth/effectiveness isn't what the market-for-cope is providing.
I imagine that it's increasingly hard to contribute or even participate in the ongoing research, which is dominated by major companies?
I work in a different field, but the way of working now is so different from when I started. It feels like factory work / hamsterwheeling. It used to feel more like the work of an architect, a mixture of art and engineering.
Probably the fact that even ten years ago one could hope to do something impressive (and maybe even state of the art) with a custom-built model running on a laptop... these days anything you might try is likely to be outdone by a more general model.
Like the fact that all sorts of fun but elementary language processing tasks that previously required some clever engineering or at least careful data collection and a custom-built model can just be done (and done much better) by something like GPT-x. There's far less incentive to play around with basic ML stuff as an amateur now.
The reason why unanchored training fails is fairly simple. "Training" is a misnomer, we're really copying and compressing[1]. When you train a model on itself, you're making a lossy copy of the original, which isn't a very good truth anchor.
There's probably other ways to anchor a self-training process, though. ChatGPT and other text-to-text transformer models are operated as autoregressive processes, where the model spits out a probability distribution, which you then sample to get a token to add to the input, and then repeat until the model says stop. You'll notice that if you squint a little, this looks like the policy function of AlphaGo, but being run stochastically instead of being min-maxed. Which begs the question: why can't we train GPT like we train chess AI, with self-play followed by fine-tuning on the result, as scored by some kind of reward model?
Granted, you'd have to specify a reward model, as well as what behavior you're trying to 'reward'. One other idea that's been bouncing around my head for self-training is training a model to remember details of prior conversations that have since fell off the end of the context window. The biological analogy being "long-term memory", in contrast to the "short-term memory" of the context window. So perhaps your reward model is the model plus the current context window, and your loss is calculated on the same model but without the parts of the context window you want to free up.
No clue if this has already been done, but if it has please reply with the name of the thing I'm not aware of.
[0] Or difference, I forget. If you get the signs wrong you get a hilariously horny version of ChatGPT.
[1] And, thanks to induction heads, compressing the knowledge of how and what to copy.