They don't have to be trained that way! The training data for 1-bit LLMs is the same as for any other LLM. A common way to generate this data is called 'model distillation', where you take completions from a teacher model and use them to train the child model (what you're describing)!
Maybe I wasn't clear, I think you've misunderstood me. I understand that all sorts of LLMs can be trained using a common corpus of data. But my understanding is that the choice of creating a bitnet LLM must be made at training time, as modifications to the training algorithms are required. In other words, an existing FP16 model cannot be quantized to bitnet.