I wonder if there's research being done on training LLMs with extended data in analogy to the "kernel trick" for SVMs: the same way one might feed (x, x^2, x^3) rather than just x, and thus make a linear model able to reason about a nonlinear boundary, should we be feeding multimodal LLMs with not only a token-by-token but also character-by-character and pixel-by-pixel representation of prompt texts during training and inference? Or, allow them to "circle back" and request they be given that as subsequent input text, if they detect that it's relevant information for them? There's likely a lot of interesting work here.