Why do we expect this to perform better? Couldn’t a regular network converge on ...

famouswaffles · on April 17, 2024

It doesn't perform better and until recently, MoE models actually underperformed their dense counterparts. The real gain is sparsity. You have this huge x parameter model that is performing like an x parameter model but you don't have to use all those parameters at once every time so you save a lot on compute, both in training and inference.

imjonse · on April 17, 2024

It is a type of ensemble model. A regular network could do it, but a MoE will select a subset to do the task faster than the whole model would.

rgbrgb · on April 17, 2024

Here's my naive intuition: in general bigger models can store more knowledge but take longer to do inference. MoE provides a way to blend the advantages of having a bigger model (more storage) with the advantages of having smaller models at inference time (faster, less memory required). When you do inference, tokens hit a small layer that is load balancing the experts then activate 1 or 2 experts. So you're storing roughly 8 x 22B "worth" of knowledge without having to run a model that big.

Maybe a real expert can confirm if this is correct :)

nialv7 · on April 17, 2024

Sounds like the "you only use 10% of your brain" myth, but actually real this time.

samus · on April 17, 2024

Almost :) the model chooses experts in every block. For a typical 7B with 8 experts there will be 8^32=2^96 paths through the whole model.

cjbprime · on April 17, 2024

Not quite, you don't save memory, only compute.