Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why do we expect this to perform better? Couldn’t a regular network converge on this structure anyways?


It doesn't perform better and until recently, MoE models actually underperformed their dense counterparts. The real gain is sparsity. You have this huge x parameter model that is performing like an x parameter model but you don't have to use all those parameters at once every time so you save a lot on compute, both in training and inference.


It is a type of ensemble model. A regular network could do it, but a MoE will select a subset to do the task faster than the whole model would.


Here's my naive intuition: in general bigger models can store more knowledge but take longer to do inference. MoE provides a way to blend the advantages of having a bigger model (more storage) with the advantages of having smaller models at inference time (faster, less memory required). When you do inference, tokens hit a small layer that is load balancing the experts then activate 1 or 2 experts. So you're storing roughly 8 x 22B "worth" of knowledge without having to run a model that big.

Maybe a real expert can confirm if this is correct :)


Sounds like the "you only use 10% of your brain" myth, but actually real this time.


Almost :) the model chooses experts in every block. For a typical 7B with 8 experts there will be 8^32=2^96 paths through the whole model.


Not quite, you don't save memory, only compute.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: