Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What I'd like to know is how well those dual-Epyc machines run the 1.58 bit dynamic quant model. It really does seem to be almost as good as the full Q8.


I tried that that: ~1.5 to 3 tokens/sec.


Ouch, thanks. About what I get now on a single-CPU box with 128 GB+a 4090. Was hoping for a major speedup.


Peak performance is achieved at ~21 cores. Bottleneck - without any special configs - is RAM to CPU bandwidth.

Let me know if you find some config that really leverages more cores!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: