Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"64K tokens context window" I do wish they had managed to extend it to at least 128K to match the capabilities of GPT-4 Turbo

Maybe this limit will become a joke when looking back? Can you imagine reaching a trillion tokens context window in the future, as Sam speculated on Lex's podcast?



How useful is such a large input window when most of the middle isn't really used? I'm thinking mostly about coding. But wheb putting even say 20k tokens into the input, a good chunk doesn't seem to be "remembered" or used for the output


While you're 100% correct, they are working on ways to make the middle useful, such a "Needle in a Haystack" testing. When we say we wish for context length that large, I think it's implied we mean functionally. But you do make a really great point.


maybe we'll look back at token context windows like we look back at how much ram we have in a system.


I agree with this in the sense that once you have enough, you stop caring about the metric.


And how much RAM do you need to run Mixtral 8*22B? Probably not enough on a personal laptop.


Generally about ~1gb ram per billion parameters. I've run a 30b model (vicuna) on my 32gb laptop (but it was slow).


I run it fine on my 64gb RAM beast.


At what quantization? 4-bit is 80GB. Less than 4-bit is rarely good enough at this point.


Is that normal ram of GPU ram?


64GB is not GPU RAM, but system RAM. Consumer GPUs have 24GB at most, those with good value/price have way less. Current generation workstation GPUs are unaffordable; used can be found on ebay for a reasonable price, but they are quite slow. DDR5 RAM might be a better investment.


While there is a lot more HBM (or UMA if you're a Mac system) you need to run these LLM models, my overarching point is that at this point most systems don't have RAM constraints for most of the software you need to run and as a result, RAM becomes less of a selling point except in very specialized instances like graphic design or 3D rendering work.

If we have cheap billion token context windows, 99% of your use cases aren't going to hit anywhere close to that limit and as a result, your models will "just run"


I still don’t have enough RAM though ?


RAM is simply too useful.


Wasn't there a paper yesterday that turned context evaluation linear (instead of quadratic) and made effectively unlimited context windows possible? Between that and 1.58b quantization I feel like we're overdue for an LLM revolution.


So far, people have come up with many alternatives for quadratic attention. Only recently have they proven their potential.


tons and tons of papers, most of them had some disadvantages. Can't have the cake and eat it too:

https://arxiv.org/html/2404.08801v1 Meta Megalodon

https://arxiv.org/html/2404.07143v1 Google Infini-Attention

https://arxiv.org/html/2402.13753v1 LongRoPE

and a ton more


FWIW, the 128k context window for GPT-4 is only for input. I believe the output content is still only 4k.


How does that make any sense on a decoder-only architecture?


It's not about the model. The model can output more - it's about the API.

A better phrasing would be that they don't allow you to output more than 4k tokens per message.

Same with Anthropic and Claude, sadly.


you can always put the unfinished output as the input to continue forever until reaching the full 128k context window




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: