"64K tokens context window" I do wish they had managed to extend it to at least ...

afro88 · on April 17, 2024

How useful is such a large input window when most of the middle isn't really used? I'm thinking mostly about coding. But wheb putting even say 20k tokens into the input, a good chunk doesn't seem to be "remembered" or used for the output

jakderrida · on April 17, 2024

While you're 100% correct, they are working on ways to make the middle useful, such a "Needle in a Haystack" testing. When we say we wish for context length that large, I think it's implied we mean functionally. But you do make a really great point.

htrp · on April 17, 2024

maybe we'll look back at token context windows like we look back at how much ram we have in a system.

frabjoused · on April 17, 2024

I agree with this in the sense that once you have enough, you stop caring about the metric.

paradite · on April 17, 2024

And how much RAM do you need to run Mixtral 8*22B? Probably not enough on a personal laptop.

user_7832 · on April 17, 2024

Generally about ~1gb ram per billion parameters. I've run a 30b model (vicuna) on my 32gb laptop (but it was slow).

Lacerda69 · on April 17, 2024

I run it fine on my 64gb RAM beast.

coder543 · on April 17, 2024

At what quantization? 4-bit is 80GB. Less than 4-bit is rarely good enough at this point.

apexalpha · on April 17, 2024

Is that normal ram of GPU ram?

samus · on April 17, 2024

64GB is not GPU RAM, but system RAM. Consumer GPUs have 24GB at most, those with good value/price have way less. Current generation workstation GPUs are unaffordable; used can be found on ebay for a reasonable price, but they are quite slow. DDR5 RAM might be a better investment.

htrp · on April 17, 2024

While there is a lot more HBM (or UMA if you're a Mac system) you need to run these LLM models, my overarching point is that at this point most systems don't have RAM constraints for most of the software you need to run and as a result, RAM becomes less of a selling point except in very specialized instances like graphic design or 3D rendering work.

If we have cheap billion token context windows, 99% of your use cases aren't going to hit anywhere close to that limit and as a result, your models will "just run"

bamboozled · on April 17, 2024

I still don’t have enough RAM though ?

samus · on April 17, 2024

RAM is simply too useful.

creshal · on April 17, 2024

Wasn't there a paper yesterday that turned context evaluation linear (instead of quadratic) and made effectively unlimited context windows possible? Between that and 1.58b quantization I feel like we're overdue for an LLM revolution.

samus · on April 17, 2024

So far, people have come up with many alternatives for quadratic attention. Only recently have they proven their potential.

underlines · on April 17, 2024

tons and tons of papers, most of them had some disadvantages. Can't have the cake and eat it too:

https://arxiv.org/html/2404.08801v1 Meta Megalodon

https://arxiv.org/html/2404.07143v1 Google Infini-Attention

https://arxiv.org/html/2402.13753v1 LongRoPE

and a ton more

pseudosavant · on April 17, 2024

FWIW, the 128k context window for GPT-4 is only for input. I believe the output content is still only 4k.

moffkalast · on April 17, 2024

How does that make any sense on a decoder-only architecture?

grey8 · on April 17, 2024

It's not about the model. The model can output more - it's about the API.

A better phrasing would be that they don't allow you to output more than 4k tokens per message.

Same with Anthropic and Claude, sadly.

ljhskyso · on April 18, 2024

you can always put the unfinished output as the input to continue forever until reaching the full 128k context window