Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Self-hosting training (or gaming) makes a lot of sense, and once you have the hardware self-hosting inference on it is an easy step.

But if you have to factor in hardware costs self-hosting doesn't seem attractive. All the models I can self-host I can browse on openrouter and instantly get a provider who can get great prices. With most of the cost being in the GPUs themselves it just makes more sense to have others do it with better batching and GPU utilization



If you can get near 100% utilization for your own GPUs (i.e. you're letting requests run overnight and not insisting on any kind of realtime response) it starts to make sense. OpenRouter doesn't have any kind of batched requests API that would let you leverage that possibility.


For inference, even with continuous batching, getting 100% MFUs is basically impossible to do in practice. Even the frontier labs struggle with this in highly efficient infiniband clusters. Its slightly better with training workloads just due to all the batching and parallel compute, but still mostly unattainable with consumer rigs (you spend a lot of time waiting for I/O).

I also don't think the 100% util is necessary either, to be fair. I get a lot of value out of my two rigs (2x rtx pro 6000, and 4x 3090) even though it may not be 24/7 100% MFU. I'm always training, generating datasets, running agents, etc. I would never consider this a positive ROI measured against capex though, that's not really the point.


Isn't this just saying that your GPU use is bottlenecked by things such as VRAM bandwidth and RAM-VRAM transfers? That's normal and expected.


No I'm saying there are quite a few more bottlenecks than that (I/O being a big one). Even in the more efficient training frameworks, there's per-op dispatch overhead in python itself. All the boxing/unboxing of python objects to C++ handles, dispatcher lookup + setup, all the autograd bookkeeping, etc.

All of the bottlenecks in sum is why you'd never get to 100% MFUs (but I was conceding you probably don't need to in order to get value)


That’s kind of a moot point. Even if none of those overheads existed you would still be getting a a fractions of the mfu. Models are fundamental limited by memory bandwidth even with best case scenarios of sft or prefill.

And what are you doing that I/O is a bottleneck?


> That’s kind of a moot point.

I don't believe it's moot, but I understand your point. The fact that models are memory bandwidth bound does not at all mean that other overhead is insignificant. Your practical delivered throughput is the minimum of compute ceiling, bandwidth ceiling, and all the unrelated speed limits you hit in the stack. Kernel launch latency, Python dispatch, framework bookkeeping, allocator churn, graph breaks, and sync points can all reduce effective speed. There are so many points in the training and inference loop where the model isn't even executing.

> And what are you doing that I/O is a bottleneck?

We do a fair amount of RLVR at my org. That's almost entirely waiting for servers/envs to do things, not the model doing prefill or decode (or even up/down weighting trajectories). The model is the cheap part in wall clock terms. The hard limits are in the verifier and environment pipeline. Spinning up sandboxes, running tests, reading and writing artifacts, and shuttling results through queues, these all create long idle gaps where the GPU is just waiting to do something.


> That's almost entirely waiting for servers/envs to do things

I'm not sure why, sandboxes/envs should be small and easy to scale horizontally to the point where your throughput is no longer limited by them, and the maximum latency involved should also be quite tiny (if adequately optimized). What am I missing?


First as an aside, remember that this entire thread is about using local compute. What you're alluding to is some fantasy infinite budget where you have limitless commodity compute. That's not at all the context of this thread.

But disregarding that, this isn't a problem you can solve by turning a knob akin to scaling a stateless k8s cluster.

The whole vertical of distributed RL has been struggling with this for a while. You can in theory just keep adding sandboxes in parallel, but in RLVR you are constrained by 1) the amount of rollout work you can do per gradient update, and 2) the verification and pruning pipeline that gates the reward signal.

You cant just arbitrarily have a large batch size for every rollout phase. Large batches often reduce effective diversity or get dominated by stragglers. And the outer loop is inherently sequential, because each gradient update depends on data generated by a particular policy snapshot. You can parallelize rollouts and the training step internally, but you can’t fully remove the policy-version dependency without drifting off-policy and taking on extra stability headaches.


In Silicon Valley we pay PG&E close to 50 cents per kWh. An RTX 6000 PC uses about 1 kW at full load, and renting such a machine from vast.ai costs 60 cents/hour as of this morning. It's very hard for heavy-load local AI to make sense here.


Yikes.. I pay ~7¢ per kWh in Quebec. In the winter the inference rig doubles as a space heater for the office, I don't feel bad about running local energy-wise.


God bless Canada. I love our cheap hydro power. <3


And you are forgetting the fact that things like vast.ai subscriptions would STILL be more expensive than Openrouter's api pricing and even more so in the case of AI subscriptions which actively LOSE money for the company.

So I would still point out the GP (Original comment) where yes, it might not make financial sense to run these AI Models [They make sense when you want privacy etc, which are all fair concerns but just not financial sense]

But the fact that these models are open source still means that they can be run when maybe in future the dynamics might shift and it might make sense running such large models locally. Even just giving this possibility and also the fact that multiple providers could now compete in say openrouter etc. as well. All facts included, definitely makes me appreciate GLM & Kimi compared to proprietory counterparts.

Edit: I highly recommend this video a lot https://www.youtube.com/watch?v=SmYNK0kqaDI [AI subscription vs H100]

This video is honestly one of the best in my opinion about this topic that I watched.


Why did you quote yourself at the end of this comment?


Oops sorry. Fixed it now but I am trying a HN progressive extension and what it does is if I have any text selected it can actually quote it and I think this is what might've happened or such a bug I am not sure.

It's fixed now :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: