Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Where's my 20,000 core 100TB RAM VM instance?

You could simulate this with a bunch of regular machines and a networked hypervisor.

You could do some kind of smart caching so that processes rarely need to wait to access RAM stored on a remote machine.

Combined that with a big lock eliding/speculation scheme (ie. When a process reads memory that might have been written by a remote CPU, you continue as if it hadn't, and if you later find out that data was written then you rollback). These rollbacks 'undo' all work done in however many microseconds it takes for data to travel from one side of the machine cluster to another.

Reads of RAM that aren't cached yet on the local node can also be speculated - you just assume that RAM contained null bytes and continue execution, rolling back and replaying when the actual data arrives.

So if you can make sure that processes are contending for locks and writing conflicting data less often than once per system-roundtrip-latency, then you should get a high performance system.



This is certainly a very interesting thought to entertain and your ideas make sense. One thing that makes things harder on the CPU side in this hypothetical scenario is that CPUs tend to execute much more diverse instructions/computations than GPUs. So all the caching & speculation you mention is probably all the more important.


After writing the comment, I considered writing a little toy example just to try out the idea... It would be neat to see Linux boot with 1000 CPU's...

But upon further thought, a lot of things such a system would need are actually rather inefficient to implement in software (ie. rollbackable RAM), yet quite cheap in hardware (for example rollbackable RAM can be implemented with regular RAM plus either a buffer of 'overwritten data' or a write queue)


A write queue with a dupe-back-end to say a blob on S3 or whatever would be interesting for mirrors of outcomes could be stored.

The biggest issue it seems is bandwidth and humans' patience for a response...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: