Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Partially completed write of a block, sure. But partially completed write of a file?

I can imagine (cough) an application where the application is trying to write some binary blob to disk, doesn't finish before shutdown, and upon reboot, tries to load the binary blob back into memory, fails because the binary blob isn't consistent, doesn't handle the failure well, and refuses to boot.

App's fault? Sure. Does the customer care at 2 am? Nope.



Then all you're saying over and over is that in your imagination, not using a long running instance is very dangerous because rebooting exposes the fragility of your app.

Honestly, it's much safer in that circumstance to have a frequently rebooting instance because it will quickly expose your app's fragility during normal operations instead of that fragility being exposed in a disaster.


> it's much safer in that circumstance to have a frequently rebooting instance

I actually happen to agree with you in principle on this, and it's at the root of my current side project.

But sometimes you just don't have the flexibility to fix or replace the app. Ops engineering, like any other kind of engineering, is about dealing with real-world constraints and making the most of the resources you have. Most apps, on some notion of a fragility spectrum, are far closer to fragile than to antifragile, because fragile is the default, and extensive stress-testing to understand and plan for all failure modes before a production deployment isn't typically feasible. At that point, if you can't fix it, you have to work around it.


All you're doing is advocating larger, less frequent failures with people who know less. Robustness isn't just about your software or your ops setup, but also about your people and their knowledge and experience. I cannot see how less frequent, more intense failures with people who know less is preferable, and that anything else is "very dangerous advice"

You will ultimately have many fewer resources available if your strategy is to gloss over failure modes by telling inexperienced engineers to hope they won't happen. It's technical debt and the interest payments are very high.


You are both right. But both wrong. If you want better consistency, use either object storage or a database. If you are mutating multiple entities and need consistency, now you need a distributed transaction.

But ALL cloud providers provide warning before an instance is shutdown. There is absolutely no reason, other than a crash for an instance to have a hard shutdown.


He makes valid points, but in defense of an original ridiculous statement that the articles suggestions are extremely dangerous. There are all sorts of benefits to an ACID database, it's just not reasonable to scream about the necessity of it because reboots are scary.


I agree.

But! Lots of applications aren't built to handle partial writes, which will absolutely occur if apps are hard killed. Any disucssion around this topic should reference Crash-only Software [0][1][2] and Micro Reboots [3]

[0] https://en.wikipedia.org/wiki/Crash-only_software

[1] https://www.usenix.org/conference/hotos-ix/crash-only-softwa...

[2] https://lwn.net/Articles/191059/

[3] https://www.usenix.org/legacy/event/osdi04/tech/full_papers/...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: