Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: I built an open-source tool to make on-call suck less (github.com/opslane)
319 points by aray07 on July 28, 2024 | hide | past | favorite | 169 comments
Hey HN,

I am building an open source platform to make on-call better and less stressful for engineers. We are building a tool that can silence alerts and help with debugging and root cause analysis. We also want to automate tedious parts of being on-call (running runbooks manually, answering questions on Slack, dealing with Pagerduty). Here is a quick video of how it works: https://youtu.be/m_K9Dq1kZDw

I hated being on-call for a couple of reasons:

* Alert volume: The number of alerts kept increasing over time. It was hard to maintain existing alerts. This would lead to a lot of noisy and unactionable alerts. I have lost count of the number of times I got woken up by alert that auto-resolved 5 minutes later.

* Debugging: Debugging an alert or a customer support ticket would need me to gain context on a service that I might not have worked on before. These companies used many observability tools that would make debugging challenging. There are always a time pressure to resolve issues quickly.

There were some more tangential issues that used to take up a lot of on-call time

* Support: Answering questions from other teams. A lot of times these questions were repetitive and have been answered before.

* Dealing with PagerDuty: These tools are hard to use. e.g. It was hard to schedule an override in PD or do holiday schedules.

I am building an on-call tool that is Slack-native since that has become the de-facto tool for on-call engineers.

We heard from a lot of engineers that maintaining good alert hygiene is a challenge.

To start off, Opslane integrates with Datadog and can classify alerts as actionable or noisy.

We analyze your alert history across various signals:

1. Alert frequency

2. How quickly the alerts have resolved in the past

3. Alert priority

4. Alert response history

Our classification is conservative and it can be tuned as teams get more confidence in the predictions. We want to make sure that you aren't accidentally missing a critical alert.

Additionally, we generate a weekly report based on all your alerts to give you a picture of your overall alert hygiene.

What’s next?

1. Building more integrations (Prometheus, Splunk, Sentry, PagerDuty) to continue making on-call quality of life better

2. Help make debugging and root cause analysis easier.

3. Runbook automation

We’re still pretty early in development and we want to make on-call quality of life better. Any feedback would be much appreciated!



> It reduces alert fatigue by classifying alerts as actionable or noisy and providing contextual information for handling alerts.

grimace face

I might be missing context here, but this kind of problem speaks more to a company’s inability to create useful observability, or worse, their lack of conviction around solving noisy alerts (which upon investigation might not even be “just” noise)! Your product is welcome and we can certainly use more competition in this space, but this aspect of it is basically enabling bad cultural practices and I wouldn’t highlight it as a main selling point.


As someone who has worked in non-tech startup transitioning into enterprise, or enterprise organisarions for decades… I can tell you that non of these companies or organisations are capable of making meaningful interactions with operations. Disclaimer I haven’t worked operations, but I’ve been in relative close proximity with them work wise for natural reasons or using their infrastructure and sometimes making their tools or helping them with things like automation and Powershell.

Anyway in lot of places management see IT as a necessary evil. Like a service center similar to HR but less popular because management genuinely don’t understand it and most IT departments lack HRs political shrewdness and communication abilities. At the same time it’s not uncommon for users to be unable to tell support if they’re on an Android or iOS device (yes, I’m serious). Sometimes employees won’t even differentiate between their professional and personal IT issues on their work devices. Which means that sometimes they’ll raise hell over things that might not warrant full alert systems for on-site support.

What might be challenging here is that you’ll still need someone to actually use the authors tool correctly. Though that is probably going to be a lot easier than making any sort of change management to was organisations relationship with IT.


100x this. Garbage in = Garbage out.

In similar mindset, I've seen attempts to "fix" flaky test suites by retrying failing tests 5 times until they pass. What happens: You just set a new baseline of shit allowed. This allows even more noise to enter the system and you have to rerun them even more or need increasingly more advanced tools to filter out the noise.

Once the new baseline is anchored, you become dependent on the filter. Now every tool that interacts with the metrics need to be aware of the additional filter, there may be more than just the slack messages. Should your dashboard show the raw or the filtered metrics?

Devils advocate: consider an alert with 99/100 false positives. The LLM may be good at classifying it as noisy but will it do a better job than a human to react to the 1 true positive? Maybe, but at the same time it allows more such noise to accumulate in the system, in effect a net negative. It's better to remove such an alert instead. Even if the numbers were turned around in favor, that's a lot of complexity added.

The additional context this product provides may of course still be useful and i applaud the effort. This product space does have a lot of potential for growth and is a real pain for operators. Be careful with using it as a substitute for proper alert hygiene and culture.


If I may try to counter this point.

You are right, but there's ideal state and there's the real world. When on call most of my time is spent trying to make on call better. Reducing noise, providing more context when the alert and logs are lacking, and of course fixing the real issues that alerts have identified. That said there is a period of time in between receiving non-actionable alerts and classifying them as such, and more context without using brain power is always welcome. I think I'll give it a shot.


Thanks for the feedback. Yeah, ideally, teams would go back and fix their misconfigured alerts. Unfortunately, on-call ends and people forget. The aim was to both provide context when an alert comes up as well as provide a report at the end so everybody has context into the state of alerts.


> which upon investigation might not even be “just” noise

My company (like so many) is struggling a bit with culture around noisy alarm. Not only is noise tolerated, but when some closes an alarm because it's "known to be noise" and I prod them, it turns out that there is a very real impact on the user, it's just that nobody bothered to look into it. The alarm rings, the on-call hopes that it closes itself soon enough, it does so they consider it "false positive and noisy" even though there was impact on the user during these few minutes.

The only way to fight that is a zero-tolerance culture on alarms, which means no false positive is ever tolerated: fix it.


Ops person here, fixing it is generally impossible as most alarms are due to code and development teams don't give a shit. "It's just one false alarm occasionally, what's the big deal?" or "We are no longer working on that product." or "I have 14 features to do, go away ops and deal."

If we completely shut off the alarm, one time it's actually an issue, I'll get dinged for shutting down monitoring.

So yea, I mark it's priority as low so it doesn't wake me up in PagerDuty and move on.


A lot of times ops can't fix it (false positive), it's another teams responsibility. And if that other team can't or won't fix it, ops is screwed with constant false positives. At that point, it's in ops best interest to ignore it, let the actual positive wreak havoc and point at the other team. If they don't have the capability of putting pressure on that team, sometimes a fire is the only way to do it. I'm not saying this is a good idea, but bureaucracy is going to bureaucracy. You have to make the other team feel your pain.


Yeah, I was surprised at how common this behavior has been across companies. I don't have a good solution for this problem - I am hoping additional visibility and accountability into these alerts can be useful. One of the things we have been hoping to do is to be able to add user impact context into the alert enrichment as well.


I agree with your intent and desire, but the fact is that this problem _keeps happening,_ and we can't fix it by advocating "well, just do alerts better." There's a lot of cultural inertia at a lot of places that leads to creating too many, too low-signal alerts, and fixing that across an entire company -- or, hell, industry -- is a magnificently tall order.

However, installing a tool to specifically reign in those garbage noisy alerts is a potentially easy, significant win for the time and mental health of on-call engineers.

I mean, it sounds like you can then afterwards go in and identify the alerts that are just noise, and having that data means you can take action. Maybe contact the teams that are writing the noisiest alerts, or prepare some data-driven engineering standards for the company, whatever. But that still falls into "fix the culture", which is famously hard to do by fiat.


If you work at a shitty place, focus your energy on leaving the shitty place.


The shittiness of a place is defined by a lot more, and more important, attributes than alert hygiene. Culture, pay, location, industry, leadership. If one leaves companies for (relatively) minor things like that, there are basically no companies left to work for.


Alert hygiene is a symptom of bad culture. Not the only one of course.


Or, it can be self-evident to all that today's work can deliver more benefit if focused elsewhere, instead of root-causing fickle alert flakes. Customers buy products and services, not alert hygiene.


To give a more charitable take, in practice it's easy for noisy alerts to creep in and a tool like this could be a good nudge to dig a bit deeper on that alert and why it's gone from useful to noise.


Agreed. To solve noisy alerts, we have a group channel where we 1. call it out 2. solve it with weekly meetings. We usually "group" up and fix it, so everyone is in agreement. It takes about an hour to organize everyone and talk through it, but it improves the on-call quality-of-life over time.


Agreed. This doesn't make the problem better, it's a bandaid solution that can make the problem worse by allowing you to ignore it for longer.

Iterating on your alarms is super informative about the underlying product. It'll point to how you might improve your KPI measurements, or find bugs you didn't know were there.


Bandaids are a valuable and useful product used billions of times around the globe every year.


Missing from the context of sibling replies is the (in my experience as an SRE, quite large) category of alerts that are mostly-but-not-entirely noise and high-effort-duration to properly improve.

Consider a not-that-hypothetical example: "host computer is unreachable" alerts that page oncall when they arrive for members of a fleet of critical database servers or replicas.

The alerts have proven their usefulness (they tend to arrive several minutes before application-level error spikes when a database is e.g. so overloaded the monitoring agent can't function or a replica is gone so changelogs are overflowing) ... when they're genuine. However, they're mostly not genuine: alerting agents crash and automatic-restart-service init scripts bug out or give up; per-database-owner customizations in hosts' available file descriptor numbers are propagated incorrectly to non-database services and prevent the alerting agent from running, databases that serve infrequent-but-critical on-demand reporting loads are subjected to tens-of-minutes-long load spikes during which the host is doing what it's supposed to but so pegged that the alerting agent won't work, and so on.

What do you do with those alerts?

"Just fix the problems causing the false positives!" Fine, but that takes a lot of time and coordinated effort, even if the oncall folks are empowered to prioritize the work getting done (which is far from a given at many companies, for reasons both good and bad): auto-restart-agent scripts can be replaced with better scripts (time, effort, debugging of some hokey bash that needs to run on a wide variety of environments) or systemd (time, effort, maintenance windows, and approval/retraining to update ancient linux distributions running critical databases). File descriptor/per-database tunings can be unified and continually audited for/invalidated before configs are pushed (developer effort, coordination with teams writing configs). Reporting databases can be upsized (money, maitenance windows) or the database processes can be moved into a cgroup to leave some resources to spare (effort, distro upgrades, maintenance windows).

That's going to take awhile, if it ever happens to completion.

Meanwhile, this "host unreachable" alert is useless 90% of the time and very useful (as in: it can be leveraged to prevent downtime for customers entirely) the remaining 10% of the time.

Like, sure, some of those issues are stupid. But none are hypothetical, all are younger than 5y, and I bet this kind of struggle is common and representative even at companies who are invested in operations and operations staff.

That's not an "inability to create useful observability", that's a genuinely hard problem resulting in noisy, spurious alerts that, depending on the rate-of-change/regulatory space of the company, might persist for months or years. What's more, alert management is an ongoing process. Even if one family of noisy alerts is addressed, another one will emerge as new behaviors and technologies are adopted.

I guess this is all to say that I don't think tools like Opslane (which I have not used) are "enabling bad cultural practices". Organizations that don't give a shit about operations will continue to suck at operations no matter what tools they use. But products Opslane are valuable even (especially?) in capable, operations-focused organizations as well.


If the tool works properly, then the value proposition is:

Sure, you could spend months trying to fix or tune out every useless alarm type, or try to hack your alert manager/email inbox with filters for things you know will get fixed in a few months - OR, you can use this tool that can quickly classify things as important or not.


If something that triggers a pager takes months to resolve you already work at an organization that is so ossified it will be unwilling to adopt a random bandaid startup product like this.


I'm sure that's true in many places. The number of large companies deal with this kind of thing (understaffed teams operating hundreds or thousands of devices) is quite high. Some can pull off shadow IT or exceptions for free software.


Yeah, thats fair feedback. The main aim was to reduce the alert fatigue for on-call engineers and provide a way to get insight into the alerts at the end of the on-call shift.

This way there is data to make a case that certain alerts are noisy (for various reasons) and we should strive to reduce the time spent dealing with these alerts. Fixing some of them might be as easy as deleting them but for others might need dedicated time working on them.


> enabling bad cultural practices

I strongly disagree. There is nothing culturally bad in a system issuing an error if there is an error. Sometimes systems issue errors that are considered noise by supporters because they are not actionable, but forcing a system to not issue an error just because your support team cannot directly take action on it is an extremely odd leakage of team responsibilities and bound to have unintended consequences. Imagine a developer telling management that they didn’t implement error checking on some edge case because the support team told them they didn’t have documentation about how to take action for instance. The appropriate response there would be “why on earth are you asking support permission to add error messages for a known error?”. On the other hand, if a support team is drowning in noisy error messages they need tooling to make it easy to distinguish between those and other messages that need to be reviewed of have action taken.


Then it either shouldn't be an alert (and instead part of some kind of summary report or some such) or the devs need to take on call. It is an exercise in frustration for everyone to route the page to ops just to make ops call dev; that means dev still has to have an oncall rotation, they might as well just take the page directly.

The unintended consequence of forcing alerts down ops' throat is them gradually caring less about pages, because there's a very good chance that each one is unactionable. I've worked places that do this, I've seen it happen first-hand more than once.

It starts with frustration and ops being less helpful to devs, and ends in a jaded acceptance where ops people start telling each other "just close it and see if it happens a second or third time, that alert never means anything". At that point, the system may as well not emit the errors anyways because no one is looking at the alerts anyways.


> There is nothing culturally bad in a system issuing an error if there is an error.

That's true, but if the error says "PANIC! EVERYTHING IS DOWN" when it's not true, then it's asking for an action that's outsized to the problem. Error messages are fine, but they just need to be classified and responded to correctly, and noisy alerts are typically the ones that are misclassified and demanding attention they (probably) don't deserve.


The context here is alerts triggering on-call.

If the error is not-actionable, why wake someone up in the middle of the night because of it?

I don't think anyone is rejecting the observability of these errors, but just that there's no point in having it alert/wake someone up unnecessarily.


People do not understand the value of classifying alerts as useful after the fact.

At Netflix we built a feature into our alert systems that added a simple button at the top of every alert that said, "Was this alert useful?". Then we would send the alert owners reports about what percent of people found their alert useful.

It really let us narrow in on which alerts were most useful so that others could subscribe the them, and which were noise, so they could be tuned or shut off.

That one button alone made a huge difference in people's happiness with being on call.


I worked on a small team that covered a relatively big site where there were so many alerts it was simply hard to track... They were all sent over email to a group list and most would just delete.

I spent about 3 months, each day trying to triage into buckets based on activity and dealing with whatever was causing the most alerts each day. Some came down to just tamping out classes of 4xx errors that should never have been in the email/alert system to begin with. Others came down to indexes to reduce load/locking/contention on some db tables. Others still were much harder to dig into.

Will say at the end of the 3 months, there was only a trickle of emails a day and the notifications were taken much more seriously after not being so overwhelming as to being ignored altogether.

edit: This was just the first thing I did each day was deal with one problem, then moving to new feature work... It wasn't assigned, as the company would always prioritize new feature work, it was just something I did for my own sanity.


If each and every alert has an owner, you’ve solved half of the cultural problem already. Good on you!


Yeah, that was one of the goals we had. We try to classify when an alert comes up and let the engineer give us feedback.

We use that to generate a report so that teams have visibility into which alerts are causing the most amount of noise.


It feels to me that using LLM to classify alerts as noisy is just adding risk instead of fixing the root cause of the problem. If an alert is known to be noisy and have appeared on slack before (which is how the LLM would figure out it's a noisy alert), then just remove the alert? Otherwise, how will the LLM know it's noise? Either it will correctly annoy you or hallucinate a reason it figures that alert is just noise.


There is a lot to be said for "smoke test" metrics. Things you expect to have frequent false positives, but are sometimes early indicators of larger problems or indicators of where to look deeper if something else goes sideways. They're not things that should wake you up in the middle of the night, but they're a damn valuable tool to quickly figure out what's actually wrong when a "real" alert triggers.

Many of these lend themselves well to dashboards instead of alerts, but not everything is "dashboardable". Sometimes it's good to have a set of low-priority alerts that are treated differently than others.

E.g. "we're not receiving any data/requests". Sometimes that's just a lull in activity. Maybe a holiday. Sometimes it's because everything _else_ is broken and nothing is getting in (e.g. DNS issues).

With that said, I do think that classification should be made manually and not automatically.


That's why alerts can have different priority levels. So, less serious issues can be addressed during normal working hours (eg. disk is 80% full). Maybe LLMs will figure out the correct P level for something like disk usage, but it's unlikely to get it right for things that are particular to your application. Maybe use an LLM when you're creating the alert to auto fill the priority level. That can then be verified by someone. Don't silence an alert based on what an LLM thinks though.


Yeah, I completely agree. I just meant that alerts you don't immediately respond to or are "noisy" aren't necessarily things you want to delete. Having low priority "noisy" alerts is not a bad thing.


I like incident.io's take on LLMs with incident management: which is essentially assist, don't decide. [1]

1: https://5x9s.svix.com/p/evolution-of-incident-management


yeah, thats the goal of adding the context and the report - to hopefully bring awareness to the team that this alert should be removed.

My rationale for flagging the alert was to help prioritization for the on-call (lets say there are multiple alerts going off at the same time)


That’s a people problem and you cannot fix people problems with tech. If no one cares to do the good job of managing alerts putting AI in front of it will not change that.


> you cannot fix people problems with tech

For very specific values of "people problem", "fix", and "tech". In reality, a more true (and relevant) assertion is "appropriate tools can make virtually any problem more tractable."

For example, it takes an annoyed engineer to notice that the same flaky alert keeps going off and is noise. Then it takes non-trivial skill on their part to communicate the need to disable that alert. They will meet non-trivial resistance, because disabling alerts is dangerous. However, if the tools they are using say "This is a noisy alert, it hasn't been useful for 6 months," disabling that alert becomes more of a best practice for the organization.


An AI could help bring to your attention alerts that need managing. I like it for this better than for someone in the moment of receiving an alert deciding whether or not to pay it attention.


If someone ignores alerts they will keep ignoring them but now you automated part of ignoring with "AI" and human at the end still will ignore alerts the same.

Writing it out makes me laugh because that's like something from Douglas Adams stories. Automated Ignoring System along with Infinite Improbability Drive.


I was thinking of simply pointing out which kinds of alerts need to be tuned to be less noisy.


The goal for oncall should be to NEVER get called. If someone gets called when they are oncall their #1 task the next day is to make sure that call never happens again. That means either fixing a false alarm or tracking down the root cause of the call. Eventually you get to a state where being called is by far the exception instead of the norm.


This is how my team used to work when I was on call in telecoms a decade ago. In the right engineering culture, and with management buy-in, it works really well.

We deployed a new system and had one week on call for each of five team members. The first couple of rotations were hell. Almost every night ended up with at least one wake up call. As we learned how to solve each type of outage, we then taught the first-line staff how to reboot the right components so we didn’t get as many wake-ups, while we spent our days fixing the bugs. And eventually the system stopped crashing.

The on-call pay was really good (nearly double for that week) and it was a pretty sweet reward to be able to rake that in as calls stopped coming. We broke out a bottle of champagne when the first week of no calls had passed.

Eventually on-call was cancelled.

Imagine how this story would have ended if management had incentivized us differently, for example if you only got the extra pay for the nights where you got pages.


I wish everyone shared your philosophy! I once worked at a company where it was expected to get 10+ pages per day, and worse, a configuration error by a customer success team would trigger an engineering page because the error handling didn't distinguish between a config problem and an actual system issue. It was insane.


Depending on the stakes this is a pretty dangerous attitude. The goal for oncall is to keep the website working, and if you're tuning for "never get paged" then you'll necessarily miss an incident eventually.


If you make your goal as high availability as possible, and you only get paged on outages, then your goal should be to never get paged.

You should be building resilient architectures, not being on firewatch duty.


This is a classic developer vs business incentives misalignment.

Developers don't want to ever be paged because they don't want to be bothered, but the business might be perfectly happy to pay you to be on firewatch duty.

Consider a "low traffic" alert, how can you tell the difference between a slow period at 3am on a holiday vs a true outage? You can't without someone getting up and testing if the site is still up. (Maybe you can automate that check but there's always edge-cases you can't automate).

OP seemed to suggest it's better to disable the alarm than to just suffer the false alarm every now and then. I doubt very much that the people paying you for the on-call service would agree though.


> Developers don't want to ever be paged because they don't want to be bothered

This is a very reductive statement.

Developers have experienced their best colleagues burning out and leaving jobs because of on-call being completely overwhelming.

Developers want to behave intelligently.

Developers want the system to work.

Developers don’t want to burn their lifespan for false alarms that are being sent because someone didn’t spend 30 seconds thinking about whether a human being needs to be woken up in the middle of the night for whatever widget they’re slapping together.


> The goal for oncall should be to NEVER get called.

Is that not also reductive then? Or maybe my statement pretty accurately captures that sentiment without 4 sentences of explanation.

But no, instead of engaging with the meat of my argument you just reductively attack one sentence.

I get it, I'm oncall right now for my job. I don't like it when alarms go off. I also understand that if I were to tune the alarms so I "NEVER get called" I'd be out of a job soon enough because the business would go under.


Okay, dialing up good-faith engagement.

How would your interpretation change if the article said this instead?

> The goal for oncall should be to continuously tune the system toward having no outages and no false alarms.

FWIW, I did only attack one sentence. This was not exactly intended to be dismissive. It was my reaction to, in my eyes, the weakest part of your argument.


This is classic misalignment of business needs vs. perceived management wants.

Are they paying you to answer false alarms, or are they paying you to make sure the site is available and performant to keep customers happy? Nobody with half a brain wants to answer a bunch of false alarms. Are there are people that will happily get paid to ACK yet another noisy alarm just to collect a paycheck? Certainly; but these are button pushers, not problem solvers.

Your low traffic alert scenario simply requires synthetic requests. This is you you test anything with low usage, but requires high reliability.


Telecoms solved this problem fifteen years ago when they started automating Fault Management (google it).

Granted, neural networks were not generally applicable to this problem at the time, but this whole idea seems like the same problem being solved again.

Telecoms and IT used to supervise their networks using Alarms, in either a Network Management System (NMS) or something more ad-hoc like Nagios. There, you got structured alarms over a network, like SNMP traps, that got stored as records in a database. It’s fairly easy to program filters using simple counting or more complex heuristics against a database.

Now, for some reason, alerting has shifted to Slack. Naturally since the data is now unstructured text, the solution involves an LLM! You build complexity into the filtering solution because you have an alarm infrastructure that’s too simple.


The alerts being sent to Slack are normally from one of those alert databases (such as Prometheus and AlertManager). Slack isn't the source of truth for them, just a notification channel.


Oh, Prometheus is good for metrics but it doesn’t hold alarms in the Fault Management sense, though. It only keeps the metrics and thresholds, checks for threshold violations, and then alerts via some mechanism.

If it were an alarm database, an operator would be able to 1. Acknowledge the alarm 2. Manually clear an alarm that was issued in error.

Without those mechanisms, alarm handling becomes really difficult for an ops team, because now all you have is either a string of emails or a chat log.


The wikipedia for page for fault management has a "see also" for alarm management, which looks extremely relevant as well.


Founder of All Quiet here: https://allquiet.app.

We're building a tool in the same space but opted out of using LLMs. We've received a lot of positive feedback from our users who explicitly didn't want critical alerts to be dependent on a possibly opaque LLM. While I understand that some teams might choose to go this route, I agree with some commentators here that AI can help with symptoms but doesn't address the root cause, which is often poor observability and processes.


> Slack-native since that has become the de-facto tool for on-call engineers.

In your particular organization. Slack is one of many instant messaging platforms. Tightly coupling your tool to Slack instead of making it platform agnostic immediately restricts where it can be used.

Other comment threads are already discussing the broader issues with using IM for this job, so I won't go into it here.

Regardless, well done for making something.


As Slack is not end to end encrypted we and I imagine many other companies cannot use it.


Slack is also extremely unreliable with notification delivery.


Try Netherlands. We're Microsoft land over here. Pretty much everyone is on Azure and Teams. It's mostly startups and hip small companies that use Slack.


Startups, hip small companies, tech product based companies. Most non tech product based or enterprise banks in NL are on Teams


I really feel like the world would be a better place if it was illegal to bundle Teams like this…


Denmark is the same. Only the smaller startups use Slack. Everyone one else is on Teams.


Thanks for the feedback. We want to get something out quickly and we had experience working with Slack so it made sense for us to start there.

However, the design is pretty flexible and we don't want to tie ourselves to a single platform either.


I don't want to be relying on another flaky LLM for anything mission critical like this.

Just fix the original problem, don't layer an LLM into it.


I agree - fixing the original problem is the main motivation.

We wanted to provide that awareness because a lot of teams arent fully aware how bad the problem might be (on-calls change weekly, there might be a bunch of other issues)


A central service might be in a better position to classify messages compared to lots of individual agents.


Note that according to StackOverflows dev survey, more devs use Teams than Slack, over 50% were in Teams. (The stat was called popularity but really should have been prevalence, since a related stat showed devs hated Teams even more than they hated Slack.) Teams has APIs too, and with Microsoft Graph working you can do a lot more than just Teams for them.

More importantly, and not mentioned by StackOverflow, those devs are among the 85% of businesses using M365, meaning they have "Sign in with Microsoft" and are on teams that will pay. The rest have Google and/or Github.

This means despite being a high value hacking target (accounts and passwords of people who operate infrastructure, like the person owned from Snowflake last quarter) you don't have to store passwords therefore can't end up on Have I Been Pwned.


Filtering whether a notification is important or not through an LLM, when getting it wrong could cause big issues, is mildly concerning to me...


Almost all alerting issues can be fixed by putting managers on call too (who then have to attend the fix too).

It suddenly becomes a much higher priority to get alerting in order.


I don’t really understand the use case. If there’s a way to programmatically tell that it’s a false alarm then there must also be a way to not create the alert in the first place

I’ve never seen an issue that’s conclusively a false alarm without investigating at all. Just delete the alarm? An LLM will never find something like another team is accidentally stress testing my service but it does happen

Another perfect example is when the queen died and it looked like an outage for UK users. Can your LLM read the news? ChatGPT doesn’t even know if she’s alive

I expect you will need AGI before large companies will trust your product.


Underrated oncall problem that needs solving is scheduling IMHO:

- We have a weekday (2 shifts) / weekend (1 slightly longer shift including friday morning to allow people to take long weekends) oncall rotation as well as a group-combined oncall schedule which gets finnicky.

- When people join or leave the rotation, making sure nothing shifts before a certain date or swapping one person with another without changing the rest and other things are a massive pain in the butt

- Combine this with a company holiday list - usually there's different policies and expectations during those. - Allow custom shift change times for people in different timezones.

- We have "oncall training" / shadowing for newbies, automate the process of substituting them in gradually, first with a shared daytime rotation and then on their own etc.

- Make oncall trades (if you can't make your shift simpler)

Gripes with PD:

- Pagerduty keeps insisting I'm "always on call" because I'm on level N of a fallback pager chain which makes their "when oncall next" box useless - just let me pick.

- Similarly, pagerduty's google calendar export will just jam in every service you're remotely related to and won't let you pick when exporting, even though it will in their UI. So I can't just have my oncall schedule in google calendar without polluting it to all hell.


Thanks for the feedback! I completely relate to PD scheduling issues and something that we want to take a look at as well.


Big fan of this direction. The architecture resonates! The base lining is interesting, I'm curious how you think about that, esp for bootstrapping initially + ongoing.

We are working on a variant being used more by investigative teams than IT ops - so think IR, fraud, misinfo, etc - which has similarities but also domain differences. If of interest to someone with an operational infosec background (hunt, IR, secops) , and esp US-based, the Louie.AI team is hiring an SE + principal here.


I get your sentiment, but theres another side of this coin that everyone is forgetting, hilariously.

You can tune your monitoring!

Noisy alert that tends to be a false positive but not always? Tune alert message to only send if the issue continues for more than a minute, or if the check fails 3 times in a row. Theres hundreds of ways to tweak a monitor to match your environment.

Best of all? It takes 30 seconds at most. Find the trigger, adjust slightly, and after maybe 1-2 tries, youll be getting 1 false positive sometimes, and actual alerts when they happen, compared to 99% false alerts, all the time.

Oh and did you know any monitoring solution worth its salt can execute things automatically on alerts, and then can alert you if that thing fails?

Also, Slack is not a defacto anything. Its a chat tool in a world of chat tools


I love this space; stability & response! After my last full-time gig, I was also frustrated with the available tooling and ONLY wanted an on-call scheduling tool with simple calendar integration. So I built: https://majorpager.com/ Not OSS, but very simple and hopefully pretty straightforward to use. I'm certainly wide open to feedback.


In my current workplace (BigCo), we know exactly what's wrong with our alert system. We get alerts that we can't shut off, because they (legitimately) represent customer downtime, and whose root cause we either can't identify (lack of observability infrastructure) or can't fix (the fix is non-trivial and management won't prioritize).

Running on-call well is a culture problem. You need management to prioritize observability (you can't fix what you can't show as being broken), then you need management to build a no-broken-windows culture (feature development stops if anything is broken).

Technical tools cannot fix culture problems!

edit: management not talking to engineers, or being aware of problems and deciding not to prioritize fixing them, are both culture problems. The way you fix culture problems, as someone who is not in management, is to either turn your brain off and accept that life is imperfect (i.e. fix yourself instead of the root cause), or to find a different job (i.e. if the culture problem is so bad that it's leading to burnout). In any event, cultural problems cannot be solved with technical tools.


I work on a team which runs hyper critical infra on all production machines at BigCo and have the same experience as you.

The problem are not the alerts — the alerts actually are catching real problems — the problem is the following:

1. The team is understaffed so sometimes spending a few days root causing an alert is not prioritized 2. When alerts are root caused sometimes the work to fix the root cause is not prioritized 3. A culture on the team which allows alerts to go untriaged due to desensitization.

Our headcount got reduced by ~40% and — surprise surprise — reliability and on-call got much worse. Senior leadership has made the decision that the cost cuts are worth the decreased reliability so nothing is going to change.

The job market is rough so people put up with this for now.


When describing infrastructure, words matter. When you describe something as “hyper critical infrastructure” it implies that tens to thousands of human beings will die within seconds of failure of said “hyper critical” infrastructure. The way the rest of your comment is worded implies that’s not what you’re actually describing and makes the words “hyper critical infrastructure” irresponsible for you to use.

I don’t mean to imply there is some kind of failure magnitude competition, I just want to reinforce that software “engineering” already has a huge problem with abject neglect of the learnings that other sign-and-stamp engineering fields have already learned from and fixed. Us code slingers are not in uncharted territory, we just need to learn from our predecessors and peers that build literal bridges and towers and force management to treat our field in the same way.


Hyper critical means if it stop working potentially billions of dollars are lost for the employees and shareholders. Given that the FAA values a human life at $9 million that actually fits your arbitrary criteria of what I am allowed to call my job.


Words matter but so does context. You weren’t confused by the words here why assume others would be?


we as an industry need to have engineering management types realize that we cannot prioritize roadmap to the complete detriment of reliability


If your org claims to be "customer obsessed" then reframe your alerts as what their impact to customers are. Don't say "elevated 502 errors" say "customers couldn't encountered errors X times."


Start putting together conference bridges for "P1 customer outages" and have someone who is responsible for calling the developers, PMs, scrum masters, managers, etc. on the team and getting them all on at 1 AM to fix it.


It sounds like you forget to make an SLO? If an alert is not actionable because it's impossible to resolve, even though it has customer impact, then it should be an SLO, not an alert.


SLO as in “service level objective”? How does defining an SLO stop the existence of alerts?


> Running on-call well is a culture problem. You need management to prioritize observability (you can't fix what you can't show as being broken), then you need management to build a no-broken-windows culture (feature development stops if anything is broken).

I was lucky enough to join a company where management does this. The managers were made to do this by experienced engineers who explained to them in no uncertain terms that stuff was broken and nothing was being shipped until things stopped being broken. Unless you have good managers this won’t happen without a fight and it’s a fight I think we as engineers need to take.

Some managers in other teams played the “oh it’s not super high impact it’s not prioritized” game, and those teams now own a bunch of broken stuff and make very slow progress because their developers are tiptoeing around broken glass, and end up building even more broken stuff because nothing they own is robust. Those managers played themselves.

Communication with management is bidirectional, sometimes they need a lot of persuasion.


> Communication with management is bidirectional, sometimes they need a lot of persuasion.

Sounds like managing up, i.e. doing IC workload and the manager's job. Hard pass.


If you'd rather be miserable at work instead of content at work, that's a choice.


I tried that approach with a colleague and it just got more and more heated and frustrating. At the same time we were getting heat for reliability. I ended up quitting. Since then I heard from a colleague that they made some staff redundant, on a team that was already underwater.

I doubt very much that my experience was unique. In my new position we have the same problems with reliability but I don’t get involved in the political side of trying to argue about it, just turn up and do my 9-5. I’m a lot less stressed now!


That’s true. But technical tools can help you highlight culture problems so that they’re easier to to discuss and fix. It’s been a minute since I’ve had to process exactly the kind of on-call/alert problem we’re discussing here, but this does feel like the kind of tool that would help sell the kinds of management/culture changes necessary to really improve things, if not fix all of them.


Switching tools, or adopting new (unproven) ones doesn't address or fix the communication issue.

The existing tools mentioned can show the metrics. Management needs an education - and that is part of the engineering job.


> Management needs an education - and that is part of the engineering job

Isn’t that bizarre? In all my years as an engineer I can count the number of managers that went to learn about engineering by themselves, on one hand.

It’s literally their job, but somehow they feel they can do it without understanding it.


I don't think it is bizarre. I see lots of MBAs running things. They don't have the engineering background, they have the "resources management" background.

I think engineer brings the numbers to management to decide course.

I prefer the situation where the CTO has no MBA and worked their way up - but that is uncommon IME.

So, in many orgs, engineer puts their comms hat on an presents a solid case.

The engineer who can communicate well, and show the metrics is typically the one who can get promoted to the decision maker role. First from the bottom up, then as a great leader


> I don't think it is bizarre. I see lots of MBAs running things. They don't have the engineering background, they have the "resources management" background.

That part is fine. What I do not understand is why there is so little interest in learning what makes engineering different from running a widgets factory.

“Tell me why it won’t work” is a fine question, but it’d be nice if I didn’t have to force all their education on them.

E.g. how many managers ignore that oft repeated adage that 9 women cannot have a baby in a month, and just spam more people on a project in the hope it’ll go faster.


Or maybe page your managers, such that they can escalate the situation. They will be more aligned on solving the cultural problems if they get waked up too.


I have half-jokingly suggested that an out-of-hours page should cost the company $10k to incentivise actually fixing problems rather than releasing broken products. But I haven't thought of a way of getting around the perverse incentive to create bugs in order to get the $10k


One company I worked for introduced a trivial ($100 or something) gift card bonus for closing a certain number of bug tickets.

The number of people who started pushing code with subtle bugs so they could create a ticket for it, fix their own bug, and get closer to that $100 gift card was shocking to me.

I can’t imagine the chaos that would occur if something came with a $10K bonus attached. Some people will bend over backward to get even tiny rewards. Dangling a $10K reward would get the wheels turning in their heads immediately.


The cost is the cost of paying you to fix outages on overtime pay instead of working on the product.


Overtime pay? Is this common for oncalls? Never gotten it myself, every time I ask they reply with "just take the time back on another day" as if my time is fungible. Weekend time is worth far more to me than weekday time


Some companies do on-call bonuses, overtime pay for on-call incidents, or other schemes.

In my experience, it’s not a net win. They’ve budgeted the same amount for compensation either way, so you’re probably getting lower base comp if they’re allocating some of it for on-call.

It also creates an atmosphere where on-call becomes more normalized, because you’re getting paid extra to do it. Some people, usually young single people, will try to milk the overtime for as much as they can, dragging out the hours spent doing on-call work because every extra hour spent on the problem makes their paycheck bigger.


Or maybe page your managers, such that they can fire you


Then... problem solved!


yeah the best managers i worked with used to be on the same on-call rotation such that they would also get paged every time. That helped build empathy and visibility into the situation.


Wouldn't the manager of one team be part of every shift in such a setup?


I completely agree that technical tools cannot fix culture problems.

However, one of the things that I noticed in my previous companies was that my management chain wasn't even aware that the problem was this bad.

We also wanted to add better reporting (like the alert analytics) so that people have more visibility into the state of alerts + on-call load on engineers.

What strategies have worked well for you when it comes to management prioritizing these problems?


Show them the costs! Wasted time, wasted resources, wasted money. Show the waste and come with the plan to reduce the waste. Alerts, on-calls and tests are all waste reduction.

"We're paying down our technical debt"


>However, one of the things that I noticed in my previous companies was that my management chain wasn't even aware that the problem was this bad.

Isn't that a cultural problem?


Obviously, the best way to get management's attention is to start a stop and frisk customer engagement plan.


Nice work, I always appreciate the contribution to the OSS ecosystem.

That said, I like that you're 'saying out loud' with this. Slack and other similar comm tooling has always been advertised as a productivity booster due to their 'async' nature. Nobody actually believes this anymore and coupling it with the oncall notifications really closes the lid on that thing.


Yeah, unfortunately, I don't think these messaging tools are async. During oncall, I used to pretty much live on Slack. Incidents were on slack, customer tickets on slack, debugging on slack...


That is correct, they are not. My former workplace had Pagerduty integrated with Slack, so I get it...


co-founder of merlinn here: https://merlinn.co | https://github.com/merlinn-co/merlinn We're also building a tool in the same space with the option of choosing your own model (private llms) + we're open source with a multitude of integrations.

good to see more options in this space! especially OS. I think de-noising is a good feature given alert fatigue is one of the repeating complaints of on-callers.


Nice job and congratulations on building this! It looks like your copy is missing a word in the first paragraph:

> Opslane is a tool that helps (make) the on-call experience less stressful.


derp, thanks for catching. It has been fixed!


We could stop normalising "on-call" instead.


Could you please elaborate?


Yes, increasingly companies are pretending like their SAAS needs to run with at least 99.999% uptime and so are insisting that all their engineers/programmers/whatevers must therefore be happy to be on-call on a rota for no extra pay because of vagueness in their contracts.

Meanwhile they either have a global workforce so don't actually need to have anyone on-call or only have customers in countries they have employees in.

It's bullshit.

Either companies should be up front about this when hiring or it should be optional and paid.

Or, they can use their engineering talent, just like telecoms companies have been doing for ever, to engineer their products to be more resilient and automate failure cases so proper remediation can wait until working hours.


can you build a cheaper datadog instead?


You have plenty of options. some of them are open source https://signoz.io/ https://coroot.com/

Did you search tools that are cheaper than DD?


we've leveraged Clickhouse/S3 to build a cost effective alternative to Datadog at https://hyperdx.io (OSS, so you can self-host as well if you'd like)


every time I see notifications in Slack / Telegram it makes me depressed. Text messengers were not designed for this. If you get the "something is wrong" alert it becomes part of history, it won't re-alert you if it's still present. And if you have more than one type of alert it will be lost in history

I guess alerts to messengers are OK as long it's only a couple manually created ones, and there should be a graphical dashboard to learn the rest of problems


I would expect anything notifying via Slack or text would have an accompanying incident ticket in the system of record.

We had a rule in my team (before a management change that blew it all to shit) that we don’t use email or messaging for monitoring. Everything goes into the SOR. Once it’s in the SOR, if people want emails, texts, or whatever, it let them know there is work to do, that’s up to the team. Others would make dashboards… lots of options once it’s in the system, and nothing gets lost.

For example, I went from a team that looked at tickets all day to one that mainly worked on user stories in Jira. Because no one was looking at the incidents in the SOR, things were getting missed. I wrote something to check for incident tickets assigned to our team every hour, and it would post them in our team chat so people knew there was work to do. Then once per day, it would post everything still unassigned, so if something was lost on that hourly post, it would annoy everyone once per day until it was assigned/resolved. It worked out decently well. If there was a lot of stuff, it would post a message to have someone actually login to the SOR and look at all our tickets. I would sometimes use the standup to assign stuff out and get some attention on it, if things were getting bad.


Why? We send alerts to Slack and Pagerduty. Slack is to help everyone who might be working, PagerDuty alerts the persons who are actually in charge of working on it.


Yeah, I think it’s convenient. We use email, but for the same thing. If I inadvertedly break something, I’ll have an email in my inbox 5 minutes later.


I don't think Slack (or similar) should be a primary alert mechanism.

But, if alarms are configured in a clean way, ideally your team is getting some warnings and such there and then if there's an alert that needs to actually page, it sends that to PagerDuty or whatever platform you use along with another message to Slack.


THIS. Whispering into a slack channel off hours isn’t a way to get on-call support help nor is dropping alerts in one. If it’s a critical issue I’m going to need a page of some kind. Either from something like PagerDuty or directly wired up SMS messaging.


Yeah, I agree that slack is not the best medium for alerts. I think we it has somewhat become the default in teams is that it makes it easy to collaborate while debugging. I don't know a good way to substitute that and share information.

What strategies have you seen work well?


I might have been lucky, most of my companies were big enough to have a dedicated person to watch the dashboard 24/7. And a human will mostly know when it's a good idea to escalate and wake up the rest of the team

I have no idea what's the best setup for small companies


is this only on the frontpage because this is an HN company?


if this is open-source project how are you planning to make this a sustainable business? also why the choice of apache 2.0


how can you prove it works and doesnt hallucinate? do you have any actual users that have installed it and found it useful?


Shameless question tangential related to the topic.

We are based in Europe and have the problem that some of us sometimes just forget we're on call or are afraid that we'll miss OpsGenie notifications.

We're desparately looking for a hardware solution. I'd like something similar to the pagers of the past but at least here in Germany they don't really seem to exist anymore. Ideally I'd have a Bluetooth dongle that alerts me on configurable notifications on my phone. Carrying this dongle for the week would be a physical reminder I'm on call.

Does anyone know anything?


A candy bar cell phone, paid for by your employer and handed to whoever is on call. People who don't want it can just forward it to their phone.


Yup, this is something we've thought about as well. We're a remote company so everyone would need to get their own but this option is definitely on the table.


in this case a satellite enabled candybar. the disaster recovery policy and budget should be applicable here. make sure its able to share xg and satellite tunnel for maximum value. ensure the reporting system is satellite enabled also. added points if its sending alerts 2 your handy byod. Disaster recovery is a big deal in 2024. All sorts of factors make satellite redundancy valuable in todays reality: Coworkers on a hike or a boat, random 0-day stuff, and war can cut your normal internet.. i have experienced all of these and only in the last 4 years and more than 1 time on each topic. Train your users to destroy it in case of war as its trackable by military tech. Put a sticker on it. Check out stackexchange for questions like this tho?


This really sounds like a _you_ problem and not something you need hardware to fix.

You can already enable silence/focus time bypass modes for apps like PagerDuty and such…

If you can’t develop some sense of responsibility to check if you’re on-call, frankly you have no business being in an on-call role.

No hardware will make you or your engineers more diligent. The only reason pagers made more sense than phones is/was because of protocol reasons NOT because it’s some separate device.


Yup, this is absolutely a "me" problem. Life just gets in the way. I'm on call, the kids ask to go to the pool, I leave my phone in the locker forgetting that I'm on-call. I visit friends, forget to bring my laptop etc.

I realize it's a "me" problem and therefore I'm looking for a solution. Others in the company have the same problem. That said: This is my very own company and I have a great sense of responsibility but I also have a shit-ton of other things in my head and I'm not the only one with this issue here.

The silence etc. bypass doesn't always work (I commented in another thread).


Wow, stackoverflow vibes here.


There are phone apps that can pierce through all silent or DND settings. Get one of those. If the same app could buzz on less than 50% battery to remind to charge that would help. Also same app could request to confirm on call status so the don't forget. If they don't confirm someone else gets the shift.


OpsGenie has that. We use it at my job. I'm not sure what problem OP is having. The phone call, text message, and app alert from OpsGenie are more than enough. The notification configuration is extremely flexible and each user can customize it as needed. From a user perspective, I don't know what else you could want.

I have no affiliation with OpsGenie outside of using at work.


We have OnePlus users with issues: https://support.atlassian.com/opsgenie-android/docs/troubles...

And we have one user with another brand that also locks down the notification/alert settings and kills apps in attempt "to save battery" which can't be controlled.


> * Alert volume: The number of alerts kept increasing over time. It was hard to maintain existing alerts. This would lead to a lot of noisy and unactionable alerts. I have lost count of the number of times I got woken up by alert that auto-resolved 5 minutes later.

I don't understand this. Either the issue is important and requires immediate human action -- or the issue can potentially resolve itself and should only ever send an alert if it doesn't after a set grace period.

The way you're trying to resolve this (with increasing alert volumes) is the worst approach to both of the above, and improves nothing.


I feel like this would be a great tool for people who have had a much better experience of On Call than I have had.

I once worked for a string of businesses that would just send everything to on call unless engineers threatened to quit. Promised automated late night customer sign ups? Haven't actually invested in the website so that it can do that? Just make the on call engineer do it. Too lazy to hire off shore L1 technical support? Just send residential internet support calls to the the On Call engineer! Sell a service that doesn't work in the rain? Just send the on call guy to site every time it rains so he can reconfirm yes, the service sucks. Basic usability questions that could have been resolved during business hours? Does your contract say 24/7 support? Damn, guess thats going to On Call.

Shit even in contracting gigs where I have agreed to be "On Call" for severity 1 emergencies, small business owners will send you things like service turn ups or slow speed issues.


That's why it's always at least double time for call outs


One of the “no-bullshit” positions I have arrived at over the years is that “real-time is a gimmick”.

You don’t need that Times Square ad, only 8-10 people will look up. If you just want the footage of your conspicuous consumotion, you can easily photoshop it for decades already.

Similarly, chat causes anxiety and lack of productivity. Threaded forums like HN are better. Having a system to prevent problems and the rare emergency is better than having everyone glued to their phones 24/7. And frankly, threads keep information better localized AND give people a chance to THINK about the response and iterate before posting in a hurry. When producers of content take their time, this creates efficiencies for EVERY INTERACTION WITH that content later, and effects downstream. (eg my caps lock gaffe above, I wont go back and fix it, will jjst keesp typing 111!1!!!)

Anyway people, so now we come to today’s culture. Growing up I had people call and wish happy birthday. Then they posted it on FB. Then FB automated the wishes so you just press a button. Then people automated the thanks by pressing likes. And you can probably make a bot to automate that. What once was a thoughtful gesture has become commoditized with bots talking to bots.

Similar things occurred with resumes and job applications etc.

So I say, you want to know my feedback? Add an AI agent that replies back with basic assurances and questions to whoever “summoned you”, have the AI fill out a form, and send you that. The equivalent of front-line call center workers asking “Have you tried turning it on and off again” and “I understand it doesn’t work, but how can we replicate it.”

That repetitive stuff should he done by AI and build up an FAQ Knowledge Base for bozos and then only bother you if it came across a novel problem it hasn’t solved yet, like an emergency because, say, there’s a windows BSOD spreading and systems don’t boot up. Make the AI do triage and tell the differencd.


Really cool!


Really cool!

Anyone know of a similar alert UI for data/business alarms (eg installs dropping WoW, crashes spiking DoD, etc)?

Something that feeds of Snowflake/BigQuery, but with a similar nice UI so that you can quickly see false positives and silence them.

The tools I’ve used so far (mostly in-house built) have all ended in a spammy slack channel that no one ever checks anymore.


https://github.com/keephq/keep (disclaimer - i'm the maintainer)


Is this for missile defense systems or something? What's possibly so important that you need to be woken up for it?


System goes down or degrades in some other way at night and important customers with a different timezone get angry, threatening to leave? (happened with us a few times)

But I would'nt use LLM for it due to hallucinations


What's so important that your customers in other timezones feel like waking you up? If they're ready to walk that fast, you can't trust them not to ditch you for an alternative as soon as they find one.


Well, I guess it depends on the business. I forgot to mention that we're B2B. For example, suppose a large food chain or a major bank has an important exam scheduled for their employees on a specific day. If our platform has a blocking bug, no one can proceed (some may be sitting in the class) because the developers are too busy sleeping. Some of our clients are also airplane pilot certification authorities, which have stricter requirements. When there's an alert, you never know if it affects small clients or large clients too.

You don't have to be a missile defense system to require a stable system where devs respond quickly...


But those are the kinds of scenarios where I imagine the sun still came up if comparable disruptions occurred prior to our current era of constant connectivity. We're too invested in the myth that our special problem can't wait half a day.


I'm not sure if you are trolling or genuine, but obviously it is worth it to wake someone up (someone who is specifically paid for being available to be woken up) once it prevents enough costs by resolving the issue now instead of doing so in half a day.


> I'm not sure if you are trolling or genuine

Odd, because I'm not sure if you are either.

Very few things are so important or costly, and if you're winding up in that situation frequently enough to rob people of their personal time to manually deal with it, clearly something is majorly wrong with this hypothetical critical thing at an architectural level.

There's nothing controversial about this.


> Very few things are so important or costly

Any outage that costs more than paying an engineer to be on-call is worth it. It's not that complex. If an outage blocks 10000 people from doing their work, it's obviously worth it to wake someone up to try to resolve it half a day sooner. (Someone you've been paying specifically for this purpose!)

> rob people of their personal time

Being paid to be on-call is not your personal time.


Ah, it's a difference in lived experiences. I was certainly never compensated for it when I had to do it.


You don't trust them to ditch you. That's what agreements are for. And your track record is part of what makes them sign and extend the agreement.


On the Internet, with the war in Ukraine, that's entirely possible!


What you've come up with looks helpful (and may have other applications as someone else noted), but you know what also makes on-call suck less? Getting paid for it, in $ and/or generous comp time. :-)

https://betterstack.com/community/guides/incident-management...

Also helpful is having management that is responsive to bad on-call situations and recognizes when capable, full-time around-the-clock staffing is really needed. It seems too few well-paid tech VPs understand what a 7-Eleven management trainee does, i.e., you shouldn't rely on 1st shift workers to handle all the problems that pop up on 2nd and 3rd shift!


I guess 7-Eleven management trainees know that their company is just as replacable for their employees as their employees are to them.


Don't send an alert at all unless it is actionable. Yes, I get it, you want alerts for everything. Do you have a runbook that can explain to a complete novice what is going on and how to fix the problem? No? Then don't alert on it.

The only way to make on-call less stressful is to do the boring work of preparing for incidents, and the boring work of cleaning up after incidents. No magic software will do it for you.


Using LLMs to classify noisy alerts is a really clever approach to tackling alert fatigue! Are you fine tuning your own model to differentiate between actionable and noisy alerts?

I'm also working on an open source incident management platform called Incidental (https://github.com/incidentalhq/incidental), slightly orthogonal to what you're doing, and it's great to see others addressing these on-call challenges.

Our tech stacks are quite similar too - I'm also using Python 3, FastAPI!


Why not use statistics? Been reading about xmr charts recently on commoncog. That might help for example.


I wouldn’t say it’s particularly clever. It’s a fairly obvious idea to anyone who has worked with alerting through IMs. What it is, is /difficult/, because you really really need to avoid false positives. Probably lots of hard work involved. So kudos for making this work (if it works)!


I'm curious about incidental :) how are you going to compete with other, well established IM tools like rootly, incident.io, firehydrant.com?


Thanks for the feedback! I saw the incidental launch on HN and have been following your journey!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: