Author here: I should clarify the satellite is not running Windows. Instead, it’s running its own custom OS written in C called Flight Software (FSW) specifically designed for the satellite onboard computer.
Re-reading the post, I see how the title, my analogies, and poor attempts at humor would give the incorrect description of what’s happening with the satellite when it enters safemode. I’ll amend the post soon.
Thanks for the feedback, I’ll be better next time.
Could I ask you to clarify why avoiding safemode is so important? In a non satellite system safemode means everything is driven to a safe state which is fine during testing in the lab.
Also do you not run these tests in an even more simulated environment where there is only the flight computer and no real hardware at all?
Having discussed this same question with the more experienced members of my team, the only conclusion I can draw is that the customer (US Government) is incredibly risk averse. Any unexpected entry into safemode would require a report, multiple meetings with the customer, and them being pretty angry. Their line of reasoning seems to be "Safemode->Something is wrong->Why is something wrong? We're not paying you to be wrong". I'm personally of the opinion that safemode isn't that bad. It's fully recoverable and shows the system is working properly.
We normally have a Functional Test Assembly (real computer and some other hardware for testing) to run our tests against, but we only have one setup and it is consistently unreliable. This particular CLT was unable to get a clean run in the lab but it was decided that the issues were related to the lab setup rather than the actual test, so we moved forward to run on the satellite (against our team's protests).
This to me is the real crux of the issue: if we can't even trust our own testing environment, what's the point of having it at all? If the customer is so risk averse, why would we take this chance? Needless to say, I don't think we'll be running anything on the satellite without full FTA vetting anytime in the near future.
> Any unexpected entry into safemode would require a report, multiple meetings with the customer, and them being pretty angry. Their line of reasoning seems to be "Safemode->Something is wrong->Why is something wrong? We're not paying you to be wrong". I'm personally of the opinion that safemode isn't that bad. It's fully recoverable and shows the system is working properly.
To the last part first: Good that safe mode kicked in and did the right thing, but now what? What caused it to enter safe mode in the first place?
That's why they care when it happens. If they don't know why it's entering safe mode, they can't correct the actual problems in the system.
"Safemode is when all non critical functions are automatically shut down and the satellite becomes entirely focused on generating power by pointing its solar panels towards the Sun and trying to reestablish any communication that was lost."
The non-critical functions are all the things the customer actually bought the satellite for. Cool that it's still alive, but now the Space Internet / death lasers / etc. are offline.
There are faults IDs that trip if certain telemetry goes outside of a normal range. If a safemode were to occur, we would investigate which faults tripped and at what time, and use those to construct a "story" of what happened on the satellite before it entered safemode. We're also constantly recording every telemetry that comes down, so we could reference any telemetry we wanted as far back as months in the past.
To your point, yes you're correct. The cause of the safemode is much more interesting than the fact we entered it.
> We normally have a Functional Test Assembly (real computer and some other hardware for testing) to run our tests against, but we only have one setup and it is consistently unreliable
Its interesting to see that someone with a 2B budget have the same problem as someome with 5 million budget... we have an engineering model for our cubesats but its flaky
I enjoyed the humour, and the content. Personally I wouldn’t change it - it’s kind of a click-bait title, but I never would have read the article if it had a boring title, and I am glad I read it.
Can you speak at all as to how the development on this software is done? Is it distributed with centralized version control? Does release and engineering process interact with the version control at all? Are there mechanisms that link defect reports, corrections, and sign offs back to version control and into the build system?
I got lost recently in how the Shuttle software was managed, mostly through IBM mainframes, and z/OSs facilities for all the above. I'm curious how modern development looks in comparison.
> I got lost recently in how the Shuttle software was managed, mostly through IBM mainframes, and z/OSs facilities for all the above. I'm curious how modern development looks in comparison.
Do you have any references for this? I also recently went down a research rabbit hole of the history of computing on Earth and in space - super interesting stuff. And the parallels are quite obvious when you look at it.
> And the parallels are quite obvious when you look at it.
The insane level of detail and strategy when writing the shuttle software is something to behold. The testing laboratory SAIL was a full scale orbiter that actually flew test missions. "Day of use I-Loads" are one of my favorite things. They couldn't change the software load, but they could move some constants around before launch, really useful for feeding wind data into the shuttle before it launched.
FSW development is done by a different team than mine but I believe it's just managed through gitlab. Releases are done through tags, and any updates that need to be made have tickets created for them and are developed by the FSW team. Final approval is given by certified product engineers and then a new tag is created for that release.
Like I said this is a different team but from what I've seen the process is fairly modern given how old our hardware is. I'm not sure of the exact process of how it's loaded onto the satellite through.
Technical blog pro tip: Assume that many of your readers are VERY literal-minded, and many of your other readers like their humor obscure and as deadpan as possible. Sorry.
Re-reading the post, I see how the title, my analogies, and poor attempts at humor would give the incorrect description of what’s happening with the satellite when it enters safemode. I’ll amend the post soon.
Thanks for the feedback, I’ll be better next time.