On Hubris and Humility: developing an OS for robustness in Rust [video]


> Unlike a signal-handler, it never alters the control-flow of the code you read on the page...Code you write in a task executes as written, or fails. Nothing in the system will arbitrarily alter your program's control-flow from what appears on the page...Now, when phrased like that, it seems weird to have to say it out loud, like, don't all programmers pretty much assume that the code that they write runs the way they wrote it? And to that I say yes, we absolutely do, and that assumption is often wrong, and leads to common classes of bugs, starting with data races and moving on up.

Wow, this is really insightful


It really puts the task into the hands of the programmer. The way the it works (if I understand correctly from the docks) the programmer of a task needs to make sure it checks for notifications when it want to and controls that process.

This gives the programmers some rope to hang themselves with, but on the other hand you don't get shot by the OS when you don't want to.


If you want a very concrete example to hang your hat on: making the I2C driver interrupt driven was a snap -- and importantly, can easily be done from a library, where the caller merely provides functions to call to enable a specified interrupt and to synchronously wait for interrupts.[0]

[0] https://github.com/oxidecomputer/hubris/blob/01b5af3d54348ba...


Looks interesting! Though I noticed the bit about the Konami code. The terminology is great, but alas it feels like I've run into as many i2c devices that break the i2c protocol as follow it. Ahem, I'd recommend avoiding Infineon I2C devices to keep your sanity. ;)

Reading the code, is it correct to say the ISR will timeout if the device doesn't respond? It's nice the driver returns the state of the bus locking in the error `Err(drv_i2c_api::ResponseCode::BusLocked);`. Makes me curious how you're doing that. I need to find timeout watch the video.


Yeah. Though importantly, all of these tasks are (as I understand it) being designed and developed cohesively, by a small team under a single roof. We're not talking about random userspace applications in a general-purpose OS; new tasks can't even be started and stopped at runtime, they're laid out at build time. I think that's why this explicit and cooperative model can work.


I would assume as the ecosystem grows 'standard' task will develop that will see a lot of reuse between different people using this OS.

But I agree critically you build deploy this as a complete system even if many of the tasks are not inhouse. You still test the complete system.


For those looking for more on "parse don't validate" mentioned in the talk here are links to the original, and two previous discussions.

0. https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

1. https://news.ycombinator.com/item?id=21476261 (original 2019)

2. https://news.ycombinator.com/item?id=27639890 (revisited 2021)


“Parse don’t validate” reminds me of John Ousterhouts “Define errors out of existence.”[0]

[0] https://youtu.be/bmSAYlu0NcY?t=1315


I thought the same thing when I read the blogpost.


+1, it's become a mantra for me in any code I write that deals with foreign data (which is most of them). Really reshapes your thinking.


No rule of thumb should be a mantra. Parsing when you don’t need to induces unnecessary coupling, which can be quite undesirable in many scenarios. Like any technique, there are times when it’s worth it and times when it’s not.


Is "No rule of thumb should be a mantra" its own counterexample?


Only the Sith deal in absolutes


I am extraordinarily and unspeakably biased here, but I also say this as someone who didn't see the talk until I myself was an attendee at OSFC: Cliff's talk is as good as any I have ever seen in my career. It is dense with technical detail that itself reflects years of thinking and wisdom, but it remains well-paced with very helpful visuals; highly recommended viewing!


Slightly a tangental, but you seem like a good person to ask and this seems like as good a place as any: Is the way Hubris handles failed tasks intentionally reminiscent of the way illumos Fault Management Architecture works, or is that a coincidence born of them solving the same task in a relatively obvious way?


That actually hadn't occurred to me, in part because the Hubris model mainly deals with software failure, whereas FMA is mainly dealing with hardware failures. But certainly they both have a shared zeitgeist of robustness in light of errors are faults! So I would say that commonalities aren't directly correlated, but also not totally unrelated. ;)


Very interesting. This reminds me a ton of L4, with its message passing style. I'm curious if it was inspired by L4.

It's a great talk, and my only disappointment was that I was hoping to hear about what comes after. After watching the OS talk that Cantrill references [0], I really want to know the answer to "what does an OS that runs across a bunch of heterogeneous chips look like?" Having Hubris be just a single-node OS was a bit of a let down, so I will be staying tuned for the sequels.

[0] https://www.youtube.com/watch?v=36myc8wQhLo

EDIT: Went to go read the docs and: "The operating system that originally made this point was L4, and Hubris’s IPC mechanism is directly inspired by L4’s"


Yes, definitely L3/L4 (and QNX) inspired! In terms of the Roscoe talk: I think it's a great talk because he talks about the problems, but I don't particular agree with the proposed solution. ;) In particular, I don't think it makes sense to have single operating systems spanning these wildly disparate, heterogeneous cores; to me, it makes much more sense for each to have its own (tiny) operating system -- e.g., Hubris.


Thanks! I guess I'll check back in 5 years and see who was right.


Love the technical insights, but I comment here to praise the presentation style.

My first judgement was that it seemed a bit rehearsed, but I ended up really enjoying the style. It felt very respectful of my time as a viewer, very concise and well prepared. I'm sure a lot of time went into that and I appreciate it.

So thanks, for both an awesome application of rust and an amazing presentation.


This is the video referred to in the post of this thread from earlier: https://news.ycombinator.com/item?id=29390751


The talk itself is great, but the background to the slides and speakers is constantly moving from side to side and is pretty distracting. I would have preferred to see the video of the slide deck itself without any extraneous background or the speaker’s head.


This is almost all astonishingly sensible. If static provisioning works at all for the application, it is obviously better.

If you did have a need to have dynamic tasks -- e.g., drivers for an unbounded set of USB peripherals -- one of the static tasks could be the executive for those. They would still be unable to interfere with the other, static tasks.

I have to agree with the rabbi that the Rust evangelism detracts from the talk. E.g., the runtime borrow checking example, which does not rely on any Rust mechanism, would work as well with any language.


I really enjoyed this talk.

How does a system with completely static resource allocation accommodate cases where the underlying hardware is actually dynamic? For example, consider hot-plugging USB devices or storage media.

Would there be a fixed "maximum number of USB devices," with resources reserved at compile time for the maximum? If so, does this preclude high resource utilization in the common case, where the user isn't hitting these limits?


In the past, I've worked on real-time systems which deliberately run in "worst-case" load all the time, basically to avoid your second concern: if high resource utilization is possible, then you have to design for it, and this is an easy way to make sure the system doesn't fall over.

For example, if you're building a system that does controls multiple motors, you'd naively do motion-planning calculations for only the motors that are actively running. If instead you do these calculations for every motor all the time, even the ones that aren't moving, you'll have a worst-case performance profile; if it still works, then you can be more confident about the system.

(I work at Oxide, but only for about a month, so this is more past experience than Hubris-specific)


This talk would be more interesting if it weren’t thinly veiled Rust evangelism. I lived through the time when XML and Java evangelism was considered technical content. The older I get, the less substance I see in evangelism of any sort.

Centering language choice as the causal factor for any engineering project signals poor discipline in engineering.


Ordinarily I'd agree with you. While I enjoy Rust, rather a lot of projects using it I've seen place the language chosen farther forward in the value proposition than the capabilities or requirements of the project.

But this seems different. The Oxide folks need a tool that does specifically what Rust does in this space, and (at least going from the talk and related announcement blog post) did a substantial amount of research to try to avoid re-inventing the wheel before choosing Rust. Compared to other rewrite-it-in-Rust exercises, this one seems to evince mechanical sympathy with several of Rust's strengths, to the point that I doubt another language/platform would be even in the ballpark of a good fit for Hubris.

And that's true of your other examples as well: sure, a lot of Java evangelism is hot air driven by process-obsessed leadership sold on false promises of model-your-org-chart-in-code interoperability, yielding self-satirizing OO soup. But some projects have the right combination of (for example) non-synchronously-communicating teams, boxes-and-lines design processes, unskilled programmers, and safety requirements. For those projects, Java is the right choice; choosing a tool that doesn't "resonate" with those requirements as well as possible would be unwise.

Similarly, if I needed to work on a data-modeling-intensive project whose requirements were driven by lots of mechanically specified contracts (schemas) and was worked on by non-programmers who were comfortable with data description but not imperative logic, XML might be the right choice. I personally hope that's rarely the case, but the point stands.


They explicitly justify the lack of solutions for memory safety in their space - both in terms of hardware and software - and why they are building their product using specific tools. They even note that this may seem like a strange choice (as opposed to using something off the shelf) but that they were willing and able to invest in these tools, specifically that they were going to build pretty much everything from scratch.

They even call the project 'Hubris' as a joke about the ambition.

Further, they discuss how borrow checking as a model lends itself to the task architecture. It's obviously very relevant.

It seems silly to call this evangelism as opposed to a very self-aware deep dive into their choices.


The abstract software techniques they used to achieve certain properties is more substantial and generally applicable than the specific language they used to instantiate those properties. Citing the language as the causal factor in choosing those techniques and not their requirements is unnecessary evangelism.


I don't get your point. They had goals and chose technologies and approaches to achieve those goals. They cite their task model - would you call that some sort of 'task evangelism'? They cite that system calls in their OS are synchronous, and how that enables some optimizations that work well with Rust's borrow checker.

All of this works together and feels relevant.




If you’re looking for technical content then ignore the “Rust evangelism” and talk about the benefits and trade offs of Hubris’ various design choices.

There’s plenty of them to go around. In other words, start the discussion you want to have rather than complaining it wasn’t presented exactly as you’d prefer.


If civil engineers used a system to calculate bridges that is known to frequently collapse bridges it is good engineering to change to a system that prevents this from happening.

There was a time qhen civil engineers thought they were geniuses and their own genius would be enough to prevent such mistakes. Guess what: then bridges collapse.

In programming we still have people who think they can prevent stupid mistake by shere genius and/or willpower, despite clear evidence to the opposite. Get over yourself and do the right thing.


This comment is refuting a point I did not make. By all means, please continue to use Rust in the domains in which you think it excels but no matter how much we agree that there exists domains in which Rust excels it remains true that language evangelism is not very interesting relative to higher level and more general purpose engineering concepts and principles. Well, that is unless you actually prefer evangelism to more substantive technical content.


> Centering language choice as the causal factor for any engineering project signals poor discipline in engineering.

This is the point I argued against. If the way you do mathematic calculations in your bridge design repeatedly produces fatal flaws, it is indeed poor engineering if you keep that system.

Using tooling that allows you to e.g. produce buffer overflows and blaming a "poor discipline in engineering" is the kind of quote you can read from the engineers that built the biggest collapsing civil structures in history.

The best design of a system is the simplest design that gets you there, while preventing all errors that you can make on the way. And a programming language is always part of that system. This is why comment is also not about Rust it is more about recognizing languages as part of the engineering you are doing.


can you name a single widely used C library that hasn't had a buffer overflow vulnerability?


I don’t find comments that dismiss one type of technology over another without any reaons to be very interesting. Apparently there are two kinds of technology that are discussed here: language technology and OS technology. Or at least that’s the two that they want to focus on. Yet you dismiss one of them as pretty much irrelevant.[1] Yet you can do the same for the OS technology, if this was about something that was implemented on OS Y: “It’s not interesting to me that this was developed on OS Y. The abstract techniques that they used are more interesting to me than the specific OS they used.” This would effectively communicate that either (1) the OS is irrelevant, or (2) the OS is uninteresting to you. (1) is doubtful (then threads like this would be irrelevant) and (2) is just an expression of your own proclivities and interests.

It would seem obvious that developing certain things on certain OSes have tradeoffs. And likewise for programming languages. Unless you want to be consistent and subscribe to the ridiculously relativist idea that everything is the same; it’s just a matter of what you do with them.

Complaining about evangelism without offering any kind of criticism of what is supposedly being evangelized betrays just as much bias as the supposed missionaries.

[1] “The abstract software techniques they used to achieve certain properties is more substantial and generally applicable than the specific language they used to instantiate those properties”


> Yet you can do the same for the OS technology, if this was about something that was implemented on OS Y: “It’s not interesting to me that this was developed on OS Y. The abstract techniques that they used are more interesting to me than the specific OS they used.”

I would absolutely say that. OS evangelism is as uninteresting as language evangelism, at least in comparison to a discussion about abstract software techniques / concepts and their implications.

> Complaining about evangelism without offering any kind of criticism of what is supposedly being evangelized betrays just as much bias as the supposed missionaries.

I did offer clear and actionable criticism. Evangelism is less interesting than a talk based in general first principles. That’s my opinion and I think it’s a widely held one. I never accused the presenters of bias and if they are biased, I have absolutely no problem with that because all humans have biases. For the record, I am not biased against Rust but perhaps I am not sufficiently biased in favor of Rust to the extent that Rust evangelism would be interesting to me.


Given that the raison d'être of oxide is to use rust to build their product I don’t know that this is fair criticism, unless you want to write their whole enterprise off for the same reason.


I don't think that's their 'raison d'être', their 'raison d'être' is making a better computer.


> Given that the raison d'être of oxide is to use rust to build their product

Thanks for proving my point!


The talk mentions Oxide Computers. Just saw their site, I like the design there: https://oxide.computer