Bugs Rust won't catch

Daily Digest email

Get the top HN stories in your inbox every day.

collinfunk

Hi, I am one of the maintainers of GNU Coreutils. Thanks for the article, it covers some interesting topics. In the little Rust that I have used, I have felt that it is far too easy to write TOCTOU races using std::fs. I hope the standard library gets an API similar to openat eventually.

I just want to mention that I disagree with the section titled "Rule: Resolve Paths Before Comparing Them". Generally, it is better to make calls to fstat and compare the st_dev and st_ino. However, that was mentioned in the article. A side effect that seems less often considered is the performance impact. Here is an example in practice:

  $ mkdir -p $(yes a/ | head -n $((32 * 1024)) | tr -d '\n')
  $ while cd $(yes a/ | head -n 1024 | tr -d '\n'); do :; done 2>/dev/null
  $ echo a > file
  $ time cp file copy

  real 0m0.010s
  user 0m0.002s
  sys 0m0.003s
  $ time uu_cp file copy

  real 0m12.857s
  user 0m0.064s
  sys 0m12.702s

I know people are very unlikely to do something like that in real life. However, GNU software tends to work very hard to avoid arbitrary limits [1].

Also, the larger point still stands, but the article says "The Rust rewrite has shipped zero of these [memory saftey bugs], over a comparable window of activity." However, this is not true [2]. :)

[1] https://www.gnu.org/prep/standards/standards.html#Semantics [2] https://github.com/advisories/GHSA-w9vv-q986-vj7x

pornel

Indeed, std::fs suffers from being a lowest common denominator. Rust had to have something at 1.0, and unfortunately it stayed like that.

Rust uutils would be a good place to design a more foolproof replacement for Rust's std::fs API.

dapperdrake

Unix embodies this, as well.

When K&R created unix and C there was still the better option of moving changes that were better to have in the "kernel" into the kernel.

Now we have "standards" that even cause headaches between Linux and BSD's.

Linux back-propagates stuff like mmap, io_uring, etc. to where it belongs. In this way it is like the original unix. And deservedly running on most servers out there.

dapperdrake

First of all, thank you for presenting a succinct take on this viewpoint from the other side of the fence from where I am at.

So how can I learn from this? (Asking very aggressively, especially for Internet writing, to make the contrast unmistakable. And contrast helps with perceiving differences and mistakes.) (You also don’t owe me any of your time or mental bandwidth, whatsoever.)

So here goes:

Question 1:

How come "speed", "performance", race conditions and st_ino keep getting brought up?

Speed (latency), physically writing things out to storage (sequentially, atomically (ACID), all of HDD NVME SSD ODD FDD tape, "haskell monad", event horizons, finite speed of light and information, whatever) as well as race conditions all seem to boil down to the same thing. For reliable systems like accounting the path seems to be ACID or the highway. And "unreliable" systems forget fast enough that computers don’t seem to really make a difference there.

Question 2:

Does throughput really matter more than latency in everyday application?

Question 3 (explanation first, this time):

The focus on inode numbers is at least understandable with regards to the history of C and unix-like operating systems and GNU coreutils.

What about this basic example? Just make a USB thumb drive "work" for storing files (ignoring nand flash decay and USB). Without getting tripped up in libc IO buffering, fflush, kernel buffering (Hurd if you prefer it over Linux or FreeBSD), more than one application running on a multi-core and/or time-sliced system (to really weed out single-core CPUs running only a single user-land binary with blocking IO).

ericbarrett

Coreutils are not only used in interactive contexts. They are the primitives that make up the countless shell scripts which glue systems together. Any edge case will be encountered and the resulting poor performance will impact somebody, somewhere.

Here's a related example of what happens when you change a shell primitive's behavior - even interactively. Back in the 2000s, Linux distributions started adding color output to the ls command via a default "alias ls=/bin/ls --color=auto". You know: make directories blue, symlinks cyan, executables purple; that kind of thing. Somebody thought it would be a nice user experience upgrade.

I was working at a NAS (NFS remote box) vendor in tech support. We frequently got calls from folks who had just switched to Linux from Solaris, or had just moved their home directories from local disk to NFS. They would complain that listing a directory with a lot of files would hang. If it came back at all, it would be in minutes or hours! The fix? "unalias ls". Because calling "/bin/ls" would execute a single READDIR (the NFS RPC), which was 1 round-trip to the server and only a few network packets; but calling "/bin/ls --color=auto" would add a STAT call for every single file in the directory to figure out what color it should be - sequentially, one-by-one, confirming the success of each before the next iteration. If you had 30,000 files with a round-trip time of 1ms that's 30 seconds. If you had millions...well, either you waited for hours or you power-cycled the box. (This was eventually fixed with NFSv3's READDIRPLUS.)

Now I'm sure whomever changed that alias did not intend it, but they caused thousands of people thousands of hours of lost productivity. I was just one guy in one org's tech support group, and I saw at least a dozen such cases, not all of which were lucky enough to land in the queue of somebody who'd already seen the problem.

So I really appreciate GNU coreutils' commitment to sane behavior even at the edges. If you do systems work long enough, you will ride those edges, and a tool which stays steady in your hand - or script - is invaluable.

dapperdrake

In short, NFS has a terrible data model and only pretends to be a file system.

dijit

> Does throughput really matter more than latency in everyday application?

In my experience latency and throughput are intrinsically linked unless you have the buffer-space to handle the throughput you want. Which you can't guarantee on all the systems where GNU Coreutils run.

dapperdrake

Higher throughput increases the risk of high latency.

Low latency increases the risk of "wasted cycles”, i.e. lowers (machine) throughput. Helps with human discovery throughput, though.

The sled.rs people had a well readable take on this in their performance guide.

duped

Just want to point out that race conditions are a correctness problem, not a performance problem.

dapperdrake

Accurate a.k.a. "correct" implementation of ACID needs a single (central) source of truth and temporal serializability (or something close to that).

In practice this always "impacts" performance.

If I understand it correctly, then in physics this is called an event horizon.

awesome_dude

> Question 2:

> Does throughput really matter more than latency in everyday application?

IME as a user, hell yes

Getting a video I don't mind if it buffers a moment, but once it starts I need all of that data moving to my player as quickly as possible

OTOH if there's no wait, but the data is restricted (the amount coming to my player is less than the player needs to fully render the images), the video is "unwatchable"

tharkun__

What's every day?

Exactly, lots of different things.

When I alt-tab I care about latency.

When I ssh I care about latency.

When I download a 25GB game I care about throughput for the download to a certain extent that is probably mainly ISP bound rather than local system bound. I don't care if the download takes 10 or 11 minutes as long as I can still use my system with zero delays meanwhile. And whether it takes 11 minutes of 3 hours depends on my ISP mostly. But being responsive to me while it downloads is local latency bound.

The Youtube example you have makes sense, sure.

WJW

I don't mean to nitpick, but absolute values for both of these matter much less than how much it is compared to "enough". As long as the throughput is enough to prevent the video from stuttering, it doesn't matter if the data is moved to your video player program at 1 GB/s or 1 TB/s. Conversely, you say you don't mind if a video buffers for a moment but I'm willing to bet there's some value of "a moment" where it becomes "too long". Nobody is willing to wait an hour buffering before their video starts.

The perception of speed in using a computer is almost entirely latency driven these days. Compare using `rg` or `git` vs loading up your banking website.

jorvi

Hell no.

Linux desktop (and the kernel) felt awful for such a long time because everyone was optimizing for server and workstation workloads. Its the reason CachyOS (and before that Linux Zen and.. Licorix?) are a thing.

For good UX, you heavily prioritize latency over throughput. No one cares if copying a file stalls for a moment or takes 2 seconds longer if that ensures no hitches in alt tabbing, scrolling or mouse movement.

michaelmrose

This isn't what prioritizing throughput actually looks like in most scenarios.

In the example you gave the amount of read speed the user needs to keep up with a video is meager and greater read speed is meaningless beyond maintaining a small buffer.

You in fact notice more if your process is sometimes starved of CPU IO memory was waiting on swap etc. Conversely you would in most cases not notice near so much if the entire thing got slower even much slower if it's meager resources were quickly available to the thing you are doing right now.

dapperdrake

Additional point:

The point of data storage is to be a singleton.

(Backups are desireable, anyhow.)

s20n

Sorry, complete noob here. Why didn't you just cd into $(yes a/ | head -n $((32 * 1024)) | tr -d '\n')? Why do you need to use the while loop for cd?

EDIT: got it. -bash: cd: a/a/a/....../a/a/: File name too long

collinfunk

No need to apologize at all. Doing it in one cd invocation would fail since the file name is longer than PATH_MAX. In that case passing it to a system call would fail with errno set to ENAMETOOLONG.

You could probably make the loop more efficient, but it works good enough. Also, some shells don't allow you to enter directories that deep entirely. It doesn't work on mksh, for example.

dapperdrake

Facetious reply:

> However, GNU software tends to work very hard to avoid arbitrary limits [1].

safercplusplus

I don't know if you're aware, but there is a demonstration of wget (a fellow "gnu utility", right?) being auto-translated to a memory-safe subset of C++ [1]. Because the translation essentially does a one-for-one substitution of potentially unsafe C elements with safe C++ counterparts that mirror the behavior, the translation should be much less susceptible to the introduction of new bugs and behaviors in the way a rewrite would be.

With a little cleaning-up of the original code, the code translation ends up being fully automatic and so can be used as a build step to produce (slightly slower) memory-safe executables from the original C source.

[1] https://duneroadrunner.github.io/scpp_articles/PoC_autotrans...

dapperdrake

Filesystem access is mostly treated by users as serialized ACID transactions on "files in directories."

"Managing this resource centrally" is where unix syscalls came from. An OS kernel can be used like a specialized library for ACID transactions on hardware singletons.

People then got fancy with virtual memory, interrupts, signals, time-slicing, re-entrancy, thread-safety, and injectivity.

It doesn’t matter, whether you call the "kernel library" from C, C++, Fortan, BASIC, Golang, bash, Rust, etc.

theteapot

Probably a dumb question, but is GNU Core utils interested in / planning on doing its own rust rewrite?

wpollock

Thomas Jefferson famously said that "A coreutils rewrite every now and again is a good thing". Or something like that.

When I was a beta tester for System Vr2 Unix, I collected as many bug reports as possible from Usenet (I used the name "the shell answer man". Looking back I conclude that arrogance is generally inversely proportional to age) and sent a patch for each one I could verify. Something like 100 patches.

So if this rust rewrite cleans up some issues, it's a good thing.

collinfunk

At the current moment I would be against it. The language and library is changing too fast. Also, Rust has some other things that make it hard to use for coreutils. For example, Rust programs always call signal (SIGPIPE, SIG_IGN) or equivalent code before main(). There is no stable way to get the longstanding behavior of inheriting the signal action from the parent process [1]. This is quite annoying, but not unique to Rust [2].

[1] https://doc.rust-lang.org/beta/unstable-book/compiler-flags/... [2] https://www.pixelbeat.org/programming/sigpipe_handling.html

greatgib

The rewrite in Rust is mostly vanity and marketing but not based on a real technical need...

So I don't see why they would want to do that.

kibwen

Canonical's usage of uutils is likely for marketing. But the codebase itself was developed for fun, as an excuse for people to have a hands-on way to learn Rust back before Rust was even released, with a minor justification as being cross-platform. From the original README in 2013:

Why?

----

Many GNU, linux and other utils are pretty awesome, and obviously some effort has been spent in the past to port them to windows. However those projects are either old, abandonned, hosted on CVS, written in platform-specific C, etc.

Rust provides a good platform-agnostic way of writing systems utils that are easy to compile anywhere, and this is as good a way as any to try and learn it.

https://github.com/uutils/coreutils/blob/9653ed81a2fbf393f42...

pxc

I thought it was a learning exercise, and maybe some corporations also like it because it has more permissive licensing.

joaohaas

>the article says "The Rust rewrite has shipped zero of these [memory saftey bugs], over a comparable window of activity." However, this is not true

That bug got fixed before the Ubuntu release, and is from way before Canonical was even involved with the project.

rossvor

In the given list of GNU CVEs in the original article, it included a buffer overrun in tail from 2021. So for a fair comparison 2021 is part of the "window of activity" (the year uu_od CVE was published).

gzread

I see even the coreutils maintainers find themselves needing -n (no newlines) and -c (count) options to "yes".

dapperdrake

GNU coreutils is known for adding command libe options.

One of the big philosophical differences to the BSD's.

For a human being, it sucks both ways.

jimmypk

[dead]

wahern

> What’s notable is that all of these bugs landed in a production Rust codebase, written by people who knew what they were doing

They knew how to write Rust, but clearly weren't sufficiently experienced with Unix APIs, semantics, and pitfalls. Most of those mistakes are exceedingly amateur from the perspective of long-time GNU coreutils (or BSD or Solaris base) developers, issues that were identified and largely hashed out decades ago, notwithstanding the continued long tail of fixes--mostly just a trickle these days--to the old codebases.

concinds

Reading that Canonical thread was jaw-dropping. Paraphrased: "Rust is more secure, security is our priority, therefore deploying this full-rewrite of core utils is an emergency. If things break that's fine, we'll fix it :)".

I would not want to run any code on my machines made by people who think like this. And I'm pro-Rust. Rust is only "more secure" all else being equal. But all else is not equal.

A rewrite necessarily has orders of magnitude more bugs and vulnerabilities than a decades-old well-maintained codebase, so the security argument was only valid for a long-term transition, not a rushed one. And the people downplaying user impact post-rollout, arguing that "this is how we'll surface bugs", and "the old coreutils didn't have proper test cases anyway" are so irresponsible. Users are not lab rats. Maintainers have a moral responsibility to not harm users' systems' reliability (I know that's a minority opinion these days). Their reasoning was flawed, and their values were wrong.

oefrha

This leaves such a bad taste in my mouth. If you fucking found 44 CVEs with some relatively amateurish ones (I'm no security engineer but even I've done that exact TOCTOU mitigation before) in such a core component of your system a month before 26.04 LTS release (or a couple months if you count from their round 1), surely the response should be "we need to delay this to 28.04 LTS to give it time to mature", not "we'll ship this thing in LTS anyway but leave out the most obviously problematic parts"?

The snap BS wasn't enough to move me since I was largely unaffected once stripping it out, but this might finally convince me to ditch.

3836293648

It's insane that this is going into an LTS. It's the kind of experiment I'd expect them to play with in a non-LTS and revert in LTSes until it's fully usable, like they did with Wayland being the default, which started in 2017

undefined

[deleted]

PunchyHamster

Ubuntu has been doing careless shit like that their entire existence, it's nothing new

zx8080

Agree with the point. Asking sincerely, how to filter out installing any rust-rewrite packages on my machines? Does anyone know the way?

kibwen

If you don't want Canonical's packages, you should probably just be using Debian rather than Ubuntu. It's not 2008 anymore, stock Debian is quite user-friendly.

theandrewbailey

I'm unaware of any Rust rewrites outside of coreutils, so:

    sudo apt install coreutils-from-gnu

https://computingforgeeks.com/ubuntu-2604-rust-coreutils-gui...

duped

This is a people problem and Canonical just isn't good at hiring people

kstrauser

I’ve gotta agree. Some horror stories were going around about their interview process. It seemed highly optimized to select people willing to put up with insane top-down BS.

nine_k

More than that: it seems that Rust stdlib nudges the developer towards using neat APIs at an incorrect level of abstraction, like path-based instead of handle-based file operations. I hope I'm wrong.

NobodyNada

Nearly every available filesystem API in Rust's stdlib maps one-to-one with a Unix syscall (see Rust's std::fs module [0] for reference -- for example, the `File` struct is just a wrapper around a file descriptor, and its associated methods are essentially just the syscalls you can perform on file descriptors). The only exceptions are a few helper functions like `read_to_string` or `create_dir_all` that perform slightly higher-level operations.

And, yeah, the Unix syscalls are very prone to mistakes like this. For example, Unix's `rename` syscall takes two paths as arguments; you can't rename a file by handle; and so Rust has a `rename` function that takes two paths rather than an associated function on a `File`. Rust exposes path-based APIs where Unix exposes path-based APIs, and file-handle-based APIs where Unix exposes file-handle-based APIs.

So I agree that Rust's stdilb is somewhat mistake prone; not so much because it's being opinionated and "nudg[ing] the developer towards using neat APIs", but because it's so low-level that it's not offering much "safety" in filesystem access over raw syscalls beyond ensuring that you didn't write a buffer overflow.

[0]: https://doc.rust-lang.org/std/fs/index.html

juergbi

> So I agree that Rust's stdilb is somewhat mistake prone; not so much because it's being opinionated and "nudg[ing] the developer towards using neat APIs", but because it's so low-level that it's not offering much "safety" in filesystem access over raw syscalls beyond ensuring that you didn't write a buffer overflow.

`openat()` and the other `*at()` syscalls are also raw syscalls, which Rust's stdlib chose not to expose. While I can understand that this may not be straight forward for a cross-platform API, I have to disagree with your statement that Rust's stdlib is mistake prone because it's so low-level. It's more mistake prone than POSIX (in some aspects) because it is missing a whole family of low-level syscalls.

masklinn

> For example, Unix's `rename` syscall takes two paths as arguments; you can't rename a file by handle

And then there’s renameat(2) which takes two dirfd… and two paths from there, which mostly has all the same issues rename(2) does (and does not even take flags so even O_NOFOLLOW is not available).

I’m not sure what you’d need to make a safe renameat(), maybe a triplet of (dirfd, filefd, name[1]) from the source, (dirfd, name) from the target, and some sort of flag to indicate whether it is allowed to create, overwrite, or both.

As the recent https://blog.sebastianwick.net/posts/how-hard-is-it-to-open-... talks about (just for file but it applies to everything) secure file system interaction is absolutely heinous.

[1]: not path

jerf

Unfortunately, it's not the Rust stdlib, it's nearly every stdlib, if not every one. I remember being disappointed when Go came out that it didn't base the os module on openat and friends, and that was how many years ago now? I wasn't really surprised, the *at functions aren't what people expect and probably people would have been screaming about "how weird" the file APIs were in this hypothetical Go continually up to this very day... but it's still the right thing to do. Almost every language makes it very hard to do the right thing with the wrong this so readily available.

I'm hedging on the "almost" only because there are so many languages made by so many developers and if you're building a language in the 2020s it is probably because you've got some sort of strong opinion, so maybe there's one out there that defaults to *at-style file handling in the standard library because some language developer has the strong opinions about this I do. But I don't know of one.

dgacmu

Openat appeared in Linux in 2006 but not in FreeBSD until 2009; go started being developed in 2007. It probably missed the opportunity by a year. It would have been the right thing to change the os module at some point in the last 18 years, however.

JuniperMesos

After reading this article, I'm inclined to think that the right thing for this project to do is write their own library that wraps the Rust stdlib with a file-handle-based API along with one method to get a file handle from a Path; rewrite the code to use that library rather than rust stdlib methods, and then add a lint check that guards against any use of the Rust standard library file methods anywhere outside of that wrapper.

dbdr

If that's the right approach, then it would be useful to make that library public as a crate, because writing such hardened code is generally useful. Possibly as a step before inclusion in the rust stdlib itself.

akoboldfrying

Agreed. (This approach feels like a cousin of Parse, Don't Validate.)

jeroenhd

If anything, I find the rust standard library to default to Unix too much for a generic programming language. You need to think very Unixy if you want to program Rust on Windows, unless you're directly importing the Windows crate and foregoing the Rust standard library. If you're writing COBOL style mainframe programs, things become even more forced, though I doubt the overlap between Rust programmers and mainframe programmers that don't use a Unix-like is vanishingly small.

This can also be a pain on microcontrollers sometimes, but there you're free to pretend you're on Unix if you want to.

Someone

If you want to support file I/O in the standard library, you have to choose _some_ API, and that either is limited to the features common to all platforms, or it covers all features, but call that cannot be supported return errors, or you pick a preferred platform and require all other platforms to try as hard as they can to mimic that.

Almost all languages/standard libraries pick the latter, and many choose UNIX or Linux as the preferred platform, even though its file system API has flaws we’ve known about for decades (example: using file paths too often) or made decisions back in 1970 we probably wouldn’t make today (examples: making file names sequences of bytes; not having a way to encode file types and, because of that, using heuristics to figure out file types. See https://man7.org/linux/man-pages/man1/file.1.html)

bonzini

That's the same for the C or Python standard libraries. The difference is that in C you tend to use the Win32 functions more because they're easily reached for; but Python and Rust are both just as Unixy.

PunchyHamster

That's a norm in most languages, this is just more convenient way to operate

onlyrealcuzzo

> They knew how to write Rust, but clearly weren't sufficiently experienced with Unix APIs, semantics, and pitfalls.

The point of Rust is that you shouldn't have to worry about the biggest, easiest to fall in pitfalls.

I think the author's point of this article, is that a proper file system API should do the same.

AlotOfReading

Someone once coined a related term, "disassembler rage". It's the idea that every mistake looks amateur when examined closely enough. Comes from people sitting in a disassembler and raging the high level programmers who had the gall to e.g. use conditionals instead of a switch statement inside a function call a hundred frames deep.

We're looking solely at the few things they got wrong, and not the thousands of correct lines around them.

Cthulhu_

Thing is, these tools are so critical that even one error may cause systems to be compromised; rewriting them should never be taken lightly.

(Actually ideally there's formal verification tools that can accurately test for all of the issues found in this review / audit, like the very timing specific path changes, but that's a codebase on its own)

bluGill

Is formal verification able to find most of these issues? I'm no expert on formal analysis, but I suspect most systems are not able to handle many of these errors. It seems more likely that the system will assume the file doesn't change between two syscalls - which seems to be the majority of issues. Modeling that possibility at least makes the formal system much harder to make.

irishcoffee

When I read the article I came away with the impression that shipping bugs this severe in a rewrite of utils used by hundreds of millions of people daily (hourly?) isn’t ok. I don’t think brushing the bad parts off with “most of the code was really good!” is a fair way to look at this.

Cloudflare crashed a chunk of the internet with a rust app a month or so ago, deploying a bad config file iirc.

Rust isn’t a panacea, it’s a programming language. It’s ok that it’s flawed, all languages are.

gmueckl

I think that legitimate real world issues in rust code should be talked about more often. Right now the language enjoys a reputation that is essentiaöly misleading marketing. It isn't possible to create a programing language that doesn't allow bugs to happen (even with formal verification you can still prove correctness based on a wrong set of assumptions). This weird, kind of religious belief that rust leads to magically completely bug free programs needs to be countered and brought in touch with reality IMO.

fluffybucktsnek

If I'm not mistaken, in the Cloudflare case, both the Rust rewrite and the C++ original version crashed. The primary cause being the bad config file.

lelanthran

I find it hilarious that this comment is being downvoted.

Exactly what is the controversial take here?

> I don’t think brushing the bad parts off with “most of the code was really good!” is a fair way to look at this.

Nope. this is fine.

> Cloudflare crashed a chunk of the internet with a rust app a month or so ago, deploying a bad config file iirc.

Maybe this?

> Rust isn’t a panacea, it’s a programming language. It’s ok that it’s flawed, all languages are.

Nope, this is fine too.

empath75

Having panics in these are pretty amateur hour even just on a Rust level. I could see if they were like alloc errors which you can't handle, but expect and unwraps are inexcusable unless you are very carefully guarding them with invariants that prevent that code path from ever running.

slopinthebag

Seems pretty impressive they rewrote the coreutils in a new language, with so little Unix experience, and managed to do such a good job with very little bugs or vulns. I would have expected an order of magnitude more at least.

Shows how good Rust is, that even inexperienced Unix devs can write stuff like this and make almost no mistakes.

nine_k

Yes, it's the lack of Unix experience that's terrifying. So many of mistakes listed are rookie mistakes, like not propagating the most severe errors, or the `kill -1` thing. Why were people who apparently did not have much experience using coreutils assigned to rewrite coreutils?

aw1621107

> Why were people who apparently did not have much experience using coreutils assigned to rewrite coreutils?

From what I understand, "assigned" probably isn't the best way to put it. uutils started off back in 2013 as a way to learn Rust [0] way before the present kerfuffle.

[0]: https://github.com/uutils/coreutils/tree/9653ed81a2fbf393f42...

JuniperMesos

Why is it even possible to represent a negative PID, let alone treat the integer -1 as a PID meaning "all effective processes"? This seems like a mistake (if not a rookie mistake) in the Linux kernel API itself.

gblargg

Rewriting perfectly good code was a colossal mistake.

Cthulhu_

Not necessarily, but was the reasoning sound and have the tradeoffs been made? The website (https://uutils.github.io/) shows some reasonable "why"s (although I disagree with making "Rust is more appealing" a compelling reason, but that's just me (disclaimer: I don't like C and don't know Rust so take this comment as you will)), but I think what's missing is how they will ensure both compatibility and security / edge case handling, which requires deep knowledge and experience in the original code and "tribal knowledge" of deep *nix internals.

dwattttt

I do wonder whether people got down the article enough to see the list of bugs patched in GNU coreutils.

That "perfectly good code" that it sounds like no one should question included "split --line-bytes has a user controlled heap buffer overflow".

kibwen

The irony here being that GNU's coreutils themselves originated as rewrites, from back when BSD's copyright status was still legally unclear.

twhitmore

[flagged]

pando85

Memory safety catches buffer overflows. CI catches logic bugs. Neither catches the Unix API gotchas nobody documented.

Arch-TK

They're not API gotchas in most cases.

And writing comprehensive tests for this behaviour is very difficult regardless of which language you are using.

I am all for rust rewrites of things. But in this case, these are mistakes which were encouraged by the lazy design of `std::fs` and the developers' lack of relevant experience.

And to clarify, I don't blame the developers for lacking the relevant experience. Working on such a project is precisely the right place to learn stuff like this.

I think it's an absurdly dumb move by Canonical to take this project and beta-test it on normal users' machines though…

vhantz

How does CI catch logic bugs?

bluGill

That depends on what tests you are running. In any significant projects you need a test suite so large that you wouldn't run all the tests before pushing to CI - instead you are the targeted tests that test the area of code you changed, but there are more "integration tests" that go through you code and thus could break, but you don't actually run.

You can also run some static analysis that is too long to run locally every time, but once in a while it will point out "this code pattern is legal buy is almost always a bug"

It is also possible to do some formal analysis of code on CI that you wouldn't always run locally - I'm not an expert on these.

bjourne

CI catches all kinds of bugs.

cubefox

LLM account

hombre_fatal

One thing that's hard about rewriting code is that the original code was transformed incrementally over time in response to real world issues only found in production.

The code gets silently encumbered with those lessons, and unless they are documented, there's a lot of hidden work that needs to be done before you actually reach parity.

TFA is a good list of this exact sort of thing.

Before you call people amateur for it, also consider it's one of the most softwarey things about writing software. It was bound to happen unless coreutils had really good technical docs and included tests for these cases that they ignored.

aykutseker

good example from the article: the chroot+nss CVE. the rule that nss is dynamic and dlopens libraries from inside the chroot isn't anywhere obvious. it's encoded in 25+ years of sysadmins finding it out. clean-room rewrites end up re-learning that, usually as new CVEs. and LLM ports of the same code inherit the problem: the function signature is what they read, but the scars are what they need.

cataflutter

> the function signature is what they read, but the scars are what they need.

This feels like a golden quote. Don't know if you intended for it to rhyme, but well done :D

aykutseker

thanks. honestly didn't catch the rhyme, accidental aphorism :D

TheDong

What's even harder is doing that while trying to avoid the GPL, so doing that without reading the original source code.

uutils would be so much better imo if it was GPL and took direct inspiration from the coreutils source code.

dbdr

The GPL prevents you from reading the licensed code before writing related non-GPL code? Which section of the GPL says that?

TheDong

It's based on an interpretation of "derived from".

It does not matter if it's in the GPL explicitly or not since we're talking about uutils and their stance on it, and they've written that:

https://github.com/uutils/coreutils/blob/6b8a5a15b4f077f8609...

> we cannot accept any changes based on the GNU source code [..]. It is however possible to look at other implementations under a BSD or MIT license like Apple's implementation or OpenBSD.

The wording of that clearly implies that you should not look at GNU source code in order to contribute to uutils.

snovv_crash

This is clean room implementation 101, and why LLMs are so controversial in terms of licensing.

einpoklum

> The code gets silently encumbered with those lessons, and unless they are documented, there's a lot of hidden work that needs to be done before you actually reach parity.

It should be stressed that failure to document such lessons, or at least the bugs/vulnerabilities avoided, is poor practice. Of course one can't document the bugs/vulnerabilities one has avoided implicitly by writing decent code to begin with, but it is important to share these lessons with the future reader, even if that means "wasting" time and space on a bunch of documentation such as "In here we do foo instead of bar because when we did bar in conditions ABC then baz happens which is bad because XYZ."

lionkor

I struggle to find anything on this post that wouldn't be caught by some kind of unit test or manual review, especially when comparing with the GNU source for the coreutils. The whole coreutils rewrite is a terrible idea[1] and clearly being done in the wrong way (without the knowledge gained from the previous software).

If you do a rewrite, you should fully understand and learn from the predecessor, otherwise youre bound to repeat all the mistakes. Embarassing.

To be clear; I love Rust, I use it for various projects, and it's great. It doesn't save you from bad engineering.

[1]: https://www.joelonsoftware.com/2000/04/06/things-you-should-...

dwattttt

> I struggle to find anything on this post that wouldn't be caught by some kind of unit test or manual review, especially when comparing with the GNU source for the coreutils.

> If you do a rewrite, you should fully understand and learn from the predecessor, otherwise youre bound to repeat all the mistakes. Embarassing.

Interestingly, the uutils project uses the GNU coreutils test suite.

EDITED to add: they also have a stated position of not allowing contributions based on reading the GPL'd source.

cwillu

I expect nothing less from the creators of unity, upstart, and snap.

a-dub

welcome new systems programmers: unix is broken and you must write ugly non-pedagogical workarounds and do empirical testing. this is what reliable software and good software engineering actually is... surprise!@#%

Joker_vD

> The pattern is always the same. You do one syscall to check something about a path, then another syscall to act on the same path. Between those two calls, an attacker with write access to a parent directory can swap the path component for a symbolic link. The kernel re-resolves the path from scratch on the second call, and the privileged action lands on the attacker’s chosen target.

It's actually even worse than that somewhat, because the attacker with write access to a parent directory can mess with hard links as well... sure, it only messes with the regular files themselves but there is basically no mitigations. See e.g. [0] and other posts on the site.

[0] https://michael.orlitzky.com/articles/posix_hardlink_heartac...

sysguest

hmm... maybe a 'write lock' on the directory? though this will become more hairy without timeouts/etc...

masklinn

To the extent that locking exists in posix it is various degrees of useless and broken. And as far as I know while BSDs have extensions which make some use cases workable Linux is completely hopeless.

misja111

The root cause of some of the bugs seems to be the opaque nature of some of the Unix API. E.g.

> The trap is that get_user_by_name ends up loading shared libraries from the new root filesystem to resolve the username. An attacker who can plant a file in the chroot gets to run code as uid 0.

To me such a get_user_by_name function is like a booby trap, an accident that is waiting to happen. You need to have user data, you have this get_user_by_name function, and then it goes and starts loading shared libraries. This smells like mixing of concerns to me. I'd say, either split getting the user data and loading any shared libraries in two separate functions, or somehow make it clear in the function name what it is doing.

12_throw_away

> The root cause of some of the bugs seems to be the opaque nature of some of the Unix API.

Some, maybe, but if you've decided to rewrite coreutils from scratch, understanding the POSIX APIs is literally your entire job.

And in any case, their test for whether a path was pointing to the fs root was `file == Path::new("/")`. That's not an API problem, the problem is that whoever wrote that is uniquely unqualified to be working on this project.

aw1621107

Interestingly, it looks like the `file == Path::new("/")` bit was basically unchanged from when it was introduced... 12 (!) years ago [0] (though back then it was `filename == "/"`). The change from comparing a filename to a path was part of a change made 8 months ago to handle non-UTF-8 filenames.

> That's not an API problem, the problem is that whoever wrote that is uniquely unqualified to be working on this project.

To be fair, uutils started out with far smaller ambitions. It was originally intended to be a way to learn Rust.

[0]: https://github.com/uutils/coreutils/commit/7abc6c007af75504f...

ordu

> Some, maybe, but if you've decided to rewrite coreutils from scratch, understanding the POSIX APIs is literally your entire job.

Yes, it is. But still such traps in API just unacceptable. If you design API that requires obscure knowledge to do it right, and if you do it wrong you'll get privilege escalation, it is just... just... I have no words for it. It is beyond stupidity. You are just making sure that your system will get these privilege escalations, and not just once, but multiple times.

ambicapter

No one is under any impression (or should be) that the POSIX API isn't old and legacy. That's not why we still use it.

emmelaich

Rather, I think that using a functional safe language tricks people into thinking that the data it deals with is stateless. Whereas many many things change in operating systems all the time.

Until we have a filesystem that can present a snapshot, everything has to checked all the time.

i.e. we need an API which gives input -> good result or failure. Not input -> good result or failure or error.

justincormack

Yes thats one thing Musl libc removes.

geocar

If the attacker can control newroot/etc/passwd they _still_ get getpwnam to return whatever userid they want. The solution is to not lookup --userspec=username:group inside the chrooted-space, but from outside.

Also, hi how's things? :)

justincormack

hi! good, how are you doing?

geocar

> The root cause of some of the bugs seems to be the opaque nature of some of the Unix API.

Seems and smells is weasel words. The root cause is not thinking: Why is root chrooting into a directory they do not control?

Whatever you chroot into is under control of whoever made that chroot, and if you cannot understand this you have no business using chroot()

> To me such a get_user_by_name function is like a booby trap

> I'd say, either split getting the user data and loading any shared libraries in two separate functions, or somehow make it clear in the function name what it is doing.

You'd probably still be in the trap: there's usually very little difference between writing to newroot/etc/passwd and newroot/usr/lib/x86_64-linux-gnu/libnss_compat.so or newroot/bin/sh or anything else.

So I think there's no reason for /usr/sbin/chroot look up the user id in the first place (toybox chroot doesn't!), so I think the bug was doing anything at all.

Joker_vD

> The root cause is not thinking: Why is root chrooting into a directory they do not control?

Because you can't call chroot(2) unless you're root. And "control a directory" is weasel words; root technically controls everything in one sense of the word. It can also gain full control (in a slightly different sense of the word) over a directory: kill every single process that's owned by the owner of that directory, then don't setuid into that user in this process and in any other process that the root currently executes, or will execute, until you're done with this directory. But that's just not useful for actual use, isn't it?

Secure things should be simple to do, and potentially unsafe things should be possible.

geocar

> And "control a directory" is weasel words;

I did not choose the term to confuse you, that's from the definition document linked to the CVE:

https://cwe.mitre.org/data/definitions/426.html

The CVE itself uses the language "If the NEWROOT is writable by an attacker" which could refer to a shared library (as indicated in the report), or even a passwd file as would have been true since the origin of chroot()

> root technically controls everything in one sense of the word.

But not the sense we're talking about.

> Because you can't call chroot(2) unless you're root

Well you can[1], but this is /usr/sbin/chroot aka chroot(8) when used with a non-numeric --userspec, and the point is to drop root to a user that root controls with setuid(2). Something needs to map user names to the numeric userids that setuid(2) uses, and that something is typically the NSS database.

Now: Which database should be used to map a username to a userid?

- The one from before the chroot(2)?

- Or the one that you're chroot(2)ing into

If you're the author of the code in-question, you chose the latter, and that is totally obvious to anyone who can read because that's the order the code appears in, but it's also obvious that only the first one* is under control of root, and so only the first one could be correct.

[1]: if you're curious: unshare(CLONE_USERNS|CLONE_FS) can be used. this is part of how rootless containers work.

dapperdrake

Unix and POSIX are fractally a booby trap.

penguin_booze

The correct phrasing of the title: The bugs Rust won't catch.

tdiff

Ok if there were some rust guys rewriting coreutils with no experience in linux, but how come Ubuntu accepted it into its mainline?

Joeboy

Because it's Ubuntu policy to replace some foundational part of the system with some janky unfinished experiment in every release.

I agree with you that that's more the story here than "OMG, somebody wrote Rust code with bugs in it".

12_throw_away

Right? Canonical wanted (still wants?) to use a coreutils implementation where "rm ./" would print "invalid input" while silently deleting the directory anyway.

I don't really care that some very amateur enthusiasts wrote some bad code for fun, but how in the world did anyone who knows anything about linux take this seriously as a coreutils replacement?

foobar1274278

The original is GPL licensed, while the rewrite is MIT.

tdiff

Was at actually so important to rush with the switch?

alkonaut

> What’s notable is that all of these bugs landed in a production Rust codebase, written by people who knew what they were doing

So does this mean that neither did the original utils have any test harness, the process of rewriting them didn't start by creating one either?

Sure there are many edge cases, but surely the OS and FS can just be abstracted away and you can verify that "rm .//" actually ends up doing what is expected (Such as not deleting the current directory)?

This doesn't seem like sloppy coding, nor a critique of the language, it's just the same old "Oh, this is systems programming, we don't do tests"?

Alternatively: if the original utils _did_ have tests, and there were this many holes in the tests, then maybe there is a massive lack in the original utils test suite?

geocar

> So does this mean that neither did the original utils have any test harness, the process of rewriting them didn't start by creating one either?

Yes.

> Sure there are many edge cases, but surely the OS and FS can just be abstracted away and you can verify that "rm .//" actually ends up doing what is expected (Such as not deleting the current directory)?

I think people have been trying that since before I was born and haven't yet been successful, so I am much less sure than you are.

For example: How do you decide how many `/` characters to try?

For a better one: Can you imagine if "rm" could simply decide to refuse to delete files containing "important" as first 9 bytes? How would you think of a test for something like that without knowing the letters in that order? What if the magic word wasn't in a dictionary?

> This doesn't seem like sloppy coding, nor a critique of the language, it's just the same old "Oh, this is systems programming, we don't do tests"?

I've never heard anyone say that except as a straw man.

I've heard people say tests don't do what people think they do.

omcnoe

My understanding is the uutils development process involved extensive testing against the behaviour of the original utilities, including preserving bugs.

alkonaut

But we still have CVE's for trivial things? I mean just a medium sized test suite for "rm" alone should probably be many thousand test cases or so. And you'd think that deleting "." and "./" respectively would be among them? Hindsight is always 20/20 and for inputs involving text input you can never be entirely covered, but still....

12_throw_away

If something as basic as "rm ./" is broken, the word "extensive" does not apply to whatever testing there was.

duped

> Sure there are many edge cases, but surely the OS and FS can just be abstracted away and you can verify that "rm .//" actually ends up doing what is expected ?

This is one reason why Windows disables symlinks by default, and it's not an abstraction but wholesale removal of a feature. Unixes can't do that without breaking decades of software that relies on their existence.

MacOS does something similar, for example the chroot() bug isn't an issue in practice because MacOS forbids chroot() by default (you need to disable system integrity protection).

The fundamental problem is caused by the POSIX APIs. They have sharp edges by their very nature. The "fix" is to remove them.

eb08a167

I'm totally fine with people experimenting and making amateur attempts at what adult people do. After all, that's how we grow. What I'm actually curious about is how the decision-making chain at Ubuntu got so messed up that this made it into production.

eviks

Sometimes growing is only your height increasing

z3t4

To be fair these are mostly gotchas with Linux and not Rust itself, but I guess the std in Rust could handle some of these issues, in that a std should not allow you to shoot yourself in the foot by default.

marcosscriven

That’s a great article, and indeed a very good blog. Just spent ages reading lots of their other articles.

Of the bugs mentioned I think the most unforgivable one is the lossy UTF conversion. The mind boggles at that one!

Daily Digest email

Get the top HN stories in your inbox every day.