Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

bjackman

A potential lesson here (i.e. I am applying confirmation bias to retroactively view this article as justification for a strongly held opinion, lol):

Unless you are gonna benchmark something, for details like this you should pretty much always just trust the damn compiler and write the code in the most maintainable way.

This comes up in code review a LOT at my work:

- "you can write this simpler with XYZ"

- "but that will be slower because it's a copy/a function call/an indirect branch/a channel send/a shared memory access/some other combination of assumptions about what the compiler will generate and what is slow on a CPU"

I always ask them to either prove it or write the simple thing. If the code in question isn't hot enough to bother benchmarking it, the performance benefits probably aren't worth it _even if they exist_.

forrestthewoods

Blog author here. I somewhat agree, somewhat disagree. This line makes me uneasy:

> I always ask them to either prove it or write the simple thing. If the code in question isn't hot enough to bother benchmarking it, the performance benefits probably aren't worth it _even if they exist_.

One of my philosophies is that death by a thousand cuts is fine, but death by ten thousand cuts isn’t. A team of 10 engineers can probably fix most of a thousand cuts in two or three months. But if you have ten thousand cuts you’re probably doomed. And those don’t show up cleanly in a flame graph.

Now for some context my background is video games. Which means the team knows they need to hit an aggressive performance bar. This isn’t true for many projects. shared_ptr is a canonical example of death by ten thousand cuts.

That said, I strongly agree with the principle of “just do the simple thing”. However I think it’s important to have “sane defaults”. A project can easily have a thousand or ten thousand papercuts that kill performance. But you can’t microbench every tiny decision. And microbenches are only a vague approximation of what actually matters.

I’m also wary of “the compiler will make it fast”. Because that’s true… until it’s not! Although these days you don’t have any choice but to lean heavily on the compiler and “trust but verify”.

No one wants a super complex solution if it’s not needed. However I am very amenable to “do a slightly more complex thing if you know it’s correct and we can never think about this ever again”. It’s much easier to do the fast thing upfront than for someone else to try and speed it up in two years when we’re doing a papercut pass.

klyrs

> But if you have ten thousand cuts you’re probably doomed. And those don’t show up cleanly in a flame graph.

I am reminded of the lovely nanosecond/microsecond talk by Grace Hopper. If your code does a little bit of setup and then spends all of its time in a single hotspot, fine. But if your code is full of microsecond-suboptimal speed bumps, you can probably hide your hotspot altogether. And a flat-ish flame graph looks fine: nothing stands out as a problem!

It's valuable to do micro-benchmarks, not just to hone your optimization skills, but to learn optimal patterns in your language of choice. Then, when you're "in the zone" and laying down new code, you just do the optimal thing reflexively. Or, when you're reviewing or rewriting something, those micro-hotspots jump out and grab your attention.

There's a reason that ancient software running on ancient hardware is way more responsive & snappy than what we have today. Laziness.

Dylan16807

> There's a reason that ancient software running on ancient hardware is way more responsive & snappy than what we have today. Laziness.

Laziness in terms of using entirely inappropriate algorithms, sure.

Laziness in not microbenchmarking minutia? It shouldn't be. There's a limit on how much that can hurt you. I would say much less than a factor of ten, but let's go with 10x just for argument's sake. If you have a CPU that's 500x faster, and use easy code that's 10x slower, you're doing just fine. This is not the problem with modern unresponsiveness.

naasking

> And a flat-ish flame graph looks fine: nothing stands out as a problem!

If your program is still slow, that would also indicate that everything is a problem, ie. the ten thousand cuts. Start optimizing at some obvious spots and then see what happens.

bjackman

Haha "death by a thousand cuts" is exactly the phrase I encounter in these debates!

And actually I still disagree - e.g. I once took over a DMA management firmware and the TL told me "we are really trying to avoid DBATC so we take care to write every line efficiently". But the thing was that once you have a holistic understanding of the systems performance you tend to find _only a small fraction of the code ever affects the metrics you care about_!

E.g. in that case the CPU was so rarely the bottleneck that it really didn't matter, we could have rewritten half the code in Python (if we'd had the memory) without hurting latency or throughput.

Admittedly I can see how games or like JS engines might be a kinda special case here, where the OVERALL compute bandwidth begins to become a concern (almost like an HPC system) and maybe then every line really does count.

Dylan16807

Very little of your code is in hot loops. If the code that takes half a millisecond per frame could be twice as fast, but the hot loop is very optimized, then it doesn't really matter. And that's what I would think of by default for having many many cuts. Better to spend the optimization effort elsewhere.

> shared_ptr is a canonical example of death by ten thousand cuts

Why does that count as ten thousand cuts rather than one cut? That doesn't sound intractable to fix if you have months.

kllrnohj

> Very little of your code is in hot loops.

Very few programs have "hot loops" as small, contained things. "hot loops" is at this point as much of a bad trope as "but premature optimizations!" is.

> Why does that count as ten thousand cuts rather than one cut? That doesn't sound intractable to fix if you have months.

Why spend months re-writing code you could have just written correctly the first time if a single person had spent an hour messing around with benchmarks to figure out the proper guidance?

undefined

[deleted]

undefined

[deleted]

throwaway894345

I generally agree, but it’s also not obvious to me in Rust (or in Go) whether passing by reference or by copy is more maintainable or clear. I guess what I want is some guidance on what I should do by default, which you sort of give with “do what is more maintainable”, but I can’t tell what that means in practice (I’ve been told to default to pass-by-reference in the past because most traits take &self and not self).

kibwen

> I’ve been told to default to pass-by-reference in the past because most traits take &self and not self

This is only blanket advice for designing traits, because as the trait author you don't know what concrete type the downstream user is going to want to use, and taking `&self` in that circumstance is the choice that is friendliest to both Copy and non-Copy types.

If you're just writing a non-generic function and you do know what concrete types you're using, the flowchart is pretty simple:

1. If the type is not Copy, then pass by-ref if you just need to read the value, pass by-mutable-ref if you just need to mutate the value, and pass by-value if you want to consume the value.

2. If the type is Copy, then pass by-value, but if your type is really big or if benchmarking has determined that this is a critical code path then pass by-ref.

__turbobrew__

I would still consider myself a go novice, but I have been burned a number of times passing simple objects by reference and then that object gets mutated causing subtle bugs. Also, go is happy to blow your foot off if you take the reference of a loop variable. Although, there is a proposal to fix that.

Generally I find that less bugs get introduced when using copy instead of pass by reference, but I’m sure others have the opposite opinion.

411111111111111

This is about rust though and thats not really possible there (at least to my knowledge). You should get a compiler error if you attempt this.

I got very little experience in rust though, so there might be a way (I'm just not aware of) to circumvent this check

mcguire

I have had the same results. Passing by copy is simpler and less bug-prone and reduces the urge to "just set the value since I have a reference to the object" which is a well-paved road to significant pain.

And the objects have to get surprisingly large before passing by reference really makes a difference.

throwaway894345

Yeah, const semantics are one of many things I would have preferred to have in Go over generics. You can sort of emulate them with pass by copy, but (1) that has performance implications and (2) if your data contains a reference, that can still be mutated.

touisteur

Ouch, I actually feel Ada was right in having the for-loop 'variable' be constant inside the body of the loop. If you want a modifiable loop variable, use a loop/while loop, but at least the first question in peer-review becomes 'why not a for-loop here'?

undefined

[deleted]

saghm

I don't think one of them is more clear or maintainable universally, but in a lot of contexts, there might be an obvious choice. As a trivial example (that isn't quite fair given that the topic is about structs), it will almost never be more clear or maintainable to pass a shared reference to an integer (although there may be cases where a mutable reference might make sense). I don't think there's much need for one to have precedence over the other by default; if anything, I see the discussion about performance tradeoffs not being worth fretting about in the absence of actual measurement to be an argument _against_ one of them being inherently preferable.

bjackman

Yeah totally agree it's not always/usually obvious. But there are cases where there's a clear readibility/assumed-performance tradeoff and in those cases I say always prefer readibility (unless you benchmark).

jackmott

[dead]

jjice

> but that will be slower because it's a copy/a function call/an indirect branch/a channel send/a shared memory access

I really dislike these takes. I see engineers optimize these cases and then go ahead and make two separate SQL queries that could be one, ruining any false optimization gains they got by lord know how many times.

Yeah, you can loop over that 100 element list twice doing basic computation if you want, it's not going to make a difference for many engineering workloads, but could make a big difference in readability.

Patrol8394

This x 10000 ! If I had a dime for every time I provided this exact feedback in code reviews…I find surprising that a lot of devs in tech industry are obsessed with pointless micro optimizations and they don’t care about writing maintainable, testable simple code. My final comment is always to not outsmart compilers/jvm because they tend to do a much better job that developers.

Please, don’t optimize unless you have reasons to do so and numbers backing that up.

josephg

My advice is the opposite: if you want to make performance justifications for code, you need a benchmarking suite. I have them for a lot of my projects. (Rust’s criterion is a delight). A good benchmark suite is a subtle thing to write - you want real world testing data, and benchmarks for a range of scenarios. The benchmarks should be run often. For some changes, I‘ll rerun my benchmarks multiple times for a single commit. I benchmark the speed of a few operations, serialisation size, wasm bundle size and some other things.

Having real benchmarking data is eye opening. It was shocking to me how much the wasm bundle size increased when I added serialisation code. The time to serialise / deserialise a big chunk of data for me is 0.5ms - so fast that it’s not worth more microoptimizations. Lots of changes I think will make the code slower have no impact whatsoever on performance. And my instincts are so often wrong. About 50% of microoptimizations I try either have no effect or make the code slightly slower. And it’s quite common for changes that shouldn’t change performance at all to cause significant performance regressions for unexpected reasons.

I’ve also learned how important “short circuit” cases can be for performance. Adding a single early check for the trivial case in a function can sometimes improve end to end performance by 15-20%, which in already well tuned code is massive.

Performance work is really fun. But if you do performance tuning without measuring the results, you’re driving blindfolded. You’re as likely to make your code worse as you are to make it better. Add benchmarks.

ok123456

This is true for application code. But, Rust is trying to sell itself as a systems language and an embedded language and a language you can write kernel modules in. Memory budget matters in these cases.

tialaramex

If memory budget matters, you have a memory budget. So you should be measuring and you can actually tell where you need improvements.

But in practice what we see overwhelmingly is that people want to do this stuff but they aren't measuring, because measuring is boring whereas making the code more complicated to show off how much you think you know about optimisation is easy. Knock it off.

dahfizz

It depends on your specialization, I guess. If you're making a website, a few microseconds here and there probably don't matter.

But in my field (Fintech), performance really does matter. Doing the simple, slow thing is just lazy and won't make it through review.

imron

Great. Should be easy to prove then.

Diggsey

One neat thing here is that the compiler is aware of which types are `Copy` and not internally mutable (not contianing an `UnsafeCell`). For these types, passing `&T` and `T` are equivalent, so the compiler could just choose the faster option.

Even if it's not smart enough to do that today, it could implement this optimization in the future. This could work even without inlining, since the Rust calling convention is unstable, and an optimization based on type size could be incorporated into it.

cwzwarich

It would be nice if Rust could do this, but it breaks backwards compatibility. Some existing code depends on pointer values of &T being equal or not equal.

comex

As an addendum, LLVM can automatically perform the “&T-as-T” optimization (without inlining) in some cases where the callee function is in the same compilation unit and known to not care about the pointer value. However, these types of optimizations tend to be fragile, easily disturbed when things get slightly complex.

zozbot234

It would be more advisable to add this as a clippy hint, because `&T` and `T` are not always equivalent wrt. FFI.

kibwen

Indeed, but the compiler is still capable of doing it on a case-by-case basis. Quite often the observed semantics are identical and it's easy for the backend to see that a pointer has been created only to be immediately dereferenced.

saghm

This seems to be an unpopular opinion, but I feel similarly about how sometimes people seem to toss out `inline` (and even more suspect, `inline(always)` annotations on Rust functions like candy on Halloween and there are almost never any sort of actual measurements of whether it actually helps in the cases it's used. It's not even that I think it really hurts that much in most of the stuff I've worked on (which tends to be more sensitive to concurrency design and network round trips), but I can't help but worry that people using stuff like this when they don't seem to fully grasp the implications is a recipe for trouble.

bjackman

Yeah inline is an absolute classic for this. The number of uncommented __attribute__((always_inline))s I see in C code drives me crazy. There are absolutely legitimate reasons to use that attribute but there should ALWAYS be a comment about why, so that later readers know in what conditions they can safely remove it.

phkahler

In the old days I saw C code with massive overuse of the "register" keyword. Meaning this variable should be kept in a register. Someone had ove 12 of these in one function back when that was waaaay more variables then processors had registers.

Good thing modern register allocation by compilers makes this irrelevant.

heydenberk

Even if you do benchmark something, maintainability can be more important than a marginal performance improvement.

I've seen this happen a lot with JavaScript, particularly in the last 5-10 years as JS engines have developed increasingly sophisticated approaches to performance. Today's optimization can be tomorrow's de-optimization. Even given an unchanging landscape of compiler/interpreter, tightly-optimized code can become de-optimized when updated and extended, as compared to maintainable code that may not suffer much performance degradation upon extension.

Dobbs

edit: I misread the previous post. Ignore this.

How are you using the word simpler? Because to me that implies a combination of more obvious and number of lines of code. Something that a benchmark shouldn't be involved in.

For example asking someone to delete 10 lines of code and instead use go's ` net.SplitHostPort` would be an example of "simpler".

2OEH8eoCRo0

I've read that good generals worry about tactics and great generals worry about logistics.

Good programmers play code golf, great programmers write readable and maintainable code.

Your example seems reasonable but programmers also like to act like the smartest one in the room. I often come across tricky and borderline obfuscated code because somebody wanted to look clever. This is a logistical nightmare.

tested23

Ugh, you are right but then someone comes and uses this to rationalize not including things like map, filter and reduce in a language because they are supposedly too complicated and you can just do it with a for loop

karamanolev

That's what they're saying: if someone says anything more complicated is faster, they challenge them to benchmark it. Usually, it turns out whoever argues the "is faster" point doesn't bother to benchmark it and the simpler code-wise thing wins out. So yes - the benchmark goes to performance, simplicity is in lines-of-code, cyclomatic, "in the eye of the beholder" or whatever other metric you choose, but usually it's obvious.

celeritascelery

I don’t feel like this gave a satisfactory answer the question. Since everything was inlined, the argument passing convention made no difference in the micro benchmarks. But what happens when it does not inline? Then you would actually be testing by-borrow be by-copy instead of how good rust is at optimizing.

ncallaway

I sort of agree and sort of disagree.

> Then you would actually be testing by-borrow be by-copy instead of how good rust is at optimizing.

I don’t think the question is actually: “what is faster in practice, a by-copy method call or a by-value method call”, I think the question is: “as an implementer, which semantics should I choose when I’m writing my function”.

For the second question: “Rust is usually pretty good at aggressively inlining, so… if you’re willing to trust Rust’s compiler, you’re often okay going with by-copy implementations, but you should keep an eye on it”. Whereas, as you note, for the first question it’s not an answer.

But, I do think if someone was going to put more work into it I’d be very curious what the answer to the first question is. If I’m choosing to implement with by-copy semantics and trusting the Rust compiler to hopefully inline things for me, I’d like to know the implications in the cases when it doesn’t.

forrestthewoods

Blog author here. This feels like the best summary in this comment section.

The root question is indeed “what semantics should I use”. And the answer I came up with was “the compiler does a lot of magic so by-copy seems pretty good”. I agree with the previous commenter this is not a satisfying conclusion!

My experience with Rust is that it requires a moderate amount of trust in the compiler. Iterator code is another example where the compiler should produce near optimal code. Emphasis on should!

FpUser

When value size is small (whatever "small" means for particular architecture) I'd say "trust the compiler" suggestion is reasonable. When the size grows there should be no more "trust" unless compiler can decipher if it is safe to use ref instead of value basing on value size (we assume that the function does not mutate the value).

Your tests on my PC:

  Rust - By-Copy: 14124, By-Borrow: 8150
  C++ - By-Copy: 12160, By-Ref: 11423
P.S. Just built it using LLVM under CLion IDE and the results are:

  G:\temp\cpp\rust-cpp-bench\cpp\cmake\cmake-build- 
   release\fts_cmake_cpp_bench.exe
   Totals:
     Overlaps: 220384338
     By-Copy: 4397
     By-Ref: 4396  Delta: -0.0227428%

  Process finished with exit code 0

furyofantares

I feel like they got excited by their C++ code being so much slower and curious about the "weird" C++ result and forgot to figure out the original question.

fnordpiglet

To be fair I got excited too. But I still want to know the answer as well.

FpUser

>"C++ code being so much slower"

  Rust - Windows - By-Copy: 14124, By-Borrow: 8150
  C++ - Windows MS Compiler - By-Copy: 12160, By-Ref: 11423
  C++ - Windows LLVM 15 - By-Copy: 4397, By-Ref: 4396  Delta: -0.0227428%

So it appears that C++ - Windows LLVM 15 beats Rust by large margin.

jasonhansel

In Rust it's considered idiomatic to pass things by-value whenever you can. Usually this is also the most performant option, since it avoids dereferencing in the callee.

Of course, if your struct is truly enormous, you may want to break this rule to avoid large copies. But in that case you probably want to Box<T> the struct anyway.

Of course, if your struct contains something that can't be copied--like a Vec<T>--you'll have to decide whether to clone the whole struct (and thus the vector in it), pass the struct by-borrow, or find some other solution.

int_19h

When the compiler can see both the callee and the caller (the most common case), why should it even matter? If you pass by value, but don't actually do anything that mutates the copy, and there are no aliasing concerns, surely the compiler can make it by-ref under the hood?

(The other direction is trickier, since by-ref implies the desire to observe aliasing, even though that's usually not expected in practice - but the compiler cannot tell.)

brundolf

I don't think I'd agree that idioms come into play here, one way or the other. Safely borrowing things by reference is one of Rust's headline features

kibwen

> Safely borrowing things by reference is one of Rust's headline features

Sure, but it's worth noting that references in Rust do not exist merely to avoid passing by-value. They also exist to make it easier to deal with Rust's ownership semantics: they let you pass things to a function without also requiring the function to "pass back" those things as returned values. In other words, references let you do `fn foo(x: &Bar)` rather than `fn foo(x: Bar) -> Bar`. This is a unique and interesting consequence of languages with by-default move semantics.

dwheeler

This is one advantage of Ada, where parameters are abstractly declared as "in" or "in out" or "out". The compiler can then decide how to best implement it for that specific size and architecture.

sampo

> This is one advantage of Ada, where parameters are abstractly declared as "in" or "in out" or "out".

Also Fortran has "in", "inout" and "out".

Congeec

int_19h

So far as I can tell, it's not quite the same thing since these still have pointer semantics (and thus have to deal with aliasing etc). The in/out approach is more generic, since "in" can map to a pointer where it makes sense, and to a copy where it does not.

Better yet when you prohibit such arguments from aliasing (or at least make no-alias the default) - now the compiler can also implement "in out" by copying the value back and forth, if it's faster than indirection.

IAmLiterallyAB

Ehh that's not quite what those are. The types being added for C++23 are designed for FFI.

Herb made a proposal for proper in/out parameters for C++ in 2020 https://youtu.be/6lurOCdaj0Y

jb1991

And Swift also has "inout" parameters.

stephencanon

But not “out” params, sadly.

It can return multiple values, so this doesn’t matter much for value types, but it would be nice to be able to specify that a pointer arg is an out-param sometimes and enforce that it is not read from while handling allocation in the caller.

trifurcate

Also, MSVC has similar annotations for various static analyses: https://learn.microsoft.com/en-us/cpp/code-quality/understan...

FpUser

So does Delphi / FreePascal

pletnes

Fortran also has «default» / no intent. This is somehow different from inout.

wiz21c

and GL/SL IIRC ...

ardel95

How is that semantically different from Rust?

in - regular function arguments

inout - mut function arguments

out - function return

Is there any additional information that a compiler can infer from Ada’s parameter syntax?

layer8

The difference between passing by reference vs. by value is observable when comparing pointers to the original vs. to the argument. This difference may be unobservable in Ada though (not sure), so Ada would have more freedom choosing between the two.

undefined

[deleted]

undefined

[deleted]

rwaksmunski

A question to the Rust experts, would lifetime annotations 'a in Rust have similar benefit as "in" or "in out" or "out" in Ada and other languages? With the additional benefit in Rust where the compiler can deduce those automatically for most cases?

chc

As a sibling comment points out, "in" is effectively equivalent to "&T", and "inout" is effectively equivalent to "&mut T". Rust is missing purely "out" parameters, but that isn't a very common case, and I'm not sure how much value there is in saying "this reference can't be read" since references are always guaranteed to be valid in Rust.

db48x

A Rust program can trivially return a tuple of values, so it doesn't need out params.

chromatin

Dlang can also qualify parameters as in, out, and inout; although I don't know to what degree the compiler is able to use that for optimization purposes (it is used for safety checks IIRC)

bvrmn

Always curious how Ada solves ABI issue with such optimizations in place.

rightbyte

As long as the calling convention is deterministic from the declaration of the function it should be fine right?

usrnm

If it's deterministic, the compiler cannot actually choose the best way to optimise it.

undefined

[deleted]

mcguire

This is one of those questions where you really, honestly, do need to look at a very low level.

Back in the ancient days, I worked at IBM doing benchmarking for an OS project that was never released. We were using PPC601 Sandalfoots (Sandalfeet?) as dev machines. A perennial fight was devs writing their own memcpy using dst++ = src++ loops rather than the one in the library, which was written by one of my coworkers and consisted of 3 pages of assembly that used at least 18 registers.

The simple loop was something like X cycles/byte, while the library version was P + (Q cycles/byte) but the difference was such that the crossover point was about 8 bytes. So, scraping out the simple memcpy implementations from the code was about a weekly thing for me.

At this point, we discovered that our C compiler would pass structs by value (This was the early-ish days of ANSI C and was a surprise to some of my older coworkers.) and benchmarked that.

And discovered that its copy code was worse than the simple dst++ = src++ loops. By about a factor of 4. (The simple loop would be optimized to work with word-sized ints, while the compiler was generating code that copied each byte individually.)

If you are doing something where this matters, something like VTune is very important. So is the ability to convince people who do stupid things to stop doing the stupid things.

lukaszwojtow

I always prefer by-borrow. That's because in the future this struct may become non-copy and that means some unnecessary refactoring. My thinking is a bit like "don't take ownership if not needed" - the "not needed" part is the most important thing. Don't require things that are not needed.

zozbot234

If a struct might lose Copy you shouldn't implement Copy at all, to preserve forward compatibility. You can still derive Clone in most cases; using .clone() does not per se add any overhead.

carlmr

Exactly, and if performance at some point matters: benchmark!

And I would bet 9 times out of 10 it won't be the bottleneck or even make a measurable difference.

QuadDamaged

Exactly why IMHO the rust stdlib is so easy to understand. Ownership only when required as a design principle tends to make the design of the overall system more consistent / easier to approach.

eterevsky

If it's a 3D real-valued vector, or similarly basic structure, you can be fairly certain, that it will stay copyable.

josephg

I agree. Being copyable is part of the signature for something like this. Explicitly so in rust.

theptip

Rust noob here - is it common to see a struct lose Copy as things grow?

sedatk

> don't take ownership if not needed

That's my approach too as a Rust newbie. Borrow by default and take ownership only when needed, for the best ergononmics.

arcticbull

> Blech! Having to explicitly borrow temporary values is super gross.

I don’t think you ever have to write code like this. Implement your math traits in terms for both value and reference types like the standard library does.

Go down to Trait Implementations for scalar types, for instance i32 [1]

impl Add<&i32> for &i32

impl Add<&i32> for i32

impl Add<i32> for &i32

impl Add<i32> for i32

Once you do that your ergonomics should be exactly the same as with built in scalar types.

[1] https://doc.rust-lang.org/std/primitive.i32.html

forrestthewoods

Oh neat, that’s my blog. My old posts don’t resurface on HN that often.

Lots of criticism of my methodology in the comments here. That’s fine. That post was more of a self nerd snipe that went way deeper than I expected.

I hoped that my post would lead to a more definitive answer from some actual experts in the field. Unfortunately that never happened, afaik. Bummer.

the_mitsuhiko

My only criticism is the “ugly mess” part. You can implement the traits on references too.

forrestthewoods

True, that does work for traits. But it's super annoying if you have to write multiple copies of the same thing. That can get out of control quick if you need to implement every combination.

And that doesn't help at all if you're writing a "free function" like 3D primitive intersection functions. I suppose you could change that simple function into a generic function that takes AsDeref? Bleh.

brundolf

Maybe it'll happen here! :)

ergonaught

It's compiled, so, without any investigation at all, I would have been disappointed if there were any significant difference in the code emitted in these cases. I would expect the compiler to do the efficient thing based on usage rather than the particular syntax. I may have too much faith in the compiler.

CHY872

I'd expect your claim to be true whenever the callee is inlined into the caller. In this case, the compiler has all the relevant information at the right point in time. As other commenters have pointed out, by enabling inlining the author has gone down a rabbit hole somewhat unrelated to the question, because any copies can be simply elided.

If there's no inlining at play, I'd expect vast differences to be possible. For example, imagine a chain of 3 functions - f calls g, g calls h, where one of the arguments is a 1kB struct and the options are passing by copy or by borrowing. In this case, each stack frame will be 1kB in size in the copy case and there will be a large performance overhead as opposed to the by-reference case. One would expect simply calling the function to be similar in overhead to an uncached memory load.

Within a single crate the inlining is possible, with multiple crates it's only possible with LTO enabled (and I'm not sure how _probable_ it is that the inlining would occur).

In either case, the difference between a 32 byte and 8 byte argument in terms of overhead is likely meaningless - the sort of thing to be optimized if profiling says it's a problem as opposed to ahead of time.

kibwen

> Within a single crate (more specifically, codegen unit) the inlining is possible

Cross-crate inlining happens all the time. In order to be eligible for inlining, a function needs to have its IR included in the object's metadata. This happens automatically for every generic function (it's the only way monomorphization can work), and for non-generic functions can be enabled manually via the `#[inline]` attribute (which does not force inlining, it only makes it possible to inline at the backend's discretion).

However, as you, say, if you have LTO enabled then "cross-crate" inlining can happen regardless, since it's all just one giant compilation unit at that point.

cogman10

At the VERY end of the article, the author points out "Oh, btw, I used MSVC for the C++ compilation, when I used clang things changed!"

So, what the author actually measured was the difference between llvm and msvc throughout the article. Particularly when they talked about rust being better at autovectorization than C++.

forrestthewoods

Incorrect. Clang C++ vs MSVC C++ is very comparable, and noticeably worse for f64 by-ref. Clang C++ is still slower than Rust by a large margin. Using Clang C++ throughout would not change any conclusion (or lack thereof).

spuz

I'd be interested to know what the benchmarks of the two rust solutions are when inlining is disabled so we can get an idea of the different performance characteristics of each function call even if it's not a very realistic scenario.

The other question I have is which style should you use when writing a library? It's obviously not possible to benchmark all the software that will call your library but you still want to consider readability, performance as well as other factors such as common convention.

ptero

I would go with the version that gives the clean user interface (that is, by copy in this case). If it turns out that the other version is significantly more performant and this additional performance is critical for the end users consider adding the by-borrow option.

The clarity of the code using a particular library is such an big (but often under-appreciated) benefit that I would heavily lean in this direction when considering interface options. My 2c.

daviddever23box

Agreed - and this applies in nearly every language: start simple, trust your compiler, and optimize only when performance becomes untenable.

osigurdson

The assumption behind such arguments is when a performance problem does arise, a profiler will point to a single, easy to fix, smoking gun. Unfortunately this is not always the case. Performance problems can be hard to diagnose and hard to fix. A lot of damage has been done by unexamined / dogmatic "root of all evil" mantra.

throw10920

In the vast majority of situations (1) you'll prematurely optimize in the wrong place and (2) yes the profiler will point to a single, easy-to-fix smoking gun.

Situations otherwise are the exception, rather than the rule, and it takes an expert to (1) recognize those situations and (2) know exactly how to write optimized code in that situation.

That's why "don't prematurely optimize" is a good rule of thumb - because it works the majority of the time, and it takes experience to know when not to apply it.

mattgreenrocks

The misapplication of that mantra doesn’t justify the design damage done by dogmatically passing everything by ref.

There’s no hard and fast rule here. Even if there was, optimizers still occasionally surprise seasoned native devs in both positive and negative ways.

Glad the author’s first instinct was to pull out profiling tools.

kllrnohj

This advice hinges hugely on what "start simple" really means. There's a ton of counter-examples here where that just isn't true at all depending on what you're calling "simple". In particular JIT'd languages can be especially problematic here. An example would be using Java's Streams interfaces to do something that could be done without much difficulty with a regular boring ol' for loop. At the end of the day you're hoping the JIT will eventually convert the streams version into the same bytecode the for loop version would have started with. But it won't do that consistently, and you've still wasted time before it did so.

Trusting the compiler also means knowing what the compiler actually understands & handles vs. what's a library-provided abstraction that's maybe too bloated for its own good and that quickly becomes "not simple" depending on your language of choice.

int_19h

> you're hoping the JIT will eventually convert the streams version into the same bytecode

Not really. I'm just hoping that it will be "fast enough", which in the vast majority of cases it is.

mlindner

I agree in general, but the side-effect of doing this is that no matter how fast your hardware gets, your software will always end up optimized to the new hardware. So over time your software gets slower and slower but performance stays consistent as hardware gets faster.

Rustwerks

I just went through all of this when building a raytracer.

* Sprinkling & around everything in math expressions does make them ugly. Maybe rust needs an asBorrow or similar?

* If you inline everything then the speed is the same.

* Link time optimizations are also an easy win.

https://github.com/mcallahan/lightray

woodruffw

> Maybe rust needs an asBorrow or similar?

FWIW, the `Borrow`, `AsRef`, and `Deref` traits all exist to support different variants of this.

masklinn

> * Sprinkling & around everything in math expressions does make them ugly. Maybe rust needs an asBorrow or similar?

Do you mean AsRef, or do you mean magic which automatically borrows parameters and is specifically what rust does not do any more than e.g. C does?

Though you can probably get both if the by-ref version is faster (or more convenient internally): wrap the by-ref function with a by-value wrapper which is #[inline]-ed, this way the interface is by value but the actual parameter passing is byref (as the value-consuming wrapper will be inlined and essentially removed).

zamalek

The benchmarks lack the standard deviation, so the results may well be equivalent. Don't roll your own micro-benchmark runners.

References may get optimized to copies where possible and sound (i.e. blittable and const), a common heuristic involves the size of a cache line (64b on most modern ISAs, including x86_64).

Using a Vector4 would have pushed the structure size beyond the 64b heuristic. You would also need to disable inlining for the measured methods.

cogman10

It was also (needlessly) using 2 different compilers, MSVC and LLVM. This is just a bad way to compare things all around.

And, for simple operations like this, you really should just look at the assembly output. If you are only generating 20ish instructions, then look at those 20 instructions rather than trying to heuristically guess what is happening.

Daily Digest email

Get the top HN stories in your inbox every day.

Should small Rust structs be passed by-copy or by-borrow? (2019) - Hacker News