Reference count, don't garbage collect

Daily Digest email

Get the top HN stories in your inbox every day.

shwestrick

This debate has gone round and round for decades. There are no hard lines; this is about performance tradeoffs, and always will be.

Perhaps the biggest misconception about reference counting is that people believe it avoids GC pauses. That's not true. Essentially, whereas tracing GC has pauses while tracing live data, reference counting has pauses while tracing garbage.

Reference counting is really just another kind of GC. I'd highly recommend perusing this paper for more details: A Unifying Theory of Garbage Collection. https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-...

One of the biggest issues with reference counting, from a performance perspective, is that it turns reads into writes: if you read a heap-object out of a data structure, you have to increment the object's reference count. Modern memory architectures have much higher read bandwidth than write bandwidth, so reference counting typically has much lower throughput than tracing GC does.

deterministic

I am the maintainer of a very high-performance JIT compiler for a Haskell like rules programming language used by large enterprises around the world. It uses reference counting + a global optimisation step to reduce the reference count updates to an absolute minimum. The result is compiled code that runs faster than C++ code carefully hand optimised by C++ experts over a 10 year period. There are zero GC pauses. Unless you claim that a C++ alloc/feee call is “garbage collection”. Which is not common terminology. It also (BTW) scales linearly the more cores you throw at it.

brabel

You've gone from claiming reference-counting is faster than tracing GC to claiming it's even faster than hand optimized C++, which is quite honestly unbelievable - whatever the reference counting algorithm is doing can be emulated by the hand-optimised C++ code so that's just literally impossible. But anyway, it's a completely fruitless discussion here unless you provide data that we can look at and scrutinize. OP hasn't provided any. You haven't provided any (and I do believe you may think you're right, but I've been in the position of being very confident of something just to be proven completely wrong by giving all my data to others to scrutinize... it's disheartening but necessary to get to the bottom of what's real). It's like the V language saying it can do memory management magically and it's much faster than Rust or whatever when they don't even have a working system yet.

fnord123

> to claiming it's even faster than hand optimized C++, which is quite honestly unbelievable - whatever the reference counting algorithm is doing can be emulated by the hand-optimised C++ code so that's just literally impossible.

There is nuance here. They claimed that their project is faster than a specific hand optimized project. Not faster than a theoretical peak performance c++ program.

I've run into similar situations where python reimplementation s are faster than java because python is easier to change and fixup algorithms. And %timeit in the ipython shell is way easier than the black magic involved in profiling and benchmarking java.

You also have people on rust subreddit or discourse asking for optimization help when their rust is not as fast as a Go example they wrote. Often you get buffered IO going and it's on par. But to melt faces like ripgrep and friends you often need to drop pretences and work on Vec<u8>s.

deterministic

The code is not open source unfortunately so I can’t provide evidence. But I have nothing to sell or promote (unlike the V language guys who has a clear motivation to inflate what they are selling). What the compiler does differently is that it does a whole program optimisation pass where it minimises the code manipulating the reference counters to an absolute minimum. This is something you can’t do in a C++ compiler. Which is one of the reason why the resulting machine code is faster. Also, I am using LLVM as the backend. So I am getting the same low-level optimisations that C++ compilers get. However the most important optimisations are done before lowering the code to LLVM. Function specialisation for example is huge (generating multiple versions of the same function with the arguments unlined and optimised away). I am doing quite a few high level optimisations like that before even lowering it to LLVM.

staticassertion

> whatever the reference counting algorithm is doing can be emulated by the hand-optimised C++ code so that's just literally impossible.

shared_ptr and unique_ptr have pretty significant overhead and are common practice, even for optimized codebases, so I wouldn't say it's impossible at all.

yakubin

What happens in your language when a linked list is freed? Doesn't running its destructor (or its equivalent) take a linear amount of time relative to the length of the list?

mbrodersen

The compiler uses arrays not linked lists. One of the big mistakes that other functional compilers make (IMHO) is that they use linked lists. It is a huge performance problem. There is a reason why high-performance software written in C++ and C always use arrays and not linked lists. Memory access patterns is the #1 thing to optimise for on modern CPUs.

afiori

My guess is that this could be done concurrently and/or in parallel.

It still take time linear in length, but dead nodes are by definition stable so it does not really matter when you free them.

pclmulqdq

The global optimization step is often what people commonly refer to as "garbage collection." Putting it inside a framework to RC as few times as possible is pretty cool.

However, I doubt the efficacy of your C++ experts: most of the people I know who write C++ are actually really bad at optimizing code. They mostly use it for legacy reasons. If you get a team of experienced (and expensive) systems programmers, you will likely get a slightly better result than your GC algorithm.

mbrodersen

The C++ code I am referring to was hand optimised and maintained for 10+ years before the compiler I am maintaining was fast enough to take over. I never looked at the C++ code myself. I simply use it as the milestone to beat. When I took over the maintenance of the compiler it was 25x slower than the C++ code. It took a lot of work to finally make it faster. I also used the C++ code to compare the output of the compiler and the C++ code. I found quite a few bugs in the C++ code doing that by the way.

UncleEntity

> However, I doubt the efficacy of your C++ experts: most of the people I know who write C++ are actually really bad at optimizing code.

Which is a very good reason to develop an optimized GC algorithm, the domain experts can crank out code without having to optimize every single memory (de)allocation which sounds like a waste of their time.

It’s funny, people don’t usually doubt that a modern compiler can do a better optimization job than an expert but add a memory management algorithm and that’s a bridge too far.

dataflow

What I understood from their comment (which may not be correct) is the following. Say you have something like this:

  extern void foo(T *p);   // some arbitrary function
  
  void bar1(bool cond)
  {
    ..
    auto p = std::make_unique<T, your_deleter>();
    if (cond) { return foo(p.release()); }
    ...
  }

This requires the compiler to call your_deleter::operator() regardless of whether cond is true or false, even though it's unnecessary (and can thus be slower) in the case where cond is true. Moreover, the obvious way to avoid it is to write it "C-style":

  void bar2(bool cond)
  {
    ..
    auto p = new T();
    if (cond) { return foo(p); }
    ...
    your_deleter()(p);
  }

which can up being faster when cond is true. But this isn't something an expert would generally want to do, as now the C++ code becomes unidiomatic, fragile, and unmaintainable.

In an ideal world, though, you could have an optimizer smart enough to do that transformation automatically. C++ compilers already do that in trivial cases, but they can't do it in general. My impression is that their Haskell compiler exploits the internal knowledge of what your_deleter does (i.e. reference counting) in order to optimize the code in various ways, like optimizing out such code, consolidating refcount updates, etc. And if I understand this correctly, there's no surprise at all that it can be faster than idiomatic C++ code written even by experts.

The question for me isn't the expertise of their programmers. Perhaps in their case they genuinely do need to have lots of objects on the heap, have (say) tight loops where they (for whatever reason) nevertheless cannot avoid the heap allocations, and don't have much of a use for finalizers besides freeing memory. In which case, I'm not surprised their solution clearly delivers better results than the C++ equivalent. The question from me, instead, is how well they think that generalizes, such as to (a) well-written Haskell programs in general, (b) well-written C++ programs in general, and/or (c) other domains. It would be one thing if their solution delivers better results in Haskell than C++ for their use case; it would be another thing if they could claim their solution delivers better results in Haskell than C++ for most use cases.

mbrodersen

Just to clarify: it is not a framework. It is a whole program optimisation pass done by the compiler before lowering the intermediate code to LLVM and then generating machine code.

viraptor

How do you collect cycles without a pause?

eru

There are solutions to this one. Real time garbage collectors are a thing. You just might not be able to collect the cycle in one GC run.

Or you can do what Erlang does: Erlang has neither mutation nor laziness, so you can't create cycles. The GC also lays out object in topological order in memory, so that you can detect garbage without tracing every life object.

elcritch

By requiring non-cyclic data structures or provide "weak" references. It's actually pretty easy to write most code without cycles. I program a lot in Nim and generally compile lots of programs with ARC and no cycle collector without issue.

mbrodersen

The language has zero data cycles by design. It turns out that we didn’t need it. Thinking about it, I can’t remember even a single situation in my 30+ years career where I needed a data structure with data cycles. There are usually better ways to do it in my experience.

pkolaczk

I simply don't do cycles. Software is simpler without them.

dataflow

I'm gonna take a guess they dedicate a core to GC (or something along these lines).

tsimionescu

free() calls that have to run for a data-dependent amount of time are more or less equivalent to GC pauses (assuming a concurrent GC that doesn't need to stop the world, like Java's). The most typical example is free()-ing a a linked list, which takes O(n) free() calls to free with a simple RC mechanism.

kccqzy

If you are assuming GC does not need to stop the world, you can also assume that the freeing of memory (including the O(n) calls to free()) will not be done in a critical path; all memory could be handed off to a separate dedicated thread that actually calls free(). Or in a RPC or HTTP server all memory can be freed after the request has been served.

It's very easy to make a reference counting scheme not stop the world. It's a bit more difficult to make a GC implementation so.

mbrodersen

The compiler uses arrays for lists instead of linked lists. It is a performance improvement compared with how functional compilers usually do it. And the language is designed in such a way that rule writers manipulate whole lists at a time instead of one element at a time. A bit like array languages. It changes how the code is written compared with traditional functional code.

naasking

> There are zero GC pauses. Unless you claim that a C++ alloc/feee call is “garbage collection”.

Alloc/free can introduce arbitrary pauses last I checked, so yes, there are pauses. Any time doing book keeping for resources rather than running your code counts as GC time.

MaxBarraclough

> Any time doing book keeping for resources rather than running your code counts as GC time.

Perhaps a nitpick: memory management time, yes, but not GC time.

alloc/free is manual memory management, not garbage collection.

aaaaaaaaaaab

On any OS which is not hard realtime, there could be arbitrary pauses with any syscall. This is just nitpicking.

refulgentis

I'm not sure whats being asserted here, could you explain more? This sounds like you're describing a non-stopping GC, and its well understood reference counting is garbage collection. I'm not sure how the rest applies, you're correct, it is possible to write software with just malloc and free.

cakoose

Re: "it's well understood reference counting is garbage collection". I think this might just be a terminology thing.

There appear to be two ways the terms are categorized:

1. "reference counting" and "garbage collection" are two types of automatic memory management/reclamation.

2. "reference counting" and "tracing garbage collection" are two types of garbage collection.

I think mbrodersen is using #1.

(Back in the 90s and 2000s I feel like #1 was much more prevalent. But #2 seems to have gained popularity since then. My theory is that whoever started writing the Wikipedia content for this stuff picked #2.)

Bolkan

Pics or gtfo.

bitcharmer

Yeah, GP's claim is completely bonkers. And they haven't provided any links or data to back up their nonsensical claim.

hamstergene

For a unifying term I prefer Automatic Memory Management.

One reason is that GC is already universally used to mean only tracing garbage collection, and trying to defend its wider meaning is a pointless uphill battle.

Another is that is suits the job much better, because not every AMM technique works by producing garbage then collecting it, you know.

kibwen

If you want to get even more precise, call it automatic dynamic memory management. Automatic static memory management would be something like Rust's scope-based memory reclamation via ownership.

pjmlp

Great to confuse more people instead of properly learning about affine type systems.

Someone

> not every AMM technique works by producing garbage then collecting it

And the confusing thing is that garbage collection (GC) doesn’t collect garbage, while reference counting (RC) does.

GC doesn’t look at every object to decide whether it’s garbage (how would it determine nothing points at it?); it collects the live objects, then discards all the non-live objects as garbage.

RC determines an object has become garbage when its reference count goes to zero, and then collects it.

That difference also is one way GC can be better than RC: if there are L live objects and G garbage objects, GC has to visit L objects, and RC G. Even if GC spends more time per object visited than RC, it still can come out faster if L ≪ G.

That also means that GC gets faster if you give it more memory so that it runs with a smaller L/G ratio (the large difference in speed in modern memory cache hierarchies makes that not quite true, but I think it still holds, ballpark)

omginternets

What are these other techniques and what can a technically-literate newcomer like myself read to get acquainted?

pjmlp

I just keep feeding them the respective CS literature.

kaba0

One great example would be a C++ program that runs fast, and then just spends 10s of seconds doing “nothing” while it deallocates shared pointers’ huge object graphs at the end. They really are two sides of the same coin, with tracing GCs being actually correct (you need cycle detection for a correct RC implementation), and having much better throughput. It’s not an accident that runtimes with thousands dev hours are polishing their GC solutions instead of using a much more trivial RC.

hinkley

I don't know what the current state of the art is, but at one point the answer to GC in a realtime environment was to amortize free() across malloc(). Each allocation would clear up to 10 elements from queue of free-able memory locations. That gives a reasonably tight upper bound on worst case alloc time, and most workflows converge on a garbage-free heap. Big malloc after small free might still blow your deadlines, but big allocations after bootstrapping are generally frowned upon in realtime applications, so that's as much a social problem as a technical one.

pclmulqdq

This is normal today inside malloc implementations.

pklausler

It's not marketed as a GC, but exit(2) is fast and effective when used as one.

game-of-throws

I wouldn't call it a GC. It's more like an arena allocator, with the arenas managed by the OS :)

saagarjha

Now you've pushed the job of cleaning up page tables to the OS.

im3w1l

Fun fact: if you dont do anything important in the destructors you can avoid that delay by intentionally leaking the memory. The os will clean it up when the program exits and it does a better job since it frees the pages rather than looking at your objects one by one.

kaba0

Most GCs do exactly that -- they only "work" when absolutely necessary, and their heuristics says that they are getting behind the created garbage. If the program exits shortly after it will just leak the memory.

The problem with that in the case of C++ is that you likely only want to leak things used in the end from the main thread, but not "recursively" - the distinction is hard to do.

pornel

Saying that both have pauses is a false equivalence to me.

It overlooks the difference in how likely this can occur (without large enough object graphs freed from the top it may never be an issue), when this occurs (any time vs on cleanup that may not be latency sensitive), and how much control the programmer has over RC costs (determinism allows to profile this and apply mitigations).

RC with borrow checking can avoid a lot of refcount increments.

Tracking GC typically needs write barriers, so it’s not free either.

kaba0

> and how much control the programmer has over RC costs (determinism allows to profile this and apply mitigations).

I fail to see how would it be deterministic in a highly dynamic program. Like, imagine a game for example where the user can drag'n'drop different things to a "parent" object. Observability is imo an entirely different axis.

> RC with borrow checking can avoid a lot of refcount increments.

That's the same thing as escape analysis - with language support many many objects could be effectively "removed" from the guidance of the GC, decreasing load and greatly improving performance. It is a language-level feature, not inherent in the form of GC we do (RC vs tracing)

_gabe_

> Like, imagine a game for example where the user can drag'n'drop different things to a "parent" object.

Where is the reference count going to 0 here? Presumably the object you dragged from one parent to another parent just swapped the reference and no ref count ever went to 0 which would have triggered a free.

I'm guessing OP meant deterministic as in, "oh when I profile my code and it hits that big lag spike, this occurs in my particle system. Let's swap that to a fixed size static memory allocation instead of RC". Whereas with GC you say, "oh my program has a big lag spike. I wish I could tell why that is, but the code executing at that time may or may not have been the reason GC was collecting then". The only experience I've had trying to get accurate profiling data on memory usage in Java/C# was aggravating to say the least.

Also, most games these days (and probably since the beginning of gaming) don't allocate and free stuff randomly. They're most likely using RC on big blocks of memory that are shared among several other game objects. Or RC on several fixed size pools that get used for different components like in an ECS. Or they use RC for things like long living assets that have difficult lifetime management.

Honestly, RC and garbage collection are unnecessary in most instances. Most of the time all you need is a unique pointer because you know exactly where the ownership of the object lies. For the edge cases, RC works fine.

I guess my biggest complaint is, why rely on a magical algorithm that you have no control over? I've had so many "fun" times trying to force GC to collect at specific points. It's really nice when you can say, "at this point in the program, this ref count will go to 0 and it will free and I don't have to worry about it lagging anything". Rather than "please GC.collect(), actually do something when I call you right here and don't wait another 5 minutes and ignore my pleading".

pornel

Even in a highly dynamic program you can know all the places where releases happen. If you find a slow case, you can reproduce it, and mitigate (e.g. send object to a background thread to avoid lag on the ui thread). This is different from tracing GC that may fire on any allocation, on a timer, and workload depends on heuristics and other things in the program.

Escape analysis is hard, and GC is typically used to free programmers from assisting it with extra syntax or unsafety. This is why GC languages tend to use generational GC instead of having precise analysis.

im3w1l

If you use refcounted pointers for everything then you'd be better off with a proper gc. But at least in the programs I see, 99% of objects are not refcounted , and that is reserved for a tiny majority of objects with especially tricky lifetimes.

ncmncm

This is the key.

Using std::shared_ptr in a performance-sensitive context (e.g. after startup completes) is code smell.

Using pointers as important elements of a data structure, such that cycles are possible at all, is itself code smell. A general graph is usually better kept as elements in a vector, deque, or even hash table, compactly, with indices instead of pointers, and favoring keeping elements used near one another in the same cache line. Overuse of pointers to organize data tends to pointer-chasing, among the slowest of operations on modern systems.

Typical GC passes consist of little else but pointer chasing.

But the original article is completely, laughably wrong about one thing: an atomic increment or decrement is a remarkably slow operation on modern hardware, second only to pointer chasing.

Systems are made fast by avoiding expensive operations not dictated by the actual problem. Reference counting, or any other sort of GC, counts as overhead: wasting time on secondary activity in preference to making forward progress on the actual reason for the computation.

Almost invariably neglected or concealed in promotion of non-RC GC schemes is overhead imposed by touching large parts of otherwise idle data, cycling it all through CPU caches. This overhead is hard to see in profiles, because it is imposed incrementally throughout the runtime, showing up as 200-cycle pauses waiting on memory bus transactions that could have been satisfied from cache if caches had not been trashed.

If a core is devoted to GC, then sweeps would seem to cycle everything through just that core's cache, avoiding trashing other cores' caches. But the L3 cache used by that core is typically shared with 3 or 7 other cores', so it is hard to isolate that activity to one core without wastefully idling those others. Furthermore, that memory bus activity competes with algorithmic use of the bus, slowing those operations.

Another way GC-dependence slows programs is by making it harder, or even impossible, to localize cost to specific operations, so that reasoning about perforce becomes arbitrarily hard. You lose the ability to count and thus minimize expensive operations, because the cost is dispersed throughout everything else.

ribit

> But the original article is completely, laughably wrong about one thing: an atomic increment or decrement is a remarkably slow operation on modern hardware, second only to pointer chasing.

Very true. No matter how one looks at it, GC imposes a non-trivial overhead. If I remember correctly, wasn’t there a study that demonstrated an average Swift application spending 40% of its time on RC?

But at the same time hardware can be improved. Apple CPUs are an obvious example. Uncontended atomics are extremely cheap and the memory prefetcher can detect and prefetch pointer chases.

eru

Nowadays good compilers can handle many of these non-tricky lifetimes statically, too.

tuetuopay

> One of the biggest issues with reference counting, from a performance perspective, is that it turns reads into writes: if you read a heap-object out of a data structure, you have to increment the object's reference count.

This is one of the biggest misconception about RC. You need not to increase the refcount just to read the referred data because you already have a reference whose count has been increased when handed down to you. That’s a semantic that’s very well carried off by Rust’s Arc type: the count is inc’ when the Arc is cloned, and dec’ when the cloned Arc is dropped. But you can still get a regular ref to the data since the compiler will be able to enforce locally the borrow, ownership and lifetime rules.

For example, in a web server, you might have the app’s config behind an Arc. It gets cloned for each request (thus rc inc’d), read a lot during the req, then dropped (thus rc dec’d) at the end of the handler.

arcticbull

RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from bottom to top once in a while.

RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.

And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system to handle these edge cases.

There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long effort kicked off to build a Rube Goldberg machine for closing that knowledge gap.

klodolph

> RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from time to time.

Depends on the GC algorithm used. Various GC algorithms only trace reachable objects, not unreachable ones.

Reference counting does the opposite, more or less. When you deallocate something, it's tracing unreachable objects.

One of the problems with this is that reference counting touches all the memory right before you're done with it.

> And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system.

This is not a typical solution.

Java threw finalizers into the mix and everyone overused them at first before they realized that finalizers suck. This is bad enough that, in response to "too many files open" in your Java program, you might invoke the GC. Other languages designed since then typically use some kind of scoped system for closing file descriptors. This includes C# and Go.

Garbage collection does not need to be used to collect all objects.

> There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long rube goldberg machine was built around closing that gap.

When I hear rhetoric like this, all I think is, "Oh, this person really hates GC, and thinks everyone else should hate GC."

Embedded in this statement are usually some assumptions which should be challenged. For example, "memory should be freed as soon as it is no longer needed".

arcticbull

> When I hear rhetoric like this, all I think is, "Oh, this person really hates GC, and thinks everyone else should hate GC."

Yeah, I think it's an inelegant, brute-force solution to a language problem - and that we continue to throw good money after bad improving it.

We should be investing in removing the need for GC through smarter compilers and through languages that allow us to better express our intent - and our object graph. Instead, we continue to improve on a giant linked-list traversal through all of live memory to try and infer the object graph at runtime. Without sufficient information to do it efficiently. The fundamental issue is that we don't have languages that express our goals in a way that allows memory management to be elided efficiently.

That doesn't mean I have some particular affinity for reference counting, though. It has its own issues as you rightly point out. I prefer it for its determinism, nothing more.

tsimionescu

> RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.

All tracing GC algorithms scan only live memory, and they typically do so in an array-like scan (writing some bits in the object header when a pointer to that object is discovered), not in linked-list order.

bitcharmer

> RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from bottom to top once in a while.

This is patently untrue. Contemporary GCs have had card marking/scanning for 10+ years now.

garethrowlands

Pointer chasing is expensive because fetching a random location defeats locality and therefore caches the the CPU's stream detection.

But a compacting GC copies the data it's scanned into a contiguous stream, dramatically improving locality, cache utility and stream detection. And this affects not only subsequent GCs but also the application itself, which may traverse its object graph far more often than the GC does.

int_19h

Most GC implementations that I can think of know which bits in memory are pointers and which aren't, and only scan the pointers.

waterhouse

> Essentially, whereas tracing GC has pauses while tracing live data, reference counting has pauses while tracing garbage.

This is pretty trivial to avoid. When your thread finds itself freeing a big chain of garbage objects, you can have it stop at any arbitrary point and resume normal work, and find some way to schedule the work of freeing the rest of the chain (e.g. on another thread). It's much more complex and expensive to do this for tracing live data, because then you need to manage the scenario where the user program is modifying the graph of live objects while the GC is tracing it, using a write or read barrier; whereas for garbage, by definition you know the user can't touch the data, so a simple list of "objects to be freed" suffices.

"Reads become writes" (indeed, they become atomic read-modify-writes when multiple threads might be refcounting simultaneously) is a problem, though.

kaba0

> It's much more complex and expensive to do this for tracing live data

But this is what happens in a modern state-of-the-art tracing GC implementation, isn't it?

pjmlp

And so it becomes a simple tracing GC implementation, while one keeps calling it RC to feel good.

jerf

From what I can see, the myth that needs to be debunked isn't that garbage collection is super fast and easy with no consequences, it's the myth that garbage collection always automatically means your program is going to be spending 80% of its time doing it and freezing for a second every five seconds the instant you use a language with garbage collection. I see far more "I'm writing a web server that's going to handle six requests every hour but I'm afraid the garbage collector is going to trash my performance" than people who believe it's magically free.

It's just another engineering decision. On modern systems, and especially with any runtime that can do the majority of the GC threaded and on an otherwise-unused core, you need to have some pretty serious performance requirements for GC to ever get to being your biggest problem. You should almost always know when you're setting out to write such a system, and then, sure, think about the GC strategy and its costs. However for the vast bulk of programs the correct solution is to spend on the order of 10 seconds thinking about it and realizing that the performance costs of any memory management solution are trivial and irrelevant and the only issue in the conversation is what benefits you get from the various options and what the non-performance costs are.

It is in some sense as big a mistake (proportional to the program size) to write every little program like it's a AAA game as it is to write a AAA game as if it's just some tiny little project, but by the sheer overwhelming preponderance of programming problems that are less complicated than AAA games, the former happens overwhelmingly more often than the latter.

Edit: I can be specific. I just greased up one of my production systems with Go memstats. It periodically scans XML files via network requests and parses them with a parser that cross-links parents, siblings, and children using pointers and then runs a lot of XPath on them, so, it's kinda pessimal behavior for a GC. I tortured it far out of its normal CPU range by calling the "give me all your data" JSON dump a 100 times. I've clicked around on the website it serves to put load on it a good 10x what it would normally see in an hour, minimum. In 15 minutes of this way-above-normal use, it has so far paused my program for 14.6 milliseconds total. If you straight-up added 14.6 milliseconds of latency to every page it scanned, every processing operation, and every web page I loaded, I literally wouldn't be able to notice, and of course that's not what actually happened. Every second worrying about GC on this system would be wasted.

code_runner

The biggest GC issues I’ve personally seen manifest were arrays of historical data that grew to tens of millions of entries and due to array storage in .net, the array was placed in the large object heap. Swapping to a linked list actually fixed the issue and the team lived to fight another day.

Like a lot of premature optimization, it isn’t a problem until it is… but solutions aren’t unattainable.

hinkley

I still mostly remember the day a coworker convinced me that object pooling was dead because it tears the mature generation a new one over and over.

It's nice when the runtime solves a problem you've had to solve yourself, but it also takes a bit of your fun away, even if your coworkers are relieved not to have to deal with 'clever' code anymore.

nickbauman

Yes; I have a friend who is part of a small team that wrote a very successful stock market trading gateway in Java. Turns out the JVM's GC can be tuned in very specific ways based on your needs. And there are ways to avoid having to do JVM GC in critical areas of the code as well.

jonas21

> And there are ways to avoid having to do JVM GC in critical areas of the code as well.

Yeah, you allocate a large pool of objects up front and manually reference count them. Every high-performance Java application I've seen ends up doing this. But isn't that an argument for reference counting?

kgeist

>Yeah, you allocate a large pool of objects up front and manually reference count them. Every high-performance Java application I've seen ends up doing this

Not sure if it's still relevant, though.

One popular physics library years ago went as far as instrumenting compiled bytecode to turn all vector/matrix allocations to fetching preallocated objects from a pool, because a simple math operation could allocate tens/hundreds of vector/matrix objects and GC was slow, but then in newer versions they removed it because Java's GC became fast enough.

pjmlp

It is an argument that Java can do arenas like some other languages, while providing the productivity of a tracing GC for the rest of the non critical path code.

PaulHoule

Generally Java has made a huge investment in the garbage collector over a long period of time to address the problems that people have in some use cases. JDK 17 is much better than JDK 8. If you were writing a GC from scratch you are not going to do as well.

hinkley

To be fair, they definitely got into a rut in the JDK 6-7 timeframe. I maintain it's no accident that memcache came into its own during this period. That was a major pain point, and going out-of-process shouldn't have been necessary.

marginalia_nu

There's several exchanges and clearing platforms running Java[1], although I'm not sure how many are still around after Nasdaq's hostile takeover.

[1] https://www.marketswiki.com/wiki/TRADExpress

kaba0

The JVM GC’s are absolutely insanely good. G1 can sustain loads with heap sizes well into TERAbytes.

viktorcode

As I heard those guys write allocation-free Java code in critical paths. Nothing allocated, nothing to collect.

viktorcode

> it's the myth that garbage collection always automatically means your program is going to be spending 80% of its time doing it and freezing for a second every five seconds the instant you use a language with garbage collection.

Not 80%, but still annoying enough to dump it: https://discord.com/blog/why-discord-is-switching-from-go-to...

pjmlp

Magpie developers would use any excuse to move on.

It is less boring than building up the skills to fix the plane in mid-flight.

https://github.com/usbarmory/usbarmory/wiki

scruple

> Magpie developers

That was a new one for me. I sort of understood the reference but the Jeff Atwood post [0] where it's introduced is as relevant today as it was when this was published in 2008.

[0]: https://blog.codinghorror.com/the-magpie-developer/

flohofwoe

Such a claim really needs hard data to back it up. Reference counting can be very expensive, especially if the refcount update is an atomic operation. It's hard to capture in profiling tools because the performance overhead is smeared all over the code base instead of centralized in a few hot spots, so most of the time you don't actually know how much performance you're losing because of refcounting overhead.

The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.

okennedy

This. Exactly this.

Garbage collection has a huge, and generally entirely unappreciated win when it comes to threaded code. As with most things, there are tradeoffs, but every reference counting implementation that I've used has turned any concurrent access to shared memory into a huge bottleneck.

arcticbull

> The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.

RAII gets you a lot of the way there.

jsnell

> Basically, you attach the reference to the object graph once, and then free it when you're done with it.

So reference counting works by the programmer knowing the lifetime of each object allowing them to only increment / decrement the refcount once, and trusting that the raw uncounted pointers they use elsewhere are always valid? There's another word we have for this: manual memory management. It's unsafe and unergonomic, and it's pretty telling that the author needs to this pattern to make RC appear competitive. It's because actually doing reference counting safely is really expensive.

> If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do, instead of going through the hassle of implementing reference counting for everything but the one case I mentioned?

Because they've made reference counting a part of their C extension API and ABI. If they wanted to use a GC, they'd instead need a very different API, and then migrate all the modules to the new API. (I.e. a way for those native extension to register/unregister memory addresses containing pointers to Python objects for the GC to see.)

Early on the deterministic deallocation given by reference counting would also have been treated by programmers as a language level feature, making it so that a migration would have broken working code. But I don't think that was ever actually guaranteed in the language spec, and anyway this was not carried over to various alternative Python implementations.

eru

Yes.

In Python reference counting precedes tracing garbage collection.

So they didn't 'go through the hassle of implementing reference counting' after they already had tracing garbage collection. Instead they went through the hassle of implementing tracing garbage collection after they already had reference counting.

(And as you say for backwards compatibility reasons, they can't get rid of reference counting.)

int_19h

It should be noted that the Python language spec explicitly says that refcounting, and the resulting deterministic cleanup, is an implementation detail of CPython, and not part of the Python language proper. Precisely so that other implementations like Jython or IronPython could just use GC without refcounting.

https://docs.python.org/3/reference/datamodel.html#objects-v...

So, idiomatic Python does not rely on this, and uses with-statements for deterministic cleanup.

But, of course, as with any language, there's plenty of non-idiomatic Python out in the wild.

eru

Yes, I know that I was referring to a detail of CPython when I said 'Python' in general. Alas, in practice Python really is CPython, more or less.

The biggest problem is that reference counting is baked into how CPython's C-modules work.

twic

Details aside, using Python to make an argument about performance is a pretty bold move.

cakoose

While Python is not high performance overall, people have spent a ton of time optimizing the memory management. So while it's not definitive proof of reference counting being slow, it seems like a fair initial question.

viktorcode

> So reference counting works by the programmer knowing the lifetime of each object allowing them to only increment / decrement the refcount once, and trusting that the raw uncounted pointers they use elsewhere are always valid? There's another word we have for this: manual memory management.

Programmer, or compiler. In latter case it is automatic reference counting, which I never heard called "manual memory management".

yyyk

Reference counting is garbage collection, just a different strategy - and all these strategies tend to blur to the same methods eventually, eventually offering a latency-optimized GC or a throughput-optimized GC.

Swift is inferior here because it uses reference counting GC without much work towards mitigating its drawbacks like cycles (judging by some recent posts, some of its fans apparently aren't even aware RC has drawbacks), while more established GC languages had much more time to mitigate their GC drawbacks - e.g. Java's ZGC mitigates latency by being concurrent.

musicale

Maybe Apple just hires mediocre developers (I certainly have lots of complaints about their software and UI issues, but I would probably suspect management/priorities/schedules rather than the technical staff) but they implemented GC and Automatic Reference Counting for ObjC and found that the latter resulted in better and more consistent responsiveness, which is probably what most users care about in apps. (Apps still get compressed or paged out which can lead to annoying pauses.)

My anecdata indicate that Java apps are not as responsive as ObjC/Swift for the most part.

pjmlp

https://github.com/ixy-languages/ixy-languages

The real reason why a tracing GC was a failure in Objective-C was due to the interoperability with the underlying C semantics, where anything goes.

The implementation was never stable enough beyond toy examples.

Naturally automating the Cocoa release/retain calls made more sense, given the constraints.

In typical Apple fashion they pivoted into it, gave the algorithm a fancy name, and then in a you're holding it wrong style message, sold their plan B as the best way in the world to manage memory.

When Swift came around, having the need to easily interop with the Objective-C ecosystem naturally meant to keep the same approach, otherwise they would need the same machinery that .NET uses (RCW/CCW) to interop with COM AddRef/Release.

What Apple has is excellent marketing.

mamcx

> What Apple has is excellent marketing.

Whoa! Marketing? Interop with C is a MASSIVE use-case. And iOS is A LOT faster and more responsive than android. That is, right there, total proof that can't denied.

I do both, and the speed of apple way and the simplicity of bridge the C-abi is not a joke.

yyyk

ARC is a perfectly respectable choice by itself. What's annoying me is ignoring the improvements other GC strategies and other languages have made in the last 20 years. There's a certain section of Apple users that thinks that every choice Apple makes is the only good way to do things.

kaba0

And RC might be the correct choice for their devices (due to them not having much RAM, and latency being more important than throughput). But that is not true everywhere, I would even say that a good tracing GC is a better default.

KerrAvon

Your first paragraph is only true if malloc/free counts as GC. I have seen people try to claim that free() is just manual GC. You can say that if you want, but it renders the terminology meaningless.

Any true GC strategy (== one that collects cycles) will fundamentally touch and allocate more memory than malloc/free, where reference counting is pretty close to malloc/free performance; it doesn’t need to touch any memory not involved. At OS scale, that’s a huge performance advantage. You can scope down the memory involved in GC using advanced, modern GC techniques, but you’re still going to be behind malloc/free in overall efficiency — cache efficiency, memory maintenance overhead, and additional memory required for bookkeeping. And reference counting will be pretty darn close to malloc/free.

pjmlp

Chapter 5, https://gchandbook.org/

bruce343434

It puts that "extra memory" inside the very objects it tracks with a... reference count.

fingerlocks

What do you mean? Potential reference cycles are a compile time error in Swift. That’s the whole point of the @escaping annotation

yyyk

So this[0] wasn't current despite being edited half a year ago? I guess I'm guilty of the same mistake the post had: criticizing without understanding the state of the art on the other side (at least I didn't miss it by 20 years and on an entire subfield of Computer Science).

[0] https://stackoverflow.com/questions/32262172/how-can-identif...

jumhyn

No, the SO post is accurate—it’s trivially easy to create strong reference cycles in Swift and @escaping annotation is only of limited use in detecting strong reference cycles. Though Swift also broadly pushes towards the use of value types for which creating reference cycles of any kind is impossible since they’re values, not references!

viktorcode

Putting both under the same term is meaningless.

Reference counting and Garbage Collection have very clear difference: when the referenced objects are destroyed (not deallocated). In RC it happens when the count reaches zero. In GC it happens some time later. That difference is crucial for having or not having deterministic performance in your program.

brabel

Is the Pony language's GC a GC then? It runs at determinate times, namely when a behaviour (an actors' receive function, basically) finishes running... and because actors cannot share mutable state, and immutable values with more than one owner can be handled easily by a simple common parent actor, each actor's GC is completely independent of each other.

Things are rarely as clear cut as we would want to believe.

viktorcode

> Is the Pony language's GC a GC then? It runs at determinate times, namely when a behaviour (an actors' receive function, basically) finishes running...

It is. It might run when a behaviour finishes running, according to their docs.

Animats

> In GC it happens some time later.

Yes. In languages with destructors/finalizers called from the garbage collector, things can get very complicated. C# and Java have this problem.

Go avoids it by having scope-based "defer" rather than destructors.

pjmlp

Java has scope based defer since version 7, while C# has it since version 4, with the improvement on version 8 to be syntactically like defer.

With a bit of generics and lambdas, they can be made to even be more defer like.

pjmlp

Chapter 5, https://gchandbook.org/

smasher164

For how strongly worded this article is, you'd think the author would provide some substance in their reasoning. Reference counting, even atomic, is quite expensive. Not only because it can invalidate the cache line, but depending on the architecture (looking at you x86), the memory model will deter reordering of instructions. On top of this, reference counting has a cascading effect, where one destructor causes another destructor to run, and so on. This chain of destructor calls is more or less comparable to a GC pause.

deterministic

Not my experience at all. I am maintaining a very high performance JIT compiler for a Haskell like programming language used in production at large enterprises around the world. So I am used to very carefully analyse performance. And reference counting is never the bottleneck. You might be right in theory but not in practice.

smasher164

Not sure about the characteristics of your workload, but here's a talk where a group wrote device drivers in several high-level languages, and measured their performance: https://media.ccc.de/v/35c3-9670-safe_and_secure_drivers_in_...

They found that the Swift version spent 76% of the time doing reference counting, even slower than Go, which spent 0.5% in the garbage collector.

bitcharmer

Yeah, GP has no idea what they are talking about. They replied something similar on another thread without ever providing evidence for their claims. They even claimed their RC implementation in Haskell is faster than hand-crafted C++.

Gave me a chuckle.

deterministic

All that tells me is that the Swift implementation is different from the one I implemented.

kaba0

Well, it won’t be the bottleneck itself, but it has an overhead on basically every operation, which likely won’t show up during profiling.

Also, I fail to see the advantage of RC in case of a presumably mostly immutable language - a tracing GC is even faster there due to no changes to the object graph after allocation, making a generational approach scale very well and be almost completely done in parallel.

mbrodersen

Nope. The compiler does a full program optimisation, reducing the reference counting to an absolute minimum. There is close to zero overhead passing data around. It does not work like C++ shared_pre. shared_ptr is slow.

kgeist

I wonder how Haskell's purity influences RC usage patterns. Are there tricks which aren't possible in an imperative language?

tomsmeding

Haskell specifically is a poor choice of language here, because it creates cycles like the pest. (This is because it uses lazy evaluation, and programming patterns (design patterns?) using lazy evaluation tend to use cyclical references. Strict FP languages might support your point better, but then Ocaml again doesn't work because it mutates like the pest.)

Furthermore it also allocates like the pest: busy Haskell programs regularly allocate on the order of 1GB/sec. However, this works out fine because the majority of those allocations are short-lived, hence become dead quickly, hence a GC that is designed to only touch live data (like Haskell's GC!) will handle that well.

cmroanirgo

One reason you'd choose ref counting is because it's deterministic behavior, whereas you lose that granularity with gc, even if you did a gc cleanup.

I see great reasons for both systems being useful, but both systems also bring their own warts.

Yes, ref counting affects cache and branch prediction, but gc is a whole complete subsystem running in parallel with your main code, constantly cleaning up after you. It will always depend upon the application which will determine what's best for that application.

Some languages lean heavily one way than the other too. Scripting with ref counting would be a nightmare, as would running a garbage collector on an 8bit micro. Since the article's talking C & C++, then of course a pro ref counting stance makes sense.

kgeist

>One reason you'd choose ref counting is because it's deterministic behavior

Not sure if it's entirely deterministic. A variable going out of a scope can trigger deallocation of a large object graph and it's not always clear by just looking at a code what will happen (especially if objects have destructors with side effects, your object graph is highly mutable, and your code is on a hot path). A common trick is to delay deallocation to a later time, but then again you can't be sure when your destructors will be run. Another issue is cycles, if your RC system has cycle detection, your program will behave differently depending on whether a cycle formed at runtime or not.

_gabe_

> trigger deallocation of a large object graph and it's not always clear by just looking at a code what will happen

If you can't understand what's happening when an object gets freed, it may be a sign that your code is too tightly coupled and/or becoming spaghetti. I've found that the more graph-like my data structures become, the more inadvertent complexity I'm adding. That's the whole reason we talk about data normalization, is to avoid these types of couplings.

viktorcode

Deterministic means that running the same part of the program will take the same amount of time. Deallocating the same memory object graph is pretty much deterministic.

GC can throw a spanner in that by deciding that now is the time to do its thing.

eru

> Scripting with ref counting would be a nightmare, [...]

Why? Old version of Python used ref counting only, and Python still largely relies on reference counting (but has a GC to detect cycles).

bjourne

Time to tout my own horn. I made a project comparing different types of garbage collectors (I still prefer the original terminology; both ref-counting and tracing garbage collection collects garbage, so they are both garbage collectors) a few years ago: https://github.com/bjourne/c-examples

Run ./waf configure build && ./build/tests/collectors/collectors and it will spit out benchmark results. On my machine (Phenom II X6 1090), they are as follows:

    Copying Collector                                8.9
    Reference Counting Collector                    21.9
    Cycle-collecting Reference Counting Collector   28.7
    Mark & Sweep Collector                          10.1
    Mark & Sweep (separate mark bits) Collector      9.6
    Optimized Copying Collector                      9.0

I.e for total runtime it is not even close; tracing gc smokes ref-counting out of the water. Other metrics such as number of pauses and maximum pause times may still tip the balance in favor of ref-counting, but those are much harder to measure. Though note the abysmal runtime of the cycle-collecting ref-counter. It suggests that cycle collection could introduce the exact same pause times ref-counting was supposed to eliminate. This is because in practice cycles are very difficult to track and collect efficiently.

In any case, it clearly is about trade-offs; claiming tracing gc always beats ref-counting gc or vice versa is naive.

staticassertion

This is cool, thank you. Note that my comments are those of a layman, I don't consider myself an expert on these topics, but this gave me some thoughts. Happy to learn more, would love links to blogs/ papers where I can read more.

I would not be surprised to find that even a naive mark and sweep collector is faster than naive refcounting on some workloads. One obvious thing to consider is that the work is delayed, you can perform the sweeping 'as needed'. Even the marking doesn't have to run on any deterministic schedule.

The thing is that, from my naive perspective, run of the mill tracing collector algorithms are just way more advanced than your typical refcount. Most refcounting is just that - either an integer, atomic integer, or both, that gets incremented and decremented based on a number of operations applied to the underlying type. The naive approach has no delays.

Tracing GCs on the other hand, although perhaps not naive ones (could you link me info on the quickfit algorithm? I can not find anything online), might contain epochs that bump allocate in the majority of cases. That'll be particularly nice for benchmarks where allocations are likely very short lived and may actually never need to get to the mark/sweep phase. Your algorithm isn't really documented and I just really don't feel like looking at C right now.

Although naive refcounting is very common it's not the only game in town. Depending on the language you can group refcounts together - for example, imagine you have:

(assuming all fields are automatically refcounted) struct Foo { bar: Bar, baz: Baz, }

In theory, a "copy" of this type would involve 3 increments, possibly atomic increments. Each increment would also require a heap pointer dereference, and there would be no locality of those integers behind the pointers. That would be the trivial implementation.

But depending on the language you could actually flatten all of those down to 1 RC. This is language dependent, and it requires understanding how these values can be moved, referenced, etc, at compile time. You could also store all reference counts in tables associated with structures, such that you have locality when you want to read/write to multiple counters. The pointer dereference is going to be brutal so having locality there will be a nice win. I'd be curious to run your benchmarks through valgrind to see how much the refcount is just spending time on memory fetches that get invalidated in the cache immediately.

Anyway, an example of a pretty slick refcounting GC is what Pony built: https://tutorial.ponylang.io/appendices/garbage-collection.h... https://www.ponylang.io/media/papers/OGC.pdf

Pony has different types for: 1. Local, Immutable 2. Local, Mutable 3. Shared, Immutable 4. Shared, Mutable

You can read the paper where they discuss how they track local variables vs shared variables, the implementation of counter tables, etc.

So I guess to summarize:

1. The results make sense, or as much sense as anything. I'd be interested in more details on the algorithms involved and your benchmark methodology.

2. "Naive" tracing GCs are actually pretty advanced, and advanced refcount implementations are pretty scarce.

bjourne

Let me preface by stating that I'm no expert. I created the repo a few years ago when I was self-studying gc. Then I realized the rabbit hole was much deeper than I thought and retreated. The book I read is The Garbage Collection Handbook. Quite expensive but definitely worth its price.

Throughput-wise, it's hard to beat naive tracing gc. The algorithms are just too simple and they don't "interfere" with "normal operations" like ref counting does. Assuming the same allocation pattern (i.e no cheating by stack allocating objects), a tracing gc would likely (again, throughput-wise) beat manual memory management too. The additional benefit tracing gives you is easy heap compaction. Thus future pointer-chasing and memory allocations will be more efficient. With ref counting, compaction is harder.

True, you could delay sweeping, but ime, marking time dominates so you don't gain much. Even with a huge heap of several gigabytes, sweeping is just a linear scan from lowest to highest address.

Quick fit is a memory allocator, see: http://www.flounder.com/memory_allocation.htm Most gcs do not keep the heap contiguous so you need it in a layer below the gc. Quick fit is the algorithm almost everyone uses and it is very good for allocating many small objects of fixed sizes (8, 16, 32, etc.). It could be swapped out with malloc/free pairs instead, at the price of some performance.

I have to disgree with naive tracing being advanced. My mark & sweep implementation is only about 50 lines and that includes comments: https://github.com/bjourne/c-examples/blob/master/libraries/... A copying collector isn't much more complicated. Neither is beyond the reach of most comp sci students. Yes, optimized multi-generational tracing collectors supporting concurrent and resumable tracing makes them very complicated. But the same is true of optimized ref counting schemes. :)

Pony looks very interesting. It looks like it is supposed to have less object churn than very dynamic languages like JavaScript which probably makes ref counting very suitable for it.

nemothekid

It's my theory that Java, unintentionally, did a lot of damage to P&L research. I write a lot of Rust, and while the borrow checker is great, I've come to really admire the work that was put in the Go GC even if it's not as fast Java.

There is a whole generation of programmers that have come to equate GC with Java's 10 second pauses or generics/typed variables with Java's implementation of them. Even the return to typed systems (Sorbel, pythons' typing, typescript) could be seen as typed languages are great, what we really hated was Java's verbose semantics.

RandomBK

I feel that every time we talk about Java, we should clarify between Java the language, Java the programming style, and Java the JVM.

As a language, Java's not too bad. It's a bit wordy, in bad need of some syntax sugar, but it's designed to be fairly straightforward and for the most part it does its job well. I don't need a degree in language theory to get started writing it.

Java the programming style, particularly enterprise, is a horrendous over-engineered mess that schools jam down the throats of students who don't know any better. It's designed to (and fails to) enforce a common style that can be written by armies of mediocre developers plodding along inside giant enterprise codebases, so that no matter who wrote the code, some other developer in another department can figure out how to call it.

Java the JVM is a pretty nifty beast. It's made tradeoffs that means it's not always suited for every use case, but put in its element it really shines. The modern GC algos give developers options based on the program's needs. It's currently struggling to overcome some historical decisions that while good back in the old days are now holding it back.

Personally I'm very biased towards Kotlin, which gives me the benefit of the JVM without the barf that is Java. It's not the fastest-executing language out there, but for me it's a perfect balance between development speed, ecosystem of battle-tested libraries, and competitive execution speed.

forrestthewoods

> There is a whole generation of programmers that have come to equate GC with Java's 10 second pauses

Anyone who has ever shipped a C# Unity game know the pain that is the garbage collector. It’s effectively impossible to avoid frame hitches with the GC.

I’ve spent a LOT of time going way out of my way to avoid any and all garbage collects. Which somewhat defeats the purpose of using a GC-based language.

I definitely would say “GC used to be bad but now it’s good”. That tale has been spun for 30+ years at this point!

nemothekid

I don't know much about the C# garbage collector; and it's likely that garbage collectors are a bad fit for programs that have hard deadlines.

That said, it could also be a function of the same "problem" Java has in its design - Java by default boxes everything and so every memory allocation increases garbage collection pressure. Go, by using escape analysis and favoring stack allocations, doesn't have this problem and has had sub-millisecond pauses for years now.

You rarely hear people complain about Go's GC despite it being much less mature than the JVMs. But due to actual language design, Go's GC does much less work. I wouldn't say “GC used to be bad but now it’s good”, but that "the design of languages like C# and Java were too dependent on using the GC for memory allocations, and there are other GC languaged out there that utilize the GC less which can lead to greater performance"

rdw

You're right, and it's got more layers than that. C# does have value types, which are not boxed, and using them judiciously can avoid garbage. However, they are a more recent addition to the language (which started as a lame Java clone), and so the standard library tends to not know about them. Really trivial operations will allocate hundreds of bytes of garbage for no good reason. Example: iterating over a Dictionary. Or, IIRC, getting the current time. They've been cleaning up these functions to not create garbage over tiem, of course, but it's never fast enough for my taste and leads to some truly awful workarounds.

al_mandi

The JVM also does escape analysis. Java value types are in the works as well.

pjmlp

Don't mix Unity's implementation, an ageing one, with the language.

ElectricalUnion

> what we really hated was Java's verbose semantics

I believe it's actually the opposite - Java has pretty simple, compact and well defined semantics. Too simple and compact for confort - a little syntatic sugar would have made the language a lot less verbose.

ncmncm

Java's fundamental problems are well beyond reach of any syntactic sugar.

becquerel

Please explain these fundamental problems.

al_mandi

Java (the JVM) doesn't really have 10 second pauses anymore. G1GC and ZGC and Shenandoah have been a thing for a while now.

pkolaczk

G1 can do 10 second pauses. I observed them many, many times. It tries not to, but there are no guarantees.

al_mandi

Have you tried ZGC or Shenandoah?

chrisseaton

> P&L research

What’s P&L?

> Java's verbose semantics

Does Java have verbose semantics? I think Java’s semantics are pretty neat and concise. Where’s the verbosity?

mandevil

Not OP, but someone who has gotten paid to write Java for several years.

I would say that isn't that Java's semantics are that verbose, it's that the way Java is traditionally written, with every line actually 3 lines on your screen of

public function makeItalicTextBox(String actualTextIWantToBeItalic) { ItalicTextBox itb = italicTextBoxFactoryGenerator.GenerateFactory().buildItalicTextBox(actualTextIWantToBeItalic); return itb; }

I think this is actually the pernicious work of Martin's _Clean Code_ which trained a whole generation of Java coders to write nonsense one line functions with a million interior function calls like this, not anything forced by the Java language itself, but it makes for really ugly code, in my exceptionally humble but undoubtedly correct opinion.

musicale

The builder pattern with method chaining is unfortunate and should usually be replaced by named/optional parameters, in any sensible/modern programming language at least.

IDEs are an enabler for horriblyLongIdentifierNames because they enter them for you automatically without requiring you to type them on a keyboard.

thirdmunky

Java idioms are one contributing factor. There's some inherent wordiness too - for example, Hello World in Java is wordier than in most other programming languages.

coding123

This is one of the reasons I left Java after 30 years. For those that wonder my day to day is now python and typescript.

undefined

[deleted]

nemothekid

>What’s P&L?

I meant PL research as in programming language research. Not sure why my brain decided to put an & there.

>Does Java have verbose semantics? I think Java’s semantics are pretty neat and concise. Where’s the verbosity?

`FizzBuzzEnterpriseEdition` and is a meme for a reason. That said I'm sure Java idioms written today is a lot more sane than they were around the time everyone decided to rewrite everything in Ruby. When I say Java here, I'm talking about an era of Java that probably no longer exists (the JVM also doesn't have 10 second pauses today either and is widely considered the best garbage collector ever built).

ncmncm

"Profit and loss"

lumost

It’s gotten better over the years. Java originally did not have generics, then there was the factory/uml everything crowd. Recently modern Java has been evolving towards ergonomics with streams/loom/guice/Lombok etc.

However we still don’t have something like auto/let from cpp/rust.

brabel

Typical criticism of Java: outdated by several years...

Java has had "var" since at least Java 11 (current version is 18, with version increasing every 6 months, so you're talking about Java from 4 years ago).

With new features every 6 months, yes, 4 years ago is an eternity in Java world these days... 4 years in the future, almost certainly Java will already have virtual threads (like Go coroutines), full support for pattern matching on records (ADTs like OCaml), true value types (Project Vallhalla) and some other stuff that is making other languages stay behind Java in many aspects.

chrisseaton

> However we still don’t have something like auto/let from cpp/rust.

But things like auto are more verbose semantics, not less!

And isn’t var in Java like auto in C++?

slavboj

People have been managing garbage collection schedules for decades now. It's quite possible for many systems to have completely deterministic performance, with the allocation/deallocation performance made extremely fast, gc restricted to certain times or a known constant overhead, etc. Ironically, from a programming perspective it's incredibly easy in a language like Java to see exactly what allocates and bound those cases.

Conversely, it's also possible for reference counting to have perverse performance cases over a truly arbitrary reference graph with frequent increments and decrements. You're not just doing atomic inc/dec, you're traversing an arbitrary number of pointers on every reference update, and it can be remarkably difficult to avoid de/allocations in something like Python where there's not really a builtin notion of a primitive non-object type.

Generally speaking, memory de/allocation patterns are the issue, not the specific choice of reference counting vs gc.

imtringued

Not only does the author ignore the huge progress in conventional garbage collected languages like Java, he also dismisses GC as inherently flawed despite the fact that the common strategy of only having one heap per application has nothing to do with garbage collection. In Pony each actor has its own isolated heap which means the garbage collector will only interrupt a tiny portion of the program for a much shorter period of time. Hence the concept of a stop the world pause is orthogonal to whether you have a GC or not. One could build a stop the world pause into an RC system through cycle detection if desired.

Waterluvian

I’m out of my league so this may be dumb, but does any language or VM or whatnot have a combined system where each thread has its own heap, and you can talk by passing messages, but they also have a common heap for larger data that’s too expensive to pass around, but at the cost that you have to be much more careful with lifetimes or have to manage it manually or something?

tkhattra

In the Erlang VM, each Erlang "process" has its own garbage collected heap [1]. There's also an ETS module for more efficiently storing and sharing large amounts of data, but data stored in ETS is not automatically garbage collected [2].

[1] https://www.erlang.org/doc/apps/erts/garbagecollection [2] https://www.erlang.org/doc/man/ets.html

jefftk

If you squint at it right you could say C works that way: you have many processes each with their own heap and they can pass messages, but if they want larger data that's too expensive to pass around you can use shared memory.

kaba0

That’s just IPC, nothing inherent to C.

Jweb_Guru

I believe Nim works this way, among others.

jeffmurphy

I believe Erlang works this way.

Bolkan

Dart

viktorcode

> he also dismisses GC as inherently flawed

It's a compromise, on memory consumption and performance. Modern GCs are minimising the impact of those factors, but they still remain a part of the design.

RC is a performance compromise.

ridiculous_fish

There's a lot of discussion of comparative performance, but most software isn't performance sensitive so it just doesn't matter. But there's another major facet: FFIs. The choice of memory management has huge implications for how you structure your FFI.

JavaScriptCore uses a conservative GC: the C stack is scanned, and any word which points at a heap object will act as a root. v8 is different, it uses a moving collector: references to heap objects are held behind a double-redirection so the GC may move them. Both collectors are highly tuned and extremely fast, but their FFIs look very different because of their choice of memory management.

Read and write barriers also come into play. If your GC strategy requires that reads/writes go through a barrier, then this affects your FFI. This is part of what sunk Apple's ObjC GC effort: there was just a lot of C/C++ code which manipulated references which was subtly broken under GC; the "rules" for the FFI became overbearing.

Java's JNI also illustrates this. See the restrictions around e.g. GetPrimitiveArrayCritical. It's hard to know if you're doing the right thing, especially bugs may only manifest if the GC runs which it might not in your test.

One of the under-appreciated virtues of RC is the interoperability ease. I know std::sort only rearranges, doesn't add or remove references, so I can just call it. But if my host language has a GC then std::sort may mess up the card marking and cause a live object to be prematurely collected; but it's hard to know for sure!

chubot

I agree that the API and interop with C/C++ is a huge issue and something I haven't seen good articles on.

But I was sort of put off from reference counting by working with Python extensions that leaked memory many years ago. It's so easy to forget a ref count operation. I don't have data, but I suspect it happens a lot in practice.

With tracing, you have to annotate stack roots (and global roots if you have them). To me that seems less error prone. You can overapproximate them and it doesn't really change much.

Moving is indeed a big pain, and I'm about to back out of it for Oil :-/

----

edit: I don't have any experience with Objective C, but I also think this comment is unsurprising, and honestly I would probably get it wrong too:

https://news.ycombinator.com/item?id=32283641

I feel like ref counting is more "littered all over your code" than GC is, which means there's more opportunity to get it wrong.

kaba0

It’s a good point that it is a topic that doesn’t get enough coverage, but let’s just add that it has good solutions for most use cases: GCs can pin objects that might be used from some other language.

knome

Reference counting can also have unpredictable hits if you release any large data structures. Whoever drops the last reference suddenly gets to sit through the entire deep set of items to release ( unless you can hand off the release cascade to a background thread ).

I've never heard of a reference counting implementation that can handle memory compaction.

Every time you update a reference count, which is every time you touch any object, you're going to have to write to that RAM, which means stealing it from any other threads using it on any other processors. If you share large trees of data between threads, traversing that tree in different threads will always end up with your threads constantly fighting with each other since there's no such thing as read only memory in reference counting.

When releasing something like a huge list in reference counting, how does the release avoid blowing the stack with recursive releasing? My guess is this just a "don't use a large list whose release may blow the stack with recursive releasing" situation.

eru

> I've never heard of a reference counting implementation that can handle memory compaction.

It's possible to add that in theory. But if you are tracing all your memory anyway so you can compact it, you typically might as well collect the garbage, while you are at it.

But: you are in for a treat, someone implemented compaction for malloc/free. See https://github.com/plasma-umass/Mesh

They use virtual memory machinery as the necessary indirection to implement compaction, with neither changing any pointers nor reliably distinguishing pointers from integers.

mamcx

> hits if you release any large data structures.

Well, that depends in how is the RC done. This is key to understand because if you can control it, the RC become cheaper.

You can see this way on

http://sblom.github.io/openj-core/iojNoun.htm

ie: If instead of `[Rc(1), Rc(2)]` you do `Rc([1, 2])` that work great.

kaba0

How is that not the exact same for tracing GC?

mamcx

Well, you don't need the GC. That is the point of this: Rc could be piece-meal. That is something that could be exploited very well making a interpreter, for example (that is what Array langs do under the hood)

Daily Digest email

Get the top HN stories in your inbox every day.