Interesting choice of technology, but you didn't completely convince me to why this is better than just using SQLite or PostgreSQL with a lagging replica. (You could probably start with either one and easily migrate to the other one if needed.)
In particular you've designed a very complicated system: Operationally you need an etcd cluster and a tailetc cluster. Code-wise you now have to maintain your own transaction-aware caching layer on top of etcd (https://github.com/tailscale/tailetc/blob/main/tailetc.go). That's quite a brave task considering how many databases fail at Jepsen. Have you tried running Jepsen tests on tailetc yourself? You also mentioned a secondary index system which I assume is built on top of tailetc again? How does that interact with tailetc?
Considering that high-availability was not a requirement and that the main problem with the previous solution was performance ("writes went from nearly a second (sometimes worse!) to milliseconds") it looks like a simple server with SQLite + some indexes could have gotten you quite far.
We don't really get the full overview from a short blog post like this though so maybe it turns out to be a great solution for you. The code quality itself looks great and it seems that you have thought about all of the hard problems.
> and a tailetc cluster
What do you mean by this part? tailetc is a library used by the client of etcd.
Running an etcd cluster is much easier than running an HA PostgreSQL or MySQL config. (I previously made LiveJournal and ran its massively sharded HA MySQL setup)
Neat. This is very similar to , which is _not_ a cache but rather a complete mirror of an Etcd keyspace. It does Key/Value decoding up front, into a user-defined & validated runtime type, and promises to never mutate an existing instance (instead decoding into a new instance upon revision change).
The typical workflow is do do all of your "reads" out of the keyspace, attempt to apply Etcd transactions, and (if needed) block until your keyspace has caught up such that you read your write -- or someone else's conflicting write.
Drat! I went looking for people doing something similar when I sat down to design our client, but did not find your package. That's a real pity, I would love to have collaborated on this.
I guess Go package discovery remains an unsolved problem.
Whoa, we hadn't seen that! At first glance it indeed appears to be perhaps exactly identical to what we did.
I wish pkg.dev had a signin and option to star/watch a package. I do this with GitHub repos I should revisit. Would have been handy for pkg.dev :) yes, I know - nobody wants yet another login
> What do you mean by this part? tailetc is a library used by the client of etcd.
Oh. Since they have a full cache of the database I thought it was intended to be used as a separate set of servers layered in front of etcd to lessen the read load. But you're actually using it directly? Interesting. What's the impact on memory usage and scalability? Are you not worried that this will not scale over time since all clients need to have all the data?
Well, we have exactly 1 client (our 1 control server process).
So architecturally it's:
3 or 5 etcd (forget what we last deployed) <--> 1 control process <--> every Tailscale client in the world
The "Future" section is about bumping "1 control process" to "N control processes" where N will be like 2 or max 5 perhaps.
The memory overhead isn't bad, as the "database" isn't big. Modern computers have tons of RAM.
> Running an etcd cluster is much easier than running an HA PostgreSQL or MySQL config.
What if you used one of the managed RDBMS services offered by the big cloud providers? BTW, if you don't mind sharing, where are you hosting the control plane?
> What if you used one of the managed RDBMS services offered by the big cloud providers?
We could (and likely would, despite the costs) but that doesn't address our testing requirements.
The control plane is on AWS.
We use 4 or 5 different cloud providers (Tailscale makes that much easier) but the most important bit is on AWS.
> > Running an etcd cluster is much easier than running an HA PostgreSQL or MySQL config.
> What if you used one of the managed RDBMS services offered by the big cloud providers?
Yeah, AWS RDS "multi-AZ" does a good job of taking care of HA for you. (Google Cloud SQL's HA setup is extremely similar.) But you still get 1-2 minutes of full unavailability when hardware fails.
I haven't operated etcd in production myself, but I assume it does better because it's designed specifically for HA. You can't even run less than three nodes. (The etcd docs talk about election timeouts on the order of 1s, which is encouraging.)
For many use cases, 1-2 minutes of downtime is tolerable. But I can imagine situations where availability is paramount and you're willing to give up scale/performance/features to gain another 9.
if you want a distributed key/value data store, you want to use what's already out there and vetted. It use to be zookeeper, but etcd is much simpler and that's what Kubernetes uses and it has been a big success and proved itself out there in the field. Definitely easier than a full SQL database which is overkill and much harder to replicate especially if you want to have a cluster of >= 3. Again, key is "distributed" and that immediately rules out sqlite.
It's overkill until it's not. We chose etcd initially but after a while we started wanted to ask questions about our data that weren't necessarily organised in the same way as the key/value hierarchy. That just moved all the processing to the client, and now I just wish we used a SQL database from the beginning.
Yeah, but for their use case it's just KV and also ability to link directly in go
This is about spot on. I do get the part about testability, but with a simple Key/Value use case like this, BoltDB or Pebble might have fit extremely well into the Native Golang paradigm as a backing store for the in-memory maps while not needing nearly as much custom code.
Plus maybe replacing the sync.Mutex with RWMutexes for optimum read performance in a seldom-write use case.
On the other hand again, I feel a bit weird criticizing Brad Fitzpatrick ;-) — so there might be other things at play I don‘t have a clue about...
I was initially baffled by the choice of technology too. Part of it is that etcd is apparently much faster at handling writes, and offers more flexibility with regards to consistency, than I remember. Part of it might be that I don't understand the durability guarantees they're after, the gotchas they can avoid (e.g. transactions), or their overall architecture.
This post illustrates the difference between persistence and a database.
If you are expecting to simply persist one instance of one application's state across different runs and failures, a database can be frustrating.
But if you want to manage your data across different versions of an app, different apps accessing the same data, or concurrent access, then a database will save you a lot of headaches.
The trick is knowing which one you want. Persistence is tempting, so a lot of people fool themselves into going that direction, and it can be pretty painful.
I like to say that rollback is the killer feature of SQL. A single request fails (e.g. unique violation), and the overall app keeps going, handling other requests. You application code can be pretty bad, and you can still have a good service. That's why PHP was awesome despite being bad -- SQL made it good (except for all the security pitfalls of PHP, which the DB couldn't help with).
I'd say the universal query capability is the killer feature of SQL.
In the OP they spent two weeks designing and implementing transaction-save indexes -- something that all major SQL RDBMS (and even many NoSQL solutions) have out of the box.
Maybe also part of the success of Rails? Similarly an easy to engineer veneer atop a database.
This comment helped me understand the problem and the solution better (along with a few followup tweets by the tailscale engineers). Thanks.
> (Attempts to avoid this with ORMs usually replace an annoying amount of typing with an annoying amount of magic and loss of efficiency.)
Loss of efficiency? Come on, you were using a file before! :-)
Article makes me glad I'm using Django. Just set up a managed Postgres instance in AWS and be done with it. Sqlite for testing locally. Just works and very little engineering time spent on persistent storage.
Note: I do realize Brad is a very, very good engineer.
Efficiency can be measured in many different ways.
Having no dedicated database server or even database instance, being able to persist data to disk with almost no additional memory required, marginal amount of CPU and no heavy application dependencies can be considered very efficient depending on context.
Of course, if you start doing this on every change, many times a second, then it stops being efficient. But then there are ways to fix it without invoking Oracle or MongoDB or other beast.
When I worked on algorithmic trading framework the persistence was just two pointers in memory pointing to end of persisted and end of published region. Occasionally those pointers would be sent over to a dedicated CPU core that would be actually the only core talking to the operating system, and it would just append that data to a file and publish completion so that the other core can update the pointers.
The application would never read the file (the latency even to SSD is such that it could just as well be on the Moon) and the file was used to be able to retrace trading session and to bring up the application from event log in case it failed mid session.
As the data was nicely placed in order in the file, the entire process of reading that "database" would take no more than 1,5s, after which the application would be ready to synchronize with trading session again.
>Article makes me glad I'm using Django.
This was my main thought throughout reading it. So many things to consider and difficult issues to solve it seems they face a self-made database hell. Makes me appreciate the simplicity and stable performance of django orm + postgre.
I am missing a lot of context from this post because this just sounds nonsensical.
First they're conflating storage with transport. SQL databases are a storage and query system. They're intended to be slow, but efficient, like a bodybuilder. You don't ask a bodybuilder to run the 500m dash.
Second, they had a 150MB dataset, and they moved to... a distributed decentralized key-value store? They went from the simplest thing imaginable to the most complicated thing imaginable. I guess SQL is just complex in a direct way, and etcd is complex in an indirect way. But the end results of both are drastically different. And doesn't etcd have a whole lot of functional limitations SQL databases don't? Not to mention its dependence on gRPC makes it a PITA to work with REST APIs. Consul has a much better general-purpose design, imo.
And more of it doesn't make sense. Is this a backend component? Client side, server side? Why was it using JSON if resources mattered (you coulda saved like 20% of that 150MB with something less bloated). Why a single process? Why global locks? Like, I really don't understand the implementation at all. It seems like they threw away a common-sense solution to make a weird toy.
I'd answer questions but I'm not sure where to start.
I think we're pretty well aware of the pros and cons of all the options and between the team members designing this we have pretty good experience with all of them. But it's entirely possible we didn't communicate the design constraints well enough. (More: https://news.ycombinator.com/item?id=25769320)
Our data's tiny. We don't want to do anything to access it. It's nice just having it in memory always.
Architecturally, see https://news.ycombinator.com/item?id=25768146
JSON vs compressed JSON isn't the point: see https://news.ycombinator.com/item?id=25768771 and my reply to it.
You say you want to have a small database just for each control process deployment to be independent. But you need multiple nodes for etcd... So you currently have either a shared database for all control processes, or 3 nodes per control process, or 3 processes per control node, etc. Either way it seems weird.
I get that SQLite wouldn't work, but it also doesn't make sense to have one completely independent database per process. So I imagine you're using a shared database, at which poitlnt etcd starts to make more sense. It's just not that widely understood in production as sql databases, and has limitations which you might reach in a few years.
> It's just not that widely understood in production as sql databases, and has limitations which you might reach in a few years.
Reaching limitations in a few years and biting that bullet makes the difference between a successful startup that knows when and where to spend time innovating or a startup that spends all their time optimizing for that 1 million simultaneous requests / sec.
The post touches upon it, but I didn't really understand the point. Why doesn't synchronous replication in Postgres work for this use case? With synchronous replication you have a primary and secondary. Your queries go to the primary and the secondary is guaranteed to be at least as up to date as the primary. That way if the primary goes down, you can query the secondary instead and not lose any data.
We could've done that. We could've also used DRBD, etc. But then we still have the SQL/ORM/testing latency/dependency problems.
Can you go into more about what these problems are? I've always used databases (about 15 years on Oracle and about 5 years on Postgres) and I'm not sure if I know what problems you are referring to. Maybe I have experienced them, but have thought of them by a different name.
SQL - I'm not sure what the problems are with SQL. But it is like a second language to me so maybe I experienced these problems long ago and have forgotten about them.
ORM - I never use an ORM, so I have no idea what the problems might be.
testing latency - I don't know what this refers to.
dependency - ditto
I think the "database" label is tripping up the conversation here. What's being talked about here, really, is fast & HA coordination over a (relatively) small amount of shared state by multiple actors within a distributed system. This is literally Etcd's raison d'etre, it excels at this use case.
There are many operational differences between Etcd and a traditional RDBMs, but the biggest ones are that broadcasting updates (so that actors may react) is a core operation, and the MVCC log is "exposed" (via ModRevision) so that actors can resolve state disagreements (am I out of date, or are you?).
SQL is fine. We use it for some things. But not writing SQL is easier than writing SQL. Our data is small enough to fit in memory. Having all the data in memory and just accessible is easier than doing SQL + network round trips to get anything.
ORMs: consider yourself lucky. They try to make SQL easy by auto-generating terrible SQL.
Testing latency: we want to run many unit tests very quickly without high start-up cost. Launching MySQL/PostgreSQL docker containers and running tests against Real Databases is slower than we'd like.
Dependencies: Docker and those MySQL or PostgreSQL servers in containers.
That would have been considerably less scalable. etcd has some interesting scaling characteristics. I posted some followup notes on twitter here: https://twitter.com/apenwarr/status/1349453076541927425
How is PostgreSQL (or MySQL) "considerably less scalable" exactly? etcd isn't particularly known for being scalable or performant. I'm sure it's fast enough for your use-case (since you've benchmarked it), but people have been scaling both PostgreSQL and MySQL far beyond what etcd can achieve (usually at the cost of availability of course).
[I work at Tailscale] I only mean scalable for our very specific and weird access patterns, which involves frequently read-iterating through a large section of the keyspace to calculate and distribute network+firewall updates.
Our database has very small amounts of data but a very, very large number of parallel readers. etcd explicitly disclaims any ability to scale to large data sizes, and probably rightly so :)
This reminds me of this post from the hostifi founder, sharing the code they used for the first 3 years:
It’s just 2 files.
Sometimes it’s better to focus on getting the product working, and handle tech debt later.
I do like putting .json files on disk when it makes sense, as this is a one-liner to serialize both ways in .NET/C#. But, once you hit that wall of wanting to select subsets of data because the total dataset got larger than your CPU cache (or some other step-wise NUMA constraint)... It's time for a little bit more structure. I would have just gone with SQLite to start. If I am not serializing a singleton out to disk, I reach for SQLite by default.
I've seen the same when at one point we decided to just store most data in a JSON blob in the database, since "we will only read and write by ID anyway". Until we didn't, sigh. At least Postgres had JSON primitives for basic querying.
The real problem with that project was of course trying to set up a microservices architecture where it wasn't necessary yet and nobody had the right level of experience and critical thinking to determine where to separate the services.
Storing JSON blobs in the database can be the best option if you are careful with your domain modeling.
I use the same system (a JSON file protected with a mutex) for an internal tool I wrote, and it works great. For us, file size or request count is not a concern, it's serving a couple (internal) users per minute at peak loads, the JSON is about 150 kb after half a year, and old data could easily be deleted/archived if need be.
This tool needs to insert data in the middle of (pretty short) lists, using a pretty complicated algorithm to calculate the position to insert at. If I had used an RDBMS, I'd probably have to implement fractional indexes, or at least change the IDs of all the entries following the newly inserted one, and that would be a lot of code to write. This way, I just copy part of the old slice, insert the new item, copy the other part (which are very easy operations in Go), and then write the whole thing out to JSON.
I kept it simple, stupid, and I'm very happy I went with that decision. Sometimes you don't need a database after all.
As long as you're guaranteeing correctness, it's hard to disagree with the "simple" approach. As long as you don't over-promise or under-deliver, there's no problem, AFAICS.
 Via mutex in your case. Have you thought about durability, though. That one's actually weirdly difficult to guarantee...
> Have you thought about durability, though. That one's actually weirdly difficult to guarantee...
Strictly speaking, it's literally impossible to guarantee, so it's more a question of what kinds and degrees of problems are in- versus out-of-scope for being able to recover from.
0: What happens if I smash your hard drive with a hammer? Oh, you have multiple hard drives? That's fine, I have multiple hammers.
What happened to the first hammer :D
That's good. But single file could break on powerloss. I use sqllite. It's quite easy to use, not a single line though.
Their point about schema migration is completely true though. An SQLite db is extremely stateful, and querying that state in order to apply schema migrations (for both the data schema and the indexes) is bothersome.
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.