Skip to content(if available)orjump to list(if available)

Reddit’s database has two tables (2012)

withinboredom

I worked at a startup (now fairly popular in the US) where we had tables for each thing (users, companies, etc) and a “relationship” table that described each relationship between things. There were no foreign keys, so making changes were pretty cheap. It was actually pretty ingenious (the two guys who came up with the schema went on to get paid to work on k8s).

It was super handy to simply query that table to debug things, since by merely looking for a user, you’d discover everything. If Mongo was more mature and scalable back then (2012ish), I wonder if we would have used it.

vivegi

This is quite similar to the RDF triples model. Everything is a 'thing' including 'relationships' between things. So you just need a thing table and a relation table.

The issue with this is schema management and rules gets pushed to the application layer. You also need to deal with very massive tables (in terms of # of rows) for the relationships table which leads to potential performance issues.

replygirl

let's call them "nodes" and "edges"

CabSauce

Right? What a groundbreaking idea! /s

claytonjy

other downsides I saw from an implementation of this in postgres:

1.you can't have strongly typed schemas specific to a 'thing' subtype; to store it all together, you end up sticking all the interesting bits in a JSON field

2. any queries of interest require a lot of self joins and performance craters

Please, please, never implement a "graph database" or "party model" or "entity schema" in a normal RDBMS, except perhaps at miniscule scale; the flexibility is absolutely not worth it.

NicoJuicy

That's why you have duplicated properties on things you went to query on.

philjohn

That's why in RDF you should really be using a well-modelled ontology.

As for performance issues, using a dedicated Quad store helps there, rather than one built on top of an RDBMS (OpenLink Virtuoso excluded, because somehow they manage to make Quads on top of a distributed SQL engine blazing fast).

It's been about 6 years since I've been in that world, so things might have changed wrt performance.

OJFord

Sounds like a 'star schema', just without foreign key constraints on the relationship ('fact') table.

I'm not really sure I understand why not having that constraint is helpful - what sort of 'changes' are significantly cheaper? Schema changes on the thing ('dimension') tables? Or do you mean cheaper in the 'reasoning about it' sense, you can delete a thing without having to worry about whether it should be nullified, deleted, or what in the relationship table? (And sure something starts to 404 or whatever, but in this application that's fine.)

withinboredom

> you can delete a thing without having to worry about whether it should be nullified, deleted, or what in the relationship table?

Pretty much this. You could just delete all the relationships, or just some of them. For example, a user could be related to many things (appointments, multiple vendors, customers, etc). Thus a customer can just delete the relationship between them and a user, but the user still keeps their other relationships with the customer (such as appointments, allowing them to see their appointment history, even though the customer no longer has access to the user's information).

Ecio78

wouldn't you be able to do the same even with FK in place? what FK would prevent you from doing would be deleting a user before all his relationships are removed. That should prevent you to ending up with inconsistent data inside the tables.

dehrmann

> tables for each thing (users, companies, etc) and a “relationship” table that described each relationship between things

Sounds like objects and associations in the Tao model:

https://engineering.fb.com/2013/06/25/core-data/tao-the-powe...

hamandcheese

So basically one table for each type of node, plus one table for all edges. Neat!

How big was the relationships table? How was performance?

withinboredom

It was pretty massive. I’m certain that it is probably sharded by now. Database performance wasn’t our bottleneck, we had np-hard problems all over the place which was our primary performance issues. For most customers, they never really had issues, but our larger customers def did. Those problems were amazingly fun to solve.

prox

How do you shard a thing like this? (Apart from “with great difficulty.”) :)

giancarlostoro

One neat thing about Mongo is the ObjectID's:

We didn't know it at the time, but you can get the date of creation from them. Which means... If you sort by ObjectID, you're actually sorting by creation time. You can create a generic ObjectID using JavaScript and filter by time ranges. I still have a codepen proof of concept I did for doing ranges.

Anyway, the other thing is, and I'm not sure how many people use it this way: you can copy an ObjectID verbatim into other documents, like a foreign key, so you can reference between collections. If you do this, you'll save yourself a lot of headaches. Just because you can shove anything and everything into MongoDB documents, doesn't mean you absolutely should.

CGamesPlay

But why? Ok, no foreign key checking, so it’s fast, great. Why not include the foreign ID in the local table? What’s the purpose behind a global “relationships” table?

fatherzine

Guessing:

* Most relations are properly modeled as N:N anyways.

* Corollary: no "aargh, need to upgrade this relation to N:N" regrets.

* Corollary: no 1000 relationship tables to name & chase around the DB metadata.

* Low friction to add relations on-the-fly when developing code.

* Low friction to query all relations associated with an object.

* Low friction to update a relation, but keep the ex around.

Neat approach IMHO.

fauigerzigerk

>Most relations are properly modeled as N:N anyways

I'm not sure that's true. Containment is a very common type of relationship.

>Corollary: no 1000 relationship tables to name & chase around the DB metadata.

That depends on whether relationships have associated attributes. If and when they do, you have to turn them into object types and create a table for them. And it's exactly the N:M relationships that tend to have associated information.

That's the conceptual problem with this approach in my view. Quite often, there isn't really a conceptual difference between relationship types and object/thing types. So you end up with a model that allows you to treat some relationships as data but not others.

>Neat approach IMHO.

It's a valid compromise I guess, depending heavily on the specifics of your schema. Ultimately, database optimisation is a very dirty business if you can't find a way to completely separate logical from physical aspects, which turns out to be a very hard problem to solve.

zepolen

* low friction till things go wrong and you're not even aware they went wrong because the relational mistake wasn't constrained

* aaargh we have trillion row tables that contain everything, how do we scale this (you don't)

* aaargh our tables indices don't fit into memory, what do we do (you cry)

* no at a glance metrics like table sizes to have an idea of which relations are growing

* oh no we have to have complex app code to re-implement for relationships everything a database gave for free

Reddit used a similar system in Posgresql and hit all of these issues before moving to Cassandra. Even then they still had to deal with constraints of the original model.

Edit: Didn't even read the article being about Reddit, it pretty much confirms what I said. It took them something like 2 years iirc to migrate to Cassandra and the site was wholely unuseable during that time. There is no doubt in my mind if another Reddit style site existed during that time it would have eaten their lunch same as Reddit ate Diggs lunch (for non technical reasons).

Furthermore, back then iirc they only really had something like 4 tables/relations: comment, post, user, vote. The 'faster development time' adding new relations was completely moot, they spent more time developing and dealing with a complex db query wrapper and performance issues (caching in memcache all over the place) than actual features.

The (app+query) code and performance for their nested comments was absolutely hilarious. Something a well designed table would have given then for free in one recursive CTE.

It wasn't until they moved to Cassandra that they were able to let the db do that job and work on adding the shit ton of features that exist today.

null

[deleted]

bongobingo1

Cant help but read this and wonder if it was an strict architectural decision (I think this is what you imply?), or a "move fast and fix it later" side effect that hung around. To me it seems like your discarding a lot of safety but perhaps that's overstated in my head.

withinboredom

The relationships were enforced in-code, so the safety existed there instead of by the database. Even where I work now with sharding and such, we don’t have any foreign key constraints yet serve billions of requests. Code can provide just as much safety as the database, you just have to be smart enough to not try and override it.

Tainnor

> Code can provide just as much safety as the database

No, database constraints declaratively specify invariants that have to be valid for the entirety of your data, no matter when it was entered.

Checks and constraints in code only affect data that is currently being modified while that version of the code is live. Unless you run regular "data cleanup" routines on all your data, your code will not be able to fix existing data or even tell you that it needs to be fixed. If you deployed a bug to production that caused data corruption, you deploying a fix for it will not undo the effects of the bug on already created data.

Now, a new database constraint will also not fix previously invalid data, but it will alert you to the problem and allow you to fix the data, after which it will then be able to confirm to you that the data is in correct shape and will be continue to be so in the future, even if you accidentally deploy a bug wherein the code forgets to set a reference properly.

I'm fine with people claiming that, for their specific use case or situation, database constraints are not worth it (e.g. for scalability reasons), but it seems disingenuous to me to claim that code can provide "just as much safety".

zepolen

> Code can provide just as much safety as the database, you just have to be smart enough to not try and override it.

This statement is absolutely false. Some application code that random developers write will never be a match for constraints in a mature, well tested database.

holoduke

Can you tell me what the performance impact is with code logic and extra dB calls versus built-in constraint logic? Because I am fairly convinced that most modern DBs have almost zero impact from relationship constraints. Those are heavily optimised. I am curious to your study and want to know more.

lisper

> There were no foreign keys

Then how did the relationship table refer to things in the thing tables?

chrchr

I bet they meant no foreign key constraints.

lisper

OK, but that still doesn't make any sense because:

> we had tables for each thing

I take that to mean "a separate table for each class of thing." So the single "relationship" table had to somehow refer not just to rows in tables but to the tables themselves. AFAIK that is not possible to do directly in SQL. You would have to query the relationship table, extract a table name from the result, and then make a separate query to get the relevant row from that table. And you would have to make that separate query for every relationship. That would be horribly inefficient.

null

[deleted]

senttoschool

They were using it like a NoSQL database back in 2010, just one year after MongoDB started. So there were no options.

In 2022, there are so many more mature, reliable, battle-tested NoSQL options that could have solved their problem more elegantly.

But even today, if I were to build Reddit as a startup, I'd start with a Postgres database and go as far as I can. Postgres allows me to use relations when I want to enforce data integrity and JSONB when I just want a key/value store.

lmm

> But even today, if I were to build Reddit as a startup, I'd start with a Postgres database and go as far as I can. Postgres allows me to use relations when I want to enforce data integrity and JSONB when I just want a key/value store.

This is accepted wisdom on HN but I don't see it:

- True HA is very difficult. Even just automatic failover takes a lot of manual sysadmin effort

- You can't autoscale, and while scaling is a nice problem to have, having to not just fundamentally rearchitect all your code when you need to shard, but also probably migrate all your data, is probably the worst version of it.

- Even if you're not using transactions, you still pay the price for them. For example your writes won't commit until all your secondary indices have been updated.

- If you're actually using JSONB then driver/mapper/etc. support is very limited. E.g. it's still quite fiddly to have an object where one field is a set of strings / enum values and have a standard ORM store that in a column rather than creating a join table.

senttoschool

If I were to start building Reddit from scratch without knowing if it's going to be successful or not, I'd rather iterate quick on features rather trying to spend time scaling something that will most likely fail before reaching that kind of scale.

>True HA is very difficult. Even just automatic failover takes a lot of manual sysadmin effort

Nearly all Postgres cloud providers such as RDS, CloudSQL, Upcloud, provide multi-region nodes and automatic failover.

Also, true HA databases are very expensive and have their own drawbacks. Not really designed for a Reddit-style startup.

>You can't autoscale, and while scaling is a nice problem to have, having to not just fundamentally rearchitect all your code when you need to shard, but also probably migrate all your data, is probably the worst version of it.

Hence why I said "as far as I can". I can keep upgrading the instance, creating material views, caching responses, etc.

Lastly, the most important reason is because I know Postgres and I can get started today. I don't want to learn a highly scalable, new type of database that I might need 5 years down the road.

lmm

> Nearly all Postgres cloud providers such as RDS, CloudSQL, Upcloud, provide multi-region nodes and automatic failover.

From memory you have to specify a maintenance window and failover is not instant. And relying on your cloud provider comes with its own drawbacks.

> Also, true HA databases are very expensive and have their own drawbacks. Not really designed for a Reddit-style startup.

Plenty of free HA datastores out there, and if we're assuming it'll be managed by the cloud provider anyway then that's a wash.

> Lastly, the most important reason is because I know Postgres and I can get started today. I don't want to learn a highly scalable, new type of database that I might need 5 years down the road.

That's a good argument for using what you know; by the same token I'd use Cassandra. Postgres is a very complex system with lots of legacy and various edge cases.

naniwaduni

Does a startup in Reddit's niche demand HA?

lmm

I would think that if you want to create a habit where people check you frequently, avoiding downtime would be important. But IDK.

senttoschool

On a side note, I find it a bit ironic how NoSQL was all the rage back in the last decade or so but in 2022, NoSQL DBs are racing to add SQL querying back to key value stores.

It turns out that SQL is crucial to non-app-developers such as business analysts, data scientists. Trying to setup an ETL to pull data from your MongoDB datastore to a Postgres DB so the analysts could generate reports is such a waste of time and resources. Or worse, the analysts have to request devs to write Javascript code to query the data they need. For this reason alone, I will always start with a DB that supports SQL out of the box.

ralusek

People didn't leave RDBMS for NoSQL because they wanted to get rid of RDBMS features, they did it because they couldn't scale their applications on single instances anymore. So they had to give up transactions and foreign keys and schemas, etc, because those concepts just didn't exist across instances. Not because they wanted to, but because they had to.

The entire database world has been working ever since trying to get back those RDBMS features into their horizontally scalable solutions, it's just taken a while because the problems are exponentially harder in a network. It's not like people are using eventual consistency and a loss of referential integrity because they prefer it.

senttoschool

>People didn't leave RDBMS for NoSQL because they wanted to get rid of RDBMS features, they did it because they couldn't scale their applications on single instances anymore.

I agree that this should have been the reason for choosing something like MongoDB over MySQL/Postgres.

In reality, I saw a lot about the companies switching/starting with because so their devs didn't have to learn SQL and they can just stay in Javascript/whatever language they were using.

nprateem

100%. You build something cool and want to get investment or understand how your marketing efforts are going so need to run some analytics. As soon as even the most basic queries require custom code it's obvious you've wasted a load of dev and analyst time.

null

[deleted]

machiaweliczny

We seem to be happy using Hevo + BigQuery + Metabase for analytics need. Hevo can pull from many sources.

null

[deleted]

Misdicorl

> So there were no options.

Hbase was a pretty solid NoSQL option by 2010, no?

TekMol

These days, for a new Web2 type startup, I would use SQLite.

Because it requires no setup and has been used to scale typcial Web2 applications to millions in revenue on a single cheap VM.

Another option worth taking a look at is to use no DB at all. Just writing to files is often the faster way to get going. The filesystem offers so much functionality out of the box. And the ecosystem of tools is marvelous. As far as I know, HN uses no DB and just writes all data to plain text files.

It would actually be pretty interesting to have a look at how HN stores the data:

https://news.ycombinator.com/item?id=32408322

paldepind2

I sometimes see recommendations like this and I just don’t get it. It’s not like setting up a PosgreSQL databases is hard at all? And using files? That just sounds like a horrible developers experience and problems waiting to happen. Databases gives you so much for free: ACID, transactions, easy DSL for queries, structure to your data, etc. On top of that every dev knows some SQL.

What am I missing?

rtpg

So I am also "really people, use PostgreSQL", because you have way less tricks you have to play to get it working compared to SQLite, in serious envs. However, some challenges with Postgres, especially if you have less strict data integrity requirements (reddit-y stuff, for example):

- Postgres backups are not trivial. They aren't hard, but well SQLite is just that one file

- Postgres isn't really built for many-cardinal-DB setups (talking on order of 100s of DBs). What does this mean in practice? If you are setting up a multi-tenant system, you're going to quickly realize that you're paying a decent cost because your data is laid out by insertion order. Combine this with MVCC meaning that no amount of indices will give you a nice COUNT, and suddenly your 1% largest tenants will cause perf issues across the board.

SQLite, meanwhile, one file per customer is not a crazy idea. You'll have to do tricks to "map/reduce" for cross-tenant stuff, of course. But your sharding story is much nicer.

- PSQL upgrades are non-trivial if you don't want downtime. There's logical upgrades, but you gotta be real fucking confident (did you remember to copy over your counters? No? Enjoy your downtime on switchover).

That being said, I think a lot of people see SQLite as this magical thing because of people like fly posting about it, without really grokking that the SQLite stuff that gets posted is either "you actually only have one writer" or "you will report successful writes that won't successfully commit later on". The fundamentals of databases don't stop existing with a bunch of SQLite middleware!

But SQLite at least matches the conceptual model of "I run a program on a file", which is much easier to grok than the client-server based stuff in general. But PSQL has way more straightforward conflict management in general

simonw

SQLite backups aren't quite that easy.

Grabbing a copy of the file won't necessarily work: you need at atomic snapshot. You can create those using the backup API or by using VACUUM INTO, but in both cases you'll need enough spare disk space to create a fresh copy of your database.

I'm having great results from Litestream streaming to S3, so that's definitely a good option here.

mekster

Just use zfs snapshot to backup. Let go of the old ways of dump commands.

rahimnathwani

"SQLite is just that one file"

How do you get enough time to back up that file when it's in use? Lock the file until you've copied it? zfs?

akamaka

I recommend that all developers should at least once try to write a web app that uses plain files for data storage. I’ve done it, and I very quickly realized why databases exist, and not to take them for granted.

sedatk

Exactly. I'd written my now popular web app (most popular Turkish social platform to date) as a Delphi CGI using text files as data store in 99 because I wanted to get it running ASAP, and I thought BDE would be quite cumbersome to get running on a remote hosting service. (Its original source that uses text files is at https://github.com/ssg/sozluk-cgi)

I immediately started to have problems as expected, and later migrated to an Access DB. It didn't support concurrent writes, but it was an improvement beyond comprehension. Even "orders of magnitude" doesn't cut it because you get many more luxuries than performance like data-type consistency, ACID compliance, relational integrity, SQL and whatnot.

kcartlidge

> write a web app that uses plain files for data storage ... very quickly realized why databases exist, and not to take them for granted

You haven't lived until you've built an in-memory database with background threads persisting to JSON files. Oh, the nightmares.

gjulianm

> What am I missing?

Different requirements, expertise, and priorities. I don't like generic advice too much because there are so many different situations and priorities, but I've used files, SQLite and PostgreSQL according to the priorities:

- One time we needed to store some internal data from an application to compare certain things between batch launches. Yes, we could have translated form the application code to a database, but it was far easier to just dump the structure to a file in JSON, with the file uniquely named for each batch type. We didn't really need transactions, or ACID, or anything like that, and it worked well enough and, importantly, fast enough.

- Another time we had a library that provided some primitives for management of an application, and needed a database to manage instance information. We went for SQLite there, as it was far easier to setup, debug and develop for. Also, far easier to deploy, because for SQLite it's just "create a file when installing this library", while for PostgreSQL it would be far more complicated.

- Another situation, we needed a database to store flow data at high speed, and which needed to be queried by several monitoring/dashboarding tools. There the choice was PostgreSQL, because anything else wouldn't scale.

In other words, it really depends on the situation. Saying "use PostgreSQL always" is not going to really solve anything, you need to know what to apply to each situation.

paldepind2

Thanks for the reply. All the situations that you describe sounds very reasonable. I definitely wasn't trying to say "use PostgreSQL always", I too have used files and SQLite in various siturations. For instance, your second example is IMO a canonical example of where SQLite is the right tool and where you make use of its strengths. My comment was more directed at the siturations where PostgreSQL seems to be the right tool for the job, but where people still recommend other things.

senttoschool

>In other words, it really depends on the situation. Saying "use PostgreSQL always" is not going to really solve anything, you need to know what to apply to each situation.

We're responding to this below. The requirement is a Web2 app with "millions in revenue".

>Because it requires no setup and has been used to scale typcial Web2 applications to millions in revenue on a single cheap VM.

0xbadcafebee

It's "kick-the-can-down-the-road" engineering. If the server can't hold the database in RAM anymore, they buy more RAM. If the disks are too slow, they buy faster/more disks. If the one server goes down, the site is just down for an hour or two (or a day or two) while they build a new server. It works until it doesn't, and most people are fine with things eventually not working.

This has only been practical within the last 6 years or so. Big servers were just too expensive. We started using cheap commodity low-resource PCs, which are cheaper to buy/scale/replace, but limits what your apps can do. Bare metal went down hard and there wasn't an API to quickly stand up a replacement. NAS/SAN was expensive. Network bandwidth was limited.

The cloud has made everything cheaper and easier and faster and bigger, changing what designs are feasible. You can spin up a new instance with a 6GB database in memory in a virtual machine with network-attached storage and a gigabit pipe for like $20/month. That's crazy.

nfhshy68

It's easy to install k8s curl a helm chart too, doesn't mean you should.

What you're missing is complexity and second order effects.

Making decisions about databases early results in needing to make a lot of secondary decisions about all sorts of things far earlier than you would if you just left it all in sqlite on one VM.

It's not for everyone. People have varying levels of experience and needs. I'll almost always setup a MySQL db but I'd be lying if I said it never resulted in navel gazing and premature db/application design I didn't need.

senttoschool

>Making decisions about databases early results in needing to make a lot of secondary decisions about all sorts of things far earlier than you would if you just left it all in sqlite on one VM.

Like what? What could be easier than going to AWS, click a few things to spin up a Postgres instance, click a few more things to scale automatically without downtime, create read replicas, have automatic backups, one-click restore?

I feel like it's the opposite. Trying to put SQLite on a VM via Kubernetes will likely have a lot of secondary decisions that will make "scaling to millions" far harder and more expensive.

hurril

That's a cop-out answer. You're making the claim that simply using the filesystem compared to going with Postgres somehow has less complexity and fewer (or lesser) second order effects, but you don't even indicate how come. So here's my question:

how come?

ngc248

Choosing a DB is not premature optimization, but it eliminates a whole host of data related problems and smoothens development.

jokoon

I built a gallery web app so I could tag images, for my own use.

At first i was writing in a large json file. Each time i did a change, I would write the file and close it.

It worked pretty well. I now use sqlite+dataset and the port was trivial.

I guess there are better ways to do it than open(), but using dataset is quite simple.

Maybe creating folders and duplicating data works, but it seems more complex than using sqlite and dataset.

latchkey

You aren't missing anything.

Cloud SQL postgres on GCP with <50 connections is like ~$8/month and has automated daily backups.

The differences in SQL syntax between SQLite and Postgres, namely around date math (last_ping_at < NOW() - INTERVAL '10 minutes'), make SQLite a non-starter imho... you're going to end up having to rewrite a lot of sql queries.

saagarjha

> On top of that every dev knows some SQL.

Where "some" may include "none" ;)

senttoschool

You can get a managed Postgres instance for $15/month these days. Likely cheaper, more secure, faster, better than an SQLite db hosted manually on a VM.

>Because it requires no setup and has been used to scale typcial Web2 applications to millions in revenue on a single cheap VM.

Plenty of setup. How would you secure your VM? SSL configuration? SSL expiration management? Security? How would you scale your VM to millions without downtime? Backups? Restoring backups?

All these problems are solved with managed database hosting.

zinodaur

I recently hopped on the SQLite train myself - I think it's a great default. If your business can possibly use sqlite, you get so much performance for free - performance that translates to a better user experience without the need to build a complicated caching layer.

The argument boils down to how the relative performance costs of databases have changed in the 30 years since MySQL/Postgres were designed - and the surprising observation that a read from a fast SSD is usually faster than a network call to memcached.

[1] https://fly.io/blog/all-in-on-sqlite-litestream/ [2] https://www.usenix.org/publications/loginonline/jurassic-clo...

throwaway0x7E6

my brother in Christ, SQLite is not a RDBMS. it runs on the same machine as your application. the scary, overwhelmingly complex problems you've enumerated should indeed be left to competent professionals to deal with, but they do not affect SQLite.

oefrha

> it runs on the same machine as your application.

Better, it runs in the same process.

geocar

> the scary, overwhelmingly complex problems you've enumerated should indeed be left to competent professionals to deal with

You also aren’t getting that for $15/mo, either.

oefrha

I love SQLite and use it whenever I can, but if you’re building the next Reddit, you obviously can’t live with the lack of concurrent writes. HN is fine as the write volume is really low, just a couple posts per minute, plus maybe ten times as many votes.

mekster

> couple posts per minute

There are hundreds of posts that people are reading at every moment. I don't think it's that less.

oefrha

You don't have to guess. Just check https://news.ycombinator.com/new and https://news.ycombinator.com/newcomments for everything that's posted here.

Or take a quantitative approach. HN item IDs are sequential. Using the API:

  $ curl 'https://hacker-news.firebaseio.com/v0/maxitem.json'
  32409711
  $ curl -s "https://hacker-news.firebaseio.com/v0/item/32409711.json" | jq .time
  1660125598
  $ curl -s "https://hacker-news.firebaseio.com/v0/item/$(( 32409711 - 14400 )).json" | jq .time
  1660031940
  
so, 14400 items in ~26 hours, fewer than 10 items per minute on average. At the peak you'll see somewhat more than that.

okasaki

What about one sqlite db per subreddit?

senttoschool

Where would you store the user data? Duplicate it across millions of SQLite databases?

colinmhayes

Using files sounds insane. 100% chance you end up with a nasty data race.

asdff

Why sqlite and not postgres? every sqlite tutorial or documentation or anything at all that I've seen says not to use this tool for a large database

bdlowery

https://remoteok.com and https://nomadlist.com both use SQLite as the database. 100+ million requests a month and there's been no problems.

threatripper

For what definition of "large"? SQLite surely has no problem holding gigabytes of data or millions of rows. In many cases it's much faster than most other kinds of databases. So, size is not the problem. Throughput on one machine with one active process on the same VM is not the problem. Having multiple processes on one VM in most cases is no problem.

But if you spread your compute across many machines that access a single database it can get iffy. If you need the database accessible through network there are server extensions which use SQLite as storage format but you're dealing with a server and probably could use any other database without much difference in deployment&maintenance.

14u2c

This all sounds good until you consider high-availability, which IMO is absolutely essential for any serious business. How do you handle fail-overs when that cheap VM goes down? How do you handle regional replication?

You could cobble something together with rsync, etc, but then you have added a bunch of complexity and built a custom and unproven system. Another option is to use one of the SQLLite extensions popping recently like Dqlite, but again, this approach is relatively unproven for basing your entire business on.

Or you could simply use an off the shelf DBMS like Postgres or even MySQL. They already solve all of these problems and are as battle-tested as can be.

Skinney

> high-availability, which IMO is absolutely essential for any serious business

Depends very much on the business. You can have downtime-free deploys on a single node, and as long as you've setup a system to automatically replace a faulty node with a new one (which typically takes < 10min) then a lot of businesses can live with that just fine. It's not like that cheap VM goes down every day, but just in case you can usually schedule a new instance every week or so to reduce the chance of that happening.

> How do you handle regional replication?

For backup purposes, you'd use litestream. Very easy to use with something like systemd or as a docker sidecar.

For performance purposes, if you do need that performance you'd obviously use something else. Depending on the type of service you have, though, you can get very far with a CDN in front of your server.

> Or you could simply use an off the shelf DBMS like Postgres or even MySQL.

If you need it, sure.

jsight

TBH, I've used hypersonic sql before. People thought it was crazy, but there was no concurrency, and backups were just copying a couple of files. Fixing them if they failed was easy to.

People get too caught up in assumptions without knowing the use case. There are a million use cases where a tool like sqlite would be bad, but also a million where its likely the easiest and best tool.

Wisdom is knowing the difference.

Tade0

> These days, for a new Web2 type startup, I would use SQLite.

There's a whole page in the docs explaining why this might be a bad idea for a write-heavy site:

https://www.sqlite.org/howtocorrupt.html

Section 2 is especially interesting. SQLite locks on the whole database file - usually one per database - and that is a major bottleneck.

> Another option worth taking a look at is to use no DB at all.

We tried that in a project I'm currently in as a means of shipping an MVP as soon as possible.

Major headache in terms of performance down the road. We still serve some effectively static stuff as files, but they're cached in memory, otherwise they would slow the whole thing down by an order of magnitude.

gorgoiler

Facebook was very similar, I believe. The MySQL schema boiled down to two tables: nodes with an id, a type, and key/value properties; and edges (known internally as assocs) between pairs of ids, also with key/value properties and a type.

A simple schema helps when you just need to write stuff. You can make simple queries against a simple schema, directly.

For the juicier queries what you do is you derive new tables from the data using ETL or map/reduce. The complexity of the underlying schema doesn’t have to reflect the complexity of your more complicated queries. It only needs to be complex enough to let you store data in its base form and make simple queries. Everything else can be derived on top of that, behind the scenes.

Example: nodes are people and edges represent friendships, and then from this you could derive a new table of best_friend(id, bf_id, since).

alexpotato

After 15+ years of being an SRE in finance across firms ranging in size from 2 people to 350K people, I've come to the following conclusions:

1. There are no solutions, only tradeoffs.

Sure. You win in the OP example of easy to add things but you lose in the relational aspects.

You could say the same thing in reverse, using a RDBMS make it much easier to do join based lookups but then it's harder to update

2. At the end of the day, it doesn't matter b/c some other layer will do what your base layer can't

I've seen some really atrocious approaches to storing data in systems that were highly regarded by the end users. How did this happen? Someone went in and wrote an abstraction layer that combined everything needed into an easy to query/access interface. I've seen this happen so many times that whenever people start trying to future proof something I mention this point. You will get the design wrong and/or things will change. That means you either change the base layer or you modify/create an abstraction layer.

Waterluvian

Re #1: I agree. But with a pedantic semantic distinction:

There are a lot of objectively wrong ways to do something. But among all the right ways, there are trade offs.

0xbadcafebee

But even if you do something objectively wrong, it can still work out anyway. Look at every Perl script written by non-programmers. Horrifying mess of bullshit, and you try to read the code and think this shouldn't even be parseable, but somehow the script works.

There is no right or wrong, only working and not-working.

ALittleLight

If someone has written their own bubble sort to sort a list of numbers and you replace the custom bubble sort with a built-in sort - there really aren't tradeoffs to consider. Your method will be faster, easier to understand, more robust. Their method works, but is strictly inferior.

balfebs

When I was a Perl programmer, in our shop we amended the motto "There's more than one way to do it" to "There's more than one way to do it, but most of those ways are wrong."

The Perl community themselves eventually extended the motto to "There's more than one way to do it, but sometimes consistency is not a bad thing either"

Xeoncross

You nailed it on the head (and people wonder why there are so many 'bad' laws passed).

mabbo

In 2004, I started community college program and had a professor teaching intro to databases. He was retired from industry and had lived through the rise of RDBMS's and made his money in that. Teaching was just his hobby. He liked to have fun with the class (120+ of us) by having us do call-response things as a group.

Prof: "Why do we normalize?"

Class, in unison: "To make better relations."

Prof: "Why do we denormalize?"

Class, in unison: "Performance."

It took a lot of years of work before the simplicity of the lesson stuck with me. Normalization is good. Performance is good. You trade them off as needed. Reddit wants performance, so they have no strong relations. Your application may not need that, so don't prematurely optimize.

pc86

"Performance is good" is I think something everyone could agree with just hearing it for the first time. There's no reason to prefer worse performance if you can have better performance for no additional cost. Is the same true for normalization? Why is just the state of being more normalized inherently better?

jeremyjh

Denormalized means it has multiple copies of the same pieces of data. The application is responsible for keeping those copies in sync with each other. This is a messy process that is easy to get wrong now and then. Then your app starts saying things that are absurd. No one cares much if Reddit does that every now and then.

vbezhenar

Some levels of normalisation don't allow nulls. This is absurd. Normalisation alone is theoretical concept which should not be blindly pursued in practice, even without performance requirements.

bsedlm

because a normalized database repeats less things and is much easier to understand than a messy one.

oh, and also, the entire point is that neither is "inherently better"... both have pros and cons

pc86

I think my point was "performance" in the sense of faster response time is inherently good. All things being equal, if you have a more performant option you'll choose that. The only time you accept worse performance is in pursuit of some other metric.

On the flip side, "normalization" doesn't have that same inherent good-ness. All things being equal (including performance) there isn't any inherent drive toward more normalization, maybe because a faster performing page will have clear impact to the user while a more normalized data structure would be completely transparent to the user?

mabbo

Normalized schemas are less prone to certain kinds of hard-to-solve issues.

Example: When you have the same data in two places, which is authoritative? If they disagree, which one is right?

And querying data reliably is much easier when you've got good relations that can be relied on to be accurate. SQL is almost a masterpiece (though imho it should have been FROM, WHERE, SELECT).

blahyawnblah

> don't prematurely optimize

Premature optimization is the root of all evil

isaacremuant

Careful with repeating mantras without reflection.

You may end up using them as arguments when they're just catchphrases that require context.

See people who used "goto considered harmful" as an absolute to never use goto (C).

branko_d

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

kcartlidge

How dare you bring full quotes and context to a discussion :)

Seriously true, though. What's usually missed is the first bit, about "small efficiencies", combined with the last bit, the "critical 3%". He's basically saying don't spend time in the mass of weeds until you know where the problems really are, but conversely take the time to optimise the smaller areas that are performance-critical.

Almost the opposite of the carte-blanche overly-minimal MVPs often thrown together by those of us who forget what the V stands for.

zinclozenge

So is premature pessimization.

matchagaucho

Tuple table performance, though, tends to degrade quickly when querying separate rows for every little attribute.

More _memory_ and row caching in RAM has allowed for a resurgence in denormalized schema.

FB's underlying schema is similarly just a few tables. But the entire dataset is memcached.

marginalia_nu

> Tuple table performance, though, tends to degrade quickly when querying separate rows for every little attribute.

Interesting. Do you know what causes this steep decline in performance?

A larger sequential read is typically not that much slower regardless on what sort of drive you are using. I'd have expected that the random access pattern of a normalized table would lose out.

swyx

dang

Thanks! Macroexpanded:

Reddit's database has only two tables - https://news.ycombinator.com/item?id=4468265 - Sept 2012 (79 comments)

zegl

We’re doing something similar at Codeball.

A single Aurora RDS database (with read replicas), with a single table called “blobs”. The blobs table has three columns: id, type, and json.

It’s scaling really really well, to many many millions of rows, and high traffic volumes (many thousands of writes per second).

Thorentis

Wny even use RDS? There is nothing relational about that schema. Just use DynamoDB. It'll be cheaper, you can actually query the JSON, and if you decide to flesh out the documents and go all in on NoSQL then you can do that easily.

zegl

It's PostgreSQL, so we can and do query the JSON data, and update deepely nested fields in the JSON natively as well (no need to take a lock, read the entire blob, modify it in the application, save the data, and release the lock).

PostgreSQL supports setting values, appending to arrays, etc etc natively in an ACID compliant way.

paavohtl

> so we can and do query the JSON data

Also supported in DynamoDB - it's a JSON-database after all.

> update deepely nested fields in the JSON natively as well

https://docs.aws.amazon.com/amazondynamodb/latest/developerg...

> PostgreSQL supports setting values, appending to arrays, etc etc natively in an ACID compliant way.

https://aws.amazon.com/blogs/aws/new-amazon-dynamodb-transac...

I'm not saying PostgreSQL is a bad choice by any means, or that DynamoDB is suitable for all projects. But good support for JSON is in no way a capability unique to PostgreSQL. At least ideally, a competently built document database should be at least as good if not better than an SQL database emulating a document database.

null

[deleted]

commandlinefan

> Just use DynamoDB

Or, at that point, just use a CSV and slurp the whole thing into memory.

poulsbohemian

Here's where I've never been smart enough to use this model... help me explain how you deal with scenarios like this...

1) There's an incident and we are trying to replicate the issue, but our database is just shapeless blobs and we can't track down the data that is pertinent to resolving the issue.

2) Product management wants to better understand our users and what they are doing so that they can design a new feature. But they can't figure out how to query our database or how to make sense of the unstructured data.

3) QA wants to be able to build tests and set up test data sets that match those tests, but rather than a sql script, they've got to roll all this json that doesn't have any constraints.

These are just some examples, but maybe you get the gist of why I don't quite understand how this model scales "outside" the database.

zegl

1) I think that the biggest difference between how you use KV/NoSQL databases is that your objects tends to be much bigger than a "normal" SQL row. Each object can be a parent entity, plus many sub-objects and sub-arrays of information. In a relational model, that data would have to go in other tables that you'd have to join etc. I believe that this model makes things much easier for us, as all you need to debug a single failure usually is stored in a single object, instead of spread out across the entire database.

2) Very true. We're using other tools (like PostHog) for collecting and analysing user behaviour. At a bigger scale, we'd probably ship of event data to something like BigQuery for analysis.

3) We don't have QA, but for testing, objects are created in code just like any other object, and unit tests (aka most of the tests), don't need the database at all. :-)

wizwit999

Why not just use DynamoDB at that point, would likely be cheaper.

senttoschool

Yea I agree. Seems like it should have been a NoSQL solution from that start.

Even today, it should be pretty straightforward to port the data over to a NoSQL db.

manfre

Dynamodb can be very painful with certain usage patterns or if you need to change/add usage patterns. For startups or MVP apps/features, it's a lot easier and cheaper to adhoc SQL.

ianberdin

Sorry but I do not believe in this magic.

How many queries in realities do you need to fetch all “page” data?

zegl

One! PostgreSQL has great support for indexing JSON data.

ianberdin

Looks like you built Hello World website? Sorry if I am wrong.

blowski

Sounds like an entity, attribute, value type schema? But where the value is nested data.

capitol_

This reminds me of when the /r/counting subreddit broke reddit:

https://www.reddit.com/r/counting/comments/ww3vr/i_am_the_be...

Their data model was not designed for very deep comment chains.

asddubs

I wonder why they need to compute the complete comment tree to begin with

1penny42cents

How has this evolved to today? Any growing pains or learning lessons? What stayed good about this as the team and product scaled?

jiggawatts

This design is a fault of traditional RDBMS products that have strictly imperative and synchronous DML commands.

The physical schema ought not be a transactional change, but it is in most products, with locks and everything. This isn't necessary, it's just an accident of history that nobody has gotten around to changing.

For example:

Rearranging the column order (or inserting a column) should just be an instant metadata update. Instead the entire table is rewritten... synchronously.

Adding NON NULL columns similarly requires the value to be populated for every row at the point in time when the DML command is run. Instead, the default function should be added to the metadata, and then the change should roll out incrementally over time.

Etc, etc...

Whatever it is that programmers do with simple "objectid-columnname-value" schemas to make changes "instant" is what RDBMS platforms could also do, but don't.

rastignack

> Adding NON NULL columns similarly requires the value to be populated for every row at the point in time when the DML command is run. Instead, the default function should be added to the metadata, and then the change should roll out incrementally over time.

You mean just like mysql, oracle, sybase and mssql do ?

avereveard

"Amazon DynamoDB Deep Dive" https://www.youtube.com/watch?app=desktop&v=HaEPXoXVf2k&list...

why and how of the single table design approach

__derek__

That's exactly what I thought of when reading the title.