Brian Lovin
/
Hacker News
Daily Digest email

Get the top HN stories in your inbox every day.

skrebbel

I'm appalled at the way some people here receive an honest postmortem of a human fuck-up. The top 3 comments, as I write this, can be summarized as "no, it's your fault and you're stupid for making the fault".

This is not good! We don't want to scare people into writing less of these. We want to encourage people to write more of them. An MBA style "due to a human error, we lost a day of your data, we're tremendously sorry, we're doing everything in our power yadayada" isn't going to help anybody.

Yes, there's all kinds of things they could have done to prevent this from happening. Yes, some of the things they did (not) do were clearly mistakes that a seasoned DBA or sysadmin would not make. Possibly they aren't seasoned DBAs or sysadmins. Or they are but they still made a mistake.

This stuff happens. It sucks, but it still does. Get over yourselves and wish these people some luck.

t0mas88

The software sector needs a bit of aviation safety culture: 50 years ago the conclusion "pilot error" as the main cause was virtually banned from accident investigation. The new mindset is that any system or procedure where a single human error can cause an incident is a broken system. So the blame isn't on the human pressing the button, the problem is the button or procedure design being unsuitable. The result was a huge improvement in safety across the whole industry.

In software there is still a certain arrogance of quickly calling the user (or other software professional) stupid, thinking it can't happen to you. But in reality given enough time, everyone makes at least one stupid mistake, it's how humans work.

janoc

It is not only that but also realizing that there is never a single cause to an accident or incident.

Even when it was a suicidal pilot flying the plane into a mountain on purpose. Someone had to supervise him (there are two crew members in the cockpit for a reason), someone gave him a medical, there is automation in the cockpit that could have at least caused an alarm, etc.

So even when the accident is ultimately caused by a pilot's actions, there is always a chain of events where if any of the segments were broken the accident wouldn't have happened.

While we can't prevent a bonkers pilot from crashing a plane, we could perhaps prevent a bonkers crew member from flying the plane in the first place.

Aka the Swiss cheese model. You don't want to let the holes to align.

This approach is widely used in accident investigations and not only in aviation. Most industrial accidents are investigated like this, trying to understand the entire chain of events in order that processes could be improved and the problem prevented in the future.

Oh and there is one more key part in aviation that isn't elsewhere. The goal of an accident or incident investigation IS NOT TO APPORTION BLAME. It is to learn from it. That's why pilots in airlines with a healthy safety culture are encouraged to report problems, unsafe practices, etc. and this is used to fix the process instead of firing people. Once you start to play the blame game, people won't report problems - and you are flying blind into a disaster sooner or later.

jonplackett

It’s interesting that this is the exact opposite of how we think about crime and punishment. All criminals are like the pilot, just the person who did the action. But the reasons for them becoming criminals is a seldom taken into account. The emphasis is on blaming and punishing them rather than figuring out the cause and stopping it happening again.

globular-toast

There is sometimes a single cause, but as the parent comment pointed out, that should never be the case and is a flaw in the system. We are gradually working towards single errors being correctable, but we're not there yet.

On the railways in Britain the failures were extensively documented. Years ago it was possible for a single failure to cause a loss. But over the years the systems have been patched and if you look at more recent incidents it is always a multitude of factors aligning that cause the loss. Sometimes it's amazing how precisely these individual elements have to align, but it's just probability.

As demonstrated by the article here, we are still in the stage where single failures can cause a loss. But it's a bit different because there is no single universal body regulating every computer system.

pc86

I was on a cross-country United flight ca. 2015 or so and happened to sitting right in the front of first class and got to see the pilots take a bathroom break (bear with me). The process was incredibly interesting.

1. With the flight deck door closed, the three flight attendants place a drink cart between first class and the attendant area/crew bathroom. There's now a ~4.5' barrier locked against the frame of the plane.

2. The flight deck door is opened; one flight attendant goes into the flight deck while one pilot uses the restroom. The flight deck door is left open but the attendant is standing right next to it (but facing the lone pilot). The other two attendants stand against the drink cart, one facing the passengers and one facing the flight deck.

3. Pilots switch while the third attendant remains on the flight deck.

4. After both pilots are done, the flight deck door is closed and locked and the drink cart is returned to where ever they store it.

Any action by a passenger would cause the flight deck door to be closed and locked. Any action by the lone pilot would cause alarm by the flight deck attendant. Any action by the flight deck attendant would cause alarm by the other two.

quietbritishjim

> Even when it was a suicidal pilot flying the plane into a mountain on purpose. Someone had to supervise him (there are two crew members in the cockpit for a reason), someone gave him a medical, there is automation in the cockpit that could have at least caused an alarm, etc.

There was indeed a suicidal pilot that flew into a mountain, I'm not sure if you were deliberately referencing that specific time. In that case he was alone in the cabin – this would have happened briefly but he was able to lock the cabin door before anyone re-entered, and the lock cannot be opened by anyone from the other side in order to avoid September 11th-type situations. It only locks for a brief period but it can be reapplied from the pilot side before it expires an indefinite number of times.

I'm not saying that we can put that one down purely to human action, just that (to be pedantic) he wasn't being supervised by anyone, and there were already any number of alarms going off (and the frantic copilots on the other side of the door were well aware of them).

odyssey7

My impression of the Swiss cheese model is that it's used to take liability from the software vendor and (optionally) put it back on the software purchaser. Sure, there was a software error, but really, Mr. Customer, if this was so important, then you really should have been paying more attention and noticed the data issues sooner.

oblio

The user? Start a discussion about using better programming language and you'll see people, even here, blaming the developer.

The common example is C: "C is a sharp tool, but with a sufficiently smart, careful and experienced developer it does what you want (you're holding it wrong").

Developers still do this to each other.

m463

That reminds me of the time during the rise of the PC when windows would do something wrong, from a confusing interface all the way up to a blue screen of death.

What happened is that users started blaming themselves for what was going wrong, or start thinking they needed a new PC because problems would become more frequent.

From the perspective of a software guy, it was obvious that windows was the culprit but people would assign blame elsewhere and frequently point the finger at themselves.

so yes - an FAA investigation would end up unraveling the nonsense and point to windows.

That said, aviation level of safety is reliable and dependable and few single points of failure and... there are no private kit jets darnit!

There is a continuum from nothing changes & everything works to everything changes & nothing works. You have to choose the appropriate place on the dial for the task. Sounds like this is a one-man band.

watwut

Yeah, but "pilot was drinking alcohol" would be considerate issue, would lead to fired pilot and would lead to more alcohol testing.

I understand what you are taking about, but aviation has also strong expectations on pilots.

janoc

Of course it would. But then there should be a process that identifies such pilot before they even get on the plane, there are two crew in the cockpit, so if one crewman does something unsafe or inappropriate, the other person is there to notice it, call it out and, in the extreme case, to take control of the plane.

Also, if the guy or gal has alcohol problems, it would likely be visible on their flying performance over time, it should be noticed during the periodic medicals, etc.

So while a drunk pilot could be the immediate cause of a crash, it is not the only one. If any of those other things I have mentioned functioned as designed (or were in place to start with - not all flying is airline flying!), the accident wouldn't have happened.

If you focus only on the "drunk pilot, case closed", you will never identify deficiencies you may have elsewhere and which have contributed to the problem.

t0mas88

Believe it or not, even "pilot is an alcoholic" is still part of the no blame culture in aviation. As long as the pilot reports himself he'll not be fired for that. Look up the HIMS program to read more details.

neillyons

This sounds quite interesting. Any books you could recommend on the "pilot error" topic.

miketery

When I was getting my pilots license I used to read accident reports from Canada's Transportation Safety Board [1]. I'm sure the NTSB (America's version) has similar calibre reports [2].

There is also Cockpit Resource Management [3] which addresses the human factor in great detail (how people work with each other, and how prepared are people).

In general what you learn from reading these things is that its rarely one big error or issue - but many small things leading to the failure event.

1 - https://www.tsb.gc.ca/eng/rapports-reports/aviation/index.ht...

2 - https://www.ntsb.gov/investigations/AccidentReports/Pages/Ac...

3 - https://en.wikipedia.org/wiki/Crew_resource_management

Jugurtha

"The Checklist Manifesto", by Atul Gawande, dives into how they looked at other sectors such as aviation to improve healthcare systems, reduce infections, etc. Interesting book.

permarad

The Design of Everyday Things by Donald A. Norman. He covers pilot error a lot in this book in how it falls back on design and usability. Very interesting read.

jsmith45

Not sure about books, but the NTSB generally seems to adopt the philosophy of not trying to assign blame, but instead to figure out what happened, and try to determine what can be changed to prevent this same issue from happening again.

Of course trying to assign blame is human nature, so the reports are not always completely neutral. When I read the actual NTSB report for Sullenburger's "Miracle on the Hudson", I was forced to conclude that while there were some things that the pilots could in theory have done better, given the pilots training and documented procedures, they honestly did better than could reasonably be expected. I am nearly certain that some of the wording in the report was carefully chosen to lead one to this conclusion, despite still pointing out the places where the pilots actions were suboptimal (and thus appearing facially neutral).

The "what can we do to avoid this ever happing again?" attitude applies to real air transit accident reports. Sadly many general aviation accident reports really do just become "pilot error".

benkelaar

This is also one of the core tenets of SRE. The chapter on blameless postmortems is quite nice: https://landing.google.com/sre/sre-book/chapters/postmortem-...

jnsaff2

Anything by Sidney Dekker. https://sidneydekker.com/books/ I would start by The Field Guide to Unterstanding 'Human Error'. It's very approachable and gives you a solid understanding of the field.

janoc

Not sure about books but look up the Swiss cheese model. It is widely used approach and not only in aviation. Most industrial accidents and incidents are investigated with this in mind.

AdrianB1

As a GA pilot I know people that had accidents with planes and I know that in most cases what is the the official report and what really happened are not the same, so any book would have to rely on inaccurate or unreal data. For airliners it is easy because there are flight recorders, for GA it is still a bit of Wild West.

ENOTTY

The idea that multiple failures must occur for catastrophic failure is found in certain parts of the computing community. https://en.wikipedia.org/wiki/Defense_in_depth_(computing)

suzakus

It's a piece of software for scoreboards. Not the Therac-25, nor an airplane.

qz2

This time.

Some days it’s just an on line community that gets burned to the ground.

Other days it’s just a service tied into hundred of small businesses that gets burned to the ground.

Other says it’s massive financial platform getting burned to the ground.

I’m responsible for the latter but the former two have had a much larger impact for many people when they occur. Trivialising the lax administrative discipline because a product isn’t deemed important is a slippery slope.

We need to start building competence in to what we do regardless of what it is rather than run on apologies because it’s cheaper.

ozim

Parent is not advocating about going as strict with procedures as operating an airplane. Post is saying about "a bit of aviation safety culture" then it highlights a specific part that would be useful.

Safety culture element highlighted is: not blaming a single person but finding out how to prevent accident that happened from happening again. Which is reasonable because you don't want to impose some strict rules that are expensive up front. This way you just introduce measure to prevent same thing in the future, in the context of your project.

t0mas88

It isn't about the importance of this one database, it's about the cultural issue in most of the sector that the parent comment was pointing out: we far too often blame the user/operator calling them stupid, while every human makes mistake, it's inevitable.

ganafagol

It's good to have a post mortem. But this was not actually a post mortem. They still don't know how it could happen. Essentially, how can they write "We’re too tired to figure it out right now." and right after attempt to answer "What have we learned? Why won’t this happen again?" Well obviously you have not learned the key lesson yet since you don't know what it is! And how can you even dream of claiming to guarantee that it won't happen again before you know the root cause?

Get some sleep, do a thorough investigation, and the results of that are the post mortem that we would like published and where you learn from.

Publishing some premature thoughts without actual insight is not helping anybody. It will just invite the hate that you are seeing in this thread.

ordu

> I'm appalled at the way some people here receive an honest postmortem of a human fuck-up. The top 3 comments, as I write this, can be summarized as "no, it's your fault and you're stupid for making the fault".

It seems that people annoyed mostly by "complexity gremlins". They are so annoyed, that they miss previous sentence "we’re too tired to figure it out right now." Guys fucked up their system, they restored it the best they could, they tried to figure out what happened, but failed. So they decided to do PR right now, to explain what they know, and to continue the investigation later.

But people see just "complexity gremlins". The lesson learned is do not try any humor in a postmortem. Be as serious, grave, and dull as you can.

rawgabbit

For me, this is an example of DevOps being carried too far.

What is to stop developers for checking into Github "drop database; drop table; alter index; create table; create database; alter permission;"? They are automating environment builds and so that is more efficient right? In my career, I have seen a Fortune 100 company's core system down and out for a week because of hubris like this. In large companies, data flows downstream from a core system. When you have to restore from backup, that cascades into restores in all the child systems.

Similarly, I once had to convince a Microsoft Evangelist who was hired into my company, not to redeploy our production database, every-time we had a production deployment. He was a pure developer and did not see any problems of dropping the database, recreating the database, and re-inserting all the data. I argued that a) this would take 10+ hours b) the production database has data going back many years and that the schema/keys/rules/triggers have evolved during that time-- meaning that many of the inserts would fail because they didn't meet the current schema. He was unconvinced but luckily my bosses overruled him.

My bosses were business types and understood accounting. In accounting, once you "post" a transaction to the ledger that becomes permanent. If you need to correct that transaction, then you create a new one that "credits" or corrects the entry. You don't take out the eraser.

bromuro

I think you should wait 10+ hours to read different kind of comments on HN.

For example, if i open the comments about a “14 hours ago” post, I usually see a top comment about other comments (like yours).

I then feel so out of the loop because i don’t see the “commenters” your are referring too - so the thread that follows seem off topic to me.

caspii

Thanks

qz2

I disagree.

Culturally speaking we like to pat people on their back when they do something stupid and comfort them. But most of the time this isn’t productive because it doesn’t instil the requisite fear required when working out what decision to make.

What happens is we have growing complacency and disassociation from consequences.

Do you press the button on something potentially destructive because your are confident it is ok through analysis, good design and testing or confidence it is ok through trite complacency?

The industry is mostly the latter and it has to stop. And the first thing is calling bad processes, bad software and stupidity out for what it is.

Honestly these guys did good but most will try and hide this sort of fuck up or explain it away with weasel words.

jurre

You should have zero fear instilled when pressing any button. The system or process has failed if a single person pressing a button can bring something down unintended. Fix the system/process, don’t “instill fear” onto the person, it’s toxic, plus now you have to make sure any new person on boarded has “fear instilled”, and that’s just hugely unproductive

qz2

That’s precisely my point. A lot of people have no fear because they’re complacent or ignorant rather than the button is well engineered.

But to get there you need to fear the bad outcomes.

inglor_cz

I recalled Akimov pressing the AZ-5 button in Chernobyl...

ClumsyPilot

"fear required when working out what decision to make"

People like you keep making the same mistake, creating companies/organisations/industries/societies that run on fear of failure. We've tried it a thousand times, and it never works.

You can't solve crime by making all punishments hearsh death, we've tried that in 1700 in Britain and crimerate was sky high.

This culture gave us disasters in USSR and famine in China.

The only thing that can solve this problem is structural change.

qz2

I think my point is probably being misunderstood and that is my fault for explaining it poorly. See I fucked up :)

The fear I speak of is a personal barrier which is lacking in a lot of people. They can sleep quite happily at night knowing they did a shitty job and it's going to explode down the line. It's not their problem. They don't care.

I can't do that. Even if there are no direct consequences for me.

This is not because I am threatened but because I have some personal and professionals standards.

johnisgood

> The only thing that can solve this problem is structural change.

Well, care to elaborate on this? What do we have to change, and to what end?

MaxBarraclough

You're speaking to the mistake. The comment you're replying to is speaking to the write-up analysing the mistake.

Blog posts analysing real-world mistakes should not be met with beratement.

corobo

Most will hide it away because being truthful will hurt current business or future career prospects because people like yourself exist who want everyone shitting themselves at the prospect of being honest.

In a blame free environment you find the underlying issue and fix it. In a blame full environment you cover up the mistake to avoid being fired and some other person does it again later down the line

qz2

No.

There’s a third option where people accept responsibility and are rewarded for that rather than hide from it one way or another.

I have a modicum of respect for people who do that. I don’t for people who shovel it under a rug or point a finger which are the points you are referring to. I’ve been in both environments and neither end up with a positive outcome.

If I fuck up I’m the first person to put my hand up. Call it responsibility culture.

hightekredneck

Exactly right.

The "comfort" will come from taking responsibility and owning and correcting the problem such that you have confidence it won't happen again.

Platitudes to make someone feel better without action helps nobody.

The fear is a valuable feedback mechanism and shouldn't been ignored. It's there to remind you of the potential consequences of a careless fuckup.

Lots here misunderstood this I think.. clearly the point is not to berate people for making mistakes or to foster a "fear culture" insofar as fear of personal attack but rather to not ignore the internal/personal fear of fucking up because you give a shit.

michelpp

> Computers are just too complex and there are days when the complexity gremlins win.

I'm sorry for your data loss, but this is a false and dangerous conclusion to make. You can avoid this problem. There are good suggestions in this thread, but I suggest you use Postgres's permission system to REVOKE DROP action on production except for a very special user that can only be logged in by a human, never a script.

And NEVER run your scripts or application servers as a superuser. This is a dangerous antipattern embraced by many and ORM and library. Grant CREATE and DROP to non-super users.

sushshshsh

As a mid level developer contributing to various large corporate stacks, I would say the systems are too complex and it's too easy to break things in non obvious ways.

Gone are the days of me just being able to run a simple script that accesses data read only an exports the result elsewhere as an output.

Nextgrid

This is why I am against the current trend of over-complicating stacks for political or marketing reasons. Every startup nowadays wants microservices and/or serverless and a mashup of dozens of different SaaS (some that can't easily be simulated locally) from day 1 while a "boring" monolithic app will get them running just fine.

tamrix

I think we're hitting peak tech. All this "technical" knowledge just dates itself in a year's time anyway.

Eventually, you come to realise that the more tech you've got, the more problems you have. .

Now developers spend more time googling errors and plugging in libraries and webservices together than writing any actual code.

Sometimes I wish for a techless cloudless revolution when we just go back to the foundations of computers and is use plain text wherever possible.

csomar

For the most part, we are not complicating stuff. Today's requirements are complicated. We used to operate from the commandline on a single processor. Now things are complicated: People expect a Web UI, High availability, integration with their phone, Email notification, 2FA authentication, and then you have things like SSL/HTTPS, Compliance, and you need to log the whole thing for errors or compliance or whatever.

Sometimes it's simpler to go back to a commandline utility, sometimes it's not.

sushshshsh

Yup, 100% agree. It may be that you will eventually need an auto-scalable message queue and api gateway, but for most people a web server and csv will serve the first thousand customers

ivan_gammel

Microservices are easy to build and throw away these days. In startups time to market is more important than future investment in devops. In terms of speed of delivery they are not worse than monolithic architecture (also not better). For similar reasons SaaS is and must be the default way to build IT infrastructure, because it has much lower costs and time to deploy, compared to custom software development.

greggman3

What? I thought that was just the opposite. The advantage of serverless is that I pay AWS to make backups so I don't have to. I mean under time pressure if it do it myself I'll skip making backups, setting permissions perfectly, and making sure I can actually restore those backups. If I go with a microservice, the whole point is they already do those things for me. No?

undefined

[deleted]

moksly

Why are those days gone? I do it all the time in an organisation with 10,000 employees. I obviously agree with the parent poster in that you should only do such things with users that have only the right amount of access, but that’s what having many users and schemas are for. I don’t, however, see why you’d ever need to complicate your stack beyond a simple python/powershell script, a correct SQL setup, an official sql driver and maybe a few stores procedures.

I build and maintain our entire employee database with a python script, from a weird non-standard XML”like” daily dump from our payment system, and a few web-services that hold employee data in other requires systems. Our IT then builds/maintains our AD from a few powershell scripts, and finally we have a range of “micro services” that are really just independent scripts that send user data changes to the 500 systems that depend on our central record.

Sure, sure, we’re moving it to azure services for better monitoring, but basically it’s a few hundred lines of scripting that, combined with AD and ADDS, does more than a 1 million USD a year license IDM.

theamk

Why are these days gone?

Just a few weeks ago, I set up a read-only user for myself, and moved all modify permission to role one must explicitly assume. Really helped me with peace of mind while developing the simple scripts that access data read only. This was on our managed AWS RDS database,

akerro

I'm on similar position as you, but I say systems are as complex as their designed made them and it's on you to change it.

auroranil

Tom Scott made a mistake with a similar outcome as this article, but with an SQL query that is much more subtle than DROP.

https://www.youtube.com/watch?v=X6NJkWbM1xk

By all means, find ways to fool-proof the architecture. But be prepared for scenarios where some destructive action happens to a production database.

heavenlyblue

He would not have done that if he were simply using a database transaction for this operation.

madbkarim

That’s exactly the point he’s trying to get across with that video.

thih9

> You can avoid this problem.

The article isn’t claiming that the problem is impossible to solve.

On the contrary: “However, we will figure out what went wrong and ensure that this particular error doesn’t happen again.”.

DelightOne

If you use terraform to deploy the managed production database, do you use the postgresql terraform provider to create roles or are you creating them manually?

bsder

> You can avoid this problem.

No, you can't. No matter how good you are, you can always "rm -rf" your world.

Yes, we can make it harder, but, at the end of the day, some human, somewhere, has to pull the switch on the stuff that pushes to prod.

You can clobber prod manually, or you accidentally write an erroneous script that clobbers prod. Either way--prod is toast.

The word of the day is "backups".

ikiris

excuse me, but no. this is harmful bullshit.

Yes, backups are vitally important, but no it is not possible to accidentally rm -rf with proper design.

It's possible to have the most dangerous credentials possible and still make it difficult to do catastrophic global changes. Hell it's my job to make sure this is the case.

thunderrabbit

> not possible to accidentally rm -rf with proper design.

Can you say more about this?

I understand rm -rf, but not sure how I could design that to be impossible for the most dangerous credentials.

fuzxi

Difficult, but not impossible. Which was the point, I think.

centimeter

> but this is a false and dangerous conclusion to make

Until we get our shit together and start formally verifying the semantics of everything, their conclusion is 100% correct, both literally and practically.

oppositelock

You have to put a lot of thought into protecting and backing up production databases, and backups are not good enough without regular testing of recovery.

I have been running Postgres in production supporting $millions in business for years. Here's how it's set up. These days I use RDS in AWS, but the same is doable anywhere.

First, the primary server is configured to send write ahead logs (WAL) to a secondary server. What this means is that before a transaction completes on the master, she slave has written it too. This is a hot spare in case something happens to the master.

Secondly, WAL logs will happily contain a DROP DATABASE in them, they're just the transaction log, and don't prevent bad mistakes, so I also send the WAL logs to backup storage via WAL-E. In the tale of horror in the linked article, I'd be able to recover the DB by restoring from the last backup, and applying the WAL delta. If the WAL contains a "drop database", then some manual intervention is required to only play them back up to the statement before that drop.

Third is a question of access control for developers. Absolutely nobody should have write credentials for a prod DB except for the prod services. If a developer needs to work with data to develop something, I have all these wonderful DB backups lying around, so I bring up a new DB from the backups, giving the developer a sandbox to play in, and also testing my recovery procedure, double-win. Now, there are emergencies where this rule is broken, but it's an anomalous situation handled on a case by case basis, and I only let people who know what they're doing touch that live prod DB.

azeirah

Quick tip for anyone learning from this thread.

If you're using MySQL, it's called a binary log and not a Write Ahead Log, it was very difficult to find meaningful Google results for "MySQL WAL"

x87678r

Interesting, I immediately thought they would have a transaction log, I didn't think it would have the delete as well.

Its a real problem that we used to have trained DBAs to own the data where now devs and automatic tools are relied upon, there isn't a culture or toolset built up yet to handle it.

mr_toad

> I have all these wonderful DB backups lying around, so I bring up a new DB from the backups

It’s nice to have that capability, but some databases are just too big to have multiple copies lying around, or to able to create a sandbox for everyone.

danellis

> after a couple of glasses of red wine, we deleted the production database by accident

> It’s tempting to blame the disaster on the couple of glasses of red wine. However, the function that wiped the database was written whilst sober.

It was _written_ then, but you're still admitting to the world that your employees do work on production systems after they've been drinking. Since they were working so late, one might think this was emergency work, but it says "doing some late evening coding". I think this really highlights the need to separate work time from leisure time.

cle

No. Your systems and processes should protect you from doing something stupid, because we’ve all done stupid things. Most stupid things are done whilst sober.

In this case there were like 10 relatively easy things that could have prevented this. Your ability to mentally compile and evaluate your code before you hit enter is not a reliable way to protect your production systems.

Coding after drinking is probably not a good idea of course, but “think better” is not the right takeaway from this.

TedDoesntTalk

> Coding after drinking is probably not a good idea

I’ve done some of my most productive work this way. Not on production systems fortunately, and not in a long time.

ultrarunner

Riding the Ballmer Peak is a precarious operation, but I simply cannot deny its occasional effectiveness.

nicoburns

A highly experienced developer in their 50s who I used to work with said that they used to regularly sit down and code with a pint. Until on one occasion they introduced some Undefined Behaviour into their aplication during one of these sessions and it took them 3 days to track down! Probably less of an issue with modern tooling. Still, it certainly makes me think twice before drinking and coding.

marcosdumay

Yes, it can be productive (depends on exactly what you are doing). But I imagine you revise your work while sober before you deploy it.

undefined

[deleted]

bigbubba

You know it's totally feasible to make a car that won't turn on for drunk people. Should those systems be installed on all cars, in pursuit of creating systems that don't permit stupid actions?

Maybe such a breathalyzer interlock could be installed on your workstation too. After all, your systems and processes should prevent you from stupid things.

KronisLV

Replace a breathalyzer with something that's less intruisive (like a camera with AI that would observe the person, AI with thermal imaging or air quality sensors, or another possibly-fictional-yet-believable piece of technology) and suddenly, in my eyes, the technology in this thought experiment becomes a no brainer.

If there was more of a cultural pressure against drunk driving and actual mechanisms to prevent it that aren't too difficult to maintain and utilize, things like the Daikou services ( https://www.quora.com/How-effective-is-the-Japanese-daikou-s... ) would pop up and take care of the other logistical problems of getting your car home. And the world would be all the better for it, because of less drunk driving and accidents.

cle

Don’t be absurd. Of course there are costs and tradeoffs to guardrails, and you have to balance the tradeoffs based on your requirements.

This person had to publicly apologize to their customers. One or two low-cost guardrails could have prevented it and would probably have been worth the cost.

undefined

[deleted]

emodendroket

Honestly not a bad idea to install the interlocks on all cars.

watwut

> Most stupid things are done whilst sober.

That is because most companies these days have processes around drinking in workplace, coming in drunk and working drunk.

Most mistakes are done sober only in environments where drinking couple of vine cups and then doing production change is considered unacceptable. In environment where drunk people work, mistakes are made when drunk.

cranekam

The whole piece has a slightly annoying flippant tone to it. We were drunk! Computers just.. do this stuff sometimes! Better to sound contrite and boring in such a situation IMO.

Also I agree with other comments: doing some work after a glass or two should be fine because you should have other defences in place. “Not being drunk” shouldn’t be the only protection you have against disaster.

caspii

Yeah, I agree I'm being slightly flippant.

But it's just a side-project and I will continue late night coding with a glass of wine. I find it hugely enjoyable.

I would have a different mind-set if I was writing software for power stations as a professional.

alistairSH

But it's just a side-project and I will continue late night coding with a glass of wine.

Normally, this would be fine. But, it appears the site has paying members. Presumably, it's not "just a side-project" to them. You owe them better than tinkering with prod while tipsy.

danjac

The Exxon Valdez comes to mind - the company blamed the drunk captain, but this was just part of huge systemic failures and negligence.

Aeolun

If your captain feels the need to be drunk, you’ve probably made a few mistakes before it got to that point.

danielh

According to the about page, "the employees" consist of one guy working on it in his spare time.

bstar77

I am not a drinker myself (drink 1-3 times a year), but in the past I have coded while slightly buzzed on a few occasions. I could not believe the level of focus I had. I never investigated it further, but I'm pretty sure the effects of alcohol on our coding abilities is not nearly as bad as it affects our motor skills. Imo, fatigue is far worse.

henearkr

When I was not yet a teetotaler, each time I was hitting my maths textbooks after a few drinks, I could not believe my level of focus, and everything was clear and obvious. Textbook pages were flying at a speed never seen.

Of course the next day, when re-reading the same pages, I was always discovering that the previous day I had everything wrong, nothing was obvious, and all my reasoning when with alcohol was false because simplistic and oblivious of any mathematical rigor.

centimeter

Similar effect with psilocybin or LSD - you think you had a really profound and insightful experience, but once you think back on it you realize that (most of the time) you just got the impression that it was profound and insightful.

ThrowawayR2

Not a great idea for studying anyway because of https://en.wikipedia.org/wiki/State-dependent_memory

In short, ability to recall memories is at least in part dependent on being in a similar state to the time when memories are formed, e.g. something learned while being intoxicated will be more easily recalled only when intoxicated again.

emodendroket

I've found just the opposite. Any booze at all and I basically can't work for hours.

physicles

Me too, partly because of the reduced short-term memory (which coding relies on heavily). But more than that, my motivation drops through the floor because I can’t shake the thought that life is too short to fight with computers all day.

3np

For me it's really hit and miss - it can either increase focus and motivation, invigorate the mind, while reducing distractions. Other times it has the opposite effect. Same with cannabis. The cannabis absolutely took a decade or so of recreational use until I discovered/developed coding well under the influence.

Though I'm talking one or two drinks here, not firing up vscode after a night out or going through a bottle of rum.

undefined

[deleted]

caspii

I have no employees. I only have myself to blame

beervirus

Sounds like they weren't trying to do work on production.

murillians

I think it's just a joke

yelloweyes

lol it's a scoreboard app

aszen

I had a narrow escape once doing something fancy with migrations.

We had several MySQL string columns as long text type in our database but they should have been varchar(255) or so. So I was assigned to convert these columns to their appropriate size.

Being the good developer I was, I decided to download a snapshot of the prod database locally and checked the maximum string length we had for each column via a script. Using this script it made a migration query that would alter column types to match their maximum used length keeping the minimum length as varchar (255).

I tested that migration and everything looked good, it passed code review and was run on prod. Soon after we start getting complaints from users that their old email texts have been truncated. I then realize the stupidity of the whole thing, the local dump of production database always wiped out many columns clean for privacy like the email body column. So the script thought it had max length of 0 and decided to convert the column to varchar(255).

I realize the whole thing may look incredibly stupid, that's only because the naming for db columns was in a foreign european language so I didn't know even know the semantics of each column.

Thankfully my seniors managed to restore that column and took the responsibility themselves since they had passed the review.

We still did fix those unusually large columns but this time by simple duplicate alter queries for each of those columns instead of using fancy scripts.

I think a valuable lesson was learned that day to not rely on hacky scripts just to reduce some duplicate code.

I now prefer clarity and explicitness when writing such scripts instead of trying to be too clever and automating everything.

heavenlyblue

And you didn’t even bother to do a query of the actual maximum length value of the columns you were mutating? Or at least query and see the text in there?

Basically you just blindly ran the migration on the data and checked if it didn’t fail?

The lesson here is not about cleverness unfortunately.

aszen

I did see some values and found them reasonable, problem of the whole thing was there were atleast 200 or so tables with dozens of columns each and only two or so tables were excluded from being dumped locally.

So yes I could have noticed their length 0 if I had looked carefully amidst hundreds of rows but since my faulty logic of prod db = local db didn't even consider this possible I didn't bother.

If it had been just 10 to 20 migrations queries that would have been a lot easier to validate but then I wouldn't even have attempted to write a script

mr_toad

> my faulty logic of prod db = local db didn't

It happens. “It worked in dev” is the database equivalent of “worked on my machine”.

detaro

The comment clearly states that they did.

heavenlyblue

If they did, they would have noticed that the columns were empty (because they were wiped clean for PI data).

The parent is either misrepresenting the situation or they didn’t do what they say they did.

Also in any production setup, before the migration in the same transaction you would have something along the lines of “check if the column size is larger than and then abort”, because you never know when that can be added while working on the database.

john_moscow

Just my 2 cents. I run a small software business that involves a few moderately-sized databases. The day I moved from a fully managed hosting to a Linux VPS, I have crontabbed a script like this to run several times a day:

    for db in `mysql [...] | grep [...]`
    do
        mysqldump [...] > $db.sql
    done
    
    git commit -a -m "Automatic backup"
    git push [backup server #1]
    git push [backup server #2]
    git push [backup server #3]
    git gc
The remote git repos are configured with denyNonFastForwards and denyDeletes, so regardless of what happens to the server, I have a full history of what happened to the databases, and can reliably go back in time.

I also have a single-entry-point script that turns a blank Linux VM into a production/staging server. If your business is more than a hobby project and you're not doing something similar, you are sitting on a ticking time bomb.

candiddevmike

Anyone reading the above: please don't do this. Git is not made for database backups, use a real backup solution like WAL archiving or dump it into restic/borg. Your git repo will balloon at an astronomical rate, and I can't imagine why anyone would diff database backups like this.

john_moscow

It really depends on your database size. This works just fine for ~300MB databases. Git gc takes pretty good care of the fluff and once every couple of years I reset the repository to prune the old snapshots. The big plus is that you can reuse your existing git infrastructure, so the marginal setup costs are minimal.

You can always switch to a more specialized solution if the repository size starts bugging you, but don't fall into the trap of premature optimization.

candiddevmike

Git GC won't do anything here unless you're deleting commits or resetting the repo constantly. Every commit will keep piling up, and you will never prune anything like you would a traditional backup tool. The day you do decide to start pruning things, expect your computer to burst into flames as it struggles to rewrite the commit history!

Using a real database backup solution isn't a premature optimization, it's basic system administration.

fauigerzigerk

>It really depends on your database size.

This isn't just about size though. You're storing all customer data on all developer machines. You're just one stolen laptop away from your very own "we take the security of our customers' data very seriously" moment.

jugg1es

Not every database is huge. It could be a good solution in certain circumstances.

wolfgang000

I don't believe having a massive repo with backups would be the ideal solution. Couldn't you just upload the backup to an s3 bucket instead?

Ayesh

This is what I do too.

The mysqldump command is tweaked to use individual INSERT clauses as opposed to one bulk one, so the diff hunks are smaller.

You can also sed and remove the mysqldump timestamp, so there will be no commits if there are no database changes, saving the git repo space.

mgkimsal

Any issues with the privacy aspect of that data that's stored in multiple git repos? PII and such?

john_moscow

These are private repos on private machines communicating over SSL on non-standard ports with properly configured firewall. The risk is always there, but it's low.

bufferoverflow

You should really compress them instead of dumping them raw into Git. LZ4 or ZStandard are good.

adzm

But then you don't have good diffs.

hinkley

Git repositories are compressed.

amingilani

Happens to all of us. Once I required logs from the server. The log file was a few gigs and still in use. so I carefully duplicated it, grepped just the lines I needed into another file and downloaded the smaller file.

During this operation, the server ran out of memory—presumably because of all the files I'd created—and before I know it I'd managed to crash 3 services and corrupted the database—which was also on this host—on my first day. All while everyone else in the company was asleep :)

Over the next few hours, I brought the site back online by piecing commands together from the `.bash_history` file.

tempestn

Seems unwise to have an employee doing anything with production servers on their first day, let alone while everyone else is asleep.

amingilani

It does but that was an exceptional role. The company needed emergency patches to a running product while they hired a whole engineering team. As such, I was the only one around doing things, and there wasn't any documentation for me to work off of.

I actually waited until nightfall just incase I bumped the server offline because we had low traffic during those hours.

nullsense

What's the story behind this company/job? Was it some sort of total dumpster fire?

netheril96

Why does the DB get corrupted? Does ACID mean anything these days?

theamk

Not original poster, but up to 2010, default MySQL table type was MyISAM, which does not support transactions.

thdrdt

When a server runs out of memory a lot of strange things can happen.

It can even fail while in the middle of a transaction commit.

So transactions won't fix this.

amingilani

It was an older MongoDB in my case. :)

xtracto

This happened to me (someone in my team) a while ago but with mongo. The production database was ssh-tunneled to the default port of the guys computer and he ran tests that cleaned the database first.

Now... our scenario was such that we could NOT lose those 7 hours because each customer record lost meant $5000 usd penalty.

What saved us is that I knew about the oplog (binlog in mysql) so after restoring the backup i isolated the last N hours lost from the log and replayed it on the database.

Lesson learned and a lucky save.

fma

Same happened to me many years ago. QA dropped the prod db. It's been many years but if I recall, I believe in the dropdown menu of the MongoDB browser, exit & drop database were next to each other...Spent a whole night replaying the oplog.

No one owned up to it, but had a pretty good idea who it was.

vanviegen

> No one owned up to it, but had a pretty good idea who it was.

That sounds like you're putting (some of) the blame on whoever misclicked. As opposed to everyone who has allowed this insanely dangerous situation to exist.

rocqua

Misclicking is a tiny forgivable mistake.

Not immediately calling up your boss to say "I fucked up big" is not a mistake, it is a conscious bad action.

stennie

Which MongoDB browser or admin tool are you referring to?

I haven't seen this design in practice using MongoDB Atlas or Compass, but would hope for an "Are you really sure?" confirmation in an admin UI.

3np

A dangling port-forward was my first thought to how this happened.

muststopmyths

>Note that host is hardcoded to localhost. This means it should never connect to any machine other than the developer machine. We’re too tired to figure it out right now. The gremlins won this time.

Obviously, somehow the script ran on the database host.

some practices I've followed in the past to keep this kind of thing from happening:

* A script that deletes all the data can never be deployed to production.

* scripts that alter the DB rename tables/columns rather than dropping them (you write a matching rollback script ), for at least one schema upgrade cycle. you can always restore from backups, but this can make rollbacks quick when you spot a problem at deployment time.

* the number of people with access to the database in prod is severely restricted. I suppose this is obvious, so I'm curious how the particular chain of events in TFA happened.

amluto

I have a little metadata table in production that has a field that says “this is a production database”. The delete-everything script reads that flag via a SQL query that will error out of it’s set in the same transaction as the deletion. To prevent the flag from getting cleared in production, the production software stack will refuse to run if the “production” flag is not set.

jerf

This is also one place where defense-in-depth is useful. "Has production flag" OR "name contains 'prod'" OR "hostname contains 'prod'" OR "one of my interfaces is in the production IP range" OR etc. etc. You really can't have too many clauses there.

Unfortunately, the "wipe & recreate database" script, while dangerous, is very useful; it's a core part of most of my automated testing because automated testing wipes & recreates a lot.

asddubs

one silly last resort measure I did on a project a while back was having a IS_STAGING file somewhere, only existing on localhost, and every request it would check if the hostname is that of the live site, and if so, delete that file. the file itself wasn't enough to make the server think it's in staging mode, but it was the only thing in the chain that if it were to go wrong, would fix itself automatically almost immediately (and log an error)

asimjalis

I would flip the logic. If database does not have flag that says it is non-production assume it is production.

amluto

It's a BOOLEAN NOT NULL. I don't recall off the top of my head whether TRUE means production or TRUE means testing.

at_a_remove

Nice. Very nice.

mcpherrinm

The blog mentions it's a managed DigitalOcean database, so the script likely wasn't run on the host itself.

More likely, I'd suspect, is something like an SSH tunnel with port forwarding was running, perhaps as part of another script.

StavrosK

Someone SSHed to production and forwarded the database port to the local machine to run a report, then forgot about the connection and ran the deletion script locally.

cutemonster

That has happened? Or was it a thought about what could have happened elsewhere?

StavrosK

Oh, no, that's my guess as to what happened here.

PeterisP

One aspect that can help with this is separate roles/accounts for dangerous privileges.

I.e. if Alice is your senior DBA who would have full access to everything including deleting the main production database, then it does not mean that the user 'alice' should have the permission to execute 'drop database production' - if that needs to be done, she can temporarily escalate the permissions to do that (e.g. a separate account, or separate role added to the account and removed afterwards, etc).

Arguably, if your DB structure changes generally are deployed with some automated tools, then the everyday permissions of senior DBA/developer accounts in the production environment(s) should be read-only for diagnostics. If you need a structural change, make a migration and deploy it properly; if you need an urgent ad-hoc fix to data for some reason (which you hopefully shouldn't need to do very often), then do that temporary privilege elevation thing; perhaps it's just "symbolic" but it can't be done accidentally.

jlgaddis

> the number of people with access to the database in prod is severely restricted

And of those people, there should be an even fewer number with the "drop database" privilege on prod.

Also, from a first glance, it looks like using different database names and (especially!) credentials between the dev and prod environments would be a good idea too.

unnouinceput

Quote: "Note that host is hardcoded to localhost. This means it should never connect to any machine other than the developer machine. Also: of course we use different passwords and users for development and production. We’re too tired to figure it out right now.

The gremlins won this time."

No they didn't. Instead one of your gremlins ran this function directly on the production machine. This isn't rocket science, just the common sense conclusion. Now it would be a good time to check those auditing logs / access logs you're suppose to have them enabled on said production machine.

skytreader

> Instead one of your gremlins ran this function directly on the production machine.

Exactly my first hypothesis too. But then keepthescore claims,

> of course we use different passwords and users for development and production.

How would this hypothesis explain that?

---

Metadialogue:

John Watson: "I deduce that someone changed the dev config source so that it uses the production config values."

Sherlock Holmes: "My dear Watson, while that is sensible, it seems to me that the balance of probability leans towards their production instance also having the development access credentials."

---

Just my way of saying, I think this case isn't as shut and closed as most comments (including parent) imply. I personally find the /etc/host mapping a likelier hypothesis but even that can't explain how different credentials failed to prevent this. Without more details coming from a proper investigation, we are just piling assumptions on top of assumptions. We are making bricks, without enough clay, as Holmes would say.

thamer

Agreed, it seems like most people making suggestions above are missing the point about credentials. The code they present explicitly references `config.DevelopmentConfig`:

    database = config.DevelopmentConfig.DB_DATABASE
    user = config.DevelopmentConfig.DB_USERNAME
    password = config.DevelopmentConfig.DB_PASSWORD
One way this could happen is if the objects under `config` were loaded from separate files, and the dev file was changed to a symlink to the prod file. So `config.DevelopmentConfig` always loads /opt/myapp/config/dev.cfg but a developer had dev.cfg -> prod.cfg and the prod credentials and connection details were loaded into `config.DevelopmentConfig`.

Just an idea.

ir123

If managed DBs on DigitalOcean are anything like those on AWS, you can not run commands directly on them since SSH is prohibited. EDIT: there's also the deal with different credentials for dev and prod envs.

toredash

2 cents his hosts file points localhost to the prod db IP

pvorb

Yep. Nowadays a kubectl port-forward makes something like this all too easy. They accidentally had the kubecontext point at the production cluster instead of dev, set up the port-forward to the database, and whoops! At least that's how this could happen to me, even with my years of experience in doing unexpected things to production databases.

junon

That whole ecosystem is a devexp nightmare. I try to stay away from it entirely having worked with it extensively.

Docker + Kubernetes are the biggest socially-acceptable hacks in the industry at the moment.

radu_floricica

I'm betting on a tunnel, myself. And grandparent is probably wrong, they most likely have dedicated mysql machines so "localhost" will never be the db.

kjaftaedi

I think so as well.

It doesn't even make sense to connect to a managed database using 'localhost'.

Managed databases are never localhost. They are hosted outside your VPS and you use a DNS name to connect to them.

drdaeman

That could've happened if the database is not accessible from the Internet and they were using a tunnel which binds to localhost (e.g. `ssh -L`).

It does make sense to connect to localhost on dev machines. But if that's the setup, I guess one should avoid from tunneling to localhost to avoid potentially dangerous confusion (hmm... I think I'm guilty of that on a couple projects, need to check if that's true...)

dingdingdang

Yeah, my guess would be that the script got executed on prod server by, ups was I in that terminal window, accident! Localhost is after all the-local-host, no matter what server it's on. Better to also have clear convention regarding name of prod db name versus test db (i.e. "test_mysaas" versus "mysaas").

Plus of course using git with a hook specifically for preview versus production (i.e. "git push production") that way local specific scripts can be stripped even if in same repo.

robryan

Yeah this is how I once deleted my production database. One thing I did to mitigate this was colour code the prompts for local/staging/production.

buzer

My guess would be something like PgBouncer. Someone may have installed it to the production server at some point in the past.

robryan

One solution to this is to make sure only the production servers can connect to the production database.

dschuetz

As someone else said it below: "hardcoded to localhost" doesn't mean it's hardcoded. It means it goes to whatever localhost resolves to. Really hardcoded should ALWAYS mean: 127.0.0.1

colechristensen

This is bad operations.

That it happened meant that there were many things wrong with the architecture, and summing up the problem to “these things happen” is irresponsible, most importantly your response to a critical failure needs to be in the mindset of figuring out how you would have prevented the error without knowing it was going to happen and doing so in several redundant ways.

Fixing the specific bug does almost nothing for your future reliability.

cblconfederate

> Computers are just too complex and there are days when the complexity gremlins win.

Wow. But then again it's not like programmers handle dangerous infrastructure like trucks, military rockets or nuclear power plants. Those are reserved for adults

geofft

I'm not sure I follow your point - I think you'll find the same attitude towards complexity by operators of military rockets and nuclear power plants. If you look at postmortems/root-cause analyses/accident reports from those fields, you'll generally find that a dozen things are going wrong at any given time, they're just not going wrong enough to cause problems.

codegladiator

We already know humans make mistakes. But for this particular scenario lets blame computers.

yunruse

I feel that computers make it easier for this danger to be more indirect, however. The examples you give are physical, and even the youngest of child would likely recognise they are not regular items. A production database, meanwhile, is visually identical to a test database, if measures are not made to make it distinct. Adults though we may be, we're human, and humans can make really daft mistakes without the right context to avoid them

PurpleFoxy

There are also countless safety measures on physical items that have been iterated on over decades to prevent all kinds of accidents. Things like putting physical locks on switches to prevent machinery being turned on while people are working on it.

Can you imagine if instead of a physical lock it just said “are you sure you wish to turn on this machine”. “Of course I want to turn it on, that’s why I pressed the button”

Some software makes it a lot harder for the user to mess up now. When deleting a repo on GitLab you have to type the name of the repo before pressing delete and then it puts it in a pending deletion state for a month before it’s actually deleted. Unfortunately for developers we typically get minimal cli tools which will instantly cause a lot of damage without any way to undo.

im3w1l

So, silly idea. What if, to work on the production database, you had to go into the special production room, colored in the special production color, scented with the special production perfume, and sit on a just tiny bit uncomfortable production chair.

Basically make it clear even to the caveman brain that things are different.

pontifier

I actually really like this idea... but who am I kidding, it's a luxury I don't have time for when I've got to fix stuff.

Waterluvian

All of those items have a lot of safety software in them.

ineedasername

Yep, I hate when I'm dealing with a system literally comprised of logic and a magic gremlin shows up to ruin my day.

Seems like they though a casual "everyman" type of explanation would suffice, but really who would trust them after this?

cblconfederate

I understand that web interfaces are trivial, unimportant work (it s what i do), but how can one sleep with such an unresolved mystery?

eezurr

One explanation for the author feeling that way is that the system is has too much automation. Being in a situation where you take on more responsibilities of the system at a shallower level leads to less industry expertise. This, as it turns out, places the security of the system in a precarious position.

These is pretty common, as devs tool belts have grown longer over time.

I think at some point we will stop automating or reverse some of the automation.

codegladiator

> too much automation

Literally just the automation of the test suite. That's 1 automation.

> These is pretty common

? Waiting for FB to delete their db

Daily Digest email

Get the top HN stories in your inbox every day.