GitHub's Historic Uptime

damrnelson.github.io

Daily Digest email

Get the top HN stories in your inbox every day.

fishtoaster

Is the pre-2018 data actually accurate? There seem to have been a number of outages before then: https://hn.algolia.com/?dateEnd=1545696000&dateRange=custom&...

Maybe that's just the date when they started tracking uptime using this sytem?

OlivOnTech

Data comes from the official status page. It may be more a marketing/communication page than an observability page (especially before selling)

pikzel

The status page was often down when GH was down, back in the days.

tibbon

I could imagine a leadership or viewpoint change in how they reported when/what was down.

I've seen so many times where Company A will complain that their vendors aren't accurate enough about uptime and how Company A notices first that their vendors are down, but then they themselves have a very laggy or inaccurate status page.

We want our vendors to be accurate to the minute on these, but many CTOs don't care to admit when they too have problems.

xiaoyu2006

Aha we need a status page of status page.

w0m

i assume they simply fixed the status page in 2018.. lol.

mholt

Even better IMO is this status page: https://mrshu.github.io/github-statuses/

"The Missing GitHub Status Page" with overall aggregate percentages. Currently at 90.84% over the last 90 days. It was at 90.00% a couple days ago.

montroser

It has been pretty rough. Their own numbers report just a single `9` for Actions in Feb 2026 with 98% uptime. But that said -- I don't get the 90% number.

Anecdotally, it seems believable that 1 in 50 times (2%) in Feb that Actions barfed. Which is not very nice, but it wasn't at 1 in 10 times (10%).

verdverm

It looks like the aggregate stats are more of a venn diagram than an average. So if 1/N services are down, the aggregate is considered down. I don't think this is an accurate way to calculate this. It should be weighted or in some way show partial outages. This belief is derived from the Google SRE book, in particular chapters 3 (embracing risk) and 4 (service level objectives)

https://sre.google/sre-book/embracing-risk/

https://sre.google/sre-book/service-level-objectives/

ablob

If you're using all services, then any partial outage is essentially a full outage. Of course, you can massage the numbers to make it look nicer in the way you described but the conservative approach is better for the customers. If you insist, one could create this metric for selected services only to "better reflect users".

That being said, even when looking at the split uptimes, you'd have to do a very skewed weighting to achieve a number with more than one 9.

marcosdumay

That's how you count uptime. You system is not up if it keeps failing when the user does some thing.

The problem here is the specification of what the system is. It's a bit unfair to call GH a single service, but it's how Microsoft sells it.

bandrami

Thinking back to when I was hosting, I think telling a customer "your web server was running fine it's just that the database was down" would not have been received well.

mort96

I mean I think it's useful. It answers the question, "what percentage of the time can I rely on every part of GitHub to work correctly?". The answer seems to be roughly 90% of the time.

formerly_proven

In a nutshell, why would the consumer care (for the SLO) care about how the vendor sliced the solution into microservices?

fontain

An aggregate number like that doesn’t seem to be a reasonable measure. Should OpenAI models being unavailable in CoPilot because OpenAI has an outage be considered GitHub “downtime”?

mort96

As long as they brand it as a part of GitHub by calling it "GitHub Copilot" and integrate it into the GitHub UI, I think it's fair game.

jasomill

The third-party aspect is irrelevant, but while high downtime on any product looks bad for the company and the division, I consider GitHub Copilot an entirely separate product from GitHub, and GitHub Copilot downtime doesn't interfere with my use of GitHub repos or vice versa, so I'd consider its downtime separately.

GitHub Actions, on the other hand, is frequently used in the same workflows as the base GitHub product, so it's worth considering both separately and together, much like various Azure services, whereas I see no reason at all to consider an aggregate "Microsoft" downtime metric that includes GitHub, Azure, Office 365, Xbox Live, etc.

The most useful, metric, actually, is "downtimes for the various collections of GitHub services I regularly use together", but that would obviously require effort to collect the data myself.

mememememememo

What is Google's uptime (including every single little thing with Google in the name)?

fwip

I think reasonable people can disagree on this.

From the point of view of an individual developer, it may be "fraction of tasks affected by downtime" - which would lie between the average and the aggregate, as many tasks use multiple (but not all) features.

But if you take the point of view of a customer, it might not matter as much 'which' part is broken. To use a bad analogy, if my car is in the shop 10% of the time, it's not much comfort if each individual component is only broken 0.1% of the time.

remus

> But if you take the point of view of a customer, it might not matter as much 'which' part is broken. To use a bad analogy, if my car is in the shop 10% of the time, it's not much comfort if each individual component is only broken 0.1% of the time.

Not to go too out of my way to defend GH's uptime because it's obviously pretty patchy, but I think this is a bad analogy. Most customers won't have a hard reliability on every user-facing gh feature. Or to put it another way there's only going to be a tiny fraction of users who actually experienced something like the 90% uptime reported by the site. Most people are in practice are probably experienceing something like 97-98%.

wang_li

A better analogy is if one bulb in the right rear brake light group is burnt out. Technically the car is broken. But realistically you will be able to do all the things you want to do unless the thing you want to do is measure that all the bulbs in your brake lights are working.

mememememememo

Or if your kettle is not working the house is considered not working?

skipants

These are two pages telling two different things, albeit with the same stats. The information is presented by OP in a way to show the results of the Microsoft acquisition.

goodmythical

holy shit that's nearly five weeks of down time.

Well, I mean, I guess that's fair really. How long has github been around? Surely it's got five weeks of paid time off by now...

hk__2

It’s biaised to show this without the dates at which features were introduced. A lot of the downtimes in the breakdown are GitHub Actions, which launched in August 2019; so yeah what a surprise there was no Actions downtime before because Actions didn’t exist.

cuu508

You can click on "Breakdown" and then on "Actions" to hide it.

mbauman

Even worse, those features show "100% uptime" pre-existence on the breakdowns page too.

siruwastaken

This is the real questionable part of the graphic. It seems that no-data pre 2018 was just considered 100% uptime (which is hardly historically accurate).

voxic11

Check the breakdown page. Like yes the magnitude is reduced obviously for individual services. But they all show the same trend.

hk__2

I checked the breakdown page, as I wrote:

> A lot of the downtimes in the breakdown are GitHub Actions

undefined

[deleted]

phillipcarter

FWIW if people are looking for a reason why, here's why I think it's happening: https://thenewstack.io/github-will-prioritize-migrating-to-a...

llama052

It's absolutely this. Our Azure outages correlate heavily with Github outages. It's almost a meme for us at this point.

honeycrispy

Azure's downtime doesn't appear to be as bad as Github's.

llama052

That's 100% because Azure isn't honest about their uptime.

nmaleki

You'd think they'd do all the testing elsewhere and use a much shorter window of time to implement Azure after testing. I don't think this fully explains over 6 years of poor uptime.

hadlock

The fact that even they struggle with github actions is a real testimate to the fact that nobody wants to host their own CD workers.

esseph

> The fact that even they struggle with github actions is a real testimate to the fact that nobody wants to host their own CD workers.

What a weird takeaway

phillipcarter

It certainly explains the issues _now_, IMO.

shrinks99

I got Claude to make me the exact same graph a few weeks ago! I had hypothesized that we'd see a sharp drop off, instead what I found (as this project also shows) is a rather messy average trend of outages that has been going on for some time.

The graph being all nice before the Microsoft acquisition is a fun narrative, until you realize that some products (like actions, announced on October 16th, 2018) didn't exist and therefore had no outages. Easy to correct for by setting up start dates, but not done here. For the rest that did exist (API requests, Git ops, pages, etc) I figured they could just as easily be explained with GitHub improving their observability.

padjo

It feels like they launched actions and it quickly turned out to be an operations and availability nightmare. Since then, they've been firefighting and now the problems have spread to previously stable things like issues and PRs

deepsun

They rushed to launch Actions because GitLab launched them before.

BTW, GitLab called it "CI/CD" just as a navigation section on their dashboard, and that name spread outside as well, despite being weird. Weird names are easier to remember and associate with specific meaning, instead of generic characterless "Actions".

nulltrace

We added Actions for CI in 2020. A year later realized our entire deploy pipeline just assumed it would be up.

Webhook doesn't fire, nothing errors out, and you find out when someone asks why staging hasn't moved in two days.

jamiemallers

[dead]

irishcoffee

Github actions needs to go away. Git, in the linux mantra, is a tool written to do one job very well. Productizing it, bolting shit onto the sides of it, and making it more than it should be was/is a giant mistake.

The whole "just because we could doesn't mean we should" quote applies here.

lcnPylGDnU4H9OF

The same philosophy would suggest that running some other command immediately following a particular (successful) git command is fine; it is composing relatively simple programs into a greater system. Other than the common security pitfalls of the former, said philosophy has no issue with using (for example) Jenkins instead of Actions.

irishcoffee

[flagged]

psini

But GitHub actions is not Git?

irishcoffee

Sorry yes, that was my point. GitHub turned git into some dysmorphic DVCS version of c++ on the web. Git is fine. Maybe 10% of people use plain git, it’d all wrapped in shitty web apps. Let git be git, and let ci/cd be ci/cd, the way Linux intended.

However, I don’t work on web apps. Maybe it’s better for the JavaScript folks. I hope to never write a line of js in my lifetime.

zja

PR merging broken right now https://www.githubstatus.com/incidents/ml7wplmxbt5l

dewey

I remember a lot of unicorn pages back in the days. Maybe the status page was just not updated that regularly back then?

imglorp

I think the unicorn is only for web pages. Things like git api services might be broken independently (and often are!) and they might show up on the status page after some time.

topbanana

Historical *

https://www.merriam-webster.com/grammar/everything-youve-eve...

teach

One could argue that, given how singularly awful it is, GitHub's historical uptime might qualify as "historic".

tclancy

Bless you, was very much not what I was expecting from the title.

BadBadJellyBean

I feel like by now GitHub has a worse downtime record than my self hosted services on my single server where I frequently experiment, stop services or reboot.

agilob

It's ok because we're still paying for it. QoS degradation is worth it. No need to have 99.999% then you can have 90.84% and still people to pay for it.

verdverm

Those electricity savings can better used to fuel the token bonfire

marcosdumay

It does have a worse downtime record than my tiny VPS that has a recurrent packet routing problem and keeps going offline. Measurably so.

hrmtst93837

[flagged]

frenchie4111

Github's migration to Azure has so far been a hilariously bad advertisement for Azure

otterley

I'm not a GitHub apologist, but that graph isn't at scale, at all. It's massively zoomed in, with a lower band of 99.5%. It makes it look far worse than it is.

pavon

If you plotted it from zero, then a horrible service and a great service would be indistinguishable. Their SLA for enterprise customers is 99.9%. The low end of that chart is 5x that amount downtime. It is a reasonable scale for the range people are concerned about and it looks bad because it is bad.

verdverm

It's an uptime chart and shouldn't need to show much more than the 99% range.

If you started the y-axis at zero, you wouldn't see much of anything. Logarithmic scale would still be a bit much imo.

otterley

> If you started the y-axis at zero, you wouldn't see much of anything.

That's... kind of my point.

As a reliability engineer, I'm disappointed in GitHub's 99.5% availability periods, especially as they impact paying customers. On the other hand, most users are non-paying users, and a 99.5% availability for a free service seems to me to be a reasonable tradeoff relative to the potential cost of improving reliability for them.

grayhatter

> the other hand, most users are non-paying users, and a 99.5% availability for a free service seems to me to be a reasonable tradeoff relative to the potential cost of improving reliability for them.

If they are using your data, you're still paying just not in cash.

As a former reliability engineer, I'm trying hard to remember back when we had multiple months in a row never reaching 100% uptime, and I can't. Yes, we've seen runs of painful months, but also runs of easy months without down time.

But let's talk root cause here, the cost of improving them here, is someone caring. This isn't simply a hard problem, it's a well understood hard problem that no one who makes decisions cares about. Which as a reliability engineer is an embarrassment. Uptime is one of those foundational aspects that you can build on top of. If you're not willing to invest in something as core as your code or service works. What are you even doing?

tclancy

It also has 0 reflection of load. Weren't you limited to a single private repo before Microsoft took over?

otterley

I don't think so. Even before Microsoft acquired GitHub, you could have as many private repos as you wanted, but you couldn't have more than 3 collaborators. This change happened back in 2019:

https://github.blog/news-insights/product-news/new-year-new-...

alberth

Unsolicited feedback ... changing the y-axis to be hours (not % uptime) might be more intuitive for folks to understand.

The data is there, you just have to hover over each data point.

simlevesque

It could even be both % and offline hours per year. To me the percentage is simpler to understand.

Daily Digest email

Get the top HN stories in your inbox every day.