March 20 ChatGPT outage: Here’s what happened

Daily Digest email

Get the top HN stories in your inbox every day.

abujazar

The disclosure is provides valuable information, but the introduction suggests someone else or «open-source» is to blame:

>We took ChatGPT offline earlier this week due to a bug in an open-source library which allowed some users to see titles from another active user’s chat history.

Blaming an open-source library for a fault in closed-source product is simply unfair. The MIT licensed dependency explicitly comes without any warranties. After all, the bug went unnoticed until ChatGPT put it under pressure, and it was ChatGPT that failed to rule out the bug in their release QA.

mirekrusin

They are not suing anybody, just because it’s open source doesn’t mean talking about bugs is taboo. It reads to me like raw information, very good. They said they contacted authors helping with upstream bug fix. It’s great example how to deal with this kind of problems if anything to me.

abujazar

I agree it's a good example – if not great. But considering OpenAI itself is as closed source as can be, mentioning open-source software as the cause of their outage in the very first line of their statement seems somewhat out of place. I don't think it's a coincidence that the open-source comes in the opening line while the acknowledgement to the Redis team is at the very end. Many outside of software engineering might read this as «open source caused OpenAI to crash».

random_cynic

Ironically, this sort of insane takes everywhere on the thread is causing more harm towards the image of open-source developers than OpenAI or anyone could ever do with their press release. It's important to mention whatever open-source library caused a bug because that could be happening in many other applications that are using it. That's basically one of the main points of open-source.

Angostura

What would your opening sentence have been?

It’s a great opening sentence. Short, factual explanatory. You would have to be hypersensitive to object.

As far as I can see, the whole thing is a reaction to OpenAI not being sufficiently open - which is a fine argument to have with them, but shouldn’t cloud judgement on the quality of this write-up

mvkel

But the bug resided in an open source library, not their own code. What else could they say?

Bugs will always exist whether you have a QA dept or not.

SerCe

Shifting the blame to open source isn't a good look. I like what Bryan Cantrill had to say about it: https://twitter.com/bcantrill/status/1638707484620902401

Angostura

The blame wasn’t “shifted”. The source of the bug was reported

danielbln

https://twitter.com/danielbln/status/1639535353634643968?s=2...

abujazar

Spot on

voidfunc

They're not blaming anyone. Objectively there was a bug in that library that caused the problem.

abujazar

They're not blaming, legally speaking. But they're communicating that open source software caused their outage. OpenAI chose to use software that explicitly came without warranties, and are legally solely responsible for problems caused by open-source libraries they choose to include in their product.

LASR

I understand where you're coming from. But there is certainly an audience for this kind of post that would like to read about the objective source of the bug without connecting it to some expectations around responsibility.

I read this post, and I don't see it assigning blame to anyone other than themselves. See this bit copied from the post:

> Everyone at OpenAI is committed to protecting our users’ privacy and keeping their data safe. It’s a responsibility we take incredibly seriously. Unfortunately, this week we fell short of that commitment, and of our users’ expectations. We apologize again to our users and to the entire ChatGPT community and will work diligently to rebuild trust.

Matl

Why mention it is open source then? What does that add?

fsckboy

"there was a bug in an outside library that we used" does not mention open source but has the same meaning, and would probably provoke the same complaints ("they're trying to blame somebody else for their problem").

In that case, though, they could say "look, we used a popular open source library because we had more faith that it would be better tested and correct" which would be a compliment to open source. That's essentially the information that we have.

In today's world, who builds anything from anything close to stratch? embedded developers, probably come closest. It's no worse or better to say "there was a bug that our release uncovered." If they continue to announce as many details as possible, we as the audience can develop a sense whether they're creating bugs or just uncovering bugs we're glad to know about.

clnq

On the other hand, why overthink it?

xdavidliu

why not mention it? What does mentioning it subtract?

YPPH

If someone asked ChatGPT to generate some code, then copy and pasted it mindlessly into their project, I wonder what OpenAI would think about the claim:

"The bug was a result of faulty code produced by ChatGPT."

Using an open source library is like copy and pasting code into your project. You assume responsibility for any pitfalls.

carlmr

Come to think of it, I think that's realistically going to happen a lot.

It's going to be recursive blaming all the way down.

colechristensen

I mean they’re pretty clear with their warnings about accuracy.

abujazar

Just like any author of MIT licensed open source software is clear about their license which clearly states they are not responsible in any way whatsoever for the shortcomings of their licensees.

belter

According to this is not fixed yet... https://github.com/redis/redis-py/issues/2624

Caligatio

I don't recall anyone advocating omitting the name log4j when that security bug dropped in 2021. How is that situation materially different from this? redis-py fixed the behavior so it's definitely not the case of working-as-intended.

To be clear, I would be up in arms if OpenAI was trying to hold redis potentially legally responsible.

juunpp

Damn open source! Those guys aren't living to their paychecks, I foresee staff cuts...

chaxor

I understand and agree that framing open source as the issue is ridiculous. They may be redeemed though through the likely scenario that they didn't write this - GPT probably did. Perhaps their GPT doesn't like open source. :D

chatmasta

Why did it take them 9 hours to notice? The problem was immediately obvious to anyone who used the web interface, as evidenced by the many threads on Reddit and HN.

> between 1 a.m. and 10 a.m. Pacific time.

Oh... so it was because they're based in San Francisco. Do they really not have a 24/7 SRE on-call rotation? Given the size of their funding, and the number of users they have, there is really no excuse not to at least have some basic monitoring system in place for this (although it's true that, ironically, this particular class of bug is difficult to detect in a monitoring system that doesn't explicitly check for it, despite being immediately obvious to a human observer).

Perhaps they should consider opening an office in Europe, or hiring remotely, at least for security roles. Or maybe they could have GPT-4 keep an eye on the site!

eep_social

Staffing an actual 24x7 rotation of SREs costs about a million dollars a year in base salary as a floor and there are few SREs for hire. A metrics-based monitor probably would have triggered on the increased error rate but it wouldn’t have been immediately obvious that there was also a leaking cache. The most plausible way to detect the problem from the user perspective would be a synthetic test running some affected workflow, built to check that the data coming back matches specific, expected strings (not just well-formed). All possible but none of this sounds easy to me. Absolutely none of this is plausible when your startup business is at the top of the news cycle every single day for the past several months.

sosodev

"there are few SREs for hire"

How do you figure? If you mean there are few SRE with several years of experience you might be right. SRE is a fairly new title so that's not too surprising.

However, my experience with a recent job search is that most companies aren't hiring SRE right now because they consider reliability a luxury. In fact, I was search of a new SRE position because I was laid off for that very reason.

chatmasta

You don't even need an SRE to have an on-call rotation; you could ping a software engineer who could at least recognize the problem and either push a temporary fix, or try to wake someone else to put a mitigation in place (e.g. disabling the history API, which is what they eventually did).

However, I think the GP's point about this class of bug being difficult to detect in a monitoring system is the more salient issue.

eep_social

I’ve anecdotally observed the opposite. I have noticed SRE jobs remain posted, even by companies laying off or announcing some kind of hiring slowdown over the last quarter or so. More generally, businesses that have decided that they need SRE are often building out from some kind of devops baseline that has become unsustainable for the dev team. When you hit that limit and need to split out a dedicated team, there aren’t a ton of alternatives to getting a SRE or two in and shuffling some prod-oriented devs to the new SRE team (or building a full team from scratch which is what the $$ was estimating above). Among other things, the SRE baliwick includes capacity planning and resource efficiency; SRE will save you money in the long term.

On a personal note, I am sorry to hear that your job search has not yet been fruitful. Presumably I am interested in different criteria from you —- I have found several postings that are quite appealing to the point where I am updating my CV and applying, despite being weakly motivated at the moment.

namaria

Every system fail prompts people to exclaim "why aren't there safeguards?". Every time. Well guess what, if we try to do new stuff, we will run into new problems.

wouldbecouldbe

There is nothing new about using redis for cache, or returning a list for a user.

sinuhe69

Probably the cheapest solution would be letting GPT monitors user feedbacks from various social media channels and alert human engineers to check for the summative problem. GPT can even engage with users to request more details or reproduceable cases ;)

Yiin

that's abusable, as you can manipulate gpt however you like.

scarmig

Since it now handles visual inputs, I wonder how hard it'd be to get GPT to monitor itself. Have it constantly observe a set of screenshares of automated processes starting and repeating ChatGPT sessions on prod, alert the on-call when it notices something "weird."

eep_social

RUM monitoring does 99% of what you want already. Anomaly detection is the hard part. IMO too early to say whether gpt will be good at that specific task but I agree that a LLM will be in the loop on critical production alerts in 2023 in some fashion.

pharmakom

They raised a billion dollars.

eep_social

How much have they spent?

dharmab

You don't necessarily need a full team of SREs- you can also have a lightly staffed ops center with escalation paths.

eep_social

I don’t think that model has the properties you think it does. Someone still has to take call to back the operators. Someone has to build the signals that the ops folks watch. Someone has to write criteria for what should and should not be escalated, and in a larger org they will also need to know which escalation path is correct. And on and on — the work has to get done somewhere!

guessmyname

> […] it was because they're based in San Francisco. Do they really not have a 24/7 SRE on-call rotation?

OpenAI is hiring Site Reliability Engineers (SRE) in case you, or anyone you know, is interested in working for them: https://openai.com/careers/it-engineer-sre . Unfortunately, the job is an onsite role that requires 5 days a week in their San Francisco office, so they do not appear to be planning to have a 24/7 on-call rotation any time soon.

Too bad because I could support them in APAC (from Japan).

Over 10 years of industry experience, if anyone is interested.

eep_social

I had forgotten that I looked at this and came to the same conclusion as you. I’d happily discuss a remote SRE position but on-site is a non-starter for me, and most of SRE, if I am reading the room correctly.

Edit to add: they’re also paying in-line or below industry and the role description reads like a technical project manager not a SRE. I imagine people are banging down the door because of the brand but personally that’s a lot of red flags before I even submit an application.

VirusNewbie

that is quite low for FAANG level SRE/SWE .

p1esk

Also, I heard their interviews (for any technical position) are very tough.

undefined

[deleted]

inconceivable

nobody qualified wants the 24/7 SRE job unless it pays an enormous amount of money. i wouldn't do it for less than 500 grand cash. getting woken up at 3am constantly or working 3rd shift is the kind of thing you do with a specific monetary goal in mind (i.e., early retirement) or else it's absolute hell.

combine that with ludicrous requirements (the same as a senior software engineer) and you get gaps in coverage. ask yourself what senior software engineer on earth would tolerate getting called CONSTANTLY at 3am, or working 3rd shift.

the vast majority of computer systems just simply aren't as important as hospitals or nuclear power plants.

mnahkies

Timezones are a thing - your 3am is someone's 9am and may be a significant part of your customer base.

Being paged constantly is a sign of bad alerts or bad systems IMO - either adjust the alert to accept the current reality or improve the system

inconceivable

spinning up a subsidiary in another country (especially one with very strict labor laws, like in european countries) is not as easy as "find some guy on the internet and pay him to watch your dashboard. and then give him root so he can actually fix stuff without calling your domestic team, which would defeat the whole purpose.

also, even getting paged ONCE a month at 3am will fuck up an entire week at a time if you have a family. if it happens twice a month, that person is going to quit unless they're young and need the experience.

nijave

Not only that, but you probably need follow the sun if you want <30 minute response time.

Given a system that collects minute-based metrics, it generally takes around 5-10 minutes to generate an alert. Another 5-10 minutes for the person to get to their computer unless it's already in their hand (what if you get unlucky and on-call was taking a shower or using the toilet?). After that, another 5-10 minutes to see what's going on with the system.

After all that, it usually takes some more minutes to actually fix the problem.

Dropbox has a nice article on all the changes they made to streamline incidence response https://dropbox.tech/infrastructure/lessons-learned-in-incid...

okdood64

I've worked two SRE (or SRE-adjacent) jobs with oncall duty (Some unicorn and a FAANG). Neither have been remotely as bad to what you're saying. (Only one was actually 24/7 for a week shift.)

The whole point is that before you join, the team has done sufficient work to not make it hell, and your work during business hours makes sure it stays that way. Are there a couple bad weeks throughout the year? Sure, but it's far, far from the norm.

hgsgm

Constantly? It's one wakeup in 4 months.

oulu2006

I did that for a few years, and wasn't on 500k a year, but I'm also the company co-founder, so you could argue that a "specific monetary goal" was applicable.

majormajor

You don't need 24/7 SREs, you could do it with 24/7 first-line customer support staff monitoring Twitter, Reddit, and official lines of comms that have the ability to page the regular engineering team.

That's a lot easier to hire, and lower cost. More training required of what is worth waking people up over; way less in terms of how to fix database/cache bugs.

richdougherty

Support engineers and an official bug reporting channel would help. I noticed and reported the issue on their official forums at on 16 March, but got no response.

https://community.openai.com/t/bug-incorrect-chatgpt-chat-se...

I only reported it on the forums because there didn't seem to be an official bug reporting channel, just a heavyweight security reporting process.

As well as the actions they took to fix this specific bug, another useful action would be to have a documented and monitored bug reporting channel.

undefined

[deleted]

cloudking

Probably because they launched ChatGPT as an experiment and didn't think it would blow up, needing full time SRE etc. I don't think it was designed for scale and reliability when they launched.

CubsFan1060

Do events like this cause them to lose enough revenue that it would make sense to hire a bunch of SRE's?

nijave

Probably the real reason. I assume they intend to make money off enterprise contracts which would include SLAs. Then they'd set their support based off that

chatmasta

Given the Microsoft partnership, they might not even need to manage any real infrastructure. Just hand it off to Azure and let them handle the details.

undefined

[deleted]

raldi

Just add metrics for the number of times "ChatGPT" and "OpenAI" appeared in tweets, reddit posts, and HN comments in the last (rolling) five minutes, put them on a dashboard alongside all your other monitoring, and have a threshold where they page the oncall to review what's being said. It doesn't even have to be an SRE in this case; it could be just about anyone.

Kuinox

I managed to manually produce this bug 2 months ago. As they don't have any bug bounty, I didn't submitted it. By starting a conversation and refreshing before ChatGPT has time to answer, I managed to reproduce this bug 2-3 times in January.

breckenedge

did you reach out via https://openai.com/security.txt?

Kuinox

No, as I said, their disclosure page says there is no and I'm not a professional security researcher so I was not very interested in helping them.

I find even it funny now, they could write that they will provide free API creds, but now they had a very bad moment due to their greed.

Kuinox

*there is no bug bounty

Missed a word.

MacroChip

You won't fire off a quick email nor warn others because there's no bug bounty?

catmanjan

Not everyone has the privilege of working for free

capableweb

As I understand, the "work" was already done, the only thing missing was sending a heads-up email with "hey, this seems iffy, maybe you ought to look into it".

I dunno, I generally report issues I find in software, paid or not, as I've always done. Takes usually ~10 minutes and 1% of the time, they ask for more details and I spend maybe 20 minutes more to fill out some more details.

Never been paid for it ever, most I gotten was a free yearly subscription. But in general I do it because I want what I use to be less buggy.

fintechie

I reported this race condition via ChatGPT's internal feedback system after I saw other user's chat titles loading on my sidebar a couple of times (around 7-8 weeks ago). Didn't get a response, so I assumed it was fixed...

Hopefully they'll start a bug bounty program soon, and prioritise bug reports over features.

jetrink

The explanation at the time was that unavailable chat data (due to, e.g. high load) resulted in a null input sometimes being presented to the chat summary system, which in turn caused the system to hallucinate believable chat titles. It's possible that they misdiagnosed the issue or that both bugs were present and they caught the benign one before the serious one.

fintechie

Yeah I was surprised that the bug appeared simply using the app normally. My first thought is that it was data from other user loading so I reported immediately that it looked like a race condition. But maybe it was this other bug you mention.

totallyunknown

same to my. actually only the summary of the history was from a different user. the content itself was mine.

sebzim4500

The claim made at the time was that the titles were not from other people and were in fact caused by the model hallucinating after the input query timed out (or something like that). Obviously that sounds a little suspect now, but it might be true.

nwienert

That's a lie if so, if you look at the Reddit threads there's no way those were not specific other users histories as they had the logical history of reading browser history. Eg, one I saw had stuff like "what is X", then the next would be "How to X" or something. Some were all in Japanese, others all in Chinese. If it was random you wouldn't see clear logical consistency across the list.

ajhai

> In the hours before we took ChatGPT offline on Monday, it was possible for some users to see another active user’s first and last name, email address, payment address, the last four digits (only) of a credit card number, and credit card expiration date

This is a lot of sensitive data. It says 1.2% of ChatGPT Plus subscribers active during a 9 hour window, which considering their user base must be a lot.

mach1ne

It’s a bit unclear if this means that 1.2% of all chatGPT Plus subscribers were active during that 9-hour window

lopkeny12ko

The original issue report is here: https://github.com/redis/redis-py/issues/2624

This bit is particularly interesting:

> I am asking for this ticket to be re-oped, since I can still reproduce the problem in the latest 4.5.3. version

Sounds like the bug has not actually been fixed, per drago-balto.

_5hxt

The bug: https://github.com/redis/redis-py/issues/2624

_5hxt

.... "I am asking for this ticket to be re-opened, since I can still reproduce the problem in the latest 4.5.3. version"

chatmasta

The PR: https://github.com/redis/redis-py/pull/2641

According to the latest comments there, the bug is only partially fixed.

photochemsyn

> "If a request is canceled after the request is pushed onto the incoming queue, but before the response popped from the outgoing queue, we see our bug: the connection thus becomes corrupted and the next response that’s dequeued for an unrelated request can receive data left behind in the connection."

The OpenAI API was incredibly slow and lots of requests probably got cancelled (I certainly was doing that) for some days. I imagine someone could write a whole blog post about how that worked, it would be interesting reading.

braindead_in

Was this written by ChatGPT? Maybe it found the bug as well, who knows.

pixl97

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

deathanatos

… in this case this variant seems more appropriate:

  There are 3 hard problems in Computer Science:
  1. naming things
  2. cache invalidation
  3. 4. off-by-one errors
  concurrency

ketchupdebugger

It's surprising that openai seems to be the only one being affected. If the issue is with redis-py reusing connections then wouldn't more companies/products be affected by this?

zzzeek

their description of the problem seemed kind of obtuse, in practice, these connection-pool related issues have to do with 1. request is interrupted 2. exception is thrown 3. catch exception, return connection to pool, move on. The thing that has to be implemented is 2a. clean up the state of the connection when the interrupted exception is caught, then return to the pool.

that is, this seems like a very basic programming mistake and not some deep issue in Redis. the strange way it was described makes it seem like they're trying to conceal that a bit.

roberttod

It's an open source library, I assume that logic is abstracted within it and that the "basic mistake" was one of the maintainer's.

neurostimulant

I think most app using redis-py rarely cancel async redis command. Python async web frameworks is gaining popularity, but the majority of people using python for their web application is not using an async framework. And of those people that do use them, not many of them canceling async redis requests often enough to trigger the bug.

YetAnotherNick

There is a 1 year old autoclosed issue which is very similar to OpenAI's issue: https://github.com/redis/redis-py/issues/2028

killerstorm

Serious question: Why do people feel it's necessary to use a redis cluster?

I understand in early 2000s we were using spinning disks and it was the only way. Well, we don't use spinning disks any more, do we?

A modern server can easily have terabytes of RAM and petabytes of NVMe, so what's stopping people from just using postgres?

A cluster of radishes is an anti-pattern.

manv1

1. Redis can handle a lot more connections, more quickly, than a database can. 2. It's still faster than a database, especially a database that's busy.

#2 is an interesting point. When you benchmark, the normal process is to just set up a database then run a shitload of queries against it. I don't think a lot of people put actual production load on the database then run the same set of queries against it...usually because you don't have a production load in the prototyping phase.

However, load does make a difference. It made more of a difference in the HDD era, but it still makes a difference today.

I mean, redis is a cache, and you do need to ensure that stuff works if your purge redis (ie: be sure the rebuild process works), etc, etc.

But just because it's old doesn't mean it's bad. OS/390 and AS/400 boxes are still out there doing their jobs.

nijave

A pretty small Redis server can handle 10k clients and saturate a 1Gbps NIC. You'd need a pretty heavy duty Postgres database and definitely need a connection pooler to come anywhere close.

anarazel

I agree that redis can handle some query volumes and client counts that postgres can't.

But FWIW I can easily saturate a 10GBit ethernet link with primary key-lookup read-only queries, without the results being ridiculously wide or anything.

Because it didn't need any setup, I just used:

  SELECT * FROM pg_class WHERE oid = 'pg_class'::regclass;

I don't immediately have access to a faster network, connecting via tcp to localhost, and using some moderate pipelining (common in the redis world afaik), I get up to 19GB/s on my workstation.

hobobaggins

and those have reliable backup/restore infrastructure. Using redis as a cache is fine, just don't use it as your primary DB.

xp84

I'm confused on why the need to complicate something as seemingly-straightforward as a KV store into a series of queues that can get all mixed up. I asked ChatGPT to explain it though, and it sounds like the justification for its existence is that it doesn't "block the event loop" while a request is "waiting for a response from Redis."

Last time I checked, Redis doesn't take that long to provide a response. And if your Redis servers actually are that overloaded that you're seeing latency in your requests, it seems like simple key-based sharding would allow horizontally scaling your Redis cluster.

Disclaimer: I am probably less smart than most people who work at OpenAI so I'm sure I'm missing some details. Also this is apparently a Python thing and I don't know it beyond surface familiarity.

oxymoron

Redis latency is around 1ms including network round trip for most operations. In a single threaded context, waiting on that would limit you to around 1000 operations per second. Redis clients improve throughput by doing pipelining, so a bunch of calls are batched up to minimize network roundtrips. This becomes more complicated in the context of redis-cluster, because calls targeting different keys are dispatched to different cache nodes and will complete in an unpredictable order, and additional client side logic is needed to accumulate the responses and dispatch them back to the appropiate caller.

zmj

I'm not familiar with the Python client specifically, but Redis clients generally multiplex concurrent requests onto a single connection per Redis server. That necessitates some queueing.

eldenring

Yes! I have been spending the last couple months pulling out completely unnecessary redis caching from some of our internal web servers.

The only loss here is network latency which negligible when you're colocated in AWS.

Postgres's caches end up pulling a lot more weight too when you're not only hitting the db on a cache miss from the web server.

undefined

[deleted]

cplli

For caching the query results you get from your database. Also it's easier to spin up Redis and replicate it closer to your user than doing that with your main database. From my experience anyway.

mike_hearn

I think the idea is that if your db can hold the working set in RAM and you're using a good db + prepared queries, you can just let it absorb the full workload because the act of fetching the data from the db is nearly as cheap as fetching it from redis.

killerstorm

> For caching the query results you get from your database.

This only makes sense if queries are computationally intensive. If you're fetching a single row by index you aren't winning much (or anything).

dpkirchner

Of course? I'm not really sure what the original question actually is if you know that users benefit from caching the results of computationally intensive queries.

acuozzo

> This only makes sense if queries are computationally intensive.

Or if the link to your DB is higher latency than you're comfortable with.

aadvark69

Better concurrency (10k vs ~200 max connections compared to postgres). ~20x faster than Postgres at Key-value read/write operations. (mostly) single threaded, so atomicity is achieved without the synchronicity overhead found in RDBMS.

Thus, it's much cheaper to run at massive scale like OpenAI's for certain workloads, including KV caching

also:

- robust, flexible data structures and atomic APIs to manipulate them are available out-of-the box

- large and supportive community + tooling

adrr

My redis clusters are 10x more cost effective than my postgresdb in handling load.

amtamt

For caching somewhat larger objects based on ETag?

lofaszvanitt

People know it, that's all.

qwertox

Nice writeup, it's fair in the content presented to us.

Yet I'm wondering why there is no checking if the response does actually belong to the issued query.

The client issuing a query can pass a token and verify upon answer that this answer contains the token.

TBH as a user of the client I would kind of expect the library to have this feature built-in, and if I'm starting to use the library to solve a problem, handling this edge-case would be of a somewhat low priority to me if the library wouldn't implement it, probably because I'm lazy.

I hope that the fix they offered to Redis Labs does contain a solution to this problem and that everyone of us using this library will be able to profit from the effort put into resolving the issue.

It doesn't [0], so the burden is still on the developer using the library.

[0] https://github.com/redis/redis-py/commit/66a4d6b2a493dd3a20c...

---

Edit: Now I'm confused, this issue [1] was raised on March 17 and fixed on March 22, was this a regression? Or did OpenAI start using this library on March 19-20?

Interesing comment:

> drago-balto commented 3 hours ago

> Yep, that's the one, and the #2641 has not fixed it fully, as I already commented here: #2641 (comment)

> I am asking for this ticket to be re-oped, since I can still reproduce the problem in the latest 4.5.3. version

[1] https://github.com/redis/redis-py/issues/2624#issue-16293351...

menzoic

That sounds more like a hindsight thing. In most systems authorization doesn't happen at the storage layer. Most queries fetch data by an identifier which is only assumed to be valid based on authorization that typically happens at the edge and then everything below relies on that result.

It's not the safest design but I wouldn't say the client should be expected to implement it. That security concern is at the application layer and the actual needs of the implementation can be wildly different depending on the application. You can imagine use cases for redis where this isn't even relevant, like if it's being used to store price data for stocks that update every 30 seconds. There's no private data involved there. It's out of scope for a storage client to implement.

undefined

[deleted]

undefined

[deleted]

grogers

I've long thought that it is often better to return a bit of extra data in internal API responses to validate that the response matches the request sent. That can be fairly simple like parroting a request ID, or including some extra metadata (e.g. part of the request) to validate the response is valid. It's not the most efficient, but it can safe your bacon sometimes. Mixing up deployment stacks (e.g. thinking you are talking to staging but actually it's prod) and mixing user data are pretty scary, so any defense in depth seems useful.

Daily Digest email

Get the top HN stories in your inbox every day.