Get the top HN stories in your inbox every day.
gpm
pixel_popping
I agree with you on this specific study, however, I can't really wrap my head about the fact that doctors will be better than AI models on the long-run. After all, medicine is all about knowledge, experience and intelligence (maybe "pattern recognition"), all those, we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans (aka doctors), if we already have this assumption for software engineers, we should have it for this field as well, and let's be realistic, each time I've seen a doc the last few months (and ER twice), each time they were using ChatGPT btw (not kidding, it chocked me).
So I’m genuinely curious:
What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.
teleforce
>What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.
You cannot simply put liability and ethics aside, after all there's Hippocatic oath that's fundamental to the practice physicians.
Having said that there's always two extreme of this camp, those who hate AI and another kind of obsess with AI in medicine, we will be much better if we are in the middle aka moderate on this issue.
IMHO, the AI should be used as screening and triage tool with very high sensitivity preferably 100%, otherwise it will create "the boy who cried wolf" scenario.
For 100% sensitivity essentially we have zero false negative, but potential false positive.
The false positive however can be further checked by physician-in-a-loop for example they can look into case of CVD with potential input from the specialist for example cardiologist (or more specific cardiac electrophysiology). This can help with the very limited cardiologists available globally, compared to general population with potential heart disease or CVDs, and alarmingly low accuracy (sensitivity, specificity) of the CVD conventional screening and triage.
The current risk based like SCORE-2 screening triage for CVD with sensitivity around is only around 50% (2025 study) [3].
[1] Hipprocatic Oath:
https://en.wikipedia.org/wiki/Hippocratic_Oath
[2] The Hippocratic Oath:
https://pmc.ncbi.nlm.nih.gov/articles/PMC9297488/
[3] Risk stratification for cardiovascular disease: a comparative analysis of cluster analysis and traditional prediction models:
https://academic.oup.com/eurjpc/advance-article/doi/10.1093/...
YetAnotherNick
Assume if you know for certain that AI has better senstivity and specificity than your local physician for the particular diagnosis, which likely would be the case now or in few years. Would you purposefully get inferior consultation just because of Hippocatic oath?
gherkinnn
To answer your question: talking to a human.
Medicine is so much more than "knowledge, experience, and pattern matching", as any patient ever can attest to. Why is it so hard for some people to understand that humans need other humans and human problems can't be solved with technology?
ianbutler
So much of what I know from women in my life is that the human element of medicine is almost a strict negative for them. As a guy it hasn't been much better, but at least doctors listen to me when I say something.
AntiUSAbah
One doctor didn't want to give me ritalin, so i went to another one.
One was against it, the other one saw it as a good idea.
I would love to have real data, real statistics etc.
Culonavirus
Yeah... No. I can't possibly disagree with this view more.
I don't need to "talk to a human", I need a problem with my meatbag resolved.
> humans need other humans and human problems can't be solved with technology
WTF are you talking about? Is this bait? You can't possibly mean this. Yes humans are social creatures, but what does that have to do with medicine? Are you talking about a priest, a witch doctor, a therapist? Because if you're not, that sentence is utter BS.
ipaddr
Doctors are not necessarily great at talking to patients and patients are unhappy with the information Doctors provide. This moat has dried up.
educasean
> human problems can't be solved with technology
How are you defining technology? How are you defining human problems? Inventions are created to solve human problems, not theoretical problems of fictional universe. Do X-rays, refrigerators, phones and even looms solve problems for nonhumans?
Claiming something that sounds deep doesn’t make it an axiom.
idopmstuff
It seems likely to me that doctors whose job is almost or entirely about making diagnoses and prescribing treatments won't be able to keep up in the long run, where those who are more patient facing will still be around even after AI is better than us at just about everything.
If I were picking a specialty now, I'd go with pediatrics or psychiatry over something like oncology.
ethin
Because people believe that they know everything about humans and how they work (or they hedge it). This is the exact same reason I don't trust supposed "experts" claiming AI will replace all these jobs: those same experts have no idea what these jobs actually entail and just look at the job title (and maybe the description) but have not once actually worked those jobs. And there is a huge chasm between "You read the job description" and "you actually know what it is like to be in this position and you fully understand everything that goes into it".
ForceBru
"Human problems can't be solved with technology" is just wrong, unless you have narrower definitions of a "human problem" or "technology".
For instance, transportation is a "human problem". It's being successfully solved with such technologies as cars, trains, planes, etc. Growing food at scale is a "human problem" that's being successfully solved by automation. Computing... stuff could be a "human problem" too. It's being successfully solved by computers. If "human problems" are more psychological, then again, you can use the Internet to keep in touch with people, so again technology trying to solve a human problem.
hyperpape
> we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans (aka doctors), if we already have this assumption for software engineers, we should have it for this field as well,
This is a pretty wild leap. Code has a lot of hooks for training via hill-climbing during post-training. During post-training, you can literally set up arbitrary scenarios and give the bot more or less real feedback (actual programs, actual tests, actual compiler errors).
It's not impossible we'll get a training regime that does the "same thing" for medicine that we're doing for code, but I don't know that we've envisioned what it looks like.
DrewADesign
Code is pretty much the perfect use case for LLMs… text-based, very pattern-oriented, extremely limited complexity compared to biological systems, etc.
I suspect even prose is largely considered acceptable in professional uses because we haven’t developed a sensitivity to the artifice, and we probably won’t catch up to the LLMs in that arms race for a bit. However, we always manage to develop a distaste for cheap imitations and relegate them to somewhere between the ‘utilitarian ick’ and ‘trashy guilty pleasure’ bins of our cultures, and I predict this will be the same. The cultural response is already bending in that direction, and AI writing in the wild— the only part that culturally matters— sounds the same to me as it did a year and a half ago. I think they’re prairie dogging, but when(/if) they drop that bomb is entirely a matter of product development. You can’t un-drop a bomb and it will take a long time to regain status as a serious tool once society deems it gauche.
The assumption that LLMs figuring out coding means they can figure out anything is a classic case of Engineer’s Disease. Unfortunately, this hubris seems damn near invisible to folks in the tech industry, these days.
sdwr
Emergency medicine is the coding of medicine. Fast feedback loop, requires broad rather than deep judgement, concrete next steps.
The AI coding improvement should be partially transferrable to other disciplines without recreating the training environment that made it possible in the first place. The model itself has learned what correct solutions "feel like", and the training process and meta-knowledge must have improved a huge amount.
Terretta
Humans tend to be very bad at connecting dots, which is why when we imagine someone who does, we make the show "House" about it.
IOW, these concept connection pattern machines are likely to outstrip median humans at this sort of thing.
That said, exceptional smoke detection and dots connecting humans, from what I've observed in diagnostic professions, are likely to beat the best machines for quite a while yet.
root_axis
Diagnosis is just a small part of a doctor's job. In this case, we're also talking about an ER, it's a very physical environment. Beyond that, a doctor is able to examine a patient in a manner that isn't feasible for machines any time in the foreseeable future.
More importantly, LLMs regularly hallucinate, so they cannot be relied upon without an expert to check for mistakes - it will be a regular occurrence that the LLM just states something that is obviously wrong, and society will not find it acceptable that their loved ones can die because of vibe medicine.
Like with software though, they are obviously a beneficial tool if used responsibly.
dragonwriter
> After all, medicine is all about knowledge, experience and intelligence (maybe "pattern recognition"), all those, we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans
No, I don’t see that we must.
> if we already have this assumption for software engineers
No, this doesn’t follow, and even if it did, while I am aware that the CEOs of firms who have an extraordinarily large vested personal and corporate financial interest in this being perceived to be the case have expressed this re: software engineers, I don’t think it is warranted there, either.
andai
Self-improving system given enough time to self-improve doesn't beat non-self-improving system?
oofbey
You’re holding on to the intuition (hope) that we are smarter than the LLMs in some hard to define way. Maybe. But it’s getting harder and harder to define a task that humans beat LLMs on. On pretty much any easily quantifiable test of knowledge or reasoning, the machines win. I agree experienced humans are still better on “judgement” tasks in their field. But the judgement tasks are kinda necessarily ones where there isn’t a correct answer. And even then, I think the machines’ judgement is better than a lot of humans.
Is medical diagnosis one of these high judgement tasks? Personally I don’t think so.
throw234234234
My personal anecdote when I talk to people - everyone when talking about their job w.r.t AI is like "at least I'm not a software engineer!". To give a hint this isn't just a US phenomenon - seen this in other countries too where due to AI SWE and/or tech as a career with status has gone down the drain. Then they always go on trying to defend why their job is different. For example "human touch", "asking the right questions" etc not knowing that good engineers also need to do this.
The truth is we just don't know how things will play out right now IMV. I expect some job destruction, some jobs to remain in all fields, some jobs to change, etc. We assume it will totally destroy a job or not when in reality most fields will be somewhere in between. The mix/coefficient of these outcomes is yet to be determined and I suspect most fields will augment both AI and human in different ratios. Certain fields also have a lot of demand that can absorb this efficiency increase (e.g. I think health has a lot of unmet demand for example).
largbae
But liability and ethics cannot be put aside. If treatments were free of cost and perfectly address problems, then a correct diagnosis would always lead to the optimal patient outcome. In that scenario, AI diagnosis will be like code generation and go asymptotic to perfection as models improve.
But a doctor's job in the real world today is to navigate a total mess of uncertainty: about the expected outcome of treatments given a patient's age and other peoblems. About the psychological effect of knowing about a problem that they cannot effectively treat. Even about what the signals in the chart and x-ray mean with any certainty.
We are very far from having unit test suites for medical problems.
GorbachevyChase
Liability would put all this to bed. Is OpenAI liable for malpractice if it misdiagnoses your issue? No? Then it’s no substitute. Being right is not nearly as important as being responsible. Unfortunately, there is widespread perception that software defects are acceptable, whereas operating on the wrong leg isn’t.
snickerbockers
>AI diagnosis will be like code generation and go asymptotic to perfection as models improve
uhhhhhhh, I'm pretty behind-the-times on this stuff so I could be the one who's wrong here but I don't believe that has happened????
But anyways that nitpicking aside I agree with you wholeheartedly that reducing the doctor's job to diagnosis (and specifically whatever subset of that can be done by a machine-learning model that doesn't even get to physically interact with the patient) is extremely myopic and probably a bit insulting towards actual doctors.
brookst
Isn't that conflating diagnosis and treatment plan?
Aurornis
When you read through the article it shows that the gap between doctors and LLMs actually disappeared (in terms of statistical significance) once both were allowed to read the full case notes.
The headline is quoting a number based on guessed diagnoses from nurse's notes. The LLM was happier to take guesses from the selected case studies than the doctors is my guess.
Intralexical
Not only is the study testing something which only vaguely resembles how doctors diagnose patients, but isolated accuracy percentages are also a terrible way to measure healthcare quality.
If 90% of patients have a cold, and 10% have metastatic aneuristic super-boneitis, then you can get 90% accuracy by saying every patient has a cold. I would expect a probabilistic token-prediction machine to be good at that. But hopefully, you can see why a human doctor might accept scoring a lower accuracy percentage, if it means they follow up with more tests that catch the 10% boneitis.
mhitza
I think AI can be useful in any kind of context interpretation, but not make a decision.
Could be running in the background on patient data and message the doctor "I see X in the diagnostic, have you ruled out Y, as it fits for reasons a, b, c?"
I like my coding agents the same way, inform me during review on things that I've missed. Instead of having me comb through what it generates on a first pass.
diffyd
[flagged]
tensor
Interestingly, this recent study using ChatGPT Health gave quite a different outcome (https://www.nature.com/articles/s41591-026-04297-7). Here it was wrong about emergency triage 50% of the time.
brikym
I think it's plausible since doctors tend to have human cognitive biases and miss things. People tend to fixate on patterns they're most familiar with.
namuol
A bold claim to suggest that LLMs aren’t prone to biases of their own which are less understood.
mitkebes
LLMs are having pretty consistent studies into their biases. Obviously this doesn't mean we know all the biases, but it's being actively worked on.
Meanwhile with human doctors, every one of them is a unique person with a completely different set of biases. In my experience, getting a correct diagnosis or treatment plan often involves trying multiple doctors, because many of them will jump to a common diagnosis even if the symptoms don't line up and the treatment doesn't actually help.
_heimdall
I haven't finished reading the linked paper, but I'm intrigued by the assumption that the results show illusion or mirage results when not giving access to the x-rays.
It seems like a very reasonable take away, but it skips the other one. Do x-rays make results less accurate?
sandeepkd
These type of experiments are bound to have biases depending on who is doing it and who is funding it. The experiment is being funded for a particular reason itself to move the narrative in a desired direction. This is probably a good reason to have government funded research in these type of sensitive areas.
AntiUSAbah
Weird that this is the case and a new study.
but those kind of x-ray models are already activly used. They are not used though as a only and final diagnosis. Its more like peer review and priorization like check this image first because it seems most critical today.
lukko
I'm surprised at both the article and the paper - both seem very hyperbolic. This is LLMs competing against doctors in a way that is heavily weighted in the LLMs favour, which does not represent clinical practice. These reasoning cases are not benchmarks for doctors, they are learning tools.
I think it's important to note that diagnosis also relies on accurate description of the patient in the first place, and the information you gather depends on the differential diagnosis. Part of the skill of being a doctor is gathering information from lots of different sources, and trying to filter out what is important. This may be from the patient, who may not be able to communicate clearly or may be non verbal, carers and next of kin. History-taking is a skill in itself, as well as examination. Here those data are given.
For pattern recognition from plain text, especially on questions that may be in the o1's training data, I'm not surprised at all that it would outperform doctors, but it doesn't seem to be a clinically useful comparison. Deciding which investigations to do, any imaging, and filtering out unnecessary information from the history is a skill in itself, and can't really be separated from forming the diagnosis.
lokar
Also, you need to see an analysis of the incorrect calls. The goal of a human Dr is not to get the highest accuracy, it's to limit total harm to the patient. There can be cases where the odds favor picking X (but it may not be by that much), but the safe thing to do is to rule out some other option first, or start a safe treatment that covers several other possible options.
Simply getting the "high score" on this evaluation is not necessarily good medical treatment.
creativeSlumber
> "An AI and a pair of human doctors were each given the same standard electronic health record to read"
This is handicapping the human doctors abilities. There is a lot more information a human doctor can gather even with a brief observation of the patient.
djb_hackernews
Can't the same be said for the AI?
smt88
If the answer is yes, let’s see that study.
This one compares AI to a human doctor practicing in a very unrealistic way.
cogman10
Agreed. I think the best use of this sort of tech is to use both to their strengths. Use AI to go over the record and suggest diagnoses which you have the doctor review after observing the patient.
The other thing is that common issues are common. I have to wonder how much that ultimately biases both the doctor and the LLM. If you diagnose someone that comes in with a runny nose and cough as having the flu you will likely be right most of the time.
kqr
On the other hand,
> there are few things as dangerous as an expert with access to open-ended data that can be interpreted wildly, like a clinical interview.
https://entropicthoughts.com/arithmetic-models-better-than-y...
jrm4
This feels like a deeply important observation. Now also, would be interesting to include e.g. a short video or photograph for the AI to use as well.
delfinom
Bonus, health networks now push doctors to use AI transcription software for the EHR entries. Doctors and nurses like it because they don't have to type it up. But it is a complete shitshow on whether the records are reviewed for transcription errors which happen quite often
Now feed a flawed transcripted into an AI diagnosis system and bam-o. The AI will treat it as gospel, while the doctor may go wait what.
undefined
undefined
undefined
manmal
I know a cardiologist who founded a training & knowledge base startup for doctors. He once told me (that was before LLMs), that it’s super common to tell a patient that the doc needs to look up sthg in their patient history, to then instead google the symptoms. Or, even more often, quickly text a colleague.
I have no way of knowing if this is true. But I‘d rather had a complete, guided prompt be the basis of a diagnosis, than a 2m google search.
01100011
I wouldn't put much weight in this study, but I think a lot of us can still attest to the usefulness of LLMs in self-diagnostics. The reality in the US is that it is difficult to get the attention and care of a doctor so we're left having to do it ourselves. 10 years ago you'd hear docs complaining about patients coming in with things they found on google but now I don't think there's an alternative.
Case in point, I went to a podiatrist for foot and ankle issues. He diagnosed my foot issues from the xray but just shrugged his shoulders for the ankle issues and said the xray didn't show anything. My 15 minute allocation of his attention expired and I left without a clue as to the issue or what corrective actions to take. 5 minutes with an LLM and I had a plausible reason for the ankle issues which aligned with the diagnosis in my foot.
guidedlight
I agree. I think the issue with LLM’s are not with the correct diagnoses’s but rather the incorrect ones.
Real doctors tend to have a degree of cautiousness. I would rather a real doctor be hesitate and seek more information, than an alarmist LLM suggesting I have cancer.
undefined
NegativeK
I don't think that using LLMs for medicine is an appropriate fix for the US's healthcare issues.
Unless healthcare businesses decide to improve patient care with AI instead of increasing patients per day, I think it's going to make things even worse.
vjvjvjvjghv
Doctors using AI will probably just increasing the number of patients they see. But for me as patient AI is super useful to get a good handle on the situation before I see a doctor.
01100011
I'm not suggesting it as a fix. I'm saying it's the only option to get medical answers for many people.
jmpman
Besides for myself and wife, I've also used LLMs to diagnose my dogs. Convinced there's a huge opportunity for AI based veterinary, especially one which then performs bidding across the local veterinary clinics to perform the care/surgeries. I've noticed that local vets vary in price by more than an order of magnitude. My 80 year old mother and mother inlaw have been regularly scammed by over charging vets, and with their dogs being a major part of their lives, they extremely susceptible to pressure.
gizmodo59
The negative reactions here are baffling me. The fact that we can even get to say 30% with computer is amazing. So much hatred towards AI and anything from the frontier labs like OpenAI (or Goog for that matter) makes no sense.
pinkmuffinere
There is a lot of negativity towards AI. However, there’s also real shortcomings to the study. IMO the issue here is that the AI was given case notes for a patient, but was not shown the patient directly. This is both different than what a doctor is trained for and also unnecessarily limiting for what a doctor can do. A lot of the value doctors deliver is from talking to the patient. The headline makes it sound like AI is going to replace doctors, but it seems more like “AI can do this one niche task better than doctors can do this one niche task”. The notes being used are probably written by a doctor(s) to begin with. I think the real reward here is that the doctor+AI unit should perform better than the doctor in isolation –– in the case where a doctor would have to read case notes and make some conclusion, the doctor can now rely on AI for pretty good suggestions.
tuananh
> real reward here is that the doctor+AI unit should perform better than the doctor in isolation
that is true for other profession as well.
while everyone is afraid of layoff, the real question is always "employee+AI" is better than employee/AI alone or not.
vector_spaces
Why are you baffled? The most upvoted critical comments are mostly explaining themselves and I don't think their reasons are very technical. When the stakes are higher, we should generally be more critical, not less.
thephyber
That’s what they said about Enron.
Skepticism is an incredibly useful tool, even in excess.
an0malous
I for one am delighted for my acquaintances in the medical field with their cushy, cartel-supported salaries to feel the existential dread of AI coming for their jobs like I have
krupan
I'm sorry that you are feeling existential dread about your career. It could help to stop listening to the hype that the people selling AI are spewing and take a hard look at the tools themselves. Like most products, they aren't as good as the salespeople say they are. Also, take any predictions for how these products will do in the future with a huge grain of salt. Predicting the future is very difficult. It's taken us 70 years of computer and AI research and development to get to this point. It's likely that the rate of improvement will not change drastically. Yes, things are changing, but the singularity (still) is not coming tomorrow
12345ieee
Oh no, imagine the people that save human lives having high salaries, the horror.
If you, like me, are in the software field, know that this is likely the most comfortable job even invented by humanity, we should really be paid just above the poverty line in exchange.
robocat
Everyone is taught that doctors save lives.
However many others in society save lives that are not so lavishly praised or financially rewarded.
For example in New Zealand median pay for a Road Design Engineer is about $100k NZD compared to a GP (doctor) getting $240k. Plus the doctor gets paid a massive overpayment of social status.
Over a 40-year career, an average NZ GP will save 5 to 10 lives. The Road Design Engineer saves 40 to 120 lives. Road engineers in NZ prevent roughly 10x more serious injuries than they do deaths so it isn't just death stats.
Our hypothetical engineer should be paid > 10x more than the doctor on raw stats.
It gets harder when we start looking at quality of life versus raw lifetime numbers. You then need to consider the value of say entertainment (a good movie) versus the hypothtical lives saved by spending the budget elsewhere.
A game designer might be valued highly by a gamer mum, and negatively by their children and gaming widowed dad.
an0malous
Give me a break, most of them are glorified drug dealers. Their salaries are inflated by an artificially capped supply of doctors, at the cost of patients.
I had to leave my job this year because of burnout when the execs mandated that we use AI tools, become our own designers, PMs, and QA, and double our velocity. They run through a decision tree they leaned in residency every day and I’m learning how to do 3-4 other people’s jobs on top of whatever the new AI thing is. I was working nights and weekends while my friends in medicine are planning their 3rd vacation this year to Tuscany.
cindyllm
[dead]
droidjj
The paper: https://www.science.org/doi/10.1126/science.adz4433 (April 30, 2026)
Tenobrus
o1 has a METR time horizon of around 40 minutes, opus 4.7 has an implied horizon of 18 hours based on its ECI score. this study is on a model that's several generations behind wrt the kind of tasks it can complete. it would be shocking if this number were anywhere near as low with GPT 5.5, to the point it seems nearly totally irrelevant to talk about these results
mawadev
I don't think AI is a good use case for such critical situations. Maybe in a decade we have AI help out doctors with doing a pre check. What if Ai finds nothing and the doctor does not bother to look into it further? It is this small question which breaks the technology from any angle later down the road from my POV. AI has to stay optional here.
Even if AI is used to sample or summarize a lot of data that a human couldn't do in time: What if it misses something that a human won't? What if a human inversely misses something that AI won't? Would you rather trust the machine or the human? (Especially if the human is held accountable.)
henry2023
You can replace AI with blood tests in you comment and the same questions are relevant today.
jmcgough
LLMs can be a useful second opinion for a highly educated patient with good insight into their health and body, but this is not the average patient I see in an urban emergency department. Many patients can't give a cohesive history without a skilled clinician who can ask the right questions and read between the lines.
I am very skeptical of studies like this that don't adequately reflect real world conditions, but when I was a software engineer I probably wouldn't have understood what "real" medicine is like either.
matheusmoreira
You went from software to medicine? Pretty cool to discover I'm not alone in this world.
> LLMs can be a useful second opinion for a highly educated patient with good insight into their health and body
I have the same opinion. It's just like software in this regard. A person who's already knowledgeable can prompt well and give detailed context, and tell when the LLM is confidently bullshitting or just plain being lazy. That is not the reality of the average person.
I tried using Claude to help with some hard cases a couple of times and it was very prone to jumping to conclusions based on incomplete information. It was excellent as a research buddy though. I'm using it to great effect to keep myself up to date.
beering
o1 is several generations old and was released in 2024. Is this some quite old research that took a long time to get published?
nhinck2
It's also important to note that it beat doctors in diagnosing in a way doctors do not diagnose.
SpicyLemonZest
Yes, the preprint of the same paper (https://arxiv.org/abs/2412.10849) was first written in December 2024.
oofbey
Medical research moves. Very. Slowly.
Get the top HN stories in your inbox every day.
I'd be very very hesitant to trust studies like this. It's very easy to mess up these benchmarks.
See for example this recent paper where AI managed to beat radiologists on interpreting x-rays... when the AI didn't even have access to the x-rays: https://arxiv.org/pdf/2603.21687 (on a pre existing "large scale visual question answering benchmark for generalist chest x-ray understanding" that wasn't intentionally messed up).
And in interpreting x-ray's human radiologists actually do just look at the x-rays. In the context the article is discussing the human doctors don't just look at the notes to diagnose the ER patient. You're asking them to perform a task that isn't necessary, that they aren't experienced in, or trained in, and then saying "the AI outperforms them". Even if the notes aren't accidentally giving away the answer through some weird side channel, that's not that surprising.
Which isn't to say that I think the study is either definitely wrong, or intentionally deceptive. Just that I wouldn't draw strong conclusions from a single study here.