He asked AI to count carbs 27000 times. It couldn't give the same answer twice

Daily Digest email

Get the top HN stories in your inbox every day.

endymion-light

There's an incredibly serious lack of education with how LLMs & carb-counting works. This entire article would be better suited to astrology.com than hackernews.

When I opened it up, I assumed the author would have at least attempted a calculation service, maybe even placed something like the size of the meal into an actual model, using the integration of pre-existing tools that are (slightly more) accurate. Hell - most food literally is required to have calorie information, and you can query open source data for others!

But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

This is akin to the instagram reels that talk to chatGPT and ask it to time how long they're run is. Except those are treated as funny jokes rather than being turned into studies.

I'd like to see this study done using any kind of actual grounding knowledge, seeing what mistakes AI makes when attempting to query ground truth from picture analysis - there would at least be an interesting result methodology in that.

kalleboo

> But the author just took pictures of food & expected a realistic response?

There are very popular apps on the App Store right now that are going viral among non-techie people that do exactly this, and they have no concept of how AI works. My wife was talking about one and I had to give her a reality check that the AI had no idea what ingredients were used to make the food. And she's a licensed nutritionalist.

Studies like this create something to point at for people who are confused and serve as a springboard for a conversation in the media.

endymion-light

That's true - I suppose i'm just dissapointed that this study hasn't seemed to include those within any analysis. Being able to point out that the top 100 calorie counting apps on the app store return similiar results to simple frontier models would be of interest.

I think i'm just dissapointed that this study doesn't go deep enough, and stays at a surface level statistical analysis of frontier models.

dpark

I think it’s a very useful study specifically to debunk the apps that support this flow.

None of those apps have magic. They cannot do better than the frontier models.

asdfasgasdgasdg

To be fair these kinds of apps also existed before LLMs. They just used OpenCV or similar instead of the LLM APIs.

inerte

To be fair my expectations is that those apps have done the prompt engineering, and schema, and tools (to query nutrition database), etc... and although they're not 100% consistent, the margin of errors should be narrow to the point that barely matter, and they should do a bit better than a random ChatGPT chat session.

Centigonal

the problem isn't one that can be solved with prompts. If I gave a panel of food and nutrition experts (human or machine) a bunch of pictures of food, they still wouldn't be able to tell if, e.g. a slice of cake was made with whole milk or skim.

The "pic of packaged food --> LLM --> nutrition DB call" pipeline is workable, but many users of these apps are using them for fresh prepared foods, which is just an unworkable problem without either an understanding of the preparation process or a bomb calorimeter.

whazor

The real benchmark should be comparing the amounts with a human guess. And aa far as I know with diabetes if you are within 30% of guessing carbs then you should be fine.

xnx

Even simpler examples make the limitations obvious. Images can't distinguish Diet Coke from Coke.

senordevnyc

licensed nutritionalist

Nutritionist?

kalleboo

Haha oops. English is hard...

Insanity

[flagged]

furyofantares

From the text of the article I believe the author is implying there are apps doing exactly this, and so this is why it was studied that way.

Had the author written the article themselves rather than an LLM their motivation probably would have been clearer.

Brendinooo

> there are apps doing exactly this

Yeah, for sure there are. And people will just ask ChatGPT as well.

The funny thing is that for people who are just trying to lose weight without managing any health issues precisely, this type of extreme variance doesn't really matter, because the mere act of consciously quantifying food consumption is, based on my experience counting calories, the single biggest factor in success with weight loss.

criley2

I actually think "just asking ChatGPT" is fine, because A) the data in these apps is suspect at best and B) the data behind calories is also pretty suspect (but we all play along because we can adjust other variables to make it all "work" well enough).

Once or twice a year I spend a few weeks meticulously measuring ingredients/cooked foods and recording calories and on complex recipes apps are next to useless at getting accurate data. You're trying to input five or ten relevant ingredients, and then weighing your cooked outcome to try and divide the ingredients by proportion. Frankly it's a mess and most people aren't doing it for home cooked meals, and are getting very lossy outcomes (weighing cooked chicken and marking it as raw chicken, etc)

With reasoning and tool calling (combined with me meticulously weighing before and after), it's producing fine data for my purposes.

ozgung

The author uses the prompts and method from an open-source app that connects to insulin pump, a medical device. I think AI food identification is an experimental feature in the app.

> The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example.

https://github.com/Artificial-Pancreas/iAPS

I think these are the prompts in the app: https://github.com/Artificial-Pancreas/iAPS/tree/5eabe22e7e2...

sjhatfield

Exactly. This is not paid software. We assume full responsibility for outcomes when using it. There's a reason it's not on any app store. I'm glad features like this are being experimented with. Not how I would use AI to estimate carbs...

Ancapistani

True, but I'm working on a product that's "adjacent to" this sort of thing, and we also have a "food recognition" feature that's marked as experimental. Our users are using it, and now I plan to push fairly hard on at least measuring the accuracy and hopefully exposing those results to our users regardless of how well it performs.

andrewvc

One of the biggest gaps is that people don't understand that food labels are allowed by the FDA to be off by up to 20% in terms of the number of actual calories!

In the real world you need to calibrate your behavior with the results. Are you gaining weight? You'll need to eat less if you want to lose any. You can do all the math with nutrition labels and macros you want but that's all theoretical.

See this study below for the 20% figure, as well as their experimental results on real food items (some even exceeded this threshold though most were within it). https://pmc.ncbi.nlm.nih.gov/articles/PMC3605747/?st_source=...

bonoboTP

A large part of the effectiveness in counting calories is that you pay more attention and make more conscious decisions and are less likely to "cheat" if you have to enter it in your food log.

It's indeed like astrology. Simply thinking about personality traits and thinking through your life and your desires and goals and current situation is already beneficial to take charge and navigate your life.

smallmancontrov

I'd take the opposite point of view: just thinking about life, desires, and goals is how people wind up paying $10 to Whole Foods for a slice of "healthy pizza" that is nutritionally identical to any other pizza but comes out of a cute stone oven and is displayed on a wooden platform next to green leafy plants. Vibes astrology is notoriously easy to exploit, both by your own sugar/salt/fat-seeking instincts and by unscrupulous commercial forces, let alone the two working together. The unique thing about calorie counting is that it cannot be exploited like vibes astrology. Not even with a 20% error margin, which is (probably not coincidentally) the caloric deficit targeted by standard dieting advice.

ijk

That's an interesting bit, where reducing friction too much can eliminate the side effect that is actually driving the desired results.

Do you want to count calories, or do you want to lose weight? Sounds like it's possible to hyper-optimize calorie counting to the point that it becomes counter-productive...

johsole

Thank you for linking the study.

Some good news from it. If you weigh the food instead of depending on the package size then the labels become much more accurate!!

"Serving size, by weight, exceeded label statements by 1.2% [median] (25th percentile −1.4, 75th percentile 4.3, p=0.10). When differences in serving size were accounted for, metabolizable calories were 6.8 kcal (0.5, 23.5, p=0.0003) or 4.3% (0.2, 13.7, p=0.001) higher than the label statement."

If you look at the table "Deviation of metabolizable calories from label calories" [https://pmc.ncbi.nlm.nih.gov/articles/PMC3605747/figure/F1/] you'll see that most labels even for service side are pretty good and there are some that are really bad.

If you'll look at one of the worst offenders Tostitos, the label has "Tostitos Tortilla Chips - serving size 24 chips", but chips vary a lot in size, so you could have a huge variance in weight. If instead you weighed them, which I do with my chips, I bet the calories are much closer to the label.

Body composition comes down to routine. I've found found I love to eat, but I pretty much eat the same meals week over week, that makes it extremely easy for me to lose or gain weight depending on my goals.

adrian_b

Moreover, the calorie numbers for raw ingredients are much more accurate than for snack foods, where the amount of each ingredient may vary from nominal, even when the total mass is nominal.

So when you cook yourself and you weigh the ingredients used for cooking, you can know the real calorie content with far more accuracy than when buying ready-to-eat food.

CGMthrowaway

The errors either cancel out (if error in both directions) or they work in one direction ("bad" foods like junk systematically underestimate calories, "good" foods like protein powder systematically overestimate calories).

Either way, if you count calories and compare to your weight gain/loss over a few weeks and adjust your calorie target as warranted, assuming the types of food you are eating do not change drastically (e.g. you calibrated on regular diet and now have started an elimination diet), the error bars can be basically ignored.

andrewvc

Exactly! My point was that despite the precision of calories we should really think of them as ballpark estimates .

strken

Realistically the labels are going to be much closer for staples like long-grain Jasmine rice or olive oil, if they're measured by weight.

It's just not that easy to change the nutritional content of a kilogram of a known cultivar of dry rice when it's passed all the standard checks for moisture content, protein content, etc.

ludicrousdispla

how many calories are in strawberry?

nekusar

Its even weirder.

What has more calories: 1 lb of peanuts, OR 1 lb of peanuts ground into peanut butter?

I cant find the study, but the peanut butter has more calories since its pre-ground and more bioavailable. Peanuts get chomped up but larger pieces still remain and are not captured by the body.

stogot

Does this only apply to calories or to other categories too?

ndisn

[flagged]

throwaw12

I feel like you didn't understand the goal of this study

> The DTN-UK stated earlier this year that generic LLMs must never be used as autonomous advisory calculators for insulin delivery. This data is the quantitative evidence base for that statement.

This study is to prove that you should not rely on LLMs

lukeschlather

The thing is it doesn't really prove LLMs can't do this, it proves no existing frontier LLMs can do this.

The part where they talk about sampling multiple runs is interesting - it suggests to me that in the next few years as the reasoning process is improved the models may be able to do that autonomously.

My mind really is going to using a dedicated object detection models fine-tuned with nutrition information, but I don't think there's a fundamental reason LLMs can't eventually manage this use case, except perhaps the size of the needed weights being prohibitively large.

tsimionescu

Per some people, LLMs of the future can do literally anything that's possible to do. They could create quantum computers powered by fusion power.

That has nothing to do with the question being asked, can you rely on an LLM today to help you track carbs as a diabetic?

This is very explicitly what the article is all about. Potential future LLMs are entirely irrelevant.

The_Blade

that is good to know. presented this way i find LLM behavior to be a feature, not a bug. then again i think everything is value add over pen and paper / notepad / spreadsheet and maybe a friend or doctor (or specialized equipment if you need more than calorie in, calorie out). just go exercise and don't be a lard lad

devilbunny

> just go exercise and don't be a lard lad

You can out-exercise almost any diet, but it takes 3-4 hours a day of a hard workout.

If calories in, calories out was useful advice rather than a banal statement of physics, nobody would be fat.

tsimionescu

> just go exercise and don't be a lard lad

This is about people suffering from diabetes tracking their insulin needs. You can outrun any diet, but not insulin shots.

fabian2k

The paper itself is a lot clearer about the purpose. The blog post reads very clickbaity and doesn't really explain the context well.

Aurornis

I disagree, it clearly explains that AI carb counting apps are a problem and shouldn’t be used.

They’re writing in a neutral way that reaches their audience without lecturing or being condescending. They lead the reader to the conclusion rather than shoving it at them. I think that’s why it’s triggering so many angry comments on HN, but it’s effective for the audience they’re writing for (non technical people who may need convincing but don’t like being preached at)

snapcaster

But it's stupid. If i smack myself in the head with a hammer is that proof hammers shouldn't be relied on?

fc417fc802

If you smack yourself in the head with a hammer and it injures you that's evidence that smacking people in the head with hammers is bad and shouldn't be done, right?

jkestner

Here we’re at the origin of the tool and get to watch how many people hit themselves in the head before we learn this collective wisdom.

There’s a gap between what the tool will allow you to tell it to do, and what it’s good at. The feedback mechanism to tell the difference is deficient compared to a hammer.

coldtea

No, but it would be proof you didn't get the point of the paper.

undefined

[deleted]

jmye

Are there start-ups led by idiots suggesting that smacking yourself in the head with a hammer will help treat your diabetes?

If not, then perhaps there's a problem in your analogy.

Aurornis

> But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

The article explains this: There are apps targeting people with diabetes that claim to count your carbs with AI.

> If you’re using AI carb counting in a diabetes app

Before you dismiss a study, try to understand where it’s coming from.

The authors of the study weren’t stupid. They knew the LLMs would provide poor results. They ran the study to quantify it and create a resource to spread the information in response to the rise of AI carb counting apps.

ijk

> The authors of the study weren’t stupid. They knew the LLMs would provide poor results. They ran the study to quantify it and create a resource to spread the information in response to the rise of AI carb counting apps.

Yeah. I think it is under-appreciated that much of science is intended for debugging purposes. Sure, you and I know that X is positive, but what's it actual value? Can we find the causes that make it that way? Et cetera.

endymion-light

I don't believe the authors of this study are stupid.

If there are apps targeting people with diabetes that claims to count your carbs with AI, why haven't those been analysed? That would be a far more effective claim.

I based the study off of the clickbait article that they wrote about the study - i'll read through the study to see whether they analyse that, but it would be far more effective to see if the 'carb-counting' AI app is returning similiar results to the frontier model - that's an interesting result that actually can forward discussion.

Aurornis

> If there are apps targeting people with diabetes that claims to count your carbs with AI, why haven't those been analysed? That would be a far more effective claim.

Because the apps aren’t going to let you submit 29,000 automated requests for statistical analysis.

And if you did, the authors of those apps would just release an update saying they changed models and try to dismiss the study.

The vitriol against this article on HN is sad. Commenters who agree with the article and its conclusions are grasping for reasons to be angry about it anyway

tsimionescu

The linked "click bait" article explains this very clearly as well. It clearly explains the methodology: they took the prompt sent to an LLM by a popular open source carb counting iOS app and sent it, together with five different pictures of food that a typical person might take, to all of the frontier models, and checked the responses. They also explain the purpose: to check the possible accuracy of this approach taken by a real app that real people use.

The fact that you somehow perceived this as an attack on LLMs as a technology is a failure entirely on your part. There is nothing in the article that suggests that people shouldn't use LLMs for other purposes - just a statistical verification of the fact that they shouldn't be used for this one particular thing.

ilivethere

Typical case of the "curse of knowledge". We deal with AI on a daily basis on the technical level, so it's very easy to forget that the "common" folk really still believe that AI can replace dieticians, gym coaches, etc

coldtea

>But the author just took pictures of food & expected a realistic response? Is this genuinely what amounts to a study in AI?

If there are commercial services where you take pictures of food and are promised a realistic (paid for) response, then yes. And there are.

dahart

And what’s the variance & accuracy of their responses? Isn’t comparing the models’ variance to baseline human variance what matters here? It seems like they didn’t do that, and I agree with parent’s call for that kind of baseline.

Having counted calories for years, I don’t think I could reliably estimate the calories or carbs in the example picture of a cheese sandwich. I can make assumptions about the bread and the cheese, but I might easily be off by 2-3x. Calorie counting apps that use text descriptions also have huge variance for the same thing. The problem might be the belief that a picture or description is enough, regardless of who or what is guessing…?

Edit: Ah, I see from sibling thread you meant commercial services are LLMs, I thought you meant there were human-backed services to compare to. Anyway, I totally agree there’s a problem if people rely on AI for safety, but I’m not sure LLMs are the core issue here, it seems like using vague information and guessing is the core issue.

swiftcoder

> Isn’t comparing the models’ variance to baseline human variance what matters here?

You seem to be missing the context that this isn't just about diet apps - this is about apps claiming to be able to track carbs sufficiently accurately to be used in a medical context to dose insulin (a substance which can be lethal if incorrectly dosed)

endymion-light

But I don't see them using those commercial services in this study - instead, they're using frontier model companies? Is Gemini advertising that you get a realistic calorie count from a picture? Maybe so - in which case i'd take it back!

notahacker

The commercial services likely also have frontier model dependencies...

The opening to the actual paper is quite explicit that (i) other studies have already tested commercial apps with with unimpressive results and (ii) a popular open source app for carb counting directly relies on API calls from these frontier models, and this research batch tested the images used the exact same models and prompts as the popular open source app.

coldtea

Are commercial services anything more than just UI facades on top of frontier model APIs?

swalsh

It amazes me how much people try to build AI systems relying on nothing more than the models knowledge. I suspect a great deal of "failed" AI experiments we keep reading are people just not having any idea how to use AI at what its good at.

nextlevelwizard

As someone who used to do this. OpenAI models refuse to look up calories unless you explicitly tell them to and even then it is a hit and miss even if you tell them exactly what the product is. Easiest way to get good calculation is to just take a photo of the nutrition label or feed that info in by hand.

Funny thing is 4o did look up calories but I guess it was too good for this world

the_duke

I exclusively use thinking mode, which is slower but much more likely to double-check things with web search etc.

nextlevelwizard

Maybe. I stopped using OpenAI a while ago. But taking pictures of the nutrition labels was good enough

harperlee

There is a lot of hate in the comments but there is some merit to the post existing:

  1. Even if the task is unreasonable, it is good to showcase that the LLM will perform poorly - warning not to be used for diabetes.

  2. As it is a probabilistic model, the approach was to execute it multiple times and look at the distribution. They also tried to minimize variance: "All at the lowest randomness setting these models offer.", the post mentions. Yet the variance of the responses is surprising.

  3. A multimodal LLM should be in general able to discriminate between crema catalana and a cheese sandwich, and provide a textual, uncalculated range of how much calories the item has (internet is full with tables for calorie counting and things such as this https://fitia.app/calories-nutritional-information/cheese-sandwich-1205647).

  4. It is not clear that the "expose" surprised / outraged style is just a communication vehicle or if the author really thought that e.g. LLMs could be hypothetically able to provide confidence estimates.

bcjdjsndon

Re: 2... I think it's interesting they add arbitrary randomness in the algorithm. The problem of wildly varying outputs to the same input wouldn't exist in the first place

gblargg

Sounds like a kind of dithering to spread out errors and avoid getting stuck.

tantalor

> LLM will perform poorly

We've been seeing examples of this constantly since 2022. How many more do we need?

jaccola

It’s just an impossible problem. Photons don’t provide sufficient information to determine calories (at least not in any way they could practically be captured). Inside that sandwich could be drenched with olive oil or it could be hollow cheese with lettuce. It’s impossible to tell.

2ndorderthought

The average person has no idea this is true. And the average person cannot tell when this is the case. So we have a bunch of people, going their way through school, and then when they get stuck relying on AI. The future is gonna be wild.

lordleft

Yep. And it doesn't help that the people selling AI products act as if they're going to build God. Going, "well AI can't do that" isn't going to fly when you are lax about communicating its limitations!

2ndorderthought

It also doesn't help when the messaging is linked to how "there will be no jobs where you use your brain anymore everything will be automated". What motivation does the average 16 year old have to try hard and learn anything beyond what they immediately need.

No jobs, ai Jesus is coming, and if you use ai it will use all of the worlds compute power to try to convince you it's correct even when it's not.

renticulous

Here's technical literacy of population on display. I love these prank examples which show the true education of populace.

https://www.youtube.com/shorts/B7c9qJcRnVk

dredmorbius

A more robust measurement might be the (former) US Department of Education's "Adult Literacy in the United States" survey, most recently conducted in 2019. The results of this are sobering enough:

<https://nces.ed.gov/pubs2019/2019179/index.asp>.

There's a related study of adult technical literacy conducted in 33 OECD nations:

<https://www.oecd-ilibrary.org/education/skills-matter_978926...>.

Both show that only a small fraction (5--10%) of adults operate at high levels of literacy (whether of text, numeracy, or technology), and that a large fraction (roughly 50%) operate at a minimal or below-minimal rate.

fcarraldo

True education? What idiot would say yes to this?

Even if you _know_ the debit card transaction is safe, there’s no reason to risk it when a weirdo is filming you with some wild contraption.

rcxdude

Anything like this is going to have a very heavy selection bias, don't take any of this kind of content as a reflection of the average person.

WarmWash

Many of of witnessed the technical literacy of the general population when we ran to show them ChatGPT 3.5, and they just kind of shrugged like "So? What are you showing me?"

engineer_22

I am asking a lot here, but school needs to be training people what AI is and what it's weaknesses are and how to use it... My school taught me to use a calculator. It also taught me how to check my work when I relied on the calculator.

AI is a very complicated calculator - you give it an input, magic happens, it gives you an output. Really no different, to a layman.

jaccola

To be fair, this should probably be covered by basic physics/maybe cooking classes. “You can’t determine the calories in food by looking at it” isn’t really ML specific.

garciasn

Considering the lack of basic math skills I encounter each and every day, I don't think schools did enough; they certainly aren't going to do enough w/LLMs.

2ndorderthought

It's more complicated than a calculator. Even researchers who have dedicated their lives to the field don't know all of the limitations of any given model. That fact alone isn't helpful when a model is 80% correct in one area but 2% in another.

undefined

[deleted]

lesuorac

If you're looking for a citation about this, the 1999 Dunning-Kurger paper "Unskilled and unaware" [1] is about this.

People who are unskilled at a task are unaware of what that task performed correctly is. So, somebody who can't count calories is unable to tell that the AI can't perform the task correctly either.

[1]: https://pubmed.ncbi.nlm.nih.gov/10626367/

hombre_fatal

Fwiw invoking Dunning-Kruger is beyond trite at this point.

Which is a good thing because it means we can talk like normal humans ("people don't know that it's unreliable") instead of acting like we're making such a profound claim that it needs a citation and psychological dissection.

drrotmos

It is and it isn't. If you ask a human how many calories (or carbs) are in that sandwich, they can give you a qualified guess based on how a sandwich like that is typically constructed. They may not know the calories for a slice of bread or a slice of cheese by heart, but if you give them a food database, they can look it up.

They absolutely won't be 100% correct (bread sizes e.g. are going to be an estimate), but unless it's a trick sandwich drenched in olive oil or with hollow cheese, they're probably going to be in the right ballpark.

I don't think it's outside the realm of possibility for an LLM to be in the right ballpark as well, but that doesn't seem to be where we're at now.

arc-in-space

I'm surprised how many comments in this thread swear by the position that you literally can't tell based on a picture, as if eating trick foods designed to mislead you was an everyday occurence. Most of the time in typical use you could make a reasonable guess, maybe with some obvious caveats such as "well idk if that Coke is Diet or not so"

muwtyhg

And furthermore, once the person has "determined" how many calories the sandwich contains, they are likely to give you the same answer next time you ask instead of randomly changing their minds.

ozgung

As a human, in the photo of that sandwich I see 4 slices of bread and 4 slices of cheese (distributed unevenly). I have no idea about the weight of the bread, flour type or its sugar content. I don't know the type of the cheese, dimensions of the slices or total amount of cheese inside the bread. I don't know if there is butter or anything else inside. I can guess the size of the plate as a size reference but I can't be sure. Human or AI, it's an ill-posed problem. There can be widely different estimates which can be equally plausible.

bcjdjsndon

But why would the same llm give you wildly different answers EACH TIME you ask?

pkaye

There is a parameter in LLMs called temperature that controls creativity/randomness. If you set it to 0 it makes the model deterministic. I think some LLMs expose this as a tunable parameter.

zdragnar

Because that's how they work? They aren't knowledge machines, they are random generators.

YeGoblynQueenne

It's not impossible to tell. Diabetics and others with dietary restrictions, have to do that sort of thing every day to decide what they can and can't eat. If you pick up a loaf of bread at the baker's, the baker usually has no idea the amount of carbs, or salt, or sugar is in that loaf of bread. Try it. Ask the baker: "how much carbs in this loaf of bread?". They'll just stare at you. They can tell you whether the loaf has salt or sugar in it but can't tell you how much because they don't calculate the amount by loaf. So if you have dietary restrictions you have to know what you can and can't eat and that requires the ability to judge the contents of a piece of food from the way it looks.

Photons don't carry that information? Sure. But you don't just have photons to go by. You can rely on a large database of prior knowledge about how food is usually made and with what ingredients.

Other people who have to rely on their imperfect human senses to decide what they can and can't eat: people with allergies, people with heart problems, hypertensives, kidney patients, etc. etc.

bryanlarsen

The question isn't about calories, it's about carbs. Drenching that sandwhich in olive oil won't change its carb count. From the picture it's a thin cheese sandwich -- we can see cheese and we can see it's thin enough that there's little else. Might be no butter, might be lots of butter, but that won't affect carb count. If there's lettuce in the sandwhich there's likely a negligble amount. Hand it to a knowledgeable human and you're going to get a very consistent carb reading -- 30g, the value of two slices of wonder bread.

It could be much different -- it could one of those breads with weird macros, or fake cheese, or it could be hollowed out and packed full of hidden vegetables. But a human is going to give you the answer for two slices of plain white bread.

beached_whale

From personal experience, one can get practically close guessing such that the error isn't going to be more significant compared to the errors in insulin to carb ratios/sensitivity factors/...

I am pretty good at this and the cheese sandwich example threw me, I would have estimated around 10-15g of carb for each slice. So the 28g is fairly consistent with that, not 40g. The only real way would be to weigh it and use the labeling. Another thing that often gets people is the labeling often has a serving size of say 2 slices and a weight that does not reflect the actual weight of 2 slices.

Luckily with good tools the significance is reduced, people using closed loop insulin pumps will automatically correct for that. Lots more room to wiggle.

Ekaros

Then it should refuse to answer 100% of time.

falcor84

I don't think refusal is the right approach. I would much prefer that it respond with something like:

> There is not enough information to make an accurate estimate, but if you'd like, I can take a stab at it. If so, how much effort to put into it?

> Yes, go ahead and spend up to 5mins and $1 to analyze it.

> Done, I've had 100 subagents analyze the image and have arrived at a 95% confidence interval of the portion containing ...

muwtyhg

I know this is just an example but my eyes kind of bugged out thinking about paying $1 every time I want to estimate the calories in my sandwich.

jaccola

Indeed, I think any reasonable human might say “A few hundred calories but without measuring the ingredients I might be way off”. I think LLMs could get there, I don’t see anything stopping that. Though they have been notoriously bad at this so far.

dredmorbius

If the problem is so evidently impossible then the LLM itself should recognise this, state that the problem isn't solvable, not* provide what's certain to be an inaccurate result, and suggest better approaches to arriving at a reasonable answer.

That said, it's notable that diabetes education materials often suggest estimating glycemic loads by rough portion size / plate ratios. Which is to say that absent accurate weight measurements (themselves subject to variations in ingredients, moisture levels, etc.) current clinical recommendations are themselves pretty rough.

Aurornis

This will surprise nobody here, but it’s important to communicate to audiences that are new to LLMs.

This is targeted at people with diabetes because there are AI carb counting apps appearing in app stores

> If you’re using AI carb counting in a diabetes app

These apps are probably not even using the mainstream models used in the study because they would be too expensive for cheap or free apps, and they’re probably forcing structured output to get a response without any of the warnings that an LLM might include if you ask it directly.

tantalor

Who in Al Gore's Internet is new to LLM in 2026?

undefined

[deleted]

Aurornis

Most of the world.

rsynnott

I am... unsure why anyone would think LLMs would be able to do this. They are not magic oracles. Like I think even most humans would be extremely bad at this.

Like, are people actually using LLMs for this? Please do not, it won't work.

kioleanu

Yes, people are using LLMs for this because that is how they've been marketed, like being able to solve every day tasks like a personal assistant on one hand, but also like researchers being able to solve old problems that humans couldn't crack.

Does the model say it can't do that when asked? No, it answers confidentely.

Also it's easy to trust it if you don't know how it works

drtz

Would people really trust their personal assistant to tell them how many calories are in a sandwich just by glancing at it on a plate? I'm doubtful, and I would also expect a diabetic to be even more skeptical.

tempaccount5050

Yes.

TychoCelchuuu

[dead]

jihadjihad

> They are not magic oracles.

I came across a LinkedIn post a couple days ago where someone had asked ChatGPT, "What are the top things you get asked about $NICHE_INDUSTRY_THING_I_AM_SELLING?"

As if there is introspection like that at the meta level, where ChatGPT could actually provide hard numbers around its own usage and request patterns.

The fact that these products work with natural language beguiles people into thinking they are, indeed, magic oracles.

Ekaros

This is the weird intersection where I think that data might exist and LLM might be able to query it. But any company would never give it out. So the bot would not have access to it.

pjc50

> They are not magic oracles.

Anthropic's trillion dollar valuation hinges on the idea that it is just that, a magic oracle that can replace any worker for any type of task. Any programmer, any author, any musician, any kind of clerical work. All we've asked here is "sudo evaluate me a sandwich", the sort of estimation task that humans with internet resources might reasonably be expected to do, and it's given up?

(It would be fun to compare this to sending the picture out on Mechanical Turk and asking humans to eyeball the calorie count of said sandwich...)

Nicook

You are severely overestimating the average, or even above average understanding of LLMs.

bluefirebrand

Not to mention the fact that LLM marketing is trying to convince us that they can do anything

acchow

Most people are convinced LLMs can do this.

Cal AI, which claims to generate a nutritional breakdown based off a photo, has $30 million in annual recurring revenue.

rsynnott

... Bloody hell. I mean that's basically fraud, surely. It is _not possible to do this even vaguely accurately_.

faangguyindia

It’s because AI can debug a programme and people start thinking it can do fitness and health stuff too, but the thing is, there is no “instant-reacting compiler" for health or fitness. Things change over a long time, till then AI would have run out of context or lost the data from its cache, or the user may have got bored and deleted their account.

PUSH_AX

It’s worse, I bet there are apps in the App Store that do this, the users just have no idea on the accuracy

vector_spaces

There is a very popular app for macro counting called Cal AI that was reported to have been written by a high school student with over $1M in revenue. Looks like it was just acquired by MyFitnessPal

lordgrenville

Wow, yeah. "The result is an app that the creators say is 90% accurate".

https://techcrunch.com/2025/03/16/photo-calorie-app-cal-ai-d...

jeroenhd

https://xkcd.com/1425/ strikes again.

As far as consumers know, LLMs can identify the towns pictures were taken (without metadata), can summarize entire movies, generate clips of your kid flying a rocket to the moon, can translate images from any language imaginable, but somehow they cannot estimate the calories in a cheese sandwich.

The supposed professional posting about an LLM deleting their prod database for their non-existent company asked the AI to explain itself. That's the level of LLM knowledge you should expect from most people that actually work with these tools.

axlee

"Crema catalana: Three of four models called it “creme brulee” 100% of the time. Only Gemini 3.1 Pro got “crema catalana” — in 3.4% of queries."

----

Wikipedia for Crema catalana:

Crema catalana (Catalan for 'Catalan cream'), or crema cremada ('burnt cream'), is a Catalan dessert consisting of a custard topped with a layer of caramelized sugar.[1] It is "virtually identical"[2] to the French crème brûlée. It is made from milk, egg yolks, and sugar. Crema catalana and crème brûlée are made in the same way.

---

Oh no, my AI can't detect that an obscure clone of a famous dish is indeed the obscure clone, and not the commonly know version.

comes

The difference between them is that crème brûlée is made heavy cream instead of milk and it tastes better. But my Catalan friends would kill me for this blasphemy… so you didn’t hear that from me.

They are both covered by burned sugar and therefore indistinguishable(!) visually.

tdeck

In high school my Spanish teacher told us that Crema Catalana was the Spanish name for Creme Brulee.

ozbonus

Before the next galaxy brain shows us all how smart and witty they are by adding the nth sarcastic comment about how obvious this result is, I hope they'll take a moment to consider a few things.

Yes, people are using LLMs for this kid of thing. Lots of people. All the time. I've met plenty of them and there loads of apps that offer this kind of "service". The authors are well aware that people are doing this and probably anticipated the result.

Why do the study at all? Because it's important to demonstrate and measure things, even obvious ones. Because it's not obvious to everyone, like the people who are already consulting LLMs for dietary information to manage their health. Because it's easier to enact official policies when there's hard evidence.

wrqvrwvq

Bizarre thread so far. Some threads seem to attract a certain type of boosterism or opinion management. People all rush in to same similar things, without reading the prior comments. Seems designed to wash thoughts into a stream. Maybe coordinated pr or reputation management. It could even be organic but it doesn't seem that way.

nextlevelwizard

I used LLMs to count calories, but not based on photos, I mean I also did that, but primarily I fed in my exact ingredients and then used weights to get calorie estimates.

Was it always correct? Certainly not. But it helped me lose 30kg of weight since keeping even some track of calories was so much easier with LLM than any app I had used before.

Also of course it didn’t matter if I was exactly on point since it wasn’t about any kind of medicine

edu

Curious, why it was easier to use an LLM vs a non-AI app with a DB of foods?

Seems that in this case a traditional approach would be more precise and more environmentally efficient to get to the same results.

nextlevelwizard

Any app I have used before has asked me to look up the foods and add them manually and usually there has been ads or subscriptions involved.

Much easier for me to take pictures of the packets while making the food, the weight the final bulk product and then when I eat just weight the plate and say “500g of casserole” and the LLM spits out the calories and keeps track of the daily consumption

jerkstate

Nice. I vibe coded a similar kind of system, you can dump a recipe into the chat window and it will use tool-calling to lookup macros for any foods it doesn't have in the DB and put them in, estimate raw -> cooked changes in nutrition and weight (if needed), estimate total weight of the cooked product, and macros per gram (e.g. writes a 100 gram serving to the db, you can scale it up and down and it scales the macros linearly). Similar to you I have used this app to alter my macro mix from high-fat to high-carb (for workout performance) and cut my sodium from ~4g/day to ~2.4g/day by interrogating the DB about what foods I should eat more and less of. Found some surprising wins in my habitual diet that were easy to change to hit my health targets, and looking up and logging these things by hand without LLM assistance would have been too tedious and time-consuming for me to continue to do it for as long as I have been (maybe 3 months now)

Curious, what model are you using? I have found Qwen Flash to be really great for this - tool calling works well, it's smart enough, and very cheap.

tcoff91

Are you giving the LLM the weights of the ingredients as you go? Sounds like a great system.

tcoff91

The data entry is a pain in the ass with those apps when cooking food from scratch. It’s much much easier with LLMs and natural language and voice mode and pictures of a food scale and things like that.

mattnewport

Ironic that they used an LLM to write the article:

> 42.9 units of insulin from a single photo. That’s not a rounding error. That’s a potential fatality.

thedanbob

The vast majority of "AI screwed up" posts I've seen on HN have been written with AI.

zamadatix

The title seems to be clickbait (the 13 foods in the paper didn't even have ranges such a title would be possible) but the results/paper are much more on point.

It'd be really interesting if it evaluated humans on the exact same image sets. The correct answer is just to feed in more data, such as the exact food itself, but the post makes it sound like it's using a model that is the only risk in this approach to counting carbs.

jasonkester

LLMs seem really bad with reading numbers and reporting them back. I’m building a game, and to se how well its docs were being indexed, I tried asking simple questions to ChatGPT, Gemini, whatever Microsoft’s thing is, etc:

“What is the armour value for the Leather Shirt” in the game Stravaeger?”

It confidently got it wrong.

“You can find the game at https://stravaeger.com”

Different confident answers, also wrong.

“You’ll find it in a table on this page: https://stravaeger.com/docs.html?inventory_item=LEATHER_SHIR...“

Oh, sorry. I was inferring from other similar games. Here is a different confidently wrong number.

“It’s also in the .json file linked on that page”

And another wrong value. Random numbers should have got it right by now, but no. And the confident, authoritative tone never changed. Every model I tried was the same story.

amazingamazing

With mass information you could infer much more from pictures. With some sort of standard cube in the picture as well as taking a picture at an angle that emphasizes all three dimensions you could also better estimate the relative volume.

It’s tractable I think, but not from a pic alone.

jaccola

Yes one could potentially increase accuracy greatly. One big problem would be occlusion.

There is already a solution to this that would be very hard to beat (and one can choose to use or not use an LLM to assist): prepare food yourself and use the information provided by the manufacturer.

amazingamazing

If you consider time at all what you suggest is hardly a solution. It is the most accurate, but even 50% accuracy at orders of magnitude faster to calculate would be more useful for the main use case which is losing weight.

However for diabetes accuracy is likely preferred and I’m not sure any computer vision would be palatable.

Centigonal

maybe, but not always. I could make two identical-looking sandwiches with very different calorie content by changing the type and quantity of sauce on the inside of the bread. I could give you two "pasta with creamy sauce" dishes that look similar on camera but have different macros by partially swapping Greek yogurt for heavy cream. Dropping a couple tbsp olive oil into my marinara sauce does wonders for flavor but barely affects appearance when plated. Same with lard in my refried beans.

Daily Digest email

Get the top HN stories in your inbox every day.