GPT-4 performs significantly worse on coding problems not in its training data

Daily Digest email

Get the top HN stories in your inbox every day.

gateorade

This has been my experience. I’m really impressed by how well GPT-4 seems to be able to interpolate between problems heavily represented in the training data to create what feels like novelty, eg. Creating a combination of pong and conway’s game of life, but it doesn’t seem to be good at extrapolation.

The type of work I do is highly niche. I’ve recently been working on a specific problem for which there are probably only a hundred at most implementations running on production systems, all of them highly proprietary. I would be surprised if there were any implementations in GPTs training set. With that said, this problem is not actually that complicated. A rudimentary implementation can be done in ~100 lines of code.

I asked GPT-4 to write me an implementation. It knew a decent amount about the problem (probably from Wikipedia). If it was actually capable of something close to reasoning it should have been able to write an implementation, but when it actually started writing code it was reluctant to write more than a skeleton. When I pushed it to implement specific details it completely fell apart and started hallucinating. When I gave it specific information about what it was doing wrong it acknowledged that it made a mistake and simply gave me a new equally wrong hallucination.

The experience calmed my existential fears about my job being taken by AI.

softfalcon

This exact scenario is what I described to a friend of mine who is an AI researcher.

He was convinced that if we trained the AI on enough data, GPT-x would become sentient.

My opinion was similar to yours. I felt like the hallucinating the AI does was insufficient in performing true extrapolating thought.

I said this because humans don’t truly have access to infinite knowledge, even when they do, they can’t process all of it. Adding endless information for the AI to feed on doesn’t seem like the solution to figuring out true intelligence. It’s just more of the same hallucinating.

Yet despite lacking knowledge, us humans still come up with consistently original thoughts and expressions of our intelligence daily. With limited information, our minds create new representations of understanding. This seems to be impossible for Chat GPT.

I could be completely wrong, but that discussion solidified for me that my role as a dev still has at least a couple more decades of shelf life left.

It’s nice to hear that others are reaching similar conclusions.

visarga

Current LLMs decode in a greedy manner, token by token. In some cases this is good enough - namely for continuous tasks, but in other cases the end result means the model has to backtrack and try another approach, or edit the response. This doesn't work well with the way we are using LLMs now, but could be fixed. Then you'd get a model that can do discontinuous tasks as well.

>> Write a response that includes the number of words in your response.

> This response contains exactly sixteen words, including the number of words in the sentence itself.

It contains 15 words.

The model would have to plan everything before outputting the first token if it were to solve the task correctly. Works if you follow up with "Explicitly count the words", let it reply, then "Rewrite the answer".

Dzugaru

> but could be fixed

How? The problem is known for a while, for example this article [0] mentions it (as Chain of Thought reasoning). You could think that just having a scratchpad of tokens is enough - you can arguably plan, backtrack and rewrite there [1], right? But this doesn't really work, at least yet - maybe because it wasn't trained for that - and maybe ChatGPT massive logs (probably available only for OpenAI) can help. But the Microsoft report [2] suggests we need a different architerture and/or algorithms? They mention lack of planning and retrospective thinking as a huge problem for GPT-4. Maybe you know some articles on the ideas how to fix this? Backtracking, trying again seems to be linked to human thought - and very well can give us AGI.

[0] https://arxiv.org/abs/2201.11903

[1] https://www.reddit.com/r/ChatGPT/comments/120fi8e/chatgpt_4_...

[2] https://arxiv.org/abs/2303.12712

yorwba

Backtracking to edit the response is theoretically easily solved by training on a masked language modeling objective instead of an autoregressive one, but using it to actually generate text is a bit expensive because you can't just generate one token at a time and be done, you might have to reevaluate each output token every time another token is changed. So I expect autoregressive generation to remain the default until the recomputation effort can be significantly reduced or hardware advances make the cost bearable.

SanderNL

I don't think humans can do this either. What's the problem with producing a result and then fixing it? It's exactly how we do it.

andsoitis

> This exact scenario is what I described to a friend of mine who is an AI researcher. He was convinced that if we trained the AI on enough data, GPT-x would become sentient. My opinion was similar to yours. I felt like the hallucinating the AI does was insufficient in performing true extrapolating thought.

It turns out it isn’t just AIs that hallucinate; AI researchers do as well.

physPop

"researcher".

Majromax

> He was convinced that if we trained the AI on enough data, GPT-x would become sentient.

Is there enough data?

As I understand it, the latest large language models are trained on almost every piece of available text. GPT-4 is multimodal in part because there isn't an easy way to increase its dataset with more text. In the meantime, text is already quite information dense.

I'm not sure that future models will be able to train on an order of magnitude more information, even if the size of their training sets has a few more zeroes added to the end.

call-me-al

what about all the content not yet in text form (e.g. YouTube videos)?

psychphysic

The threshold for sentience is continually falling.

So he might be right but due to time and not due to improved performance.

I believe in the UK all vertibrates are considered sentient (by law not science). That includes goldfish.

And good luck even getting a goldfish to reverse a linked list. Even after 1000 implementations are provided.

dbsmith83

Goldfish are likely more intelligent than you give them credit for

https://petkeen.com/can-goldfish-be-trained/

https://www.npr.org/2022/01/11/1072095219/goldfish-driving-c...

delusional

I don't think that when people commonly discuss sentience they mean to include goldfish. I don't think the legal definition (which probably exists due to external legal implications) has any bearing on the intellectual debate of AI sentience.

wslh

> He was convinced that if we trained the AI on enough data, GPT-x would become sentient.

Not saying your friend is right or wrong, but imagine if civilization gives more information, in realtime, to an AI system through sensors: will be at least sentient as the civilization? Seems like a scifi story, a competitor to G-d.

antonvs

Isaac Asimov wrote a story along those lines, “The Last Question”, which he described as “by far my favorite story of all those I have written.” Full text here:

https://xpressenglish.com/our-stories/the-last-question/

danaris

Some versions of divinity (both from real-world beliefs and sci-fi/fantasy) have it being essentially a gestalt of either all the souls that have ever died, or all those alive now—a kind of "oversoul" or collective consciousness.

While that's an interesting thought experiment, I don't think it can meaningfully apply to any kind of AI we have the capability to make today, even if we could hook it up directly to all our knowledge. Information alone can't make something sentient; it requires a sufficiently complex and sophisticated information processing system, one that can reason about its knowledge and itself.

kaba0

I’m not at all an expert on the topic, but from what I gathered LLMs are fundamentally limited in the kind of problems they can approximate. They can approximate any integrable function quite well, but we can only come up with limits on a case-by-case basis for non-integrable ones, and I believe most interesting problems are of this latter kind.

Correct me if I’m wrong, but doesn’t it mean that they can’t recursively “think”, on a fundamental basis? And sure I know that you can pass “show your thinking” to GPT, but that’s not general recursion, just “hard-coded to N iterations” basically, isn’t it? And thus no matter how much hardware we throw at it, it won’t be able to surpass this fundamental limit (and without proof, I firmly believe that for a GAI we do need the ability to basically follow through a train of thought)

sebzim4500

How is it "hard-coded to N iterations"? We don't instruct the model how many lines of working it should show.

Obviously there is a limit to how much it can fit in the context, but that seems to be rising fast (went from 4k to 32k in not that long)

aiphex

If they aren't already, AIs will be posting content on social media apps. These apps measure the amount of attention you pay to each thing presented to you. If it's more than a picture or a video, but something interactive, then it could also learn how we interact with things in more complex ways. It also gets feedback from us through the comments section. Like biological mutations, AIs will learn which of its (at first) random novel creations we find utility in. It will then better learn what drives us and will learn to create and extrapolate at a much faster pace than us.

danaris

> If they aren't already, AIs will be posting content on social media apps.

No, people will be posting content on social media apps that they asked LLMs to write.

It may be done through a script, or API calls, but it's 100% at the instigation, direct or indirect, of a human.

LLMs have no ability to decide independently to post to social media, even if you do write code to give them the technical capability to make such posts.

dmichulke

> Yet despite lacking knowledge, us humans still come up with consistently original thoughts and expressions of our intelligence daily.

I think there is some sampling bias in your observation ;-)

vasco

> The experience calmed my existential fears about my job being taken by AI.

The issue is that among all the 100k+ software engineers, many don't really do anything novel. How many startups are employing dozens of engineers to create online accessible CRUDs to replace a spreadsheet?

In the company I work for I'd say we have about 15 developers or about 3 teams doing interesting work, and everyone else builds integrations, CRUDs, moves a button there and back in "an experiment", ads a new upsell, etc. All these last parts could be done by a PM or good UX person alone, given good enough tools.

The other parts I'm not worried about either.

yohannesk

For the type of engineers you describe the hard part I think is communication with other devs, communication with product owners, understanding the problem, suggesting different ways of solving the problem, figuring out which department personnel (outside other devs) to talk to about a little detail that you don't have... it's not writing the code which is hard, atleast from my experience

noduerme

Yes. I won't be worried until the day Joe CEO can write a prompt like "build me an app that lets me know where my employees are at all times," and GPT responds with a list of questions about how Joe imagines this being physically implemented, and then calls up the legal department to clear its methods.

oblio

The question is... writing the code is a very small part of the job.

Figuring out what code to write is one of the big parts.

Fixing it when it breaks in many creative ways is the other big part.

How good is ChatGPT at fixing bugs? Security bugs or otherwise?

vasco

Sure but the other parts you don't need an engineering degree for, the other parts amount to design / product work, not engineering.

sterlind

I had a similar experience. I wanted it to write code to draw arcs on a world map, with different bends rather than going on a straight bearing. I did all the tricks, told it to explain its chain of thought, gave it a list of APIs to use (with d3-geo), simplified and simplified and spent a couple hours trying to reframe it.

It just spit out garbage. Because (afaict) there aren't really examples of that specific thing on the Internet. And it's just been weirdly bad at all the cartography-related programming problems I've thrown at it, in general.

And yeah, I'm much less worried about it replacing me now. It's just not.. lucid, yet.

laurels-marts

GPT-4 is reasonably good at D3 and drawing arcs on a projection (e.g. orthographic) is not that unique, you’ll find examples of it on observable. However I wonder if you broke down the problem into a small enough task. It performs best if you provide a clear but brief problem description with a code snippet that already kind of does what you want (e.g. using straight lines) and then just ask it to modify your code to calculate arcs instead. The combination of clear description + code I found decreases the likelihood of it getting confused about what you’re asking and hallucinate. If you give it a very long-winded request with no code as basis for it then good luck.

sterlind

I did try the code snippet technique, but unfortunately it got it wrong. For example, I gave it code that drew arcs but didn't follow the shortest great-circle distance, and it gave me several plausible-looking approaches that were completely wrong (e.g. telling ctx.arc to draw counterclockwise, which does the wrong thing because it needs to use projections instead.)

I eventually just asked it to compute coordinates to a point c perpendicular to the midpoint on the great arc between a and b, such that the angle between ab and ac is alpha. I tried for hours, asking it to work out equations and name the mathematical identities it used etc. but it was all gibberish.

camjohnson26

So the closer you come to writing the code for it the better it does

noduerme

I imagine that creative approaches to spacial problem solving would be one of the harder areas for it - not just because there are by definition fewer public examples of one-off or original solutions, but also because one has to visualize things in space before figuring out how to code it. These bots don't have a concept of space. I'm thinking of DALL-E (et. al) having problems with "an X above Y, behind Z".

v4dok

GPT4 has its hands tied behind its back. It does not have active learning and it does not have a robust system of memory or a reward/punishment mechanism. We only now start seeing work on this side [1]

It might not know more than you about your niche. I don't. I would search and I would try to reason, but if I was forced to give a token by token output that is answering the question as truthfully as possible, I might have started saying bullshit as well.

I don't think that the fact that gpt doesn't know things or does some things wrong is sufficient to save dev work from automation.

[1]: https://github.com/noahshinn024/reflexion-human-eval

blablabla123

> The experience calmed my existential fears about my job being taken by AI.

Same for me. I didn't try GPT-4 yet, and not on code from work anyway but GPT-3 seems borderline useless at this point. The hallucinations are quite significant. Also I tried to produce advice for Agile development with references and as stated in other articles the links where either 404s or even completely unrelated articles.

Still I'm taking this seriously. Just considering the leaps that happened with AlphaGo/AlphaZero or autonomous driving, that was considered unthinkable in the respective domains before.

zeroonetwothree

Even if AI only takes over “easy” programming jobs, it might still create a huge downward pressure on compensation.

After all, just look at manufacturing. Compared to 1970 we produce 5x the real output but employ only 50% the people. The same will likely happen to fields like programming as AI improves.

olivermuty

For the crap devs maybe, but high skill devs and arcitechts will be able to charge more than ever to oversee all of this «productivity» from the AIs.

nimbix

I asked it to write a trivial c#/dotnet example of two actors where one sends a ping message and the other responds with pong. It couldn't get the setup stage right, called several methods that don't exist, and and had a cyclic dependency between actors that would probably take some work to resolve.

Event after several iterations of giving it error messages and writing explanations of what's not working, it didn't even get past the first issue. Sometimes it would agree that it needs to fix something, but would then print back code with exactly the same problem.

toss1

Yes, exactly this.

I wrote some questions in the specialist legal field of someone in my household, then started to get into more specialist questions, and then specifically asked about a paper that she wrote innovating a new technique in the field.

The general question answers were very impressive to the attny. The specialist questions started turning up errors and getting concepts backwards - bad answers.

When I got to summarizing the paper with the new technique, it could not have been more wrong. It got the entire concept backwards and wrong, barfing generic and wrong phrases, and completely ignored the long list of citations.

Worse yet, to the point of hilariously bad, when asked for the author, date, and employer of the paper, it was entirely hallucinating. Literally, the line under the title was the date, and after that was "Author: [name], [employer]". It just randomly put up dates and names (or combinations of real names) of mostly real authors and law firms in the region. Even when pointed out the errors, it would apologize, and then confidently spout a new error. Eventually it got the date correct, and that stuck, but even when prompted with "Look at where it says 'Author: [fname]" and tell me the full name and employer, it would hallucinate a last name and employer. Always with the complete confidence of a drunken bullshit artist.

Similar for my field of expertise.

So, yes, for anything real, we really need to keep it in the middle-of-the-road zone of maximum training. Otherwise, it will provide BS (of course if it is BS we want, it'll produce it on an industrial scale!).

dorkwood

I feel vindicated reading this. Yesterday in a separate thread I claimed that it was wrong on 80% of the coding problems I gave it, and received the response from multiple readers that I was probably phrasing my questions poorly.

I started to believe them, too. Unfortunately, my brain is structured in such a way that a unanimous verdict from a few strangers is enough to make me think I’m probably the one who’s wrong. I need to make note of these events as a way to remind myself that this isn’t always the case.

mrbombastic

I think part of the issue is that your mileage will vary greatly depending on what your problem domain and language of choice is. People working with languages and problems chatgpt works well in have a hard time believing the hard fails in other domains and vice versa. I wrote a python script the other day to delete some old xcode devices lower than a certain ios version complete witth options with a just a few back and forths with chatGPT. My knowledge of python is extremely basic and the code just worked out of the box. Then yesterday I asked for the code to tell if a device is lidar enabled in Objective-C and it failed to give me compilable code 4 times in a row until I finally gave up and went back to the docs. The correct answer is one line. I for one am pretty excited about this, things that a lot of people have done before should be easy, leaves more brain space for the tough stuff.

dpkirchner

What was the answer, if you don't mind? ChatGPT (3.5) days to use isCollaborationEnabled on the ARWorldTrackingConfiguration class which doesn't seem quite right based on the docs.

I wonder if this is a GI/GO problem. Apple's poor documentation of features being the garbage in.

mrbombastic

Sure, this was what I ended up using:

https://developer.apple.com/documentation/arkit/arworldtrack...

I believe there are multiple ways, part of the problem might be Apple doing the Apple thing where their user philosophy bleeds into their tech. They don't want you checking for a specific sensor or device capability they want you to check if whatever feature you want is enabled.

anonytrary

Vindicated and excited. Gradient descent is likely not enough. I love it when we get closer to something but are still missing the answer. I would be very happy if "add more parameters and compute" isn't enough to get us to AGI. It means you need talent to get there, and money alone will not suffice. Bad news for OpenAI and other big firms, good news for science and the curious.

I imagine physicists got very excited with things like the ultraviolet catastrophe, and the irreconcilable nature of quantum mechanics and general relativity. It's these mysteries that keep the world exciting.

BoorishBears

There's something ironic about implying that us not having a path to AGI is good news for the curious. If you're supervisiously curious then sure, we need to unlock another piece of a puzzle, more puzzle pieces means more puzzle solving.

But if you're able to actually take a step back, AGI would be the the ultimate source of new puzzles for the curious. We don't even all agree on how to define the "GI", approaching AGI wouldn't be unlike meeting extraterrestrial life sitting on a computer.

MayeulC

I think you misunderstood the parent, who was probably saying that the process for achieving AGI would be more interesting if it isn't just "more compute/training".

IIAOPSW

Woah. I've never seen someone so self aware of the asch conformity test without literally talking about the asch conformity test.

https://en.wikipedia.org/wiki/Asch_conformity_experiments

sjducb

Easiest fix for that is show the prompt you gave and the output you want. Force the people who tell you it's easy to actually do it. You'll get one of 3 outcomes.

-They try then fail (you were right) -They try then succeed (you learn) -They keep telling you it's easy but don't demonstrate. (You know that this person is full of it and you can ignore them in future)

hughesjj

Yup, always drill down and get/provide more context for when something doesn't align/seems fishy.

I've found WAYYYY too many of my issues were really just communications issues. Shits hard to navigate socially.

That said, mind sharing the specific tasks/prompts gpt-4 failed you?

philliphaydon

[dead]

nopinsight

“Reflection-Based GPT-4 Agent is State-of-the-Art on Code Gen

Iteratively refines code, shifting “accuracy bottleneck” from correct code gen to correct test gen

HumanEval accuracy:

-Reflexion-based GPT-4 88%

-GPT-4 67.0%

-CodeT 65.8%

-PaLM 26.2%”

with link to code in the Tweet:

https://mobile.twitter.com/johnjnay/status/16393620718075494...

21% improvement after adding a feedback loop and self-reflection to GPT-4, which just went public 12 days ago. (The approach is based on a preprint published 4 days ago.)

Human coders often need a feedback loop and self-reflection to properly “generate” code for problems novel to them as well.

-----

A larger question: Are we hurling ourselves toward a (near) future of unaligned AGI with self-improvement capabilities?

nopinsight

Someone asked GPT-4 to build a complete app from scratch. It's now on the store. He seems to apply good prompt engineering techniques.

Screenshots, video demo, and process here: https://mobile.twitter.com/mortenjust/status/163927657157489...

It seems plausible to me now that a junior developer position would be hard to find in 2-3 years (I thought it would be ~5 years).

mrbombastic

So this is very impressive and looks like a solid lowering of barrier to entry which is great….but that app is around 300 lines of code in one file that fetches data on 5 movies, screenshots are not cropped correctly and the pager doesn’t swipe back to the first dot. I am bit surprised it made it through the review process. Not hating I think it is great to make this stuff more approachable but not convinced junior devs are in danger yet.

nopinsight

GPT-3 was released less than 3 years ago and it was far from capable of this.

isaacfrond

the paper: https://arxiv.org/pdf/2303.11366.pdf

it reminds me of the thinking-fast vs. thinking-slow dichotomy. Current llms are the thinking fast type. Funnily people’s complains about its errors are reminiscent of this. It answers just to quick and only with its instant response neural net. A thinking slow answer would be more akin to a chain of thought answer. Allowing the llm a more flexible platform than CoT promptin might well be the next step. Of course it would als multiply compute cost. So it might not be in your 20$ subscription

cornholio

A narrower question: can we perhaps stop putting AGI and ChatGPT in the same paragraphs as if they are somehow relevant to each other? Intelligence has very little to do with a glorified Google search trained by statistical crunching of superhuman amounts of data; there is not even a chicken-sized trace of intelligence in ChatGPT that is not a reflection of the training set or of the embedded human-designed models it uses to mimicry problem solving and conversation.

nopinsight

Several necessary ingredients of human intelligence are present in GPT-4: complex pattern matching, abstraction from concrete examples and apply the abstract patterns to new examples, pattern interpolation, basic reasoning.

This is evident by its ability to generalize from the training set to new problems within many domains.

It's still unable to generalize as well as a smart human beyond the distribution it was specifically trained on, which is evident by its poor performance on AMC, Leetcode medium and hard, and Codeforces problems. But most humans are not great at these kinds of problems either.

Benchmark and test results: https://openai.com/research/gpt-4

PartiallyTyped

> A larger question: Are we hurling ourselves toward a (near) future of unaligned AGI with self-improvement capabilities?

We are running towards a brick wall and people are not paying attention. Setting up self-reflection loops today is actually fairly trivial and can be done programmatically, all the model needs is to produce a solution, invoke the evaluation and keep iterating.

undefined

[deleted]

riku_iki

so how do we know chat gpt didn't have humaneval in training data?

nopinsight

The point is adding a couple components can improve GPT-4 significantly, as shown above. The data it originally trained with is presumably held constant in the evaluation above.

riku_iki

the point is if humaneval was in gpt training data, then this component improved memorization from mediocre to Ok-ish, and actual coding skills still not tested.

drbig

Honest question: Why so many people attribute "thinking", "knowing, "understanding", "reasoning", "extrapolating" and even "symbolic reasoning" to the outputs of the advanced token-based probabilistic sequence generators, also known as LLMs?

LLMs are inherently incapable of any that - as in mechanically incapable, in the same way a washing machine is incapable of being an airplane.

Now my understanding is that the actual systems we have access to now have other components, with the LLM being the _core_, but not _sole_ component.

Can anybody point me to any papers on those "auxiliary" systems?

I would find it very interesting to see if there are any LLMs with logic components (e.g. Prolog-like facts database and basic rules that would enforce factual/numerical correctness; "The number of humans on Mars is zero." etc.).

mjburgess

Because they don't distinguish between properties of the output and properties of the system which generated it. Indeed, much of the last decade of computer-science-engineering has basically been just insisting that these are the same.

An LLM can generate output which is indistinguishable from a system which reasoned/knew/imagined/etc. -- therefore the "hopeium / sky is falling" manic preachers call its output "reasoned" etc.

Any actual scientist in this field isn't interested in whether measures of a system (its output) are indistinguishable, they're interested in the actual properties of the system.

You don't get to claim the sun goes around the earth just because the sky looks that way.

sjducb

Do submarines swim? No, but they are faster underwater than all swimmers. Therefore they are the best swimmers despite being unable to swim....

LLMs are producing human level reasoning in many domains, therefore they are the best at reasoning despite being unable to reason...

This whole debate hangs on the definition of "reasoning"

sebzim4500

Scientists are extremely interested in measurable results of experiments. I think you are thinking of philosophers.

nopinsight

Can an airplane fly? Can a submarine swim?

Yes, AI may be constructed quite differently from human intelligence. Can it accomplish the same purposes? For some purposes, the answer is a resounding yes as can be seen from its applications around the world by millions of people.

Can an animal ‘think’, ‘understand’, or ‘reason’? Maybe not as well as a homo sapiens. But it’s clear that a raven, a dolphin, or a chimp can do many things we assume require intelligence. (A chimp may even have a slightly larger working memory than a human, according to some research.)

Wouldn’t it be a little preposterous to assume that a young species like ours stands at the pinnacle of the intelligence hierarchy among all possible beings?

f6v

> Wouldn’t it be a little preposterous to assume that a young species like ours stands at the pinnacle of the intelligence hierarchy among all possible beings?

You’re right, AI doesn’t need to be AGI to be useful. Most SEO content on the internet is probably even worse than ChatGPT can do. And LLM could hallucinate another Marvel movie since they’re so similar.

My problem is that people make ungrounded claims about these systems either already having sentience or being just few steps away from it. It’s a religion at this point.

bsaul

some prompts results are only explainable if chatgpt has the ability to produce some kind of reasoning.

As for your analogy, I'm not sure we know enough about human intelligence core mechanisms to be able to dismiss NN as being fundamentally incapable of it.

mjburgess

The reasoning occurred when people wrote the text it was trained on in the first place; it's training data is full of the symptoms of imagination, reason, intelligence, etc.

Of course if you statistically sample from that in convincing ways it will convince you it has the properties of the systems (ie., people) which created its training data.

But on careful inspection, it seems obvious it doesnt.

Bugs bunny is funny because the writing staff were funny; bugs himself doesnt exist.

drbig

> Bugs bunny is funny because the writing staff were funny; bugs himself doesnt exist.

Excellent analogy, and I appreciate analogies (perhaps even a bit too much). Will be using this one. Thank you!

SanderNL

If you “sample” this enough to be reasoning in a general manner, what is exactly the problem here?

Magic “reasoning fairy dust” missing from the formula? I get the argument and I think I agree. See Dreyfus and things like “the world is the model”.

Thing is, the world could contain all intelligent patterns and we are just picking up on them. Composing them instead of creating them. This makes us automatons like AI, but who cares if the end result is the same?

bsaul

At the minimum, chatgpt displays a remarkable ability to maintain a consistent speech throughout a long and complex conversation with a user, taking into account all the internal implicit references.

this to me is the proof it is able to correctly infer meaning, and is clearly a sign of intelligence. (something a drunk human has trouble doing, for example).

krainboltgreene

> As for your analogy, I'm not sure we know enough about human intelligence core mechanisms to be able to dismiss NN as being fundamentally incapable of it.

If there's one field of expertise I trust programmers to not have a clue about it's how human intelligence works.

SanderNL

What makes you so sure we are capable of it? Gut feeling? How do you reason, exactly? This answer is worth billions so why not enlighten us.

You don’t know? But you feel you have “it”, this magical substance, this substrate of reason itself? Reasoning is build out of what, exactly?

Sorry to be that guy, but I fail to see more than word play.

YeGoblynQueenne

The thing is that we've known how to do reasoning with computers since the 1960's at least. Here:

https://dl.acm.org/doi/10.1145/321250.321253

That's the paper introducing the Resolution principle, which is a sound and complete system of deductive inference, with a single inference rule simple enough that a computer can run it.

The paper is from 1965. AI research had reasoning down pat since the 1970's at least. Modern systems have made progress in modelling and prediction, but lost the ability to reason in the progress.

Yeah, we totally "scienced that shit" as you say in a comment below. And then there was an AI winter and we threw the science out because there wasn't funding for it. And now we got language models that can't do reasoning because all the funding comes from big tech corps that don't give a shit about sciencing anything but their bottom line.

mjburgess

What makes you so sure diluting things doesn't make them stronger? I mean, you don't know any physics, chemistry or biology -- but it's just word play right?

I mean, there isnt anything called science we might used to study stuff. You can't actually study any intelligent things empirically: what would you study? Like animals, and people and things? That would be mad. No no, it's all just word play.

And you know it's wordplay because you've taken the time to study the philosophy of mind, cognitive science, empirical psychology, neuroscience, biology, zoology and anthropology.

And you've really come to a solid conclusion here: yes, of course, the latest trinket from silicon valley really is all we need to know about intelligence.

That's how the scientific method works, right?

Sillicon Valley releases a gimmik and we print that in Nature and all go home. It turns out what Kant was missing was some VC funding -- no need to write the critique of pure reason.

pixl97

>What makes you so sure diluting things doesn't make them stronger

Alcohol diluted with around 30% water makes it 'stronger' at killing bacteria...

I mean it's easy to say "just science that shit", and then forget we've been spending decades and billions of dollars doing just that.

SanderNL

Let’s all see and be amazed by the absolutely breathtaking achievements of those fields in the domain of AI…

krainboltgreene

> What makes you so sure we are capable of it? Gut feeling? How do you reason, exactly?

It never fails: When faced with the reality of what the program is your average tech bro will immediately fall back to trying to play their hand at being a neuroscientist, psychologist, and philosopher all at once.

SanderNL

You did not answer a single thing and maybe I am those things. You don’t know me.

YeGoblynQueenne

>> Honest question: Why so many people attribute "thinking", "knowing, "understanding", "reasoning", "extrapolating" and even "symbolic reasoning" to the outputs of the advanced token-based probabilistic sequence generators, also known as LLMs?

It's very confusing when you come up with some idiosyncratic expression like "advanced token-based probabilistic sequence generators" and then hold it up as if it is a commonly accepted term. The easiest thing for anyone to do is to ignore your comment as coming from someone who has no idea what a large language model is and is just making it up in their mind to find something to argue with.

Why not just talk about "LLMs"? Everybody knows what you're talking about then. Of course I can see that you have tied your "definition" of LLMs very tightly to your assumption that they can't do reasoning etc., so your question wouldn't be easy to ask unless you started from that assumption in the first place.

Which makes it a pointless question to ask, if you've answered it already.

The extravagant hype about LLMs needs to be criticised, but coming up with fanciful descriptions of their function and attacking those fanciful descriptions as if they were the real thing, is not going to be at all impactful.

Seriously, let's try to keep the noise down in this debate we're having on HN. Can't hear myself think around here anymore.

m3kw9

Just curious why you’d respond so much to him and also add nothing to the discussion?

YeGoblynQueenne

Hang on, how is it fair to ask me why I "add nothing to the discussion" when all your comment does is ask me why I add nothing to the discussion? Is your comment adding something to the discussion?

I think it makes perfect sense to discuss how we discuss, and even try to steer the conversation to more productive directions. I bet that's part of why we have downvote buttons and flag controls. And I prefer to leave a comment than to downvote without explanation, although it gets hard when the conversation grows as large as this one.

Also, can I please separately bitch about how everyone around here assumes that everyone around here is a "he"? I don't see how you can make that guess from the user's name ("drbig"). And then the other user below seems to assume I'm a "him" also, despite my username (YeGoblynQueenne? I guess I could be a "queen" in the queer sense...). Way to go to turn this place into a monoculture, really.

sebzim4500

Not him but I am also extremely frustrated by the fact it is impossible to have a real discussion about this topic, especially on HN. Everyone just talks past each other and I get the feeling that a majority of the disagreement is basically about definitions, but since no one defines terms it is hard to tell.

Madmallard

I don't think there's anything inherently different algorithmically or conceptually.

Our brain is just billions of neurons and trillions of connections, with millions of years of evolution making certain structural components of our network look a certain way. The scale makes it impossible to replicate.

sebzim4500

What do you mean 'impossible to replicate'. With current technology, or in general?

Madmallard

Possibly both? Certainly the first of the two.

m3kw9

it kind of does “understand” when humans supervise it during training and they are able somehow relate and give mostly coherent responses. It may not be feeling it but it does seem to “understand” a subject more than a few people

whazor

But we do not even know whether GPT-4 is 'just a LLM'. Given the latest addons and the fact it can do some mathematics, I think there is more under the hood. Maybe it can query some reasoning engine.

This is why I think it is so important for OpenAI be more open about the architecture, so we can understand the weaknesses.

hesdeadjim

I threw a challenging rendering problem at it and I was pretty impressed with the overall structure and implementation. But as I looked deeper, the flaws became apparent. It simply made up APIs that didn’t exist, and when prompted to fix it, couldn’t figure it out.

Still, despite being fundamentally wrong it did send me down some different paths.

jasfi

Using APIs that don't exist is the biggest problem I've seen with ChatGPT, and it seems GPT-4 as well.

exodontist

I asked chatgpt about the api for an old programming game called chipwits.. it invented a whole programming language that it called chiptalk with an amalgam of the original chipwits stuff, missing some bits and adding others, and generated a parser for it, which I implemented and got to work, before figuring out how much was imaginary, after talking to the original chipwits devs. They found it pretty amusing.

carbocation

> and got to work

Can you elaborate?

irjustin

I'm fast learning Django and even though it's an extremely well documented space, ChatGPT has sent me down the wrong path more than a handful of times.

This is especially difficult because I don't know when it's wrong and it's so damn confident. I've gotten better at questioning its correctness when the code doesn't perform as expected but initially it cost me upwards of 30min per time.

Still, I would say between ChatGPT and Copilot - I'm WAY further ahead.

hughesjj

chatgpt or gpt4?

public copilot uses gpt3.5, as does non premium chatgpt.

Nathanba

my biggest problem with it is that it doesn't seem to understand its own knowledge. If you talk to it for a while and you go back and forth on a coding problem it will often suddenly start using wrong syntax that doesn't exist. Even though at this point it should already know and have looked up for sure that this syntax can't possibly exist because many times it responded correctly. So in human terms it has read the documentation and must know that this syntax can't possibly exist and yet it doesn't know that 10 sec later. That's currently what makes it seem like a not real intelligence to me.

vitorgrs

One of the advantages of Bing, and do guess now ChatGPT with browsing plugin, is that it's able to search on the web for the right API.

saulpw

To be fair, using APIs that I think should exist, is how I develop most of my APIs.

jasfi

Except that I wasn't asking it to develop a new API.

xkgt

A simple metric on confidence interval could do the trick. As the model grows larger, it is getting more difficult to understand what is going on, but that doesn't mean that it needs to be a total black box. At least let it throw some proxy metrics. In due course, will learn to interpret those metrics and adjust our internal trust model.

kozikow

You can just ask it to give you confidence in the output on a scale 0 to 1

ALittleLight

I wonder if a plugin to let it query API docs would solve this problem.

behnamoh

Also it makes up Python libraries, macOS apps to do certain tasks, etc.

splatzone

I’ve had very good results from running the code and pasting the errors back into ChatGPT and asking it what to do. Sometimes it corrects itself quite well

behnamoh

Put that in a loop and see if AGI emerges.

whitehexagon

>It simply made up APIs that didn’t exist

That has been my experience with Zig. It led me to the conclusion that there are just too many 'non indexed' developer tools in use these days, so there isnt much training data for new topics. But it was happy to hallucinate API's and their proof of existence.

rel2thr

yea I find it to be wrong a lot when coding But its faster for me to fix existing code than to write code from scratch so its still better than nothing for me

osteele

Same. It seems similar to Copilot in that regard, but better at text-to-code, porting between languages or frameworks, and generating test cases and readmes: https://notes.osteele.com/gpt-experiments/using-chatgpt-to-p...

undefined

[deleted]

RhysU

Most of us are much worse on coding problems not in our training set!

(Looks down at dynamic programming problem involving stdin/stdout and combining two data structures).

whatshisface

The reason we're being kept around still is that you can solve the problem without it ever appearing in your training set, and once you have, it has.

toomuchtodo

Hints of The Nine Billion Names of God for sure.

https://en.wikipedia.org/wiki/The_Nine_Billion_Names_of_God

dymk

Oh wow - Unsong [1] must have taken some inspiration from that. Into the queue it goes!

[1] https://unsongbook.com/

passion__desire

A short film adaptation released a year ago.

https://www.youtube.com/watch?v=UtvS9UXTsPI

yarg

A better question is, given a significant corpus of complex functionality, can it implement complex code in a language that it knows, but in which it has only seen lower complexity code?

Can it transfer knowledge across languages with shared underlying domains?

mjburgess

I think given that it's been trained on everything ever written, we should suppose the answer is no.

It has always been possible, in the last century, to build a NN-like system: it's a trivial optimization problem.

What was lacking was the 1 exabyte of human-produced training data necessary to bypass actual mechanisms of intelligence (procedural-knowledge generation via exploration of one's environment, etc.).

ren_engineer

the implication here is that GPT is just brute force memorizing stuff and that it can't actually work from first principles to solve new problems that are just extensions/variations of concepts it should know from training data it has already seen

on the other hand, even if that's true GPT is still extremely useful because 90%+ of coding and other tasks are just grunge work that it can handle. GPT is fantastic for data processing, interacting with APIs, etc.

RhysU

No, the implication is that most of us fake it until we make it. And The Peter Principle says we're all always faking something. My comment was just about humanity. ChatGPT isn't worth writing about.

tus666

We aren't state machines. We are capable of conscious reasoning which GPT or any computer is not.

We can understand our own limitations, know what to research and how, and how to follow a process to write new code to solve new problems we have never encountered.

Our training set trains our learning and problem solving abilities, not a random forest.

waf

I've been adding C# code completion functionality to my REPL tool, and ended up reverting to the text-davinci model.

The codex (discontinued?) and text-davinci models gave much better results than GPT3.5-turbo, specifically for code completion scenarios. The latest models seem to produce invalid code, mostly having trouble at the boundaries where they start the completion.

My suspicion is that these latter models focus more on conversation semantics than code completion, and completing code "conversationally" vs completing code in a syntactically valid way has differences.

For example, if the last line of code to be completed is a comment, the model will happily continue to write code on the same line as the comment. Not an issue in a conversation model as there is a natural break in a conversation, but when integrating with tooling it's challenging.

Most likely the issue is that I'm not yet effective at prompt engineering, but I had no issues iterating on prompts for the earlier models. I'm loving the DaVinci model and it's working really well -- I just hope it's not discontinued too soon in favor of later models.

wakahiu

I can corroborate that text-davinci gives much better results than for tasks involving summarization or extraction of key sentences among a large corpus. I wonder what empirical metrics OpenAI uses to determine performance benchmarks for practical tasks like these. You can see the model in action for analysis of reviews here: https://show.nnext.ai/

[Disclaimer - I work at nnext.ai]

pffft8888

I was just talking about this the other day:

> it's more hacking than crareful and well specified engineering, and that could lead down a path of instability in the product where some features get better while others get worse, without understanding exactly why.

https://news.ycombinator.com/threads?id=pffft8888&next=35269...

letitgo12345

Will take a bit of time before AI can consistently beat us on coding/proofs but the raw ingredients imo are there. As someone who was skeptical of AGI via just scaling things up even after GPT-3, what convinced me was the chain of thought prompting paper. That shows the LLM can pick up on abstract thought and reasoning patterns that humans use. Only a matter of time before it picks up on all of our reasoning patterns (or maybe it already has and is just waiting to be prompted properly...), is hooked up to a good memory system so it's not limited by the context window and then we can watch it go brrrr

It can still make stupid mistakes in reasoning but I don't think that's fundamentally unsolvable in the current paradigm

danielheath

> That shows the LLM can pick up on abstract thought and reasoning patterns that humans use.

Does it? I’m still unconvinced it’s more than copying other examples of “show your work”.

letitgo12345

It's definitely not just copying verbatim. If you mean it's emulating the reasoning pattern it sees in the training data well...don't humans do that as well to get answers to novel problems?

sidlls

We don't know all the different ways humans arrive at answers to novel problems.

And while these LLMs aren't literally just copying verbatim, they are literally just token selection machines with sophisticated statistical weighting algorithms biased heavily towards their training sets. That isn't to say they are overfitted, but the sheer scale/breadth gives the appearance of generalization without the substance of it.

qlm

No, humans don't do that. If humans did that nothing new would ever be created.

croes

They are remixing not reasoning

mirekrusin

It has been proven it creates internal abstract representation models many times. Most trivial one is playing chess or go via text.

mjburgess

The statistical distribution of historical chess games is a approximate statistical model of an actual model of chess.

It's "internal abstract representation" isnt a representation; it's an implicit statistical distribution across historical cases.

Consider the difference between an actual model of a circle (eg., radius + geometry) and a statistical model over 1 billion circles.

In the former case a person with the actual model can say, for any circle, what it's area is. In the latter case, the further you get outside the billion samples, the worse the area will report. And even within them, it'll often be a little off.

Statistical models are just associations in cases. They're good approximations of representational models for some engineering purposes; they're often also bad and unsafe.

localplume

[dead]

skybrian

It seems like chain-of-thought will work pretty well when backtracking isn't needed. It can look up or guess the first step, and that gives it enough info to look up or guess the second, and so on.

(This can be helpful for people too.)

If it goes off track it might have trouble recovering, though.

(And that's sometimes true of people too.)

I wonder if LoRA fine-tuning could be used to help it detect when it gets stuck, backtrack, and try another approach? It worked pretty well for training it to follow instructions.

For now, it seems like it's up to the person chatting to see that it's going the wrong way.

bamboozled

The perfect reasoner is upon us ?

letitgo12345

I would prefer to say that we've seen a glimpse of what a future world with a perfect reasoner will be like

epicureanideal

And I imagine even the glimpse would cause a lot of venture capital to be flowing into AI... and also government/military funds.

bamboozled

Sorry I was being sarcastic but sounds like you’re actually into it?

meh8881

This guy seems to laugh away the fact that he gave the prompt in terribly broken and chunked up formats. I don’t think it’s surprising the model did poorly. Maybe the contamination issue is true. But maybe it’s also true that the model does fine on novel codeforce problems when you don’t feed it a garbage prompt?

contravariant

Unless the formatting is somehow very different for pre 2021 problems it is still a strong signal that it is (up to a point) just parroting a solution it had heard somewhere rather than inferring it some way else.

This is neither good nor bad, that will depend on what you want to use the model for.

meh8881

That’s not my point. Maybe it is parroting a solution for something it has seen before, but still capable of writing solutions to problems it has not seen before when they’re well specified.

The presence of contamination does not mean the general capability is intrinsically poor.

krackers

The prompting might be an issue, but I think the larger sign that it's not quite there yet in terms of symbolic reasoning is that even based on their own paper, gpt-4 gets only a 30 on AMC 10 (out of 150), whereas leaving the test blank would get you a 37.5. And this is on a closed-ended multiple choice-test, so the conditions should be favorable for it to dominate.

Edit: Although this might be unfair considering that LLMs are known to be poor at calculation. (Maybe it would do better at proof-style, like USAMO?). I wonder how well chatGPT with WA integration would do.

YeGoblynQueenne

>> I don’t think it’s surprising the model did poorly.

But it did poorly only on the problems it hadn't seen before. Was it prompted differently on one kind of problem, compared to the other?

meh8881

But you can do a task you’ve done before with poor specification too. Sure, maybe it is contamination. But who cares? We only ought to judge the tool on its performance for carrying out good instructions.

famouswaffles

On code gen with some reflection baked in (ie feedback and memory), it shoout up to 88% (from 67%).

There's some contamination to be sure but also still quite a lot to be done to better output

jprete

The need for prompt engineering is an indicator of GPT failing the problem, not succeeding at it.

sdenton4

Like how the need for roads is an indication that cars don't solve transportation problems.

People adapt to maximize the utility of their tools; always have, always will.

meh8881

I disagree. We cannot expect these tools to work if we don’t communicate what we want from them. Can’t do that for humans either.

We shouldn’t be confusing clear communication with “prompt engineering”.

bamboozled

The need for “tools” is a bit of a red flag too.

hughesjj

Why? I'm significantly less productive and competent without tools, and this applies to every human I've ever known.

tarruda

Maybe the OP is also contaminated a bias of wanting to find errors in GPT-4?

Finding flaws in GPT-4 and ignoring the fact that we are at the dawn of AI, that seems like a good remedy to calm the AI existential crisis anxiety nowadays.

rakejake

One another example I encountered today.

<Prompt> show me an example of an enum LifecycleStatus deriving from an abstract class Status with fields status_type String and reason String (with default getters). Use a lombok builder on the enum class and show an example of building a LifecycleStatus enum setting only the status_type. </Prompt>

Answer: ChatGPT tells me it absolutely is possible, and generates the code as well. Except, java Enums cannot derive an abstract class (since they implicitly derive Enum<T>). You can only have an enum implement an interface. Here, I realised that I had not mentioned the word "java" in this prompt (though I had started off the chat with the word "java" so I assume it was in the context.

In any case, my next prompt was:

<Prompt> Is the code example you shared for java? </Prompt>

Answer: Yes, the code example I shared is for Java. Specifically, it is for Java 8 or newer, as it uses the @Builder annotation from Lombok, which was introduced in version 1.16.8 of Lombok.

I continued the conversation and had it try out other methods of achieving the same thing. (abstract class with static inner builder class, enums implementing an interface). But it basically follows your prompt and provides whatever you ask it to provide. When I asked it to have an enum derive an abstract class, it did that for me without any complaints. When I later said, this was not possible, it changed tracks.

I echo other users in saying that, I would not want to use this for learning anything new. There is no point if I have to break my flow and verify after every query. But it is a good rubber duck. The chat interface can be used as a deprocrastination tool - when you are not able to come up with anything fruitful, just ask a few questions. This gets the ball rolling at least.

polskibus

I wonder if he tried the trick that Microsoft recommended in their GPT4 evaluation paper, that is ask gpt to go step by step with explanations. It tends to produce much better results, simply because it is more fitting to the prediction mechanism that GPT uses . It tends to predict better when steps are smaller.

DougBTX

https://arxiv.org/pdf/2205.11916.pdf

kozikow

I'm confident that if chatgpt trained on all code submitted to topcoder and code forces, then iteratively refined responses by compiling and evaluating the output, it could get to red topcoder quite easily. Just that programming challenges probably weren't that high of a priority and data is in silos.

Daily Digest email

Get the top HN stories in your inbox every day.