Security weaknesses of Copilot generated code in GitHub

Daily Digest email

Get the top HN stories in your inbox every day.

faeriechangling

If a weakness is common, then of course Copilot is going to suggest it. Copilot gives you popular responses not correct ones. Yet if a weakness is common, it also means that human coders frequently make the same mistake as well.

The studies results are rather unsurprising and its conclusions are oft-repeated advice. As many have said, treat copilot’s code in the same light you would treat a junior programmer’s code.

crooked-v

> Copilot gives you popular responses not correct ones.

That also sums up most of the issues with LLMs in general in one sentence.

steve1977

Which is why the term Artificial Intelligence is really a misnomer for LLMs. Artificial Mediocracy might be more fitting.

Zambyte

It is artificial intelligence, it just isn't artificial general intelligence, nor artificial general knowledge. LLMs are artificial linguistic intelligence. They are really good at linguistic operations that require intelligence, like summarizing long text, transforming disrespectful text into professional looking text, translation between languages to a certain degree, etc.

It is not possible to ask an LLM for factual knowledge, without providing it the source of the fact. Without a source of the fact, you can only ask an LLM to generate an answer to the question that is linguistically convincing. And they can do a really good job at that. They can accidentally encode factual knowledge by predicting the next word correctly, but that should be regarded as an accident.

baz00

That's the most accurate term I've heard to describe the situation. I think it could get worse though because when I've seen mediocre people work with mediocre people they generate sub-mediocre solutions through trying to be clever and failing spectacularly at it.

jacobr1

A sci-fi series I read on occasion uses term "artificial stupids"

throwaway20304

Do you think people with IQ below 80 are not intelligent?

danielvaughn

Artificial Average

TheRoque

Sums up the issues with democracy too, and a ton of other stuff

sambazi

educating the "low-hanging fruit" is much more effective in moving the average than piling on excellence.

ape4

Democracy is a bit different. Hopefully the goal is a democratic government isn't to be intelligent but is to just make people's lives good (better?). In theory if people find their lives are getting worse then they can replace the government. But, sure, there are many examples of it not working. Such has the government making gas prices low because the people want that when its polluting the planet that the people live on.

undefined

[deleted]

diggan

I'm not sure you can claim that the essential functionality of something is the issue with something.

The whole idea of LLMs is that they chose the most likely token based on the tokens before, and then sometimes chose less likely tokens. But it's all based on likelihoods.

Probably there is a huge education part missing from this, if people aren't aware that this is how it works, and they think that any LLM can "creatively" come up with it's own chain of tokens based on nothing.

baby

With humans too

firtoz

Evals do help to account for correctness when it comes to LLMs

ahoka

I propose calling it artificial non-diligence.

hiAndrewQuinn

It would be very interesting to fine-tune Copilot on the code of people widely regarded in their communities as experts, to see how the suggestions would change.

pylua

I wonder if llm are biased towards older, more insecure implementations because there is a higher volume of old code vs new code.

Same thing with the data it is trained on — not all code requires all levels of refinement. Most of the data is probably around average.

squigz

I'm not sure that code being newer inherently means it will be more secure

pylua

I don’t think it is a tautology , but I can imagine a cve scanner picking up older code with log4j where newer code may avoid that library altogether, just as an example.

Since there is more older code than newer code would the llm be suspectible to that ?

foota

This makes me wonder about training an LLM on one language and then fine tuning it for another. If you train over only, say, JavaScript, and then finetune for C, I imagine it will be quite bad at writing safe code, even if it makes the code look like C, because it didn't have to learn about freeing and such.

Similarly, would it pick up patterns from one language and keep then in the other? Maybe an LLM trained on Kotlin would be more likely to write functional code finetuned.

fragmede

Given that dataset anomalies can result in LLM output corruption, I'm not convinced that cross-training like that would even work.

darkerside

> Most of the data is probably around average.

I know this is not how distributions work, but I had to chuckle at the literal interpretation of this.

d-z-m

I'd say the data is pretty normal.

smcg

That's not how it's presented or how managers expect it to be used.

reportgunner

Brawndo is great for plants because it has elecrolytes.

diogenes4

> Yet if a weakness is common, it also means that human coders frequently make the same mistake as well.

It only means programmers commonly talk about it. This isn't the same thing as measuring incidence in production or distribution.

Anyway, i'd argue the real question is "can the chatbot fix the code if requested to".

lolinder

> It only means programmers commonly talk about it. This isn't the same thing as measuring incidence in production or distribution.

Copilot was primarily trained on GitHub projects, not on communication between programmers. Patterns that frequently show up in Copilot output are most likely prevalent on GitHub, which is a pretty good indicator that they're common in production code.

prosim

That’s no longer true. Copilot uses the same ChatGPT-3.5 model as, well, ChatGPT. If it were trained on just GitHub projects, the chat features wouldn’t work at all.

undefined

[deleted]

renewiltord

A junior programmer's code? This makes no sense. It's happening right in front of you. A junior programmer isn't going to write on my screen. I can just correct it right here I am currently holding the context in my head.

These "security weakness" examples are

     print("first user registered, role set to admin", user, password)

and

     pprint({"json":"somejunk", "classes": somefunc(user)})

Nah, this stuff I can easily spot while I'm writing code. For a junior programmer, I'm going to be looking at design, and then at common specific mistakes. For Copilot it's writing in front of me. I can easily exclude anything that isn't obviously correct because I'm in the state right there.

It's a fantastic tool. If you go and use it and end up with `print(user_credentials)` I don't know what to tell you.

baq

The added complication is now you'll have to watch out for the junior+copilot combo, though it's a trade I personally am very willing to take.

devjab

You’ve had to watch out for “bad” programmers since the beginning of programming. Having been an external CS examiner for almost a decade now, I’m not too worried about LLMs in teaching because I’m not convinced they can do worse than what we’ve been doing this far. I do think it’s a little frightening that a lot of freshly educated Computer Scientists will have to “unlearn” a good amount of what they were taught to actually write good code, but on the flip side I work in a place where a big part of our BI related code is now written by people from social sciences because, well, they are easier to hire.

That’s how you end up with pipelines that can’t handle 1000 PDF documents, because they are simply designed not to scale beyond one or two documents. Because that’s what you get when you google program, or alternatively, when you use ChatGPT, and it’s fine… at least until it isn’t, but it’s not like you can’t already make a lucrative career “fixing” things once they stop being “good enough”. So I’m not sure things will really change.

If anything I think LLMs will be decent in the hands of both juniors and senior developers, it’s the mediocre developers who are in danger. At least with google programming they could easily tell if an SO answer or an article was from 20 years ago, that info isn’t readily available with LLMs. I fully expect to be paid well to clean up a lot of ChatGPT messes until the end of my career.

calibas

> The results show that (1) 35.8% of Copilot generated code snippets contain CWEs

What percent of non-Copilot generated public GitHub repos contain CWEs?

Edit: According to this study, Copilot generates C/C++ code with vulnerabilities, but at a lower rate than your average human coder: https://arxiv.org/pdf/2204.04741.pdf

belter

"...The results show that (1) 35.8% of Copilot generated code snippets contain CWEs, and those issues are spread across multiple languages, (2) the security weaknesses are diverse and related to 42 different CWEs, in which CWE-78: OS Command Injection, CWE-330: Use of Insufficiently Random Values, and CWE-703: Improper Check or Handling of Exceptional Conditions occurred the most frequently, and (3) among the 42 CWEs identified, 11 of those belong to the currently recognized 2022 CWE Top-25. Our findings confirm that developers should be careful when adding code generated by Copilot (and similar AI code generation tools) and should also run appropriate security checks as they accept the suggested code..."

laurent_du

I wonder if it would be possible to rate the code used during the training phase. For example the code could go through various static analysis tools and the result would be assigned as metadata to the code being used to train the model. The final model would then know that a given pattern is flagged as problematic by some tool and could take this into account not just to suggest new snippets but also to suggest improvements of existing snippets. Though I suppose if it was that easy, they'd have done it already.

jasfi

This is probably the next step for the LLM providers. They need to find ways to increase quality, and for code, there are many options. Perhaps code repos could get in on this too.

progval

Wouldn't this train it to avoid detection more than to avoid bad patterns?

ivancho

Yes, but presumably in the training data those two are quite correlated.

gmerc

As always the statistic is useless without the human comparison. If it improves on human coders, no amount of gnashing and wailing will stop the layoffs.

ResearchCode

They didn't improve on human truck drivers yet.

tedunangst

There's only one weakness specifically identified that I can see.

    print("new user", username, password)

Yeah, not best practice, but also pretty common for development if you wanted to check that everything is being passed to the correct function.

jddj

I don't know if it still does it, but it used to be that if you did something like

  NonQueryResult StoreUser(User user) {
   var sql = "INSERT...

It would use string interpolation to fill out the properties

worksonmine

Not best practice? That's a very generous way to describe storing plaintext passwords in logs. I've seen this in the wild too but that's no excuse.

chinathrow

> I've seen this in the wild too but that's no excuse.

See, the LLM also saw it in the wild...

dubbel

That is the CWE that they identify, but the code seems to store the apparently unhashed password in the database on top of that?

azangru

This is where one needs a hyphen :-)

siva7

A related headline could be "Security weaknesses of code produced by a junior developer". It says copilot in the product name -> it's not intended to replace the pilots (aka developers) brain.

jncfhnb

Did they prompt it to consider security weaknesses?

prosim

They did not prompt at all. They used GitHub’s code search to find projects where the repo owner specified that the code was generated “by Copilot” and the authors took that at face value for all code in the project. Whether the code was actually suggested by Copilot is not at all analyzed in the paper. As such, the results are highly questionable.

OmarShehata

That would be kind of wild. Imagine a world where whether your system was secure was just a matter of remembering to tell the AI agent "& also make it secure" before it writes your code.

(could be quite real!)

undefined

[deleted]

phyzome

This would likely help a little bit. We've already seen LLMs improve performance on some tasks by being instructed to "think carefully" first; presumably this biases it towards parts of the training set that are higher quality.

But security ultimately requires comprehension, which is not something LLMs have.

jncfhnb

Security 100% does not require comprehension in the philosophical sense

jncfhnb

I think it’s more likely that you would use a security graded bot.

It’s perfectly reasonable to not use secure code for a large number of use cases.

Grimburger

> also make it secure

[proceeds to simply refactor the same code]

Daily Digest email

Get the top HN stories in your inbox every day.