If you find it hard to keep up with AI research, here’s a foolproof rule-of-thumb for deciding which studies to ignore:
If a majority of the authors work for the AI vendor or a major partner, you know it:
- Will generally show positive effects
- Not too positive, for credibility
- Also has at least one solemn acknowledgement of a potential problem before dismissing it
- Will be impossible to replicate as some core part of it, either code or data will not be public.
So, you can ignore it.
This research is what I call advocacy research. It’s the kind of study you’d get if you went to an ad agency and told them you wanted to promote your product but, “y’know, make it sound scientific”.
The study is done by people familiar with the tech, many of whom, consciously or unconsciously, will avoid the system’s glaring pitfalls. Much like when you habitually use a system whose undo-redo stack breaks when you paste something. After a few months you automatically work around the issue without thinking about it. In these cases, they’ve been working with developing versions of these systems for years.
They’re clued in enough to know what flaws are well-known and need to be acknowledged for credibility. They probably don’t think of it that way, more that it’s the first that comes to mind, but that’s the effect it has on the study’s structure.
Finally, if the study’s results look too bad, odds are it will never see the light of day. Just look at what happened to Timnit Gebru. Google’s management felt that supressing the paper was a reasonable request, because from their perspective, it was a reasonable, par-for-the-course, request.
Studies like this are also hit by the counter-intuitive fact that, when it comes to productivity, it’s harder to prove a positive than a negative. (Completely opposite to what you usually see in science.) In workplace studies and organisational and work psychology, it’s notoriously hard to prove that a specific measure genuinely improves productivity or not.
(Doesn’t stop every company ever from claiming that it’s able to do that.)
Individual studies on a specific productivity intervention commonly suffer from:
- Demand characteristics, where the subjects alter their behaviour to fit what they think the researcher wants.
- The novelty effect, where performance tends to improve just because it’s new and shiny and not because of anything inherent in the intervention.
- Advocacy research done by people involved in the making of a system or intervention suffers especially from the observer-expectancy effect. The experimenter’s subconscious biases and familiarity with the system affects the outcome.
- And, as Terence Eden pointed out, AI in particular is vulnerable to the Barnum/Forer effect.
All of this combined mean that any study that claims to prove that a specific software intervention has a positive effect needs to be received with much more scepticism than a study showing a detrimental effect.
Because, as it happens, detrimental effects tend to be easier to demonstrate conclusively. Sometimes it’s because they are statistical patterns—memorisation/plagiarism rates, hallucination rates, that sort of thing. Sometimes it’s because the outcome is just obviously worse. And often it’s because adding a roadblock to a process is so obviously a roadblock.
All of which is to say that you should pay more attention when a study shows that AI code assistants increase the number of security issues. And you should pay less attention to studies that claim to prove that AI code assistants increase productivity.
Especially when you consider that we can’t even agree on what productivity is in the context of software development. What you measure changes completely depending on the time horizon. Software development productivity over a day, week, month, or a year are very different things. What makes you more productive over a day can make you less productive over a year by increasing defects and decreasing familiarity with the code base.
So, measuring the productivity increase of a single intervention is extremely hard and the people who do these studies are generally trying to sell you something.
You can measure the overall productivity of a system. But, when you experiment with productivity through a measure-adjust-measure cycle, you end up tampering with the process which commonly magnifies error and variability. (There’s a ton of literature on this. Search for “Deming tampering” or “Deming Funnel Experiment” for examples.)
Reading advocacy research like this is useful for discovering best-case disasters. Like in the Github Codex paper they say that memorization (where the model outputs direct copies of code from the training data) happens in about 1% of cases, as if that were not a worryingly high percentage.
(This is even on the Github Copilot sales page, under “Does GitHub Copilot copy code from the training set?”. Where it says: “Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set." So, the answer to that is yes.)
From that you can fairly safely assume that it happens at least 1% of the time because tech vendors always underestimate disaster.
And, y’know, 1% chance of plagiarism every time you use it should have been a catastrophic scandal for a code tool. But, apparently we’re all fine with it? Because, mixing random GPL-licensed code in with proprietary code is never an issue, I guess?
For more of my writing on AI, check out my book The Intelligence Illusion: a practical guide to the business risks of Generative AI.