I’ve been reading through hundreds of research papers on AI over the past few months, for a project that is now almost ready.
One type of paper I keep running into are of the AI is great; here’s how it’s great variety that then gets shared across the web on social media and blogs.
—Look how amazing AI is! So much productivity!
But you can be sure that nobody reads past the abstract, and those that do *definitely *don’t read the appendices that outline the questions and methodology of the study.
It shouldn’t surprise you either that everybody misses the fact that the studies usually aren’t peer-reviewed either.
I’ve written before about the issue with productivity studies: they are incredibly hard to do well. It’s easier to prove that something conclusively harms productivity, but proving that productivity is increased because of a specific intervention is extremely difficult.
I still read through these papers, because if I do find one that I think is done well, that’d be news interesting enough to shout from the rooftops, and the thought process I go through might help explain some of the issues with these studies.
So, I had a go at rereading a study I’d read before and documented my thoughts as I went through the paper, its appendices and methodology.
—I’ve seen this ChatGPT study before. It looks interesting, I wonder why I dismissed it the last time I checked it out?
—Hmm, synthetic tasks detached from actual work? Not ideal but probably the only option. The issue with these is that ChatGPT is excels at synthetic tasks.
—This is the Overleaf study! That’s why. Their methodology annoyed me. Also, not peer reviewed.
—Eh, $10/h on a synthetic task makes for very different incentives from a press release for a project you’ve worked on for months.
—15-30m tasks? Big meh. Most tasks like these in workplaces are iterative. You work on them, with breaks, share with coworkers, get feedback. Office writing is usually only sporadically solitary.
—Wow, the manager tasks are such bullshit. JFC, I know the tasks have to be synthetic, but they shouldn’t be actively bad tasks representing bad decisions.
—The tasks are also so synthetic that you wouldn’t even notice hallucinations or overfitting, both of which would be a big issue in actual work.
—Okay, this tells you a lot: ‘We also ask whether they have completed a similar task before in their job; 66% say yes.’ This means that a full third had never had to do the 30m task you’re measuring. I’m going to go out on a limb here and guess that the ‘yesses’ aren’t doing these tasks daily either. See, this is why all of these claims about ChatGPT being a productivity boom are suspect. You’re wasting all this energy delivering a productivity boost to uncommon office tasks!
—The ‘in the past year, how many times have you done a task similar to this?’ answers max out at 10+. So, even the study authors realise they’re measuring a tool that give you a 15m productivity boost once a month at most.
—Self-reports for sub-task times spent. 🤨
—The synthetic nature of the tasks complicates the grading. Essentially, the graders are assessing creative writing or office fan fiction and not actual work product, so they’re going to be grading fluency, not quality of work.
—Anything that primarily measures fluency, disregards hallucinations and overfitting, and is disconnected from actual work is going to favour ChatGPT results substantially.
—Wow, do I hate the management writing tasks. It’s like they cribbed management notes from all my least favourite people.
—The data analyst tasks kind of hammer home how much this is creative writing and not genuine tasks. I really doubt that writing a ‘code notebook’ is the first thing a data analyst would do there. I have no doubt that’s the first thing they’re supposed to do, but I highly doubt that’s how they approach these tasks in actual office settings.
—The highly rated answers also show how much bullshit the question is. ChatGPT is going to do so well on these.
—Oh, lord. I hate the marketer tasks even more. No wonder the respondents had fun writing this. It’s a creative writing sci-fi exercise, not a work task.
—The consultant task is fairly accurate, though. We do love our bullshit tasks.
—The study’s authors are obsessed with scifi nonsense, though. Lab-grown meat production, VR, and AR aren’t exactly representative of common US industries.
—Goddamnit. VR bullshit again. You might as well be asking these people about issues with agricultural production in a self-sufficient Mars colony at this rate.
—Yeah, I don’t think the conclusions are warranted given the study structure and task design.
—It’s notable that they asked how often the participants did a task like this at work but an overview of the answers to that is nowhere to be found.
—Also notable that the tasks instructions, with or without ChatGPT, aren’t representative of how people actually work. These tasks are always collaborative in real life, and they are never 30 minutes then done and send.
—So, with the task rate—clearly less than 10+ per year since they don’t mention the result—this study is basically saying that with ideal tasks that fit ChatGPT’s strengths perfectly, no hallucination or overfitting concerns, it might save you a couple of hours of work a year.
My general feeling based on this study (and others like it that I’ve read) is that in real office settings with genuine tasks, the time saved due to the productivity benefit of current generative AI tools would be a rounding error on the time the employees spend on coffee breaks. They feel very productive while you’re using them, but when measured and placed in context with what normal office work is trying to accomplish, the benefit is tiny.
That isn’t to say that this couldn’t change when Microsoft and Google integrate these tools into their office suites. I’m sceptical, but that’s at least a more plausible thesis.
For more of my writing on AI, check out my book The Intelligence Illusion: a practical guide to the business risks of Generative AI.
“I tried out SyntheticUsers, so you don’t have to”
This link fits into both the AI and the software development category.
Niloufar Salehi, who is way more generous with their time than me went and tested the SyntheticUsers service.
We’re going to see so much more of this kind of bullshit AI crap.
- “The Company Behind Stable Diffusion Appears to Be At Risk of Going Under”. “Stability AI raised $100 million last year and has already spent a significant portion of those funds.” The AI grift ain’t cheap.
- “I’m an ER doctor: Here’s what I found when I asked ChatGPT to diagnose my patients”. “If my patient in this case had done that, ChatGPT’s response could have killed her.”
- “A quote from Jim Fan”. This theorises that Midjourney is collecting user data to fine tune its model. Kinda hope this isn’t the case because otherwise they’re likely to run into issues with the GDPR down the line.
- “Blinded by Analogies - by Ethan Mollick - One Useful Thing”. The use of wishful comparisons, where AI is explained by analogy with something that’s entirely different from what it’s actually doing, is a long-standing issue with AI research. Drew McDermott called it “wishful mnemonics” in 1976.
- “Midjourney CEO Says ‘Political Satire In China Is Pretty Not Okay,’ But Apparently Silencing Satire About Xi Jinping Is Pretty Okay - Techdirt”
- “Copyright lawsuits pose a serious threat to generative AI”. Between the lawsuits, EU regulators, and the various unsolved technical issues, the total dominance of Generative AI is not the certainty its made out to be.
- “Closed AI Models Make Bad Baselines - Hacking semantics”
- “Italy’s ChatGPT ban attracts EU privacy regulators - Reuters”. I told you so.
- “Merchant: How AI doomsday hype helps sell ChatGPT - Los Angeles Times”. “Scaring off customers isn’t a concern when what you’re selling is the fearsome power that your service promises.”
- “More Everything With AI - Jim Nielsen’s Blog”
- “April 4 - by Rob Horning - Internal exile”. “What I usually take away from interacting with chatbots, more than any information they supply, is a sense of immediacy, a visual spectacle of words coming from nowhere unspooling themselves on the screen.”
What are the major business risks to avoid with generative AI? How do you avoid having it blow up in your face? Is that even possible?
The Intelligence Illusion is an exhaustively researched guide to the business risks of language and diffusion models.
Software development links (and interesting stuff)
- “Pixels of the Week – April 10 , 2023 by Stéphanie Walter - UX Researcher & Designer."
- “Tech Companies Are Ruining Their Apps, Websites, Internet” These companies don’t make good software and treating their processes as “best practices” means you don’t either.
- I ran into this blog post the other day and enjoyed it enormously. “Worldwide Story Structures”