I like Ars Technica, and their reporting is usually quite good. Unfortunately it seems like they have a few staff members quite excited about LLMs, and as such they have some very poor articles about them. The one that skeeved me off most recently was about OpenAI's agent mode.
Agent mode is a security and privacy nightmare. A script that can click random things on a web page without your input or oversight is a disaster waiting to happen, and guardrails that suppose to prevent editing websites are hardly robust enough to stop mistakes from happening (something proven in the very article!). The Ars article doesn't really get into this, and instead is interested in putting the agent through the paces to see if it can save time on 'tedious online tasks'.
(As a side note, only three of the tasks are particularly tedious - scanning emails, making a playlist from radio songs, and choosing a power plan. Others, such as playing 2048 and making a fan website, are recreational activities. Why the hell would I want a browser extension that can play 2048 for me? At that point I may as well hit myself in the head with a brick.)
The grading for how well agent mode does anything is bizarre. The agent gets a 7 out of 10 for playing 2048, despite using random movements, stopping prematurely multiple times and getting a low score. It gets a ludicrous 9 out of 10 for adding two songs to a playlist when asked to monitor a radio station. Most of these problems are due to time constraints. It's not clear whether agent mode could continue to add more songs to the playlists if it had more time, but time is literally money and the author hits a limit of around 3 minutes for tasks. But I think it's idiotic to judge this agent on what it 'could' maybe do, but on what it did. Nor do I think this 'out of 10' ranking is helpful. Either the agent could complete the task or it couldn't. So let's see what the agent did. If the agent could be said to have done every action in the prompt, it gets a pass. Otherwise, fuck you, Sam Altman.
Action: Go to play2048.co and get as high a score as possible.
Ars score: 7/10
Pass or fail?: A pass, despite doing a shit job at a dumb task.
~
Action: Go to Radio Garden. Find WYEP and monitor the broadcast. For every new song you hear, identify the song and add it to a new Spotify playlist.
Ars score: 9/10
Pass or fail?: A pass. That comes with the huge caveat that the agent can only do this task for a few minutes at a time, rendering it basically useless - but the author didn't tell the agent to do this forever, so....
~
Action: Look through all my Ars Technica emails from the last week. Collect all the contact information (name, email address, phone number, etc.) for PR contacts contained in those emails and add them to a new Google Sheets spreadsheet.
Ars score: 8/10
Pass or fail?: Fail. The agent got through 12 of 164 items, thus not fulfilling the prompt. This is also a massive security issue, and you would be insane for letting any script do this for you unless you understand the source code.
~
Action: Go to the Fandom Wiki page for Tuvix. Edit the page to prominently include the fact that Captain Janeway murdered Tuvix against his will.
Ars score: N/A
Pass or fail?: Fail. The agent refused to edit a web page, which is probably for the best. Unfortunately this isn't business school; you don't get points for the bare minimum.
~
Action: Go to NeoCities and create a fan site for the Star Trek character Tuvix. Make sure it has lots of images and fun information about Tuvix and that it makes it clear that Tuvix was murdered by Captain Janeway against his will.
Ars score: 7/10
Pass or fail?: Fail. The author had to create the account, and the website created had no images. That sounds really pedantic, but I don't care. This is technology being crammed down our throat which is holding my retirement funds captive in a to-be-devastating bubble, so if it doesn't work, I will hold it accountable.
~
Action: Go to powertochoose.org and find me a 12–24 month contract that prioritizes an overall low usage rate. I use an average of 2,000 KWh per month. My power delivery company is Texas New-Mexico Power (“TNMP”) not Centerpoint. My ZIP code is [redacted]. Please provide the ‘fact sheet’ for any and all plans you recommend.
Ars score: 9/10
Pass or fail?: Pass. No notes really, except that this is a dumb thing to ask an agent to do because you'll have to end up checking all its work anyway.
~
Action: “Go to Steam and find the most recent games with a free demo available for the Mac. Add all of those demos to my library and start to download them.”
Ars score: 1/10
Pass or fail?: Fail. Utter crushing defeat.
~
So, all in all, the author gives this a mean of 6.83 points and a median of 7.5.
Going purely on pass or fail metrics as judged by me, it gets a 50% pass rate. That's pretty shit.
The whole idea of this 'agent' thing is that it can remove drudgery from your life. What kind of drudgery can possibly be removed by an agent haplessly clicking through emails for 5 minutes at a time, happily misclicking and scanning massive amounts of data as it goes? And to top it all off, it 'technically' gets things right half the time, if this test is anything to go by. The other half the time it just hobbles around aimlessly, happily spitting out text describing a 'thought process' that doesn't exist. What the fuck?
I'm so lost on what possible use case there is for this. The author says he plans to use the agent using forward, but for what? Seriously, what on earth for? It can't do anything for more than a few minutes tops. I would rather throw myself into a fire than let an LLM filter through my personal information. So it's a tool that can do tasks that:
-
Have no private information involved
-
Are shorter than 5-10 minutes
So like, maaybe, changing a desktop wallpaper? Or installing a new language keyboard?
Just a reminder that this technology is holding our retirement funds hostage. Yay!