A leading UNSW computer scientist says a touted solution to a big problem for generative AI is better suited for other forms of artificial intelligence.
AI chatbots, like ChatGPT and Google Gemini, are running out of data to eat.
Generative AI models have swallowed up most of the data they’re legally allowed to process that is of high-enough quality to improve their function.
Chatbots may only have until 2032 until the good data runs out.
Even the low-quality data (taken from sources that aren’t as reliable, for example webpages instead of published books) is meant to run out—at the very most—a couple of decades after. AI companies are looking at what else they can use to keep from stalling in the fiercely competitive race to provide the best artificial assistant. Industry leaders are pointing to ‘synthetic data’ as a potential solution. Synthetic data is two things. One is data generated by AI, based off real-world information. Give a chatbot a spreadsheet filled with numbers and ask it to make another one just like it with different numbers. That’s synthetic data.
It can also mean information that may have been edited or manipulated by humans, but more on that later.
The CEO of ChatGPT creator OpenAI, Sam Altman, says chatbots will one day be smart enough to train themselves purely on synthetic data.
“As long as you can get over the event horizon where that model is smart enough to make good synthetic data, I think it should be all right,” he said in an interview last year.
UNSW Computer Science Professor Claude Sammut doesn’t agree.
“If all they’re doing is feeding on themselves, they’re not really producing anything that new. Unless there’s significantly different mechanisms [for training] that we don’t know about yet, I don’t think it’s going to keep scaling like that.”
Patterns and logic
The way AI models learn is by looking at many pieces of content that real people have labelled as certain things.
So, for an AI ‘vision system’ like a self-driving car to learn what a traffic cone is, someone has to manually file many pictures under the label ‘traffic cone’ and then feed them to the program.
A generative AI model, like ChatGPT, is different.
Think of it as a very sophisticated version of the predictive text on your phone. Based on all the texts it’s seen before, it learns to predict what follows your prompt.
But where your phone only learns to predict the next few words, a large language model can be trained on large pieces of text so that it can generate whole documents.
Prof. Sammut says generative AI systems have big limitations because chatbots lack critical thinking.
“These systems are based on doing pattern matching, they are very good at that, but they can’t do any sort of logical sequential reasoning.”
A chatbot can only tell you 1+1=2 because someone told it so, not because it learned how to do arithmetic.
“These systems can even write computer programs, which is a like a sequential plan, but they all do it on patterns they’ve seen before and assemble it all together,” Prof. Sammut says.
“Usually, they manage to assemble it correctly, but not always.”
A recent study popularised the term ‘model collapse’, after researchers fed a photo into an image generator and asked for it to make a copy of the picture.
They fed the copy back in and asked it to do the task again and then repeated the process several times.
It didn’t take long until the generated image looked like a blurry mess. The model had ‘collapsed’ by eating and regurgitating its own product.
Something like this is why Prof. Sammut says, for the foreseeable future, there’s always going to be a need for an intervening force with generative AI.
“That’s why combining it with classical AI systems, that do this logical reasoning, I think that’s really necessary.”
Below is a photo of UNSW’s Library Lawn uploaded to Adobe’s AI generator Firefly with the prompt to create “an exact replica of the photo uploaded for reference”.
The results were fed back to it and the process was repeated several times. The AI may have done a good job of avoiding a pixelated disaster, but the photo has morphed into something fairly different.
Classic and synthetic
‘Classical AI’ is AI that represents knowledge in the form of symbols, like a set of rules.
They are often slower than generative AI systems, but they can have guarantees of correctness.
Think of a playing chess against a computer, or an early version of a robot vacuum cleaner, bumping into a wall, reversing and correcting course.
Synthetic data may not be the best fix for generative AI’s data shortfall, but classical AI has plenty of uses for it.
Prof. Sammut’s robot soccer team, the rUNSWifts, uses synthetic data to train its players.
“We collect a lot of sample images, and then we do things to them, like we invert them, we rotate them, we do various transformations,” he says.
“If you try and teach a robot to recognize an object, and all your data shows the object in one orientation, you take the image and you rotate it, and then you help it train to recognize in different orientations. All that stuff does work.”
This is the other version of the synthetic data mentioned in the beginning of this story.
Alterations to real data to make more data, rather than generating with AI.
The rUNSWift team has taken five world titles since 2000, so there must be something in it.
Data in a different light
An AI industry that needs as much data as possible, regardless of where it comes from, is self-driving cars.
A recent study comparing accident data between robot cars and ones driven by humans found that self-driving cars are ‘generally’ safer, but have a big problem with safety at sunrise and sunset.
A fatal Tesla Autopilot accident in the US happened during one of these low-light conditions, where a truck pulled out in front of a car.
The Tesla hit the truck’s carriage and kept driving even though it had been split in half.
“And that was because the vision system thought they saw this big white thing and thought it was a bridge you could drive under, and that’s the sort of thing that obviously [the Autopilot had] never seen that particular configuration of the truck,” Prof. Sammut says.
“The accidents that you see happening with self-driving cars is because they’ve collected lots and lots of data, but they can never collect everything.”
Self-driving vehicle companies operating in the US are using synthetic data to train their cars, and some businesses are focused solely on providing synthetic, virtual worlds for companies to train their autonomous devices.
“There will always be a certain amount of uncertainty,” Prof. Sammut says.
“Do you want driverless cars to be 100% perfect? Because people aren’t either. It’s just they’ve got to be better than, or at least as good as, the reliability of people. But maybe that’s not going to be enough for everybody to accept that.”
Key Facts:
A leading UNSW computer scientist says a touted solution to a big problem for generative AI is better suited for other forms of artificial intelligence.