They’re not both true, though. It’s actually perfectly fine for a new dataset to contain AI generated content. Especially when it’s mixed in with non-AI-generated content. It can even be better in some circumstances, that’s what “synthetic data” is all about.
The various experiments demonstrating model collapse have to go out of their way to make it happen, by deliberately recycling model outputs over and over without using any of the methods that real-world AI trainers use to ensure that it doesn’t happen. As I said, real-world AI trainers are actually quite knowledgeable about this stuff, model collapse isn’t some surprising new development that they’re helpless in the face of. It’s just another factor to include in the criteria for curating training data sets. It’s already a “solved” problem.
The reason these articles keep coming around is that there are a lot of people that don’t want it to be a solved problem, and love clicking on headlines that say it isn’t. I guess if it makes them feel better they can go ahead and keep doing that, but supposedly this is a technology community and I would expect there to be some interest in the underlying truth of the matter.
They’re not both true, though. It’s actually perfectly fine for a new dataset to contain AI generated content. Especially when it’s mixed in with non-AI-generated content. It can even be better in some circumstances, that’s what “synthetic data” is all about.
The various experiments demonstrating model collapse have to go out of their way to make it happen, by deliberately recycling model outputs over and over without using any of the methods that real-world AI trainers use to ensure that it doesn’t happen. As I said, real-world AI trainers are actually quite knowledgeable about this stuff, model collapse isn’t some surprising new development that they’re helpless in the face of. It’s just another factor to include in the criteria for curating training data sets. It’s already a “solved” problem.
The reason these articles keep coming around is that there are a lot of people that don’t want it to be a solved problem, and love clicking on headlines that say it isn’t. I guess if it makes them feel better they can go ahead and keep doing that, but supposedly this is a technology community and I would expect there to be some interest in the underlying truth of the matter.