Which will go away (as will everything else frankly) if you get a bad seed and you start generating and refeeding AI content into your AIs. Which will happen especially if AI gets used more for general use.
You can solve this problem by having professionals curate the input. But having an army of field experts curating all your data in every field is not an effective solution
Saw a thing a few days back, which said that the AI art generators are running into a problem with how they're starting to pull AI-generated art into their sources, and it leads to worse results.
(by a thing i mean a screenshot of a twitter post, so since elon seems to have done something to screw twitter up, unlikely to be able to check if it was sourced or not. image below if you're curious)
Some talk of how this means it's harder to make new datasets, and how it gives a monopoly advantage to companies that already have existing models.
Apparently ChatGPT has a thing which allows users to upload text as well, which means deliberate or accidental feeding of AI-generated material into the AI model is going to happen there too.
With text though, there's also a copyright issue. Lots of books are public domain, but a lot of books are not, especially anything published after the mid 20th century or thereabouts. I remember a thing on google books mentioning this. So it's easier to make an AI model of Chaucer or Shakespeare or Austen than it is for a 20th century author. One does wonder if the lack of freely available text of modern language is behind some of the more egregious instances of data scraping.