The AI bill Newsom didn’t veto — AI devs must list models’ training data

David Gerard@awful.systems · 4 months ago

The AI bill Newsom didn’t veto — AI devs must list models’ training data

OhNoMoreLemmy@lemmy.ml · 4 months ago

The other reason they don’t do it is because many models are trained on a large corpus of pirated texts, and documenting this would be a confession.

Not just in an ‘I scraped the new york times without permission’ kind of way, but in a ‘I illegally downloaded a torrent containing bestsellers from the last 30 years’ kind of way.

Soyweiser@awful.systems · 4 months ago

Bestsellers? There used to be torrents of basically all releases. My provider blocks torrent sites and I dont use a vpn so im not sure if people still do this, but downloading basically all books (in english) at once released in a certain period was possible

skillissuer@discuss.tchncs.de · 3 months ago

occasionally i see this for music (weekly new tracks)

Tar_Alcaran@sh.itjust.works · 4 months ago

Exactly. It’s not that they can’t, or that it’s too expensive, it’s that doing so will reveal their crimes.

imadabouzu@awful.systems · 4 months ago

In a sense, to me, it is the same thing. If your business is built upon repurposing everyone else’s inputs indiscriminately to your benefit and their detriment, it is, too expensive, to reveal that simple truth.