A.I.’s un-learning problem: Researchers say it’s virtually impossible to make an A.I. model ‘forget’ the things it learns from private user data

assassin_aragorn@lemmy.world · 1 year ago

A.I.’s un-learning problem: Researchers say it’s virtually impossible to make an A.I. model ‘forget’ the things it learns from private user data

Something Burger 🍔@jlai.lu · 1 year ago

Can’t they remove the data from the training set and start over?

knotthatone@lemmy.one · 1 year ago

Not really, no. None of the source material is actually stored inside the model’s dataset, so once it’s in, it’s in. Because of the way they are designed, you can’t point to a particular document and just delete that one thing. It’s like unscrambling an egg.

snooggums@kbin.social · 1 year ago

They can remove ALL the data and start over.

teradome@lemmy.one · 1 year ago

exactly.

removing one thing from a pile != removing the entire pile.

b/c the original goal was to not disturb the rest of the pile

snooggums@kbin.social · 1 year ago

If they can’t remove individual pieces then they need to remove the whole pile, and rebuild the process in a way that does allow then to remove individual pieces.

No, I don’t care how much time and effort it costs. That is on them for abusing other people’s data.

mo_ztt ✅@lemmy.world · 1 year ago

Yes, but that’s not easy… I can’t remember exactly, but I think I saw an estimate that the compute time to train just one of the GPT models cost around $66 million. IDK whether that’s total cost from scratch, or incremental cost to arrive at that model starting from an earlier model that was already built, but I do know that GPT is still to this day using that September 2021 cutoff which to me kind of implies that they’re building progressively on top of already-assembled models and datasets (which makes sense, because to start from scratch without needing to would be insane).

You could, technically, start from scratch and spend 2 more years and however many million dollars retraining a new model that doesn’t have the private data you’re trying to excise, but I think the point the article is making is that that’s a pretty difficult approach and it seems right now like that’s the only way.

skulblaka@kbin.social · 1 year ago

Un-robbing a bank also isn’t easy, but that doesn’t mean I’m able to just say “it too hard :c” and then walk off into the sunset with my looted gains.

Zeth0s@lemmy.world · edit-2 1 year ago

Information leaking is a thing. Some information is spread across multiple sources without actually being in any of those. If you remove something, the model can still infer the information.

If macron asks for his name to be deleted, you can retrieve his political opinion by simply knowing the history of interactions with the French government. I just need to tell the model that the person he has no direct information about is named macron, and he can profile him.

Same with the search engine. The only difference is that the inference of missing information now is done by human brains. The model can substitute them