• Jamie@jamie.moe
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    11 months ago

    Speaking for LLMs, given that they operate on a next-token basis, there will be some statistical likelihood of spitting out original training data that can’t be avoided. The normal counter-argument being that in theory, the odds of a particular piece of training data coming back out intact for more than a handful of words should be extremely low.

    Of course, in this case, Google’s researchers took advantage of the repeat discouragement mechanism to make that unlikelihood occur reliably, showing that there are indeed flaws to make it happen.

    • TWeaK@lemm.ee
      link
      fedilink
      English
      arrow-up
      3
      ·
      11 months ago

      If a person studies a text then writes an article about the same subject as that text while using the same wording and discussing the same points, then it’s plagiarism whether or not they made an exact copy. Surely it should also be the case with LLM’s, which train on the data then inadvertently replicate the data again? The law has already established that it doesn’t matter what the process is for making the new work, what matters is how close it is to the original work.