Things might change but right now, you simply don’t need anyones authorization.
Hopefully it doesn’t change because only a handful of companies have the data or the funds to buy the data, it would kill any kind of open source or low priced endeavour.
FWIW, Common Crawl - a free/open-source dataset of crawled internet pages - was used by OpenAI for GPT-2 and GPT-3 as well as EleutherAI’s GPT-NeoX. Maybe on GPT3.5/ChatGPT as well but they’ve been hush about that.
Things might change but right now, you simply don’t need anyones authorization.
Hopefully it doesn’t change because only a handful of companies have the data or the funds to buy the data, it would kill any kind of open source or low priced endeavour.
FWIW, Common Crawl - a free/open-source dataset of crawled internet pages - was used by OpenAI for GPT-2 and GPT-3 as well as EleutherAI’s GPT-NeoX. Maybe on GPT3.5/ChatGPT as well but they’ve been hush about that.