Big Tech hoovered up Australian content without permission - Nine’s 60 Minutes, ABC News, Junkee, SBS, The Conversation, NDIS and Unis scraped by an open source AI institute whose initial donors included Canva
Big tech firms such as Apple, Anthropic, Nvidia, and Salesforce used Australian content scraped from YouTube to train their databases without the permission of copyright holders, and contrary to Alphabet's rules, courtesy of an open-source AI Institute that scrapes content and then licenses the data. EleutherAI, partly funded by Canva, uses fair use provisions in the US to circumvent Australian law. ABC News, Nine's Sixty Minutes, The Australian Greens, half a dozen local universities, and myriad government departments have content that has been hoovered up by EleutherA. It's the latest example of tech giants gifting themselves new rights over others IP. It's all fair game, says Microsoft's AI chief.
What you need to know
- Tech giants including Apple, Anthropic, Nvidia, and Salesforce trained AI models on Australian content that was scraped from YouTube without permission by an open-source AI institute, EleutherAI
- Nine’s 60 Minutes, ABC News, Australian Greens, the NDIS, half a dozen Unis, and even the local US Embassy are among the content creators whose IP ended up in EleutherAI's database.
- EleutherAI sponsors include CoreWeave, HuggingFace, stability.ai, Google TRC, Nat Friedman and Lambda Labs, and Australian tech unicorn Canva which was a pioneer donor though it is not listed on the site
- It's the latest example of Big Tech gifting itself rights to other people's content and using US fair usage provisions to circumvent copyright laws in jurisdictions like Australia
- Alphabet, which helps fund EleutherAI, says it violated YouTube's T&Cs.
-
EleutherAI is part of the open-source community, a widespread but unregulated network of developers who share code, often without payment, to hasten tech advances. Its contribution to this network is to suck in information from the open web and convert it into massive data sets that can be used to train AIs.
-
EleutherAI says the goal is to democratise AI by making data and tech available to the developer community to accelerate the adoption of emerging technology.
- Australian media companies are still developing their policies and responses. News Corp, for instance, has done a deal with OpenAI, the publisher of ChatGPT, that gives the LLM access to some of the media giant's mastheads.
- Nine meanwhile confirmed to Mi3 that it is exploring options to ensure it gets 'fair compensation' for its content
-
Big Tech's AI ambitions is hitting a wall as 60 per cent of the world’s publishers have added code or changed their terms and conditions to block their content from being scraped.
-
That has made model collapse a real risk - where AI’s advance is halted because the machines cannot learn to be any better.
Nine is exploring a number of options to ensure we receive fair compensation for both the historical and ongoing use of our content to train Large Language Models
Big tech firms trained their AI models on Australian content without permission, relying on data provided by an open-source AI institute that uses US fair use provisions to circumvent local copyright laws.
And the organisation behind the massive data scrape, EleutherAI, was in part funded by Australian tech unicorn Canva, according to TechCrunch. Canva declined to comment, and it is not listed as a sponsor on the EleutherAI. However, a post from EleutherAI's executive director Stella Biderman from a year ago on Reddit confirmed the connection.
Apple, Anthropic, Nvidia, and Salesforce, all used videos to train their AI that were taken without permission by EleutherAI which argues fair use to try to sidestep copyright law in jurisdictions beyond the US.
The tech firms may not have aware of the legalities around how the data was collected.
EleutherAI claims on its website to be a non-profit research institute focused on building large-scale artificial intelligence.
As well as Canva, it has raised funds from Hugging Face and Stability AI, former GitHub CEO Nat Friedman, and Lambda Labs.
Thousands of YouTube videos from Australian publishers, broadcasters, and the government, have been used by Apple and Nvidia to train their AIs.
Anthropic also used the data that was harvested from subtitles on 173,536 videos in breach of Google-owned YouTube’s usage policies.
ABC News, Nine’s 60 Minutes, multiple federal government departments, the military, major universities, charities and religious groups were targeted.
It’s a live issue with Australian media companies who are still working through how best to respond.
Some, like News Corp, have struck deals with large language model companies like OpenAI, owner of ChatGPT. Under the terms of that deal, “OpenAI will receive access to current and archived content from News Corp’s major news and information publications, including The Wall Street Journal, Barron’s, MarketWatch, Investor’s Business Daily, FN, and New York Post; The Times, The Sunday Times and The Sun; The Australian, news.com.au, The Daily Telegraph, The Courier Mail, The Advertiser, and Herald Sun; and others. The partnership does not include access to content from any of News Corp’s other businesses.”
A spokesperson for Nine meanwhile told Mi3: “Nine is exploring a number of options to ensure we receive fair compensation for both the historical and ongoing use of our content to train Large Language Models.”
For their part, and as Mi3 reported recently, Big Tech has been rewriting the terms and conditioning of its platforms to gift itself new rights over what it can do with other companies' content.
Per an Mi3 report in June: “Tech vendors are rewriting platform usage rulebooks, adjusting terms and conditions – and privacy policies – to give themselves legal cover, and in some cases to gift themselves new rights over customer data and content.”
Hoovered up
An enormous training dataset of 489 million words was siphoned from the platform, and even the US Embassy in Australia and Google Australia were hit.
EleutherAI is part of the open-source community, a widespread but unregulated network of developers who share code, often without payment, to hasten tech advances.
Its contribution to this network is to suck in information from the open web and convert it into massive data sets that can be used to train AIs.
EleutherAI says the goal is to democratise AI by making data and tech available to the developer community to accelerate the adoption of emerging technology.
It calls its largest dataset The Pile, and it contains 825 GBs of data, most of which was collected and made available under the US’ opaque fair use version of copyright law.
EleutherAI’s website says: “The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project to text data made available by the data owners, to third-party scrapes available online.”
It was taken from multiple sources including:
- 17 million documents scraped from URLs posted to Reddit.
- Subtitles from thousands of movies and TV shows including InsideOut2, and Game of Thrones.
- Wikipedia, Microsoft-owned GitHub, and Y Combinator’s HackerNews.
- Thousands of scientific papers from New York’s Ivy League Cornell University.
- Publicly available medical papers, web development databases, and
- Patent documents filed with the US government.
Many of these are controversial and may breach Australian copyright law, and all other jurisdictions that do not recognise America’s unique fair use carve-out.
The Pile’s data collection policies have already provided legal responses. It once contained 197,000 pirated books until authors launched a court action, and the Book3 dataset was removed.
It has also triggered a lawyer's picnic downstream.
More than 150 authors are suing Nvidia alleging the $3 trillion chipmaker used The Pile to train their AI. Meta also acknowledged in court filings it accessed it.
Zoom in
But it’s the YouTube Subtitles dataset that’s now in focus.
YouTube’s terms of use, explicitly ban its videos being scraped. Yet, EleutherAI founder Sid Black wrote on GitHub that he created the YouTube Subtitles dataset by using a script to download them.
He did not have the permission of content creators, and the code used remains freely available to download on the web, and the YouTube data remains in The Pile.
Wikipedia says The Pile has “become widely used to train other models, including Microsoft and Meta AI”.
My trawl of The Pile unearthed video content taken from myriad Australian outlets – too many to list - however most prominent were:
- Media companies including ABC News, Nine’s 60 Minutes Australia, The Big Issue, Junkee, SBS, The Feed, The Conversation, and Screen Australia.
- The Australian Government and its agencies including The Australian House of Representatives, the National Archives, the Financial Security Authority, the Department of Education, Department of Health and Aged Care, Disability Employment Australia, NDIS Australia, the Australian Public Service Commission, the Australian Greens, and Labour MP for Bruce Julian Hill.
- Seven top universities, as well as the National Library of Australia, The Australian Museum, the Australian Academy of Science, and The Australian Human Rights Commission.
- Global brands including the BBC, Wall Street Journal, Stephen Colbert and Jimmy Kimmel, had their videos used to train AI without their permission, as well as YouTube’s megastars MrBeast and PewDiePie.
A joint investigation by Wired and data-driven US publisher Proof News into the scraping practices included the hostile reactions of YouTube creators including David Pakman, host of The David Pakman Show, a politics channel with two million subscribers
“No one came to me and said we would like to use this,” he said.
If AI companies are paid, Pakman said, he should be compensated for the use of his data. He pointed out that some media companies have recently penned agreements to be paid for the use of their work to train AI.
“This is my livelihood, and I put time, resources, money, and staff time into creating this content,” Pakman said.
Proof News uncovered internal Apple research documents that confirmed it used Pile data for AI training.
It also found more documents confirming Nvidia, Salesforce, and Anthropic did too.
Proof News has given Mi3 permission to quote its article at length. The lead reporter on the story, Annie Gilbertson, shared a quote from Anthropic confirming it uses the for it AI Claude, but downplaying the significance
“The Pile includes a very small subset of YouTube subtitles," the spokesperson claimed. “YouTube’s terms cover direct use of its platform, which is distinct from use of The Pile dataset.
“On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors,” the spokesperson told Proof News
Salesforce also confirmed to Proof News it use the Pile to build an AI model - Caiming Xiong, VP of AI research, said the dataset was “publicly available.”
“It is not clear that the big tech firms will succeed on their fair use defence arguments - which is probably why they are doing deals with major publishers."
Hungry dogs v bucket of blood
The Big Tech trillionaires are locked in a struggle to get ahead in the $4 trillion AI race. To win, they need access to all the world’s content.
But they are hitting a wall as 60 per cent of the world’s publishers have added code or changed their terms and conditions to block their content from being scraped.
And that has made model collapse a real risk - where AI’s advance is halted because the machines cannot learn to be any better.
Microsoft’s AI chief Mustafa Suleyman, who created his own storm of controversy by describing content on the open web “freeware” admitted the risk of model collapse recently.
But that is not silencing investors, who are in over their heads on AI investments and are not seeing the returns they need to be even richer. Recent reversals on AI shares won’t have helped.
That means the rules are there to be broken - to move fast and break things, and probably the law. Examples abound:
- The New York Times report that YouTube owner Google used videos on the platform to train its models.
- When OpenAI risked Google’s wrath to pinch more YouTube videos without permission.
- OpenAI’s flustered CEO-for-five-minutes, Mira Murati, choked on air when The Wall Street Journal asked whether she’d raided YouTube.
As the FTC chair Lina Khan has said in the past, Big Tech see fines as the cost of doing business.
For Australian content creators caught up in a fight between scrapers and platforms, the issue is one of legal jurisdiction.
According to Josephine Johnston, CEO, Australia's Copyright Agency: “The problem for Australian rights holders, like those you have identified, is that the infringing behaviour is likely to have occurred in other jurisdictions and under different laws.
“If the copying took place in the United States, the relevant laws to apply would be US law, including the fair use defence.
Johnston also notes that big tech companies aren’t arguing that fair use applies in Australia. Rather they are arguing that the law should be changed in Australia to introduce fair use.
“The problem is that it is incredibly difficult, impractical, and expensive for Australian creators to enforce their rights in a different jurisdiction. It’s hard for Australian creators to join class actions in the US, for example.”
There is no guarantee that big tech will ultimately win. That’s one of the reasons companies like OpenAI have cut deals with publishers like New Corp.
“It is not clear that the big tech firms will succeed on their fair use defence arguments - which is probably why they are doing deals with major publishers."
Johnston told Mi3: “Our position is that the Government should intervene to provide a mechanism to ensure that creators are compensated for the unauthorised copying of their work in training AI models offshore. We are currently working on options that could address that issue.”
“For any copying that occurs within Australia, the Copyright Act already provides an efficient and fair framework that supports innovation and protects rightsholders. Developers just have to seek a licence. Fair use is not part of our law.”
Dean Ormston, the CEO of APRA AMCOS, which manages the rights of 119,000 artists across Australia and New Zealand, said:
“APRA AMCOS has long held concerns regarding the lack of transparency that most Generative AI platforms have demonstrated in acknowledging the content which has been scraped and copied in order to create their outputs.
“Creators pour their hearts and souls into their work, investing countless hours refining their craft. Yet, they face the reality of seeing their creations exploited by AI platforms without credit, consent or compensation.
“It is this type of secretive behaviour that de-legitimises this new technology and embeds the trust deficit so many Silicon Valley businesses now face.
“We look to the good corporate players that are working in partnership with creators to ensure there is a share in the profitability of artificial intelligence to those that provided the dataset for these platforms.”
Finally, experience has taught me to follow the money. So if EleutherAI provides its datasets for free, how does it pay its people and cover its bills?