Who’s scraping your data to train LLMs? Tech vendors rewrite T&Cs, privacy policies to claim new rights over customer data in AI arms race – cue rage

News Analysis 27 Jun 2024 - 12 min read

AMI CPD: 0.5

Who’s scraping your data to train LLMs? Tech vendors rewrite T&Cs, privacy policies to claim new rights over customer data in AI arms race – cue rage

By Andrew Birmingham - Martech | Ecom | CX Editor

Market Voice 3 min read

Retail media: Key lessons to avoid repeating programmatic ‘gold rush’ mistakes, hollowing out brand budgets

By Rory Heffernan - Chief Executive Officer, Atomic 212º | Partner Content

Retail media’s rise comes as marketers are being told to do even more with even less, says Atomic 212º Chief Executive Officer, Rory Heffernan, and puts brand budgets under further pressure.

Market Voice 4 min read

Airports are back bigger and better: why they are the next frontier for incredibly human brand connection

By Jemma Enright - General Manager, Airport, JCDecaux ANZ | Partner Content

In the next 5 years, brands will be reaching 85m passengers a year in the transformed terminals of Sydney, Adelaide, and Perth airports. That’s a massive opportunity to engage an emotionally charged audience in a context of curiosity and discovery, says JCDecaux’s Jemma Enright.

Market Voice 3 min read

The best relationships are personal – regional strategies should be too

By Phil Leyshon - Director of Digital, ACM | Partner Content

Mi3 sat down with ACM Director of Digital Phil Leyshon to discuss how first party data and enriched targeting is bringing brands closer to the regions – and notching upwards of 30 per cent growth.

Tags: Generative AI

Tech vendors are rewriting platform usage rulebooks, adjusting terms and conditions – and privacy policies – to give themselves legal cover, and in some cases to gift themselves new rights over customer data and content. But even when they are trying to do the right thing, their preference for legal compliance over messaging clarity is leaving customers angry and confused. And a new generation of digital giants has bastardised Mark Zuckerberg's famous entreaty to "move fast and break things." But while the Zuck was talking about internal design processes, the generative AI newcomers are breaking long-held conventions and rules, and arguably laws, by stealing IP according to long-suffering publishers. Mi3 asked some martech majors specifically whether they use customer data to train large language models or other AI. The answers were not always binary.

What you need to know:

Vendors like Microsoft, Google, Salesforce, Adobe, and Meta have been rewriting their terms and conditions, and privacy policies for the generative AI era.
But a culture of communication that favours legal compliance over message clarity, married to a customer mindset that is increasingly skeptical and distrustful about how digital giants use data is leading to miscommunication and angry outbursts on social media and specialist message boards.
The reality is that most vendors Mi3 queried either do not use customer data to train large language models, or if they do, they explicitly require consent.
But not all vendors.
Some are simply opting their clients (and their client's data) into their model training and not making it clear or easy to opt-out.
Worse, some of the newer emerging generative AI giants are ignoring clear customer requirements not to scrape their content by ignoring things like robots.txt.

To be clear, Slack is not scanning message content to train AI models.

— Slack spokesperson

Advertising works. This morning Mi3 watched as a recently alighted passenger on Wynyard station stood transfixed on the platform. She was focused on a Salesforce advertisement playing on the big screen across the train tracks, even as the heaving sea of commuters washed around her.

The ad promised that “Salesforce AI never steals or shares your customer data” and at the end, a cowboy threw some data-thieving hombres off the back of a train. The message was clear enough: Your data is safe with us.

True story. But what feels like clarity in a well-crafted, highly-produced piece of creative still leaves plenty of room for ambiguity as marketing technology vendors like Salesforce, Adobe, and others are now learning.

Salesforce, for instance, earned the ire of its customers last month when reports started circulating widely on social media that it was using customer data from its Slack messaging platform to train AI. Worse, it was automatically opting customers in.

1997 called and wants its workflow back

The backlash started when an angry user posted to Hacker News, a popular developer website, after learning that the only way to opt out of having their data used for model training purposes was to email the company. Animus spread quickly to social media, and soon social media did what social media does best – pushed its users to ever more edgier conclusions.

Pretty soon a story got running – incorrectly – that all those amusing, pithy, and occasionally snarky Slack messages we all love to share with our colleagues were part of the data set being used for training. Part of the problem may be that the Salesforce mothership takes a different approach to Slack which Salesforce acquired in December 2020.

Salesforce which has put AI at the centre of its pitch since the launch of Einstein in 2016, says it will only use customer data to train models if the customer consents. And specifically on generative AI, the spokesperson stressed: “We have a zero data retention policy with our LLM providers, which means that data is never stored or used for model training — or seen by a human.”

Slack also uses customer data to train models, but customers have to opt-out … by email.

According to a Slack statement sent to Mi3, “To be clear, Slack is not scanning message content to train AI models. We published a blog post to clarify our practices and policies regarding customer data."

The Slack spokesperson also clarified the difference between Slack’s generative AI and other machine learning features:

“For Slack AI, which leverages third-party LLMs for generative AI in Slack, no customer data is used to train those third-party LLMs and Slack does not train any sort of generative AI on customer data.

“Separately, Slack has used machine learning for other intelligent features (like search result relevance, ranking, etc.) since 2017, which is powered by de-identified, aggregate user behaviour data. These practices are industry standard, and those machine learning models do not access original message content in DMs, private channels, or public channels to make these suggestions.”

As tech media website Techcrunch noted, the offence felt by Slack customers may be new, but the terms and conditions are not. They have been applicable since September last year.

The story, however, is emblematic of the problems marketing technology and other vendors are getting themselves into as they try and rewrite the usage rules for ChatGPT era. Problems arise due to their failure to enunciate their policies clearly or when key information is hard to find online.

Catastrophize me

Adobe is another martech leader to experience some recent time on the rack. It spent recent weeks hosing down a wildfire of outrage amongst its customers in the creative community caused by changes in its terms and conditions, and its failure to clearly articulate the meaning of those changes.

In this case, the issue was Adobe updating its terms and conditions to reflect how it uses data and the conditions under which it might access data.

Once again customers took to social media to vent. Some said they were angry that software they already paid for wouldn't work unless they clicked "agree" to the policy updates (which frankly does stretch the definition of consent), others expressed a sense of betrayal. A lawyer complained that Adobe wanted access to privileged information that was protected by client-attorney confidentiality.

It didn’t help that Adobe seemed to be updating the T&Cs on the fly as the crescendo of voices rose.

As to its actual policies, Adobe does not use customer data to train its gen AI models. Recent updates to its terms and conditions clarified that Adobe's generative AI models, such as Firefly, are trained on datasets consisting of licensed content from Adobe Stock and public domain content where copyright has expired. Additionally, Adobe maintains that they do not assume ownership of customers' work and only access user content for purposes such as delivering cloud-based features, creating thumbnails, or enforcing their terms against prohibited content – this is the issue that seems to have triggered the flare-up on social media.

The problems Salesforce and Adobe have encountered reflect a growing trend around personal and corporate data.

Do the right thing?

Customers simply don’t trust the tech sector to do the right thing mainly because they increasingly don't trust anyone to do the right thing.

John Bevitt, the managing director of Honeycomb Strategy has a clear sense of where customer scepticism is coming from.

In his recently released Brands beyond Breaches report he noted, “Customers, now more than ever, seek assurance that their data doesn't just fuel profits but is respected and protected with the utmost integrity – and with a constant stream of data breaches being announced, it’s no surprise that distrust is on the rise.”

His report found that distrust pervades every single industry with media companies (65 per cent), search engines (58 per cent), and market research firms (50 per cent) rounding out the top five list. Not far behind are ecommerce companies (49 per cent), industry trade publications (48 per cent), online services (48 per cent), health and fitness companies (48 per cent) and technology brands (45 per cent).

Part of the problem is a tech industry culture that prioritises legal compliance over clarity and transparency.

Mi3 asked a range of vendors “Do you use customer data to train AI models and if so do you allow customers to opt-out.”

These are both binary propositions, but we rarely enjoyed the precision of yes or no answers.

Another example is Sitecore, a tier one CRM and martech provider. It says the use of Gen AI within its products is optional and that it has designed its Gen AI features to be “distinct and discernible within our cloud products, giving you the freedom to choose whether to use them or not.”

A spokesperson gave a very precise answer about whether customer data generated by input prompts and the subsequent output was used to train models.

“No, neither Sitecore nor its third-party AI model providers will use these inputs/outputs for their own purposes, meaning we don’t use prompts or results to train models. For additional comfort to you, Sitecore is committed to ensuring that our third-party providers adhere to this same policy.”

Beyond that, its third-party providers “may only use your customer data to provide its services to you and will process the customer data temporarily for such purpose.”

But even this answer still allows for ambiguity as it only relates to the inputs and output of the generative AI capabilities.

It doesn’t seem to preclude using other customer data to train models. We have asked for additional clarification.

Clarity is not difficult

It shouldn’t be this hard for vendors to answer simple questions about how data is used.

Both Pegasystems, a real-time interaction management (and business process management) vendor, and Qualtrics and experience management platforms have demonstrated and released material improvements in capability fuelled by generative AI at their international user conferences this year.

In both instances, Mi3 asked if customer data was used to train models.

Alan Trefler, Pega's CEO, said not only does his company not use customer data to train LLMs, it doesn’t train LLMs at all.

Peter van der Putten, the director of Pega’s AI Lab further explained the firm made a deliberate choice to avoid the training game. It’s too difficult and costly, he told Mi3. Instead, Pega wanted to see how much juice it could extract without training models. Its recent Socrates and Blueprint initiatives are examples of the transformative power of LLMs, and what can be achieved without using customer data.

Meanwhile, Qualtrics, an experience management platform launched a series of new AI-powered capabilities at its conference to help organisations maximise research investments and deliver insights faster. CEO Zig Serafin was clear about the fact that it uses customer data to train models but said Qualtrics provides a simple and clear way to opt-out.

It's complicated.

— Aravind Srinivas, the CEO of Perplexity

Thar be llamas

As to the providers of large language models themselves, such as Google, Meta and OpenAI, all use publicly available data on the internet to train their models. And all have been updating their T&Cs to gift themselves new rights, and to clarify the limits of those rights although OpenAI has reportedly become a little more reticent to identify sources as the legal challenges have started adding up.

In addition, the LLMs are doing direct deals with other platforms to access data. And as Scientific America noted late last year, there is very little you can do to stop them.

Here's a cheatsheet to current approaches as we can best determine from terms and conditions on their sites.

Google

Google’s terms and conditions make it clear it uses information posted online to train AI models. That of course includes data on sites that Google does not own or manage. It updated its terms and conditions to allow it to do this in July last year. Essentially that update extended the existing T&Cs to include new AI services.

“We may collect information that’s publicly available online or from other public sources to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities. Or, if your business’s information appears on a website, we may index and display it on Google services.

OpenAI

According to Edureka, OpenAI’s ChatGPT “was trained on large collections of text data, such as books, articles, and web pages. OpenAI used a dataset called the Common Crawl, which is a publicly available corpus of web pages. The Common Crawl dataset includes billions of web pages and is one of the largest text datasets available.”

In March last year, it experienced the same kind of blowback Salesforce and Adobe have recently encountered when it emerged that it was using customer data submitted through its APIs for model training (under the umbrella of “service improvements.” ) At the time the then Chairman Greg Brockman acknowledged it changed that policy in response to legal challenges and customer feedback.

OpenAI also updated its terms and conditions and privacy policy this year and made some significant changes. Users can now disable chat history, preventing conversations from being used to train OpenAI's models. This setting can be toggled in ChatGPT’s settings and ensures that while the chat history is off, conversations will be retained for 30 days solely for abuse monitoring before being permanently deleted.

OpenAI is also introducing a ChatGPT Business subscription, which will follow the same data usage policies as its API. This means that data from ChatGPT Business users will not be used to train models by default. The firm says this offers more control and privacy for professional and enterprise users.

Microsoft

Microsoft's recent updates to its privacy policies and terms of service indicate that customer data for generative AI services, including Azure OpenAI and Copilot, is used to train or improve the base models but only with explicit customer consent.

For Azure OpenAI services, the prompts and responses are not stored or used for training purposes unless customers explicitly upload their own data for fine-tuning models. Even in this case, the data remains under the customer's control, stored within the same Azure region, and is not used to enhance Microsoft's or third-party base models. Microsoft says it maintains strict measures to ensure that customer data remains secure and private, following compliance with privacy laws like GDPR and using advanced encryption methods

Similarly, for Copilot services with commercial data protection, Microsoft does not retain user prompts or responses beyond a short caching period for runtime purposes. Once the session ends, the data is discarded and not used for model training.

Meta

According to Meta “For publicly available online information, we filtered the dataset to exclude certain websites that commonly share personal information.”

It does use publicly shared posts from Instagram and Facebook – including photos and text to train the generative AI models. However, it also stresses that it did not use people’s private posts for training.

“We also do not use the content of your private messages with friends and family to train our AIs. We may use the data from your use of AI stickers, such as your searches for a sticker to use in a chat, to improve our AI sticker models.”

European users have additional protections due to GDPR. Meta updated its privacy policy this year to allow the use of customer data to train AI models. It relies on the legal basis of "legitimate interest" to use information shared publicly on its platforms for training AI models. This update applies to users in Europe, who are given the option to opt out of this data usage, a requirement under European privacy laws like GDPR (Facebook) (Facebook).

For users outside Europe, there has been no indication of similar opt-out options being provided. Meta's approach has been to comply with European regulations while emphasising transparency and user control within those regions

Adios, robot guardian

Recently a new issue has arisen.

Organisations have long been able to opt out of having their sites scraped by legitimate bots by deploying robots.txt which allows them to identify which parts of their site can be scraped and which can not. Google, the world’s dominant search provider, has long honoured this convention.

Some of the new players in the emerging Gen AI do not, according to a plethora of reports which suggest that key LLM providers have arbitrarily decided to scorn long-standing conventions of respecting the robots.txt.

There is after all no legal requirement to do so.

A Reuter’s report referencing a letter from Tollbit, a content licensing firm, to publishers suggested multiple AI vendors were ignoring the robots.txt protocol. Reuters didn’t name the vendors but they were subsequently identified in other media reports as OpenAI (ChatGPT) and Anthropic (Claude), two of the giants in the emerging LLM world. Both have issued denials.

Perplexity, an AI search engine, is also in the frame with Wired magazine saying its analysis confirmed its own content was hoovered up by Perplexity when robots.txt should have made that impossible.

Aravind Srinivas, the CEO of Perplexity denied the suggestion in an interview with Fast Company. He attributed Wired’s experience to a third-party crawler, but when pushed if he had asked the third party to stop, he demurred saying, “It’s complicated.”

There's a reason this all sounds familiar. Meta’s CEO Mark Zuckerberg once famously espoused a culture of “move fast and break things.” Zuckerberg was talking about internal design and management processes but Napster had already famously applied the idea to music IP and the thing it broke was the law.

Today's emerging Gen AI leaders are also going for broke.

Anything that gets in the way is fair game even if it means fairness itself goes out the window.