Episode 91

E91: Behind the AI Curtain: Sourcing: A Look at the Data That Powers Generative AI

In our latest episode, hosted by Erin, we dive deep into the complexities of AI training data, exploring its sources, legal ramifications, and evolving landscape. A must-listen for anyone curious about how generative AI models like ChatGPT are trained and the ethical and legal debates they stir.

Here are three key takeaways from the discussion:

Diverse Data Sources: AI platforms are now shifting from relying on web scraping to purchasing licensed data sets, including private archives from major tech players like Meta and Microsoft, ensuring richer and legally compliant training materials.
Synthetic Data Innovation: The creation of synthetic data by AI to train other AI systems shows promise for enhancing privacy and data security, pivotal for sectors like healthcare and finance.
Legal and Ethical Considerations: As AI consumes a vast array of content, issues surrounding copyright infringement, privacy, and contractual breaches come to the forefront, highlighting the importance of adhering to legal standards and ethical practices.

Stay informed about the dynamic world of AI and ensure you're up to speed on the implications of AI developments on privacy, creativity, and legal standards.

Connect with Erin to learn how to use intellectual property to increase your income and impact. hourlytoexit.com/podcast.

Erin's LinkedIn Page: https://www.linkedin.com/in/erinaustin/

Hourly to Exit is Sponsored By:

This week’s episode of Hourly to Exit is sponsored by the NDA Navigator. Non-disclosure agreements (NDAs) are the bedrock of protecting your business's confidential information. However, facing a constant stream of NDAs can be overwhelming, especially when time and budget constraints prevent you from seeking full legal review. That's where the NDA Navigator comes to your rescue. Designed specifically for entrepreneurs, consultants, and business owners with corporate clients, the NDA Navigator is your guide to understanding, negotiating, and implementing NDAs. Empower yourself with legal insights and practical tools when you don’t have the time or funds to invest in a full legal review. Get 20% off by using the coupon code “H2E”. You can find it at www.protectyourexpertise.com.

Think Beyond IP YouTube Page: https://www.youtube.com/channel/UCVztXnDYnZ83oIb-EGX9IGA/videos

Music credit: Yes She Can by Tiny Music

A Team Dklutr production

Transcript

Speaker: 00:00:00

today we're talking about AI again, but more specifically about the

Speaker: 00:00:04

training data sets for generative AI, where it comes from, what some

Speaker: 00:00:11

of the legal issues are, what is it?

Speaker: 00:00:13

when we think about chat, GBT and other AI tools, I know I talk

Speaker: 00:00:17

about chat, GBT all the time.

Speaker: 00:00:20

Frankly, the one that I use.

Speaker: 00:00:21

And so the one I'm most familiar with, but this applies to,

Speaker: 00:00:24

all generative AI platforms.

Speaker: 00:00:26

You hear about the vast amounts of data that they utilize.

Speaker: 00:00:30

And as you can imagine, trading data plays a crucial role in the

Speaker: 00:00:34

development and the efficiency.

Speaker: 00:00:36

effectiveness of generative AI platforms, but where does all of that data come from?

Speaker: 00:00:43

And I know you have lots of questions about that because you're worried

Speaker: 00:00:46

about that is coming from your website.

Speaker: 00:00:49

So let's start with what is training data?

Speaker: 00:00:52

training data is the backbone of any machine learning.

Speaker: 00:00:56

project, which is what generative AI is.

Speaker: 00:00:59

It consists of large sets of information that's used to teach algorithm how to

Speaker: 00:01:04

recognize patterns and make predictions.

Speaker: 00:01:08

That's how it is creative, i.

Speaker: 00:01:10

generative.

Speaker: 00:01:11

And so you put in this vast amount of data and it's labeled in certain ways.

Speaker: 00:01:17

I don't know how it does this, but it learns the patterns and then it

Speaker: 00:01:23

can make informed predictions and create new content based on that.

Speaker: 00:01:29

So given the scale of modern AI, requirements, the data sets

Speaker: 00:01:34

are absolutely enormous, often encompassing billions of parameters.

Speaker: 00:01:40

and that, of course, will.

Speaker: 00:01:41

Change depending on the size and complexity of the

Speaker: 00:01:45

model that is being trained.

Speaker: 00:01:47

So the primary sources of training data, or I should say, traditionally,

Speaker: 00:01:52

the sources of training data for platforms like open AI was

Speaker: 00:01:57

scraped from the internet for free.

Speaker: 00:02:00

And that was used to train the first generative AI models like chat, GBT, and

Speaker: 00:02:06

they've done a pretty good job, I'd say, of learning to mimic human creativity.

Speaker: 00:02:11

of course, thought, believe, and I think they're still sticking to this story, that

Speaker: 00:02:16

it was legal and ethical for them to do so, relying on some prior cases that, you

Speaker: 00:02:22

can use, publicly available information, so long as it's transformative,

Speaker: 00:02:28

essentially making a fair use argument.

Speaker: 00:02:31

I'm not going to go into the fair use argument, but, that is the basis of

Speaker: 00:02:35

why they thought they could do this.

Speaker: 00:02:37

as you probably know, there have been a number of high profile

Speaker: 00:02:41

lawsuits about their use.

Speaker: 00:02:43

So we will see, and there has not been resolved yet.

Speaker: 00:02:46

And so we will see if their reasoning and their defenses hold up.

Speaker: 00:02:51

So, to discuss a few of the, ways that they do get training data.

Speaker: 00:02:56

web scraping, which you've already talked about.

Speaker: 00:02:58

so there would be crawlers, they send out, scours the internet.

Speaker: 00:03:04

It should only be scouring for things that are publicly available,

Speaker: 00:03:08

that are not behind a paywall.

Speaker: 00:03:10

However, there well, you can ask the crawler, I'm assuming, to go behind the

Speaker: 00:03:15

paywall, which obviously would be a breach of, the, terms and conditions of a site if

Speaker: 00:03:21

you go around their paywall, and also, you know, Even if there is no paywall, many

Speaker: 00:03:26

sites will have terms and conditions would say you're not allowed to use crawlers.

Speaker: 00:03:30

if you don't, comply with those terms conditions, then you're also, obviously.

Speaker: 00:03:36

breaching those terms and conditions of that, as well as when they're

Speaker: 00:03:40

scraping that data off many times, if not always, because like, we

Speaker: 00:03:44

can't really quite see what's in the black box of that training data.

Speaker: 00:03:47

They're taking off any copyright notices, and it is a violation of the Copyright

Speaker: 00:03:54

Act to take off copyright notices.

Speaker: 00:03:56

So there's a number of issues, involved with it.

Speaker: 00:03:59

web scraping.

Speaker: 00:04:00

that obviously is falling in disfavor.

Speaker: 00:04:03

So what is replacing that licensed data sets?

Speaker: 00:04:07

Very large data sets that are licensed from entities that

Speaker: 00:04:11

own large amounts of data.

Speaker: 00:04:14

I read this, regarding this new, path forward.

Speaker: 00:04:18

There is a rush right now.

Speaker: 00:04:20

To go for copyright holders that have private collections of stuff that is

Speaker: 00:04:25

not available to be scraped, so this is from a lawyer who is advising content

Speaker: 00:04:29

owners on deals worth tens of millions of dollars apiece to license archives

Speaker: 00:04:35

of photos, movies, and books for AI.

Speaker: 00:04:38

training.

Speaker: 00:04:39

Bruder spoke to more than 30 people with knowledge of AI data deals,

Speaker: 00:04:43

including current and former executives of companies involved, the lawyers and

Speaker: 00:04:47

consultants to provide the first in depth exploration of this fledgling market and

Speaker: 00:04:53

Detailing the types of content that's being bought, the prices that they're

Speaker: 00:04:57

getting, and any emerging concerns that come from harvesting this type of

Speaker: 00:05:03

data, even if it's licensed, because of the personal data risks that go along

Speaker: 00:05:09

with harvesting large amounts of data where, the personal data of the, human

Speaker: 00:05:14

that it belongs to, is done without the knowledge or consent of that person.

Speaker: 00:05:18

Who are these huge licensees?

Speaker: 00:05:20

There's of them.

Speaker: 00:05:21

We have tech companies who have been quietly, buying, content

Speaker: 00:05:27

that is behind locked paywalls and behind login screens from companies

Speaker: 00:05:34

like Instacart, Meta, Microsoft.

Speaker: 00:05:38

X and zoom.

Speaker: 00:05:39

And so this might be some long forgotten chat blogs or long forgotten photos

Speaker: 00:05:46

from old apps that are being licensed.

Speaker: 00:05:50

tumblers, parent company automatic said last month, and I'm recording

Speaker: 00:05:55

this in, April 2024, right?

Speaker: 00:05:57

It was sharing content with select AI companies.

Speaker: 00:06:01

And in February, that'd be 2024, Reuters reported Reddit struck a deal with

Speaker: 00:06:06

Google to make its content available for training the latter's AI models.

Speaker: 00:06:11

of course there's going to be some customer blowback.

Speaker: 00:06:16

while this type of licensed content is accelerating, there will probably be

Speaker: 00:06:21

some amendments still to it because, yes, meta goes in and it changes his terms

Speaker: 00:06:26

of use, but does anybody read the terms of use of meta or of X or of zoom even.

Speaker: 00:06:33

so.

Speaker: 00:06:33

They're going in changing their terms and conditions without anyone

Speaker: 00:06:36

kind of without it saying in bright red letters, Hey, we're going to be

Speaker: 00:06:39

selling your data now to AI training.

Speaker: 00:06:42

what comes from that.

Speaker: 00:06:44

All right.

Speaker: 00:06:44

Then there are archives that are owned such as the Associated Press

Speaker: 00:06:50

and Getty images, or say aggregator.

Speaker: 00:06:54

They don't own all those images.

Speaker: 00:06:55

And so you can go to them and license their entire archives.

Speaker: 00:07:00

And that provides a great amount of data for your data sets.

Speaker: 00:07:04

Universities and research institutions are also owners or controllers of

Speaker: 00:07:10

vast amounts of data that can be licensed all in one fell swoop.

Speaker: 00:07:15

And then there are some nonprofit organizations that want to encourage

Speaker: 00:07:20

the use of AI Just as we've had, other types of nonprofits in the

Speaker: 00:07:24

past, such as creative commons, who want to help people get more

Speaker: 00:07:28

access to, copyrightable materials.

Speaker: 00:07:31

now there are some who feel the same way about.

Speaker: 00:07:34

Making AI, data more accessible.

Speaker: 00:07:36

for instance, this, nonprofit Allen Institute for AI released a data set

Speaker: 00:07:41

of 3 million tokens from a diverse mix of web content, academic publications,

Speaker: 00:07:47

code books, and encyclopedic materials.

Speaker: 00:07:50

Now, another source is synthetic data when this is a new one to me, but it

Speaker: 00:07:56

really points to how powerful AI can be.

Speaker: 00:08:00

So synthetic data generation means that you use one generative AI

Speaker: 00:08:05

tool to create synthetic data.

Speaker: 00:08:08

And then you use that data, that synthetic data to train another.

Speaker: 00:08:12

Generative AI tool.

Speaker: 00:08:14

So let's say you're developing a customer service AI model.

Speaker: 00:08:18

You could use another generative AI tool to create fictional customers

Speaker: 00:08:24

and situations and interactions.

Speaker: 00:08:27

And then you can use those fictional customer situations and

Speaker: 00:08:31

interactions as the training data for your public facing AI model.

Speaker: 00:08:36

So that way you're not at risk of exposing private information.

Speaker: 00:08:42

If you were to directly put your customer information into your AI

Speaker: 00:08:46

tool, first, you kind of anonymize it using one generative AI tool.

Speaker: 00:08:51

And it's not just enough to de identify it because there could be customer

Speaker: 00:08:55

situations that are so specific that you could only point to one person.

Speaker: 00:08:59

It's possible.

Speaker: 00:09:00

So you also have to make up perhaps new situations, new

Speaker: 00:09:03

backgrounds, things like that.

Speaker: 00:09:04

But then you can use that as your fictional customer for your AI,

Speaker: 00:09:09

govern customer service model to then use that to train to help provide

Speaker: 00:09:14

customer service on an AI basis.

Speaker: 00:09:17

So we will see this with hospitals and banks as well

Speaker: 00:09:19

that have sensitive information.

Speaker: 00:09:21

Obviously they cannot use their customer's sensitive information.

Speaker: 00:09:25

as training data, but they do want to have access to what is really kind of part

Speaker: 00:09:30

of doing business these days of having some sort of a I based training systems.

Speaker: 00:09:36

And then, of course, not last and not least, is the data

Speaker: 00:09:40

that comes from you and me.

Speaker: 00:09:42

So, what.

Speaker: 00:09:44

Does that mean when we are using AI generated, uh, AI platforms, when

Speaker: 00:09:49

we input our prompts, if we, put in something that we've written and ask

Speaker: 00:09:56

it to, create a summary of it, if we put in a transcript from something

Speaker: 00:10:01

and ask it to create a show notes, like everything that we put into that.

Speaker: 00:10:06

has the potential to become training data for that platform.

Speaker: 00:10:11

And so if we are doing that, we need to be aware of the

Speaker: 00:10:15

Speaker: 00:10:18

most of them will tell you that it can be part of the training data.

Speaker: 00:10:22

And it might also end up being an output for someone who puts in a query,

Speaker: 00:10:28

a prompt that what you put in as a perfect answer for, you just don't know.

Speaker: 00:10:32

And so we need to be careful about what we are putting in as prompts

Speaker: 00:10:37

or as, the input for whatever the AI platform that you're using.

Speaker: 00:10:41

Make sure you are aware of their terms and conditions.

Speaker: 00:10:44

Do not use any confidential information in there.

Speaker: 00:10:48

whether it's yours or your clients.

Speaker: 00:10:50

So make sure that you're really aware of that.

Speaker: 00:10:54

some, AI platforms, I'm thinking in particular of they do use AI.

Speaker: 00:11:00

And obviously, when you're using, uh, DocuSign, there are legal agreements that

Speaker: 00:11:05

are going in there that have identifiable information of the parties, commercial

Speaker: 00:11:09

terms, and things like that are in there.

Speaker: 00:11:12

And so DocuSign, said that they, strip out any identifying data from that, so

Speaker: 00:11:18

that they do use the agreements, for training data, but that they do strip

Speaker: 00:11:23

out identifying information from it.

Speaker: 00:11:25

So things to be aware of.

Speaker: 00:11:27

in summary, the legal issues, I think we've covered, but just to sum them

Speaker: 00:11:31

up, there are the copyright issues of Putting data into the database.

Speaker: 00:11:37

I believe it was last week, I talked about the copyright

Speaker: 00:11:41

ability issues of the output.

Speaker: 00:11:43

So now I'm talking about the copyright issues with the input,

Speaker: 00:11:46

whether or not the AI platform or you have the right to, add it.

Speaker: 00:11:52

Information to the training data set, whether or not that

Speaker: 00:11:56

is a copyright infringement.

Speaker: 00:11:58

Is that fair use of that data?

Speaker: 00:12:01

one of the issues, in the copyright side is, sometimes the output will literally

Speaker: 00:12:06

be an exact replica of what went in and it's hard to make a fair use argument when

Speaker: 00:12:13

a verbatim, uh, paragraphs, in the case of the New York Times, which is the basis

Speaker: 00:12:18

of their lawsuit against OpenAI, verbatim paragraph comes out as the output.

Speaker: 00:12:23

Where's the fair use there?

Speaker: 00:12:25

Same with Getty, images.

Speaker: 00:12:27

They've had exact replicas of their images come out of an AI platform.

Speaker: 00:12:33

So that's obviously an issue.

Speaker: 00:12:35

addition to copyright issues, we have privacy concerns.

Speaker: 00:12:38

maybe real images of people where there are instances of real images of people

Speaker: 00:12:43

coming out most certainly, private photos that are from somebody's, old

Speaker: 00:12:48

Facebook or old blog posts, old journals.

Speaker: 00:12:51

Think about what original, blogs were kind of like journals, right?

Speaker: 00:12:55

And people would use them, as a journal and they're probably

Speaker: 00:12:58

hanging around somewhere.

Speaker: 00:12:59

Think about, I mean, I'm thinking about a blog, I guess it was

Speaker: 00:13:02

at the time that I was started.

Speaker: 00:13:04

I mean, it didn't last very long and no one ever saw it.

Speaker: 00:13:07

But it's still somewhere.

Speaker: 00:13:08

Like, I don't know if I could find it today, but it's still out there and

Speaker: 00:13:11

somebody's a web crawl that could find it.

Speaker: 00:13:13

I don't think they'd be very interested, but it's there.

Speaker: 00:13:16

And so we do have the privacy concerns.

Speaker: 00:13:19

And then we have the contract breach of, if we are using say a client's

Speaker: 00:13:25

confidential information, we're entering it into a AI chat, bot.

Speaker: 00:13:30

And It's the potential to be shared.

Speaker: 00:13:34

We are breaching our contractual obligations to our clients.

Speaker: 00:13:38

If we're doing that without permission, even if it is silent with the specifics

Speaker: 00:13:44

of whether or not you can use a I and some contracts are being explicit about it.

Speaker: 00:13:49

But even if it's silent.

Speaker: 00:13:50

And you are obligated to use the client's information, keep it

Speaker: 00:13:54

confidential and only share it under very specific circumstances, putting

Speaker: 00:13:58

it into an AI platform is probably not one of those permitted uses.

Speaker: 00:14:03

So you do have issues there as well.

Speaker: 00:14:06

All right.

Speaker: 00:14:06

So that is what I wanted to cover today regarding AI training data.

Speaker: 00:14:11

as you know, this is a fast moving.

Speaker: 00:14:14

matter, you know, who knows what will come next week.

Speaker: 00:14:17

I'll try to keep you up to date, but always feel free to connect with me and

Speaker: 00:14:22

let me know what your questions are.

Speaker: 00:14:23

I'm always happy to answer them.

Speaker: 00:14:26

Thanks again.

Speaker: 00:14:26

And don't forget IP is fuel.

Scaling Expertise

Episode 91

E91: Behind the AI Curtain: Sourcing: A Look at the Data That Powers Generative AI

Transcript

About the Podcast

Listen for free

About your host

Erin Austin

Join our newsletter community to stay connected!