Episode 91
E91: Behind the AI Curtain: Sourcing: A Look at the Data That Powers Generative AI
In our latest episode, hosted by Erin, we dive deep into the complexities of AI training data, exploring its sources, legal ramifications, and evolving landscape. A must-listen for anyone curious about how generative AI models like ChatGPT are trained and the ethical and legal debates they stir.
Here are three key takeaways from the discussion:
- Diverse Data Sources: AI platforms are now shifting from relying on web scraping to purchasing licensed data sets, including private archives from major tech players like Meta and Microsoft, ensuring richer and legally compliant training materials.
- Synthetic Data Innovation: The creation of synthetic data by AI to train other AI systems shows promise for enhancing privacy and data security, pivotal for sectors like healthcare and finance.
- Legal and Ethical Considerations: As AI consumes a vast array of content, issues surrounding copyright infringement, privacy, and contractual breaches come to the forefront, highlighting the importance of adhering to legal standards and ethical practices.
Stay informed about the dynamic world of AI and ensure you're up to speed on the implications of AI developments on privacy, creativity, and legal standards.
Connect with Erin to learn how to use intellectual property to increase your income and impact. hourlytoexit.com/podcast.
Erin's LinkedIn Page: https://www.linkedin.com/in/erinaustin/
Hourly to Exit is Sponsored By:
This week’s episode of Hourly to Exit is sponsored by the NDA Navigator. Non-disclosure agreements (NDAs) are the bedrock of protecting your business's confidential information. However, facing a constant stream of NDAs can be overwhelming, especially when time and budget constraints prevent you from seeking full legal review. That's where the NDA Navigator comes to your rescue. Designed specifically for entrepreneurs, consultants, and business owners with corporate clients, the NDA Navigator is your guide to understanding, negotiating, and implementing NDAs. Empower yourself with legal insights and practical tools when you don’t have the time or funds to invest in a full legal review. Get 20% off by using the coupon code “H2E”. You can find it at www.protectyourexpertise.com.
Think Beyond IP YouTube Page: https://www.youtube.com/channel/UCVztXnDYnZ83oIb-EGX9IGA/videos
Music credit: Yes She Can by Tiny Music
A Team Dklutr production
Transcript
today we're talking about AI again, but more specifically about the
Speaker:training data sets for generative AI, where it comes from, what some
Speaker:of the legal issues are, what is it?
Speaker:when we think about chat, GBT and other AI tools, I know I talk
Speaker:about chat, GBT all the time.
Speaker:Frankly, the one that I use.
Speaker:And so the one I'm most familiar with, but this applies to,
Speaker:all generative AI platforms.
Speaker:You hear about the vast amounts of data that they utilize.
Speaker:And as you can imagine, trading data plays a crucial role in the
Speaker:development and the efficiency.
Speaker:effectiveness of generative AI platforms, but where does all of that data come from?
Speaker:And I know you have lots of questions about that because you're worried
Speaker:about that is coming from your website.
Speaker:So let's start with what is training data?
Speaker:training data is the backbone of any machine learning.
Speaker:project, which is what generative AI is.
Speaker:It consists of large sets of information that's used to teach algorithm how to
Speaker:recognize patterns and make predictions.
Speaker:That's how it is creative, i.
Speaker:e.
Speaker:generative.
Speaker:And so you put in this vast amount of data and it's labeled in certain ways.
Speaker:I don't know how it does this, but it learns the patterns and then it
Speaker:can make informed predictions and create new content based on that.
Speaker:So given the scale of modern AI, requirements, the data sets
Speaker:are absolutely enormous, often encompassing billions of parameters.
Speaker:and that, of course, will.
Speaker:Change depending on the size and complexity of the
Speaker:model that is being trained.
Speaker:So the primary sources of training data, or I should say, traditionally,
Speaker:the sources of training data for platforms like open AI was
Speaker:scraped from the internet for free.
Speaker:And that was used to train the first generative AI models like chat, GBT, and
Speaker:they've done a pretty good job, I'd say, of learning to mimic human creativity.
Speaker:of course, thought, believe, and I think they're still sticking to this story, that
Speaker:it was legal and ethical for them to do so, relying on some prior cases that, you
Speaker:can use, publicly available information, so long as it's transformative,
Speaker:essentially making a fair use argument.
Speaker:I'm not going to go into the fair use argument, but, that is the basis of
Speaker:why they thought they could do this.
Speaker:as you probably know, there have been a number of high profile
Speaker:lawsuits about their use.
Speaker:So we will see, and there has not been resolved yet.
Speaker:And so we will see if their reasoning and their defenses hold up.
Speaker:So, to discuss a few of the, ways that they do get training data.
Speaker:web scraping, which you've already talked about.
Speaker:so there would be crawlers, they send out, scours the internet.
Speaker:It should only be scouring for things that are publicly available,
Speaker:that are not behind a paywall.
Speaker:However, there well, you can ask the crawler, I'm assuming, to go behind the
Speaker:paywall, which obviously would be a breach of, the, terms and conditions of a site if
Speaker:you go around their paywall, and also, you know, Even if there is no paywall, many
Speaker:sites will have terms and conditions would say you're not allowed to use crawlers.
Speaker:if you don't, comply with those terms conditions, then you're also, obviously.
Speaker:breaching those terms and conditions of that, as well as when they're
Speaker:scraping that data off many times, if not always, because like, we
Speaker:can't really quite see what's in the black box of that training data.
Speaker:They're taking off any copyright notices, and it is a violation of the Copyright
Speaker:Act to take off copyright notices.
Speaker:So there's a number of issues, involved with it.
Speaker:web scraping.
Speaker:that obviously is falling in disfavor.
Speaker:So what is replacing that licensed data sets?
Speaker:Very large data sets that are licensed from entities that
Speaker:own large amounts of data.
Speaker:I read this, regarding this new, path forward.
Speaker:There is a rush right now.
Speaker:To go for copyright holders that have private collections of stuff that is
Speaker:not available to be scraped, so this is from a lawyer who is advising content
Speaker:owners on deals worth tens of millions of dollars apiece to license archives
Speaker:of photos, movies, and books for AI.
Speaker:training.
Speaker:Bruder spoke to more than 30 people with knowledge of AI data deals,
Speaker:including current and former executives of companies involved, the lawyers and
Speaker:consultants to provide the first in depth exploration of this fledgling market and
Speaker:Detailing the types of content that's being bought, the prices that they're
Speaker:getting, and any emerging concerns that come from harvesting this type of
Speaker:data, even if it's licensed, because of the personal data risks that go along
Speaker:with harvesting large amounts of data where, the personal data of the, human
Speaker:that it belongs to, is done without the knowledge or consent of that person.
Speaker:Who are these huge licensees?
Speaker:There's of them.
Speaker:We have tech companies who have been quietly, buying, content
Speaker:that is behind locked paywalls and behind login screens from companies
Speaker:like Instacart, Meta, Microsoft.
Speaker:X and zoom.
Speaker:And so this might be some long forgotten chat blogs or long forgotten photos
Speaker:from old apps that are being licensed.
Speaker:tumblers, parent company automatic said last month, and I'm recording
Speaker:this in, April 2024, right?
Speaker:It was sharing content with select AI companies.
Speaker:And in February, that'd be 2024, Reuters reported Reddit struck a deal with
Speaker:Google to make its content available for training the latter's AI models.
Speaker:of course there's going to be some customer blowback.
Speaker:while this type of licensed content is accelerating, there will probably be
Speaker:some amendments still to it because, yes, meta goes in and it changes his terms
Speaker:of use, but does anybody read the terms of use of meta or of X or of zoom even.
Speaker:so.
Speaker:They're going in changing their terms and conditions without anyone
Speaker:kind of without it saying in bright red letters, Hey, we're going to be
Speaker:selling your data now to AI training.
Speaker:what comes from that.
Speaker:All right.
Speaker:Then there are archives that are owned such as the Associated Press
Speaker:and Getty images, or say aggregator.
Speaker:They don't own all those images.
Speaker:And so you can go to them and license their entire archives.
Speaker:And that provides a great amount of data for your data sets.
Speaker:Universities and research institutions are also owners or controllers of
Speaker:vast amounts of data that can be licensed all in one fell swoop.
Speaker:And then there are some nonprofit organizations that want to encourage
Speaker:the use of AI Just as we've had, other types of nonprofits in the
Speaker:past, such as creative commons, who want to help people get more
Speaker:access to, copyrightable materials.
Speaker:now there are some who feel the same way about.
Speaker:Making AI, data more accessible.
Speaker:for instance, this, nonprofit Allen Institute for AI released a data set
Speaker:of 3 million tokens from a diverse mix of web content, academic publications,
Speaker:code books, and encyclopedic materials.
Speaker:Now, another source is synthetic data when this is a new one to me, but it
Speaker:really points to how powerful AI can be.
Speaker:So synthetic data generation means that you use one generative AI
Speaker:tool to create synthetic data.
Speaker:And then you use that data, that synthetic data to train another.
Speaker:Generative AI tool.
Speaker:So let's say you're developing a customer service AI model.
Speaker:You could use another generative AI tool to create fictional customers
Speaker:and situations and interactions.
Speaker:And then you can use those fictional customer situations and
Speaker:interactions as the training data for your public facing AI model.
Speaker:So that way you're not at risk of exposing private information.
Speaker:If you were to directly put your customer information into your AI
Speaker:tool, first, you kind of anonymize it using one generative AI tool.
Speaker:And it's not just enough to de identify it because there could be customer
Speaker:situations that are so specific that you could only point to one person.
Speaker:It's possible.
Speaker:So you also have to make up perhaps new situations, new
Speaker:backgrounds, things like that.
Speaker:But then you can use that as your fictional customer for your AI,
Speaker:govern customer service model to then use that to train to help provide
Speaker:customer service on an AI basis.
Speaker:So we will see this with hospitals and banks as well
Speaker:that have sensitive information.
Speaker:Obviously they cannot use their customer's sensitive information.
Speaker:as training data, but they do want to have access to what is really kind of part
Speaker:of doing business these days of having some sort of a I based training systems.
Speaker:And then, of course, not last and not least, is the data
Speaker:that comes from you and me.
Speaker:So, what.
Speaker:Does that mean when we are using AI generated, uh, AI platforms, when
Speaker:we input our prompts, if we, put in something that we've written and ask
Speaker:it to, create a summary of it, if we put in a transcript from something
Speaker:and ask it to create a show notes, like everything that we put into that.
Speaker:has the potential to become training data for that platform.
Speaker:And so if we are doing that, we need to be aware of the
Speaker:terms of use of that platform.
Speaker:most of them will tell you that it can be part of the training data.
Speaker:And it might also end up being an output for someone who puts in a query,
Speaker:a prompt that what you put in as a perfect answer for, you just don't know.
Speaker:And so we need to be careful about what we are putting in as prompts
Speaker:or as, the input for whatever the AI platform that you're using.
Speaker:Make sure you are aware of their terms and conditions.
Speaker:Do not use any confidential information in there.
Speaker:whether it's yours or your clients.
Speaker:So make sure that you're really aware of that.
Speaker:some, AI platforms, I'm thinking in particular of they do use AI.
Speaker:And obviously, when you're using, uh, DocuSign, there are legal agreements that
Speaker:are going in there that have identifiable information of the parties, commercial
Speaker:terms, and things like that are in there.
Speaker:And so DocuSign, said that they, strip out any identifying data from that, so
Speaker:that they do use the agreements, for training data, but that they do strip
Speaker:out identifying information from it.
Speaker:So things to be aware of.
Speaker:in summary, the legal issues, I think we've covered, but just to sum them
Speaker:up, there are the copyright issues of Putting data into the database.
Speaker:I believe it was last week, I talked about the copyright
Speaker:ability issues of the output.
Speaker:So now I'm talking about the copyright issues with the input,
Speaker:whether or not the AI platform or you have the right to, add it.
Speaker:Information to the training data set, whether or not that
Speaker:is a copyright infringement.
Speaker:Is that fair use of that data?
Speaker:one of the issues, in the copyright side is, sometimes the output will literally
Speaker:be an exact replica of what went in and it's hard to make a fair use argument when
Speaker:a verbatim, uh, paragraphs, in the case of the New York Times, which is the basis
Speaker:of their lawsuit against OpenAI, verbatim paragraph comes out as the output.
Speaker:Where's the fair use there?
Speaker:Same with Getty, images.
Speaker:They've had exact replicas of their images come out of an AI platform.
Speaker:So that's obviously an issue.
Speaker:addition to copyright issues, we have privacy concerns.
Speaker:maybe real images of people where there are instances of real images of people
Speaker:coming out most certainly, private photos that are from somebody's, old
Speaker:Facebook or old blog posts, old journals.
Speaker:Think about what original, blogs were kind of like journals, right?
Speaker:And people would use them, as a journal and they're probably
Speaker:hanging around somewhere.
Speaker:Think about, I mean, I'm thinking about a blog, I guess it was
Speaker:at the time that I was started.
Speaker:I mean, it didn't last very long and no one ever saw it.
Speaker:But it's still somewhere.
Speaker:Like, I don't know if I could find it today, but it's still out there and
Speaker:somebody's a web crawl that could find it.
Speaker:I don't think they'd be very interested, but it's there.
Speaker:And so we do have the privacy concerns.
Speaker:And then we have the contract breach of, if we are using say a client's
Speaker:confidential information, we're entering it into a AI chat, bot.
Speaker:And It's the potential to be shared.
Speaker:We are breaching our contractual obligations to our clients.
Speaker:If we're doing that without permission, even if it is silent with the specifics
Speaker:of whether or not you can use a I and some contracts are being explicit about it.
Speaker:But even if it's silent.
Speaker:And you are obligated to use the client's information, keep it
Speaker:confidential and only share it under very specific circumstances, putting
Speaker:it into an AI platform is probably not one of those permitted uses.
Speaker:So you do have issues there as well.
Speaker:All right.
Speaker:So that is what I wanted to cover today regarding AI training data.
Speaker:as you know, this is a fast moving.
Speaker:matter, you know, who knows what will come next week.
Speaker:I'll try to keep you up to date, but always feel free to connect with me and
Speaker:let me know what your questions are.
Speaker:I'm always happy to answer them.
Speaker:Thanks again.
Speaker:And don't forget IP is fuel.