25 votes

Your chatbot transcripts may be a gold mine for AI companies (gifted link)

20 comments

  1. [10]
    winther
    Link
    This article covers some interesting insight into how people are using AI chatbots and what it might mean for the inevitable monetization. Turns out many people are having quite personal and...

    This article covers some interesting insight into how people are using AI chatbots and what it might mean for the inevitable monetization. Turns out many people are having quite personal and intimate conversations with chatbots, which can have positive outcomes, but also opens up a whole lot of concerning privacy questions to consider.

    I have gifted the entire article, but here is a few select paragraphs:

    People share personal information about themselves all the time online, whether in Google searches (“best couples therapists”) or Amazon orders (“pregnancy test”). But chatbots are uniquely good at getting us to reveal details about ourselves. Common usages, such as asking for personal advice and résumé help, can expose more about a user “than they ever would have to any individual website previously,”

    OpenAI CEO Sam Altman recently told my colleague Charlie Warzel that he has been “positively surprised about how willing people are to share very personal details with an LLM.” In some cases, he added, users may even feel more comfortable talking with AI than they would with a friend.

    In the short term, that could mean that sensitive chat-log data is used to generate targeted ads much like the ones that already litter the internet. In September 2023, Snapchat, which is used by a majority of American teens, announced that it would be using content from conversations with My AI, its in-app chatbot, to personalize ads. If you ask My AI, “Who makes the best electric guitar?,” you might see a response accompanied by a sponsored link to Fender’s website.

    The potential value of chat data could also lead companies outside the technology industry to double down on chatbot development, Nick Martin, a co-founder of the AI start-up Direqt, told me. Trader Joe’s could offer a chatbot that assists users with meal planning, or Peloton could create a bot designed to offer insights on fitness. These conversational interfaces might encourage users to reveal more about their nutrition or fitness goals than they otherwise would. Instead of companies inferring information about users from messy data trails, users are telling them their secrets outright.

    18 votes
    1. Kind_of_Ben
      Link Parent
      This has real "They just gave it to me. Dumb fucks." vibes. Or rather, vibes of trying to find a less insensitive way of expressing that feeling.

      he has been “positively surprised about how willing people are to share very personal details with an LLM.”

      This has real "They just gave it to me. Dumb fucks." vibes. Or rather, vibes of trying to find a less insensitive way of expressing that feeling.

      32 votes
    2. [8]
      Minori
      Link Parent
      This one doesn't make much sense to me. At some point, someone is going to have to somehow turns those paragraphs and queries into keywords which can be fed into recommendation algorithms. I guess...

      Instead of companies inferring information about users from messy data trails, users are telling them their secrets outright.

      This one doesn't make much sense to me. At some point, someone is going to have to somehow turns those paragraphs and queries into keywords which can be fed into recommendation algorithms. I guess sensitive personal paragraphs give them more keywords to work off of? This is mostly a question about what kind of language processing they would be doing to generate marketing insights from this data.

      4 votes
      1. Omnicrola
        Link Parent
        You're right, but you're also not thinking enough steps ahead. Advertising isn't about presenting a relevant product to someone, it never has been. It's about emotion. If you can manipulate...

        You're right, but you're also not thinking enough steps ahead. Advertising isn't about presenting a relevant product to someone, it never has been. It's about emotion. If you can manipulate someone's emotions you have a much stronger ability to direct their behavior.

        Advertising companies already have a ton of experience and practice at manipulating your emotions to get you to buy things. Things you don't actually need, but you're convinced you do.

        Now imagine that instead of making educated guesses about general demographic groups, they know what drives you. You the specific person. What you fear, what you love, what angers you. They can now tailor their ads that much more specifically.

        Now further imagine that the ads themselves evolve further. They used to be printed on newspapers. Then they were on TV. Then they went online and could become more targeted. We've already seen experiments with some advertisers integrating promotions directly into articles as if they were actual content.

        The next iteration of ad tech I assume will work with an LLM backend. Chatbots already try to talk me up to "help" when I go on a lot of websites. Some of them are legitimately trying to be helpful. Others are 100% trying to funnel me into a sales CRM. I assume at some point one (or all) of the large LLM "AI" systems will start integrating ads in the form of the LLM making specific recommendations for products or services that are heavily biased in favor of who has paid up their AdAi bill.

        11 votes
      2. [3]
        Greg
        Link Parent
        The short answer is: the exact same type that makes an LLM work as a chatbot. It's easy to think of LLM and chatbot being synonymous - that's how they're marketed, and that's one of the only ways...
        • Exemplary

        This is mostly a question about what kind of language processing they would be doing to generate marketing insights from this data.

        The short answer is: the exact same type that makes an LLM work as a chatbot.

        It's easy to think of LLM and chatbot being synonymous - that's how they're marketed, and that's one of the only ways that makes sense to interact with them for most end users - but the core of the underlying tech is an absolutely colossal statistical model for turning freeform text into numerical relationships and back again.

        If you wanted keyword extraction from a given input you could do that in a few different ways with the LLM itself (the simplest, but probably least accurate, being a call to the chat API with a prompt something like "Please extract a comma-separated list of the ten most important keywords in the following text: ..."), but if you're running the model directly you can also bypass the existing chat output head entirely and pipe the raw numeric output straight into another model. That could be a keyword classifier, which will generally do a more accurate job than just asking the model in chat mode, but it could also be any other model you've got the data to fine-tune.

        If it were me* I'd be using the existing LLM as the first 80% of a pipeline and then sticking a few extra transformer layers on the end to predict the optimal recommendation directly. Freeform text goes in, row ID for the best targeted ad in the database comes out. Most of the time letting the prediction pipeline have that full context rather than assuming it only wants the keywords will work significantly better overall.


        *Which it wouldn't be - I'm utterly disillusioned with how much technical time and expertise goes into minor gains in advertising effectiveness rather than anything that'll feasibly benefit society. I don't blame individual devs for taking those jobs, we've all got to keep the bills paid, but goddamn do I wish all those smart people were focused elsewhere

        5 votes
        1. [2]
          sparksbet
          Link Parent
          You could honestly probably do the same things with a simpler language model too, depending on what exactly you're looking for and how much money you have to spend. Getting keywords or sentiment...

          You could honestly probably do the same things with a simpler language model too, depending on what exactly you're looking for and how much money you have to spend. Getting keywords or sentiment from natural language text is not a new task, after all!

          1 vote
          1. Greg
            Link Parent
            For sure! I reckon you'd still get meaningful gains from keeping the full tensor representation as input to the recommendation step rather than bottlenecking down to keywords or sentiment in a...

            For sure! I reckon you'd still get meaningful gains from keeping the full tensor representation as input to the recommendation step rather than bottlenecking down to keywords or sentiment in a human-readable sense, but you could do that easily enough with something like Phi 3.5 mini that'll run on a decent laptop - definitely doesn't need full-fat ChatGPT to get something powerful, and I wouldn't be surprised if diminishing returns kick in a good while before you're up at that scale.

            1 vote
      3. [2]
        CannibalisticApple
        Link Parent
        Marketing algorithms can make inferences based on pretty sparse pieces of data. I remember this article years ago about how Target figured out a teen girl was pregnant before her dad knew....

        Marketing algorithms can make inferences based on pretty sparse pieces of data. I remember this article years ago about how Target figured out a teen girl was pregnant before her dad knew. Basically, by looking at purchase trends Target's algorithms can determine if a woman is pregnant, and even estimate their due date, and send out coupons and advertising accordingly. Most of the items mentioned are pretty innocuous, too.

        Now, think about how much information someone would give out talking about their pregnancy... No need to analyze their purchases to figure out a due date when someone is outright saying it, or try to figure out products when someone's asking for ideas on what fits with a nursery theme, etc. It skips straight over the inference step to just having the information.

        It might be a bit tricky to root through for the right keywords, sure... but, well. One of the biggest use cases I see for AI is to summarize and extract key points from long texts.

        4 votes
        1. skybrian
          Link Parent
          This story is probably a myth. Not that it couldn’t have happened coincidentally, but it’s no more substantiated than other times people thought that ad targeting did something like that:...

          This story is probably a myth. Not that it couldn’t have happened coincidentally, but it’s no more substantiated than other times people thought that ad targeting did something like that:

          https://medium.com/@colin.fraser/target-didnt-figure-out-a-teen-girl-was-pregnant-before-her-father-did-a6be13b973a5

          5 votes
      4. ruspaceni
        (edited )
        Link Parent
        i think it might just be more about how its rare for you to make a post spilling your secrets and it not be on a private account or filled with slang or referencing things "out of the...

        i think it might just be more about how its rare for you to make a post spilling your secrets and it not be on a private account or filled with slang or referencing things "out of the conversation" that might be hard to infer, or generally just run through a "social filter". but with an LLM if youre trying to have that sort of intimate conversation, you will be explaining it in detail or rewording things it responded incorrectly to. not to mention how theyre already probably extracting keywords and subjects from the messages just for content moderation/safety conceners.

        it realls makes me think of this bit of an adam curtis documentary where he's talking about some really early "ai" program called Eliza as a sorta jokey "computer psychotherapist" and showed it to the people in the office, only to get asked to leave because a private conversation was happening

        https://youtu.be/yS_c2qqA-6Y?t=4924

        so its almost no wonder that a more convincing and open-ended chatbot is becoming such a goldmine for them. we just love yammering away to robots and apparently have been doing it since the 70s. but yeah i dont think it would be that hard for them to keep track of things as theyre happening (alongside the moderation thing) or just spending the GPU to go over chatlogs when an advertiser asks you a question about your userbase

        2 votes
  2. [6]
    Promonk
    Link
    They don't, though. In all of my interactions with ChatGPT, the bot has been pretty upfront that OpenAI will use user input to "improve my responses." Most of my conversations with it have been me...

    Of course, OpenAI and its peers promise to keep your conversations secure.

    They don't, though. In all of my interactions with ChatGPT, the bot has been pretty upfront that OpenAI will use user input to "improve my responses." Most of my conversations with it have been me poking at its capabilities and finding them lacking in ways that totally undercut its makers' claims about its utility, so they're welcome to my inputs if they want them.

    Besides, isn't this the sort of data we want them collecting, as opposed to scraping every last unsecured scrap of data on the Internet without regard to copyright? The tone of this article seems to want people to get angry or upset about this, but I'm failing to see anything shocking or outrageous about it.

    Mind you, I'm not a booster of LLMs or AI generally, but I recognize that the real danger they represent right now primarily isn't in the systems themselves, or even in how they steal intellectual property to generate their models, but in how people respond to and use the data they generate. This kind of weak outrage bait only obscures these issues.

    9 votes
    1. [5]
      tauon
      Link Parent
      API users, at least of OpenAI, have long been promised/guaranteed to not have their data used for training or improvements, as far as I know about this topic. Patient data, mental health concerns,...

      Of course, OpenAI and its peers promise to keep your conversations secure.

      They don't, though. In all of my interactions with ChatGPT, the bot has been pretty upfront that OpenAI will use user input to "improve my responses." Most of my conversations with it have been me poking at its capabilities and finding them lacking in ways that totally undercut its makers' claims about its utility, so they're welcome to my inputs if they want them.

      API users, at least of OpenAI, have long been promised/guaranteed to not have their data used for training or improvements, as far as I know about this topic.

      Besides, isn't this the sort of data we want them collecting, as opposed to scraping every last unsecured scrap of data on the Internet without regard to copyright? The tone of this article seems to want people to get angry or upset about this, but I'm failing to see anything shocking or outrageous about it.

      Patient data, mental health concerns, other most intimate thoughts and essentially all mail are under protection by law in many jurisdictions for a reason.

      This is no different – even though you don’t care about this or may think you don’t have anything to hide, do you really want these “AI” companies working together with potentially, just to name a few examples, your employer, insurance(s), or law enforcement?

      People here in this thread are already reporting of using GPT as an at least basic doctor replacement, and while I’m not judging that in either direction, I think it’s clear that this has pretty easy paths to huge problems being created.

      “We” as a society probably want them to collect neither copyrighted works nor user input data unchecked. Not just one of the two.

      Mind you, I'm not a booster of LLMs or AI generally, but I recognize that the real danger they represent right now primarily isn't in the systems themselves, or even in how they steal intellectual property to generate their models, but in how people respond to and use the data they generate. This kind of weak outrage bait only obscures these issues.

      Agreed, but not with your takeaway. Again, both might be an issue: People believing potential hallucinations as gospel, and new companies gaining way too much insight into people’s minds.

      What if the next OpenAI is coming from or bought by a foreign state (or state-controlled actor)? Compare the situation to e.g. Instagram and others against Tiktok.
      Do we want a potentially adversarial entity to have insight into politicians, scientists, citizens’ heads and even more so than with social media? Would we even want a non-foreign company or government entity to have that power? People already believe too much in terms of fake news from social media and LLMs as you mentioned, now imagine that data not only being directly evaluated, but possibly even manipulated live with no potential for external supervision by someone with a specific goal. The dangers here are obvious if these conversations aren’t kept private/secured by design, IMHO.

      6 votes
      1. [2]
        Omnicrola
        Link Parent
        A small clarification, since people commonly make incorrect assumptions about what information is protected by HIPAA (in the US). Individual people are not restricted by HIPAA, and neither is any...

        Patient data, mental health concerns, other most intimate thoughts and essentially all mail are under protection by law in many jurisdictions for a reason.

        A small clarification, since people commonly make incorrect assumptions about what information is protected by HIPAA (in the US). Individual people are not restricted by HIPAA, and neither is any company who's job isn't to handle your records. If someone freely shares medical information with OpenAI, they have zero obligation or legal restriction to keep that private.

        7 votes
        1. tauon
          Link Parent
          Very good to know indeed! I’d heard a bit online of HIPAA, but until now was under the very assumption you just clarified, too. Although I am not a US citizen, and my medical records are protected...

          Very good to know indeed! I’d heard a bit online of HIPAA, but until now was under the very assumption you just clarified, too.

          Although I am not a US citizen, and my medical records are protected (rights to information/correction/erasure) by EU GDPR no matter where collected or stored nonetheless :-)

          2 votes
      2. [2]
        Promonk
        Link Parent
        If this is true, then it is indeed scummy that they've taken to selling data gleaned from user inputs given by API customers, and they deserve scorn and possibly legal action. None of the rest of...

        API users, at least of OpenAI, have long been promised/guaranteed to not have their data used for training or improvements, as far as I know about this topic.

        If this is true, then it is indeed scummy that they've taken to selling data gleaned from user inputs given by API customers, and they deserve scorn and possibly legal action.

        None of the rest of what you said is specific to AI at all, though. This is all stuff we've needed to grapple with regarding the Internet and Big Data as a whole for decades. The only reason to single out AI companies for these things is because AI is the Thing to Be Outraged About du Jour™.

        I'm not saying that most of the things you mention aren't worthy of concern, they are. I am just both wary and weary of items like this that trade in boiling down big unresolved problems in tech to fear regarding a single new and scary technology, because it seems to me that the media hype cycle of feverish outrage, saturation, then apathy more often than not promotes stasis or bad regulation rather than good solutions.

        "We” as a society probably want them to collect neither copyrighted works nor user input data unchecked. Not just one of the two.

        This, though, I want to address.

        Outside the limits of any warranties they've given promising privileging the inputs of paid API users, we should have no expectations at all regarding how they may choose to use the data we freely give them. The assumption of privileged communications is generally legally reserved for very few interactions, such as those between medical practitioners or legal representatives and clients. It exists as a principle in law and medicine because the withholding of information undercuts the purpose of those endeavors entirely, and it's extremely difficult to regulate even in fields like those that do have strict licensure.

        I just don't see why we should expect the vast majority of Internet interactions to be privileged like that. It's as though people walk into a store marked "Information Merchants," ask a few questions regarding the wares, strike up some chit chat about the crazy thing their cousin did the other week, and then get all shocked Pikachu when the information merchants turn around and sell off any info from their conversation they think might be worth something. More than that, it's like doing this same thing repeatedly for 30+ years every time they change the coat of paint on the storefront.

        We definitely need to be devising workable rules regarding inferential data collection, data retention, data security, and transparency about how data is collected. Information that we freely give, literally unprompted, with no warranties of privilege, that's something else entirely.

        3 votes
        1. tauon
          (edited )
          Link Parent
          To be 100% clear, I don’t know whether they specifically have done that. Didn’t mean to imply it, either. Let’s stick to the facts we do know, and my bad for the implication. While I agree, AI...

          If this is true, then it is indeed scummy that they've taken to selling data gleaned from user inputs given by API customers, and they deserve scorn and possibly legal action.

          To be 100% clear, I don’t know whether they specifically have done that. Didn’t mean to imply it, either. Let’s stick to the facts we do know, and my bad for the implication.

          None of the rest of what you said is specific to AI at all, though. This is all stuff we've needed to grapple with regarding the Internet and Big Data as a whole for decades. The only reason to single out AI companies for these things is because AI is the Thing to Be Outraged About du Jour™.

          While I agree, AI companies are unique in this so far as they don’t inherently print cash, or turn profits at all, really. It’s not even clear if what their business model offers is truly needed at this point.
          I recently started reading an excellent blog post/article (which come to think of it, I’m probably going to post here once I finish it – a bit lengthy) that basically blasts OpenAI’s business as one that will not be profitable for the foreseeable future. (Edit: now posted here)

          My point is this: If the regular business avenues don’t suffice, they may be more tempted to do something “monetizable” with user data than they would be otherwise.

          Outside the limits of any warranties they've given promising privileging the inputs of paid API users, we should have no expectations at all regarding how they may choose to use the data we freely give them. The assumption of privileged communications is generally legally reserved for very few interactions, such as those between medical practitioners or legal representatives and clients. It exists as a principle in law and medicine because the withholding of information undercuts the purpose of those endeavors entirely, and it's extremely difficult to regulate even in fields like those that do have strict licensure.

          This is your opinion based on probably the legal framework you feel at home with as well as a bunch of other factors in lived experiences. I have a completely different world view in this regard (and to be clear, that’s fine! We don’t have to be the same person :-)) – I’m a stark proponent of “all personal data is privileged* unless cleared for use otherwise”

          *Privileged specifically meaning, at a minimum, knowing what data is out there of you, and potentially having the option to delete it. Not necessarily that it needs to be protected as strictly as e.g. medical info, I’m content with a “right to know” and “right to delete” (similar to what EU legislation actually ensures). But the key point is that I would classify my conversations as “my” data in some form or another.

          I just don't see why we should expect the vast majority of Internet interactions to be privileged like that. It's as though people walk into a store marked "Information Merchants," ask a few questions regarding the wares, strike up some chit chat about the crazy thing their cousin did the other week, and then get all shocked Pikachu when the information merchants turn around and sell off any info from their conversation they think might be worth something. More than that, it's like doing this same thing repeatedly for 30+ years every time they change the coat of paint on the storefront.

          If that were known upfront, sure. But people don’t enter the (really liking this analogy, BTW) Information Merchants’ brick-and-mortar grounds, they use a business model that sounds like Look Something Up or Connect with Your Friends For Free, but it happens online, so in the process they are able to share data they don’t even know they possess.

          We definitely need to be devising workable rules regarding inferential data collection, data retention, data security, and transparency about how data is collected. Information that we freely give, literally unprompted, with no warranties of privilege, that's something else entirely.

          Isn’t this the same pair of shoes? Can data security be guaranteed transparently if some LLM is trained on your innermost thoughts? Would you assume a friend or even acquaintance (as that’s how people might use these chat bots) gives “no warranties of privilege” in IRL or digital conversation, or would they assume a default mode of “hey maybe they won’t be sharing my thoughts with dozens/hundreds of employees and potentially millions of people”?

          Either way, it’ll be a tricky topic and regulators worldwide will, like you said, probably have a bit of a hard time figuring out good solutions that work well. To me, this unresolved problem isn’t one of “fear” per se like it’d be with social media eroding democracy or information sources being monopolized… but I’d definitely give it the “concern” label.

          1 vote
  3. skybrian
    Link
    Gold mine, toxic liability, business records subject to discovery? Maybe all of the above? ChatGPT keeps all your transcripts and gives them helpful titles. When I scroll through mine for the last...

    Gold mine, toxic liability, business records subject to discovery? Maybe all of the above?

    ChatGPT keeps all your transcripts and gives them helpful titles. When I scroll through mine for the last few months, they are nearly all programming questions. But there is also “movies for eight year olds” and “alien space bats.” I don’t see anything I’d be all that uncomfortable revealing. My photos are more useful for remembering where I was on a certain date.

    When you save a receipt, usually it has no value and eventually you throw it away, but sometimes it’s important to keep such things. Records of business transactions sometimes become important evidence. Similarly here - mostly useless, certainly not valuable enough to pay for the compute, but it could become interesting to someone for unpredictable reasons.

    6 votes
  4. [3]
    tomf
    Link
    i like conversation mode in the chatgpt app. a few weeks ago i was dying from an especially bad headache and depression. i told the app as much and it gave me some great advice for managing it....

    i like conversation mode in the chatgpt app. a few weeks ago i was dying from an especially bad headache and depression. i told the app as much and it gave me some great advice for managing it.

    i’ve got friends that are doing this a lot —- even dumping lists of symptoms and asking ‘what do i have’, often with relatively accurate responses.

    if they monetize this line of data, i apologize. it’s definitely me and my social circle’s fault.

    3 votes
    1. [2]
      tauon
      Link Parent
      If it works, the only reason not to use it would be privacy concerns. I’d probably use it too given sufficient trust in the provider company… As usual, I think the individual user cannot be at blame.

      If it works, the only reason not to use it would be privacy concerns. I’d probably use it too given sufficient trust in the provider company… As usual, I think the individual user cannot be at blame.

      2 votes
      1. tomf
        Link Parent
        it felt really dodgy at first, but slowly I got used to it. I'm not really giving anything useful about myself beyond the fact that I might have a lot of really stupid questions.

        it felt really dodgy at first, but slowly I got used to it. I'm not really giving anything useful about myself beyond the fact that I might have a lot of really stupid questions.

        1 vote