27
votes
Megathread #11 for news/updates/discussion of AI chatbots and image generators
It's been six months since ChatGPT launched and about three months since I started posting these. I think it's getting harder to find new things to post about about AI, but here's another one anyway.
Here's the previous thread.
In the "blind men and the AI elephant" department, some people who use GPT-4 noticed that it got faster but also complained that it was nerfed. According to someone at OpenAI, the model didn't change. But are they to be believed? There were accusations of gaslighting.
I think this goes to show how hard it is tell from casual observation or even regular usage what "typical performance" is for a random process where everyone's asking different questions. It's sort of like gamblers becoming convinced that a certain slot machine is "loose." Anecdotal reports are good for telling whether a website is up or down, but quality is harder to measure. Independent outside testing that tracked performance systematically would help, but there are so many different questions you could ask that it would be hard to say what "average" might be.
There are plenty of AI benchmarks but they tend only to be run once, not as a way of monitoring performance.
FWIW the last few times I've used GPT-4 I've been really impressed with it. It even asked me questions before giving an answer, which I'd never experienced before.
That is impressive! Usually when I tell it "feel free to ask clarifying questions" it doesn't. Now it does that without even being prompted.
It might just be the nature of what I asked it to do. Maybe a majority of examples in the training set of travel itinerary requests were followed by questions.
We really need to have a good open model soon. It's pretty hard to measure the performance of an AI if it's proprietary and locked-down, and right now we're pretty much at the mercy of whatever OpenAI wants to do with GPT-4.
I think the quality could be monitored pretty easily if someone wanted to, using the API and questions from an existing AI benchmark. It would be a matter of running the benchmark every now and then and publishing the score. (And hopefully the raw data too.)
A problem with benchmarks is that they often don't apply to the problem you're actually interested in, but it would be a way of establishing a baseline.
It doesn't mean anyone could do anything about it if the quality changed, but at least we'd have something concrete to talk about rather than rumors that it changed.
Another risk is that many of the established baselines are suspect of being leaked into the training data, so evaluation is not that straightforward. See this article and reddit discussion.
It's really quite a big issue in using the internet as a non-supervised training data source. Even more so if you consider that you would want to publish your evaluation data and method on the internet, which will be scraped and used to train future models. Extend that line of reasoning to all data generated by AI models and you can see how difficult of a problem it will be in the (near) future.
Yeah, data contamination is an issue, so I think this will keep test-makers busy writing new questions while trying to keep the difficulty about the same. This is much like how writing an SAT test isn't one-and-done.
We have a number of good models available now, as well as a number of options for fine tuning datasets.
For base models, there’s Llama, StableLM, Falcon, etc.
For datasets, there’s Open Assistant, Gpt4All, Alpaca, and others.
For RLHF, ColossalAI and Open Assistant have open sourced code.
Have you used them yourself? Which ones have you tried and what you think of them?
I’ve used all of them and more. The best results I’ve had are from combining multiple datasets on top of the Llama base models using regular fine tuning with some custom tokens, and I haven’t gotten very good results from RLHF yet, maybe because I don’t have a very good RLHF dataset.
What sort of problems do you work on? Are you trying them out casually? Doing research? Or are you working on something to go into production?
Research and commercial applications. My focus is mostly on reducing training cost and combatting model degradation during fine tuning.
With regards to the disconnect between the perceived decline in GPT 4’s performance and the OpenAI staff member saying there has been no change, I’d say it's worth noting that most of those sharing anecdotes about their experience with the model will have been interacting with it via ChatGPT+, while the OpenAI fella’s tweet specifically refers to the core model accessed through the API.
I highlight this is to draw attention to the fact that while the API allows as pretty much as close to direct access to the GPT 4 model itself that they’re willing to allow, when accessing it through ChatGPT model behaviour is deliberately guided for a more user friendly interactive chatbot experience.
The most well known way they do this with ChatGPT (and similar services such as Bing’s chatbot) is by effectively priming the model for a certain type of behaviour using an initial hidden message at the start of each new exchange.
This, along with other bits and bobs built into the ChatGPT platform to guide model behaviour, could’ve plausibly undergone multiple tweaks and revisions since GPT 4's launch. If this is the case it would account for the model's perceived degradation without rendering the OpenAI guy’s statement an outright lie.
That aside, I agree entirely with your sentiment that we could do with more continued benchmarking efforts. Especially when many users are drawing conclusions from attempts to identify changes in these types of models just by observing generated replies. Even when using the API with the temperature parameter set to 0 (for as close to deterministic behaviour as possible) there’s still a chance of varied outputs due to how one inconsistently chosen token at any one point in a generation can cause the rest to spiral off in a different direction (a hazard with glorified next token predictors I guess). This lack of reliably reproducible generations when using the API with explicitly set parameters makes the idea of trying to catch changes made to something like ChatGPT, where we can’t see much less control any of the parameters used, seem a pain in the arse at best and a futile effort at worst.
Good points!
If you sign up for developer access, there's a playground that lets you set the "system message" for chat. If I cared more, I would play around with that.
India’s religious AI chatbots are speaking in the voice of god — and condoning violence (Rest of World)
[…]
It would be interesting to read interviews with actual users. How do they think about their usage of these bots?
Everything about this article annoyed me. It’s fearmongering nonsense and it doesn’t sound like anyone involved, from the developers to the journalist writing it, actually understands the topic.
The central narrative of the Bhagavad Gita is that it is Krishna persuading Arjuna about the righteousness of going to war. If you ask it if, based on the Gita, killing is permissible then the answer can only be yes. That’s literally what the Gita is about (as long as the question is that open ended). But claiming this is “condoning violence” is ignorant. It is equivalent to saying there are circumstances where killing is the righteous course of action. Very few people are extreme pacifists to the point where they’d dispute that general claim.
No mention of the fact that it’s just reproducing the sentiments of its training data. The AI is not interpreting shit. It’s summarizing the interpretations that have been fed to it. And if it has opinions on Narendra Modi that suggests its corpus of training data is not constrained to the Gita and its notable commentaries. There’s a much better article where a journalist who actually knows what they’re talking about interrogates these developers about their corpus of training data and what editorial choices they made there.
I don’t see how this is anything but just ChatGPT with some pre-scripted context (answer as if you are Krishna in the Bhagavad Gita) hard coded up front. The claim that people will think this bot is the voice of God because it pretends to be seems to be based on. . .nothing at all. It’s not like Hindus are ignoramuses who can’t differentiate real spiritual experiences from simulacra. Nobody got confused that NTR, who played Krishna in many films, was actually Krishna.
Thanks! Always nice to hear from somebody who actually knows something about the subject. (This is why more diversity is good.)
I think the "oh no, people will worship it as a god" storyline comes from science fiction and fantasy and cultural narratives about history. I don't think it's completely impossible since cults and fortune-telling do happen, but it's probably not as easy to get devoted "customers" for something new as people think? Seems like it would take actual work, not just launching a fun website?
“Technological advancement” in Hindu devotion is nothing new. The tradition has adapted substantially in its ~5,000 years. Most recently, the introduction of mass printing lithographs fundamentally altered household altars and standardized more detailed depictions and visualizations of deities. Previously, household altars would be simple carved images or bronze casts if you were wealthy. But these were diverse depictions that varied significantly from region to region, identified largely by specific items or poses the deity would be associated with and tied to family or village environments, modes of dress, and lifestyles.
The introduction of lithographs, specifically Raja Ravi Varma’s, created a nation-wide consensus on this sort of imagery that played a big role in creating a pan-Indian sense of Hindu unity. But it also blurred out a lot of smaller regional practices and beliefs. These would inevitably have shuffled off anyway as people moved from villages to cities and from traditional occupations to modern ones, but the spiritual practices adapted to maintain themselves to the new context and evolving expectations for how we learn, reference, and conceptualize things.
As time and technology change it will transform religious devotion naturally. We should prioritize understanding what those effects are and whether they’re spiritually uplifting or buying into the noxious but seductive elements of modernist society.
I am concerned about Generative AI that functionally summarizes scriptural references, though, because they don’t (and can’t) in incorporate nuanced understandings of social and historical context around what’s in the books. But this is just the same old critique Hindu tradition has had against scripturalism over devotional practice that goes back ages (arguments very reminiscent of the ones Catholics had against Lutherans and their sola scriptura doctrine.
A good article might have been able to dig into that. But instead we got another warmed over “ooga booga AI bad” with a side of condescension towards Hindus and religiosity generally. The author didn’t bother talking to a single scholar or priest about it. Only some throwaway quotes from the techbros who built them. It’s boggling!
I find your point about the introduction of lithography standardizing iconography and contributing to pan-India nationalism incredibly interesting. Do you have any good recommendations for more reading on the subject?
It's a minor side topic covered in this book.
This is indeed dangerous. Some people are not there to take advice. They only need a reassurance of what they are already thinking. A simple wording / prompting trick can make the ai say whatever they want it to saw. Thereby enforcing the existing idea the person has already previously decided to be his truth, thus empowering them to act on their belief through the authority given by a chatbot that supposedly gives advise based on their belief system.
In the "AI hype getting out of hand" department, this story got enough traction in social media that there were multiple news articles about it yesterday, but it turned out to be wrong:
US air force denies running simulation in which AI drone ‘killed’ operator (The Guardian)
[...]
Edit: apparently the reason this blew up is that "in sim" was interpreted as being a computer simulation rather than just a story or thought experiment. (See this tweet.) Some people in the military are apparently very loose about what they call a simulation?
I think we’re just very used to “simulation” referring to computer simulation. Kind of how when we say “I sent you a mail” it usually means electronic mail. But it wasn’t always the case and it still isn’t.
Yeah, I’m guessing that the big misunderstanding is that there was no actual AI involved in whatever exercise he was talking about and it was just an NPC or something similar, similar to how you might have someone take the role of the Russians in a war game or whatever.
But I expect we will see more blurriness with AI-generated thought experiments and the like. If you play a role playing game with an AI and it does something weird, should that be news?
It seems like a kind of hall-of-mirrors effect. There have been lots of AI’s in science fiction stories and you can generate fiction with AI, so a fictional scenario can get transformed into a rumor and people are already primed to believe it. Rather than being a novel way of generating misinformation, it’s just how gossip often works.
Also I believe the direction the military wants to maintain is, in any case where it’s a matter of life or death, a human has to be the one making the ultimate decision on whether to do a strike. So I’m not sure this would even line up with their current doctrine regarding AI strikes.
Here's the explanation from the blog where everyone got the news.
That there is the important part. This is a well known problem, and one that we don’t have excellent solutions for yet.
Yes that's true, but it also sounds like he messed up at communicating and this is "cope" as they say. Imaginative exercises are good for coming up with scenarios you might not have otherwise thought of, but people have been imagining similar scenarios for many years. It's a movie-plot kind of idea. So discussing the idea isn't news.
If it happened as the result of an experiment, that really would be news. AI alignment theorists have written extensively about how giving goals to AI's can go wrong and if there were a real-world example, that would be very interesting. Many people would want to know the details.
A lot of people were primed to believe it because it's part of popular culture, the kind of story that journalists joke is "too good to check." I nearly posted it the day it came out even though it's sketchy, but held off to see what would happen.
There are real world examples though, that was my point. RL models are always getting the long term objective wrong because of poorly specified reward signals. For instance,
https://bdtechtalks.com/2021/01/18/ai-alignment-problem-brian-christian/
There are hundreds of examples of this in the literature, and I’ve even encountered it myself while implementing RL algorithms on various standard RL test problems.
Yeah, good point.
I guess I'll just fall back on "stories about military failures are interesting."
That particular story may have been incorrect, but the problems with correctly defining a reward signal in reinforcement learning are well known and quite real.
The scenario provided is not at all unlikely, with a few small changes. The main thing being that the reward signal as the story defined it would t have led to the outcome — a human go/no-go requirement for every kill wouldn’t incentivize killing the human. A human veto with a timeout would though.
All the unexpected ways ChatGPT is infiltrating students’ lives (Washington Post)
It gets a little meta:
It’s infuriatingly hard to understand how closed models train on their input (Simon Willison)
(He goes on to write about what's publicly known about them.)
The Unintended Consequences of Censoring Digital Technology -- Evidence from Italy's ChatGPT Ban
Here's the abstract:
(Via Marginal Revolution.)
If you're going to post it top level, then there's no need to post it anywhere else. Megathreads are for things that people decide for whatever reason aren't worthy of posting top level. (We're not consistent about this, but sometimes people are reluctant to post top level, so having discussion topics results in more conversation.)
I won't comment on the video itself because I'm not going to watch it. (I'm in the habit of only using YouTube to listen to music videos.)
LLMs are better than you think at playing you (lcamtuf’s thing)
[...]
(See the blog for some fun examples.)
A plea for solutionism on AI safety (The Roots of Progress)
Can a chatbot preach a good sermon? Hundreds attend experimental Lutheran church service to find out
AP News – Kirsten Grieshabet – 10th June 2023
This project - open source and free. Aims to provide a tool to evaluate LLMs or systems built using LLMs in the back.
https://github.com/openai/evals
Citing from the git:
Evals is a framework for evaluating LLMs (large language models) or systems built using LLMs as components. It also includes an open-source registry of challenging evals.
We now support evaluating the behavior of any system including prompt chains or tool-using agents, via the Completion Function Protocol.
With Evals, we aim to make it as simple as possible to build an eval while writing as little code as possible. An "eval" is a task used to evaluate the quality of a system's behavior.