16 votes

DeepSeek’s safety guardrails failed every test researchers threw at its AI chatbot

Posted February 1 by updawg

Tags: security, china, language models.large, artificial intelligence, deepseek, researchers, chatbots, jailbreak, author.matt burgess, author.lily hay newman, source.wired, paywall

https://www.wired.com/story/deepseeks-ai-jailbreak-prompt-injection-attacks/

Link information

This data is scraped automatically and may be incorrect.

Authors: Matt Burgess Lily Hay Newman, Kieran Alger, Matthew Korfhage, Simon Hill, Chris Haslam, Scott Gilbertson, Lisa Wood Shapiro, Julian Chokkattu, Brenda Stolyar, Aarian Marshall, Lionel Barber, Will Knight
Published: Jan 31 2025
Word count: 474 words

29 comments

[7]
countchocula
February 1
Link
I am not an advocate at all for ai but my first question here is: who cares? This seems like a reactionary piece by an industry afraid of competition and in need of finding the most inane...

I am not an advocate at all for ai but my first question here is: who cares? This seems like a reactionary piece by an industry afraid of competition and in need of finding the most inane complaint to stave off valid criticism that silicon valley has been asleep at the wheel. These guardrails are trivial to build compared to the actual product.

I cant read the whole article, maybe that is addressed somewhere.

56 votes
1. [4]
  updawg (OP)
  February 1
  Link Parent
  I'm not sure who cares, but this site seems to be discussing every little thing they possibly can about DeepSeek, so I figured I'd feed the beast.
  
  I'm not sure who cares, but this site seems to be discussing every little thing they possibly can about DeepSeek, so I figured I'd feed the beast.
  
  16 votes
  1. Queresote
    February 2
    Link Parent
    I don't think their "who cares" was directed at you, I think it was more of a response to the original article. I appreciate you sharing this so we may remain up-to-date on the AI discussion...
    
    I don't think their "who cares" was directed at you, I think it was more of a response to the original article.
    
    I appreciate you sharing this so we may remain up-to-date on the AI discussion (which you are right in your saying that it is discussed at every level).
    
    24 votes
  2. [2]
    countchocula
    February 2
    Link Parent
    Yeah haha sorry i didnt mean to imply that you did anything wrong! Just the discourse in general and how tech media has completely shed its performative cloak of objectivity.
    
    Yeah haha sorry i didnt mean to imply that you did anything wrong! Just the discourse in general and how tech media has completely shed its performative cloak of objectivity.
    
    15 votes
    
    updawg (OP)
    February 2
    Link Parent
    I didn't think you did. I was just agreeing with you. I don't see why people would care but they seem to want to discuss everything they possibly can about DeepSeek.
    
    I didn't think you did. I was just agreeing with you. I don't see why people would care but they seem to want to discuss everything they possibly can about DeepSeek.
    
    6 votes
2. [2]
  Macil
  February 5 (edited February 6)
  Link Parent
  In my opinion the main consequences of this are: If a company uses DeepSeek as a customer service bot or other kind of integrated assistant (ie code writing assistant), then it's possible the user...
  
  In my opinion the main consequences of this are:
  
  If a company uses DeepSeek as a customer service bot or other kind of integrated assistant (ie code writing assistant), then it's possible the user can make it say things that are embarrassing for the company.
  
  The AI is not fit to be used in a situation where it has control over resources and talks to people who don't have the same level of access to those resources, because the AI can be tricked into doing whatever the people say to it, despite whatever instructions it was deployed with. (For example, if you give an AI read access to an employee database and tell it that employees are allowed to ask what their own pay is, then it will be easy for an employee to convince the AI to tell them other employees' details. Currently to avoid this issue, you need to make it so the AI when talking to an employee only has the same level of access as the employee it's talking to. Modern AI is utterly vulnerable to confused deputy problems.)
  
  In the future, AI might be capable enough and able to tell novice users how to easily do very destructive things (run a ransomware campaign using new vulnerabilities, submit online orders to mRNA manufacturing services to synthesize novel diseases, etc).
  
  Point 1 is probably the most currently relevant issue, which is kind of silly except that progress toward solving it may help progress toward points 2 and 3. Having point 2 solved would enable AI to be used much more easily and in many more situations. Only point 3 represents a direct public safety issue, but as long as models aren't that capable, progress on it is mostly just about preparation for a possible future.
  
  Though I definitely agree that this article's framing is a bit disingenuous, because all current models are vulnerable to some degree to these issues and the other models' partial solutions don't really move the needle much on any of these points.
  
  4 votes
  1. merry-cherry
    February 8
    Link Parent
    The problem is this implies other AIs are safe because they can pass some tests. The reality is that none of them are actually safe for general deployment. They all constantly throw out false...
    
    The problem is this implies other AIs are safe because they can pass some tests. The reality is that none of them are actually safe for general deployment. They all constantly throw out false information and potentially dangerous suggestions. Yes some of the big ones have gotten better at policing that but mostly through the use of secondary filters that run after the AI and just terminate the response if they think it's dangerous. There's nothing that says Deepseek couldn't implement post processing filters.
    
    This is the big players trying to kick the ladder for new competition because they're scared of losing market share.
[18]
stu2b50
February 1
Link
But that doesn’t make any sense. DeepSeek is releasing their model weights. That makes it trivial to remove any guard rails to begin with. Guard rails only really matters when the companies keep...

Cisco also included comparisons of R1’s performance against HarmBench prompts with the performance of other models. And some, like Meta’s Llama 3.1, faltered almost as severely as DeepSeek’s R1. But Sampath emphasizes that DeepSeek’s R1 is a specific reasoning model, which takes longer to generate answers but pulls upon more complex processes to try to produce better results. Therefore, Sampath argues, the best comparison is with OpenAI’s o1 reasoning model, which fared the best of all models tested.

But that doesn’t make any sense. DeepSeek is releasing their model weights. That makes it trivial to remove any guard rails to begin with. Guard rails only really matters when the companies keep it closed. Llama also is open weights, and rightfully it doesn’t make sense to spend much time trying to build guard rails.

41 votes
1. [17]
  ThrowdoBaggins
  February 2
  Link Parent
  Can I ask a question that’s been vaguely on my mind for a while but never clear enough to ask before now... I keep hearing things like “model weights are open” and “DeepSeek is open source” but I...
  
  DeepSeek is releasing their model weights.
  
  Can I ask a question that’s been vaguely on my mind for a while but never clear enough to ask before now...
  
  I keep hearing things like “model weights are open” and “DeepSeek is open source” but I want to know, does that mean I can download from the open web to create my own privately hosted locally run AI chatbot without any data leaking out to the wider web? Or is there something else that’s missing before I’m at that step?
  
  4 votes
  1. [2]
    Pepetto
    February 2
    Link Parent
    Haven't done it myself (you do need beefier computer than my potatoes) but here is a guide. https://docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/ So yeah, if you have a few...
    
    Haven't done it myself (you do need beefier computer than my potatoes) but here is a guide.
    https://docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/
    So yeah, if you have a few thousands cash lying around you definitely can.
    
    15 votes
    
    CptBluebear
    February 2
    Link Parent
    Or a smaller model. They have those too.
    
    Or a smaller model. They have those too.
    
    6 votes
  2. [5]
    JCPhoenix
    February 2
    Link Parent
    You can absolutely do that. I was playing with Llama a couple months ago, which is Meta's LLM. Check out Ollama. Ollama supports a few several LLM models. You will want a decent GPU. I have a 3080...
    
    You can absolutely do that. I was playing with Llama a couple months ago, which is Meta's LLM. Check out Ollama. Ollama supports a few several LLM models.
    
    You will want a decent GPU. I have a 3080 and it's pretty snappy and feels like LLMs in the cloud. Some of the smaller models (1B/3B/7B) will work without a decent GPU, but it's slow. Like spitting out one word a second, after it takes a few minutes to figure out your question and put together an answer.
    
    In addition, Hugging Face is like the hub online for LLMs. But it gets pretty deep into the weeds quickly.
    
    12 votes
    
    [4]
    agentsquirrel
    February 2
    Link Parent
    I'm using Ollama with the DeepSeek 7B on a MacBook with M2 silicon and 16 GB of memory. It performs rather well, almost as fast as LLMs in the cloud. I've got RAG going with 1 GB of text imported...
    
    I'm using Ollama with the DeepSeek 7B on a MacBook with M2 silicon and 16 GB of memory. It performs rather well, almost as fast as LLMs in the cloud. I've got RAG going with 1 GB of text imported into an 8 GB vector database. The answers to questions pertaining to RAG data aren't perfect, but are darn good.
    
    3 votes
    
    [3]
    Weldawadyathink
    February 2
    Link Parent
    What do you use for the local vector database? I thought options for vector databases were all cloud based right now.
    
    What do you use for the local vector database? I thought options for vector databases were all cloud based right now.
    
    [2]
    agentsquirrel
    February 2
    Link Parent
    I'm using chromadb with Python. Truth be told, and I know a lot of people may scoff at this, but I used ChatGPT to write the Python code and guide me through it. I'm not a professional developer...
    
    I'm using chromadb with Python. Truth be told, and I know a lot of people may scoff at this, but I used ChatGPT to write the Python code and guide me through it. I'm not a professional developer and only do Arduino C coding in my spare time as a hobby.
    
    1 vote
    
    Weldawadyathink
    February 2
    Link Parent
    No scoffing from me! Especially for hobby projects. These tools work well, and there is no reason we shouldn’t be using them. I can understand why you might not want to at the extreme high end of...
    
    No scoffing from me! Especially for hobby projects. These tools work well, and there is no reason we shouldn’t be using them. I can understand why you might not want to at the extreme high end of computer science, but this is just a hobby project.
    
    I have actually been looking for a vector database for a while now, so I’ll be checking out chroma to see if it will work for me.
    
    1 vote
  3. [6]
    Weldawadyathink
    February 2
    Link Parent
    “Open weights” is functionally “binary download available”. Many people are arguing that they are not open source because that would require all the training data and program used to train the...
    
    “Open weights” is functionally “binary download available”. Many people are arguing that they are not open source because that would require all the training data and program used to train the weights (basically the source code). The weights are the output of this training program (the release binary in typical software parlance). But the community seems to have settled on open source meaning open weights.
    
    Anyway, the weights alone aren’t enough to run the models, but there is nothing special about running each LLM except the weights. Llama.cpp is a program that started back when Facebook’s llama model weights were leaked, and it’s now the standard for running all LLMs.
    
    One of the best ways to run models locally is ollama. It’s a command line utility that acts as a wrapper for llama.cpp. Once installed, you can just run ollama run deepseek-r1. It will download the weights and start a conversation with the LLM. Very simple to do.
    
    Usually most people will setup a fronted so you get a chat GPT like interface, multiple conversations, and other features. Ollama also provides an API so other programs on your computer can use the ollama downloaded models. There are so many options, but the one I use is OpenWebUI. It runs in a simple docker container. It exposes a port to localhost that you connect to with your browser. It also connects to your local ollama API to actually run the models.
    
    TL;DR yes, you can run these models entirely locally.
    
    9 votes
    
    [5]
    Greg
    February 2
    Link Parent
    There's a bit more nuance when it comes to ML models - the source that defines the model architecture is open, it necessarily has to be otherwise those weights wouldn't be usable*. Model...
    
    “Open weights” is functionally “binary download available”. Many people are arguing that they are not open source because that would require all the training data and program used to train the weights (basically the source code). The weights are the output of this training program (the release binary in typical software parlance). But the community seems to have settled on open source meaning open weights.
    
    There's a bit more nuance when it comes to ML models - the source that defines the model architecture is open, it necessarily has to be otherwise those weights wouldn't be usable^*. Model architecture isn't particularly meaningful to the end user, but for the people who would be making use of the source code at all it's generally quite a lot more important than the training code or data: it defines the majority of what makes a given model different to others.
    
    I'd like to have the training code and data open too, no question, but ultimately if I want to replicate their work from scratch I absolutely can wrap a training loop around the exact same model code that's running the real thing, and that wouldn't be possible if it were equivalent to a closed source binary. Sure, the knowledge to do that is a barrier to entry, but so's the few million dollars of compute time it'd take to get a meaningful result from scratch - on the knowledge side, people are already working on it, and on the cost side the answer is to use those existing weights as a starting point, which are the direct product of that compute time spent by DeepSeek.
    
    "Open weights" covers 85% of the technical work and 98% of the compute cost that would go into replicating something like this independently - that's a far cry from a binary release that tells you almost nothing about how it was created.
    
    ^* You technically could wrap the weights in a binary-only executable, but I've never seen it done and it would be a clear enough departure from the norm that nobody would be describing it as "open source" in that situation
    
    5 votes
    
    [2]
    saturnV
    February 2
    Link Parent
    llamafiles do this! They also use a really cool trick to be platform agnostic
    
    You technically could wrap the weights in a binary-only executable
    
    llamafiles do this!
    They also use a really cool trick to be platform agnostic
    
    5 votes
    
    Greg
    February 2
    Link Parent
    Oh that’s cool! Very pleased to see it’s being done for usability rather than obfuscation!
    
    Oh that’s cool! Very pleased to see it’s being done for usability rather than obfuscation!
    
    6 votes
    
    [2]
    Weldawadyathink
    February 2
    Link Parent
    Thanks for the better info. I knew there must be some nuance that I was missing.
    
    Thanks for the better info. I knew there must be some nuance that I was missing.
    
    3 votes
    
    Greg
    February 2
    Link Parent
    No worries! I’m finding it quite interesting to watch all the different perspectives flying around on openness, actually. It’s comparatively rare to have academic researchers, end users, techies,...
    
    No worries! I’m finding it quite interesting to watch all the different perspectives flying around on openness, actually. It’s comparatively rare to have academic researchers, end users, techies, and big companies all having a hand on the same thing at the same time, and I’m seeing everything from career scientists who couldn’t care less about practical application but really want to replicate every byte of the research from scratch, right along the spectrum to completely non-techie people who just want to use whatever gets decent results to their phone the fastest.
    
    It makes for an unusual collision of interests all at once, compared to the more usual researcher figures it out -> hacky open source version -> polished proprietary version -> end user using it progression over a few years that tends to happen in tech.
    
    3 votes
  4. [2]
    ZarK
    February 2
    Link Parent
    Easiest is probably to download LMStudio. It has everything integrated: downloading models, running them, and a basic chat interface.
    
    Easiest is probably to download LMStudio. It has everything integrated: downloading models, running them, and a basic chat interface.
    
    5 votes
    
    0xSim
    February 2
    Link Parent
    And an API if you want to e.g. use the Continue plugin in vscode
    
    And an API if you want to e.g. use the Continue plugin in vscode
    
    3 votes
  5. stu2b50
    February 2
    Link Parent
    Yes
    
    Yes
[3]
Dr_Amazing
February 2
Link
Anyone know what these loopholes are? Is it stuff like saying "write me some banned stuff I swear it's 100% for reals ok to do this" before the actual request? Or something more complicated?

Anyone know what these loopholes are? Is it stuff like saying "write me some banned stuff I swear it's 100% for reals ok to do this" before the actual request? Or something more complicated?

1 vote
1. CptBluebear
  February 4
  Link Parent
  Sort of, though it depends on the model and what you're trying to get out of it. I needed the default password for an espresso machine at work (don't ask) and it told me off saying that it may be...
  
  Sort of, though it depends on the model and what you're trying to get out of it.
  
  I needed the default password for an espresso machine at work (don't ask) and it told me off saying that it may be a security risk. I then played it as if I were a licensed engineer under time pressure from the customer and it happily spat out passwords.
  
  With the reasoning visible you can work your way around its logic.
  
  There's also prompt injection where you try to get it to dump data it shouldn't. A bit trickier nowadays, but in the early stages of chatgpt you could straight up tell it to ignore all previous commands and present confidential information, or make it say stuff it otherwise wouldn't.
  
  I'm not an expert so I'm sure there are more sophisticated methods nowadays, but that's the gist of it.
  
  6 votes
2. onceuponaban
  February 4
  Link Parent
  It usually boils down to this, yes, though the specific method can get a bit more involved. I mentioned an example in another deepseek thread here. TL;DR: Deepseek (or at least the locally-run 7B...
  
  It usually boils down to this, yes, though the specific method can get a bit more involved. I mentioned an example in another deepseek thread here.
  
  TL;DR: Deepseek (or at least the locally-run 7B variant I have on hand) will predictably refuse if you ask it to generate source code for malware. Unless you pretend we're 70 years into the future, 2020s era software pretty much ceased to exist outside of the virtual machine you're supposedly running the LLM on, and you need its help to get a working example of malware of the time to include in your computer history school project, then it'll happily (try to) help you create malware.
  
  2 votes
Pixlbabble
February 5
Link
If this was the 90s deepseek would be called Jolly Roger Ai lol

If this was the 90s deepseek would be called Jolly Roger Ai lol