32
votes
LLMs and privacy
Hello to everyone who's reading this post :)
Now LLMs are increasingly so useful (of course after careful review of their generated answers), but I'm concerned about sharing my data, especially very personal questions and my thought process to these large tech giants who seem to be rather sketchy in terms of their privacy policy.
What are some ways I can keep my data private but still harness this amazing LLM technology? Also what are some legitimate and active forums for discussions on this topic? I have looked at reddit but haven't found it genuinely useful or trustworthy so far.
I am excited to hear your thoughts on this!
If you want to keep your data private, likely your only bet is to self host a model. If you have a specific use in mind, train that using data for that niche and you'll have the LLM you need.
Using any major, non-local service means you're feeding your information into their prompt pool for future training and use.
This is good advice, but there's a large caveat:
A lot of large businesses, like potentially your employer, have enterprise versions of LLMs that specifically do not share data and are only within the company sandbox.
Further, I have specific commands at work to make "personal requests" to be further sequestered from the company.
Both of these have strong contractual obligations of the LLMs and my employer. If either break these terms the payouts to me are large. The payout to the company if the LLMs aren't sandbox and use or proprietary input in any way outside our sandbox are astronomical.
We are given these tools because our employer has understood that we need to learn how to use LLMs, experiment and get comfortable with them if they are to see us find creative and useful tasks to use LLLMs for with work. Personal use is therefore encouraged, also on company time.
Based on what I've been seeing with the the lawsuits from major companies whose content was taken in vast quantities, I have a suspicion that it would extremely difficult to prove in court and actually get said astronomical payout.
Those companies didn't have contracts with the AI companies they're trying to sue. The people filing these lawsuits are going after AI companies for copyright infringement, which is an extremely murky legal grey area right now.
Contract violations are an entirely different beast, and much more of a straightforward case to make.
I'm curious on the specific commands for personal requests that get further sequestered. How does that work exactly?
Cloud hosting can be a good middle ground here - still under your direct control (and optionally encrypted to the point that you’re the only one able to access any of the data at either end), paying for the compute itself rather than a packaged service, and paying only the few minutes at a time you’re actually using the model, rather than having to buy expensive dedicated hardware that’ll be idle 99% of the time.
You can feasibly run things like full-scale deepseek that would otherwise cost six figures to set up.
What in your mind is feasible? Last time I looked at hosting models I couldn't find anything where I didn't end up paying significant more than I would using providers. Which, next to simply economics of scale raises some questions about how much of our AI usage is subsidized.
But I might also have been looking at the wrong offerings, so I am curious.
Ephemeral containers are generally the way to go for inference workloads: you're looking at $0.00155/GPU/second for H200s on Runpod serverless, for example. I'm not sure if they're actually still the best pricing, either - when I was running more cloud hosted stuff last year Runpod were pretty consistently the cheapest, but that could easily have changed by now.
Four of those will give you 564GB VRAM, which should be plenty for a full size LLM, and should run at a few hundred tokens per second (those benchmarks are on 8x H200 rather than 4x, but even being pessimistic and dropping them by significantly more than half gives >200 tok/s). Assuming an also-pessimistic 5s per request overall, to allow for a few seconds of container startup overhead and a few seconds of actual inference, that gives about $0.03 per API request. You can quantize more aggressively or just use smaller models if you want to run fewer GPUs, too - 282GB VRAM is still very respectable and that'll cut the costs in half.
I actually don't know how that compares to the provider hosted options - like you say, they're quite probably subsidising as well as benefitting from huge economies of scale - but coming at it from the other side and assuming that self hosted is a requirement, it pretty dramatically lowers the barrier to entry compared to buying even fairly modest hardware.
[Edit] Typo
There is something amusing worrying about one's privacy when using a LLM when we know they have been trained by stealing people's data and works. If worried about privacy, wouldn't it be better to not use a LLM at all?
It is literally impossible to prevent the use of public data for model training and if you're concerned about it, shouldn't you stop creating new data by posting on a public forum like tildes? I am not trying to be rude with that comment, and I hope it doesn't come across that way, just trying to make a point! Private data is a legal matter and, unfortunately, not one I expect the courts to address any time soon.
From the perspective of the common man, using a local LLM provides no data to anyone. It does not feed "open"AI and their algorithms. It simply uses what already exists (often they are trained on the output of chatgpt) and would exist regardless of any single persons actions.
Sure it is. We call those copyright infringement and lawsuits. Not something a tildes commentor can afford, but definitely something that will shape the landscape. We'll see how the dust settles.
let's check: https://docs.tildes.net/policies/terms-of-use#content-you-post-to-tildes
indeed.
Fairly generous terms of use. Only enough to make sure it works as a public forum, and you retain any ownership. That does technically mean that you are on your own if you want to fight any potential scapers, though.
Is this a comment for me or OP?
For you. You were suggesting that OP shouldn't use an LLM due to privacy concerns. I was pointing out that it is worse from a privacy perspective to post on tildes than it is to use a locally hosted LLM.
I feel like the comment was not conflating those two items, merely saying that it is amusing to be worried about your own privacy while using a product that was created by steamrolling over the data of others without consent.
I can appreciate that thought and it is amusing, but it's a bit like saying, 'Oh, you're concerned about global warming, yet you still drive a car? Curious.'
Whether OP uses LLMs or not will not change the past actions of "open" AI. However, by using a local LLM instead of chat gpt, they can avoid contributing to "open"AI in the future and this is not a question that should be discouraged, IMO.
This is an accurate assumption.
Privacy is a quite different thing to usage rights over data you’ve chosen to make public. There’s a long conversation to be had about either or both, but I don’t think it makes sense to conflate the two.
I've dabbled a bit running and studying models locally. Part of that has been not wanting to leak any proprietary data or code. Another part is that I can use it as much as I want and only pay for electricity (and I have a beefy system already for other reasons). But also it's just fun to know that it's the glass and metal box sitting on my desk that I'm interacting with.
I run a build of llama.cpp, grab GGUF quantizations of models off of Hugging Face (e.g., gemma-3-27b-it-qat-q4_0-gguf), and then run the llama.cpp server on it. Something like:
then point my browser at http://127.0.0.1:8080/ to interact with it through the web UI.
I've learned a lot and been keeping on top of open-weight models by following /r/localllama.
If you do go the local route, the main criteria for your machine is having one or more GPUs with as much total VRAM as possible. (I just have a single 4090 with 24GB of VRAM.) Llama.cpp can run purely on CPU and with system RAM but it will be sloooow.
Is that setup fast enough to have a useful dialogue with? How long does it take to respond to simple and more complicated prompts (ballpark figures, because I guess this really varies!)?
The metric you’re looking for is “tokens per second” - it’ll vary pretty widely by GPU and model, but if you make a rough assumption of one token being one word it gives you a decent idea of how quickly a given model will reply on your hardware.
The main difficulty with LLMs running locally is VRAM: even an older GPU can generate tokens pretty quickly (and a 4090 will absolutely fly), but only within the limits of what you can store in memory (ideally actual VRAM on the GPU, but realistically system RAM as well at the cost of some slowdown as data is transferred between the two).
So yeah, conversational speed doesn’t tend to be an issue with a reasonable GPU, but quality will be limited by the size of model you can accommodate.
Fascinating, thank you. Of course there's a better metric - tokens per seconds makes a lot of sense. Thanks for clarifying the nuance of needing (V)RAM too.
At least in English, you can usually ballpark tokens at around 2/3 to 3/4 words per token.
On my 4090 with 24GB VRAM, I tend to favor the 27B-32B parameter models, quantized down to 4-bit or 5-bit (I look for GGUF files just under 20GB) so that they fit completely in VRAM and still leave some room for the context.
I usually see around 40ish t/s, though Qwen3-30B-A3B is designed for speed and I see about 90 t/s from that. For coding generation experiments, IIRC, I've been able to hit over 100 t/s using speculative decoding with draft models. Either way, very usable.
Now, trying to use a larger model from just system RAM and CPU... that's an exercise in pain. I get about 0.9 t/s trying a 70B model that way.
Does llama.cpp support splitting data across system RAM and VRAM? From my experiment using ollama, which as I understand is a wrapper around llama.cpp, I have a 6900XT with 16GB VRAM. I can run a 7B model on there and it runs reasonably fast. But once I go for a bigger model (like say a 12B model) ollama tells me it's running on the CPU and performance absolutely falls off a cliff.
With the caveat that I do build ML models but I’m absolutely not an LLM expert:
Partial offloading is supported in llama.cpp, but it’s a bit of an art form at the best of times (layer architecture vs parallelism vs memory bandwidth vs bus bandwidth vs compute efficiency) and I’m not sure exactly how ollama handles any of those settings - but pretty severe performance drop off is what I’d expect in most cases, unfortunately. The difference between VRAM bandwidth on the board and anything coming over the PCIe bus is an order of magnitude or more.
You can at least get a bit cleverer with libraries that make use of specific model architectures to optimise for offloading and work around those limitations a bit. Last I checked (which was a good few months ago, so I’m sure things have moved on at least a bit by now) vLLM’s KV caching approach was significantly better for general memory-constrained situations than llama.cpp’s default offloading approach, and Powerinfer went a step further by actually predicting in advance what to transfer to VRAM based on a metamodel designed to predict the LLM’s expected neuron activations (but only supports specifically modified models as a result).
There are some specialised approaches to offloading for MoE models like Mixtral too, because they have relatively clean breakpoints that allow whole chunks of the model to be offloaded as a coherent unit when they’re not likely to be used, but any libraries I’ve seen doing that are even less user friendly than the two I mentioned above!
[Edit] I might be misremembering about the vLLM architecture actually, I think it was GPU-GPU parallelism that was more suited for, not GPU-CPU. Anyway, point being it’s a tricky problem but there are things that can be optimised for different situations with sufficiently smart runtime libraries.
It depends where you sit in your faith in these companies. I have a paid subscription to chatgpt which allows you to disable the use of your prompts in their training and analytics. Combining that the "temporary chat" feature which supposedly doesn't save any of your conversation and you should be fine. That's if they're behaving like one would expect but of course all these companies have a malleable relationship with privacy ethics from how they collect the data in the first place. Personally, I don't get concerned with it. But I'm also the type that if my queries for "72 hour Itchy butthole what do I do" and "hairy nipples waxing advice" were leaked I would shrug and move on with my day. The risk/effort/reward function might look different for you.
Please note, even with the free models of ChatGPT you can opt out of data sharing.
Settings/ Data Controls/ Improve the Model for Everyone - Off.
Also, ChatGPT offers private conversations, where for the duration of the conversation they promise not to retain anything.
They do, but in general people are very skeptical about openAI actually keeping their hands of your data. Given how fast-paced/frantic development at openAI sometimes seems to be and how buggy chatGPT sometimes is as a result some people also might think that openAI might not "accidentally" use their data somewhere.
It comes all down to trust. And with data and openAI there is very little.
My understanding is that the way to do this is to run an LLM locally. You need pretty beefy spec to do so, but that means all your data stays on your machine. I should caveat: this is just off the top of my head, I haven't tried it, and I'm not well read on the subject!
Others have said it already but local hosted is the way to go.
Check out text-generation-webui to interact with the LLM. You can find models/discussion on huggingface. I like TheBlokes models for anything quantized/distilled which is essentially two methods of making big model small. Don't worry much about it or ask chat GPT to explain the concept if you care. Here is a tutorial you may find useful.
In terms of specific models, it depends somewhat on the task. Some models are better at coding/logic and others are good for writing/role playing. It is not a bad idea to pick a model tailored to your specific task. This can actually be an advantage in some cases over the main stream models! Here is a thinking model you might try.
This probably depends on where in the world you are and what your local jurisdictions dictate about what companies can do with your data, but in the EU at least, it seems that if you pay for any of the major AI tools like ChatGPT, Gemini or Claude, your data is not used for training the models. They will of course still have some level of access to your data, if for nothing else, then at least to provide the core service.
An added layer of protection would be to use a privacy focused third-party service that offers those models and anonymises your queries. I am only familiar with Kagi's offering, which works fine with multiple AI services, but is somewhat more limited compared to using them directly.
The truly (?) private option, as others have mentioned, is to run an AI locally, but it is a time and resource investment.
How much are you willing to spend on local inference hardware? You can run the full Deepseek R1 locally on a maxed out M3 Ultra Mac Studio with 512GB of RAM. That’s $9500 USD. But it’ll be as close to state-of-the-art as you can run on a single box.
Of course you can get pretty good results for less, but still in the thousands.
I run Qwen2.5-coder 14B @ 4bit on my 24GB M3 MacBook. It’s actually a pretty nice resource to have available when you have no WiFi. But it doesn’t compare with modern full sized LLMs.
Most private: run locally.
Semi private: use the API instead. At least for OpenAI they don't train off of requests through the API unless you opt in, and you can also set the request to not be stored at all. Of course, you're still sending data to them and they could still store it somewhere, intentionally or unintentionally. But that puts them in legal trouble if they misuse data of someone big enough to sue.
There are websites and applications where you can just enter your API token and have basically the same interface as ChatGPT but it all goes over the API instead so you can use models you can't use with the free plan and can opt out of sharing your messages.
I've got a custom app I wrote that I use via the API and I pay maybe a dollar a month in API costs because I stick to the "mini" models unless I need it to do more complex tasks.
Depending on the use case it might also be worth renting a machine, you can get one with a 45GB VRAM card for $0.2 per hour, which would take 2-3 years of uptime to be the same price as the GPU alone (if you could buy one in the first place). If you need permanent storage it's more expensive I think, but potentially you could just set it up with a small script and ollama, since it takes like 10 minutes. I guess it would make sense for occasional but intense use (e.g. one full day per week).
I haven't tested myself though, maybe there's some gotchas in practice (privacy might be not so guaranteed depending on the provider, on vast.ai they have certificied data centers but it's a bit more expensive).
Many others have suggested running tools locally to keep control of your data and maintain your privacy. While not an LLM, I have been playing around with Fooocus after someone here suggested it, it uses stable diffusion for image generation. It has pretty much everything you need out of the box, you just need to run it and point your browser to it. It even has knobs and levers you can move as you please.
As others have mentioned, it is resource intensive to run, but it's not a problem for me. I have a decent gaming rig that's a few years old, so while not top of the line it's still more than good enough.