43 votes

A.T.L.A.S: outperform Claude Sonnet with a 14B local model and RTX 5060 Ti

10 comments

  1. [2]
    hungariantoast
    Link
    I originally came across this on Lobsters and the OP there had a pretty good summary of the methodology:

    A.T.L.A.S achieves 74.6% LiveCodeBench pass@1-v(k=3) with a frozen 14B model on a single consumer GPU -- up from 36-41% in V2 -- through constraint-driven generation and self-verified iterative refinement. The premise: wrap a frozen smaller model in intelligent infrastructure -- structured generation, energy-based verification, self-verified repair -- and it can compete with frontier API models at a fraction of the cost. No fine-tuning, no API calls, no cloud. Fully self-hosted -- no data leaves the machine, no API keys required, no usage metering. One GPU, one box.

    I originally came across this on Lobsters and the OP there had a pretty good summary of the methodology:

    You'd have to try it out to see how it works for you, but the trick they use is pretty clever. When you ask an AI to write code, it doesn’t always get it right. Sometimes the code has bugs, sometimes it misunderstands the problem entirely. A naive way to address that is to generate a few solutions and test each one. The odds that at least one works go way up. ATLAS generates multiple attempts, running each through a test suite. Each retry also gets told what went wrong with the previous attempt, so it can try to avoid the same mistake.

    But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.

    ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.

    These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.

    So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.

    20 votes
    1. Omnicrola
      Link Parent
      If I follow along with what they're describing, then I'm unclear on how easily this is generalizable. If the smaller Cost Field network needs to be trained on what a good solution looks like,...

      If I follow along with what they're describing, then I'm unclear on how easily this is generalizable. If the smaller Cost Field network needs to be trained on what a good solution looks like, doesn't that make it limited to a fairly narrow slice of problems that it was trained on? Or am I getting tripped up by the simplified explanation?

      14 votes
  2. [5]
    V17
    Link
    Running locally is great for privacy, but also if this is that good, then even renting it through buying credits for some datacenter provider could be awesome, because the price per task is about...

    Running locally is great for privacy, but also if this is that good, then even renting it through buying credits for some datacenter provider could be awesome, because the price per task is about 10x lower than the big providers like GPT or Claude.

    Ever since I needed a visual-language model for OCR for a hobby project and found out that I can OCR 9000 webcomics for about 2.50 USD because even the biggest QWEN model ran through an independent provider is that much cheaper than ChatGPT and for this task just as good, I've been quite excited about similar models.

    7 votes
    1. [4]
      Omnicrola
      Link Parent
      I've considered doing this (loading a large model into a cloud host) but hadn't found a project that warranted it yet, so haven't looked at provider pricing or features. Who did you end up using?

      I've considered doing this (loading a large model into a cloud host) but hadn't found a project that warranted it yet, so haven't looked at provider pricing or features.

      Who did you end up using?

      2 votes
      1. V17
        Link Parent
        I just used openrouter.ai because it's simple, it allowed me to test all the models before finding one that's good enough and while obviously they need to have an overhead cost to be sustainable...

        I just used openrouter.ai because it's simple, it allowed me to test all the models before finding one that's good enough and while obviously they need to have an overhead cost to be sustainable it seemed low enough to be inconsequential for me.

        3 votes
      2. [2]
        hungariantoast
        Link Parent
        Models.dev is similar to OpenRouter in that it lists information for inference providers. However, it's just a database, not a service. The idea is: you figure out what model you want to use, then...

        Models.dev is similar to OpenRouter in that it lists information for inference providers. However, it's just a database, not a service. The idea is: you figure out what model you want to use, then go to models.dev to find the best provider for that model. If you are only interested in using a single model (or family of models, like Qwen or GLM), then it is often cheaper to use the provider's API directly (or their subscription, depending on usage) than to go through OpenRouter.

        OpenCode Zen is a "provider aggregator" like OpenRouter, where you have a single account, but get access to many providers and models. Zen does not have a "platform fee" like OpenRouter. The models and providers available through Zen are also more curated, but that means Zen's selection of models and providers is limited compared to OpenRouter.

        There's also OpenCode Go. I actually bought this a few days ago, because the first month is only $5, and I wanted to try larger models that I cannot self host. The subscription is normally $10/month though, and only gives access to four models:

        • GLM-5
        • Kimi K2.5
        • MiniMax M2.7
        • MiniMax M2.5

        If you have never worked with big models before, OpenCode Go is what I would recommend trying first. The GLM and Kimi models are good. The MiniMax models are fine for most things, but not as capable in my (limited) experience.

        I also think Go's usage limits are good. I have read people online complain about hitting their weekly limits in a single day. I don't know how they could manage that, unless they were running a model in an automated loop to slopcode a project entirely with AI. With the way that I use models[1], I have struggled to hit 10% of the subscription's five-hour usage limit, let alone an entire week's worth. I'm still very new to using LLMs though. Your experience might be different.

        If your AI usage is very high, then a subscription from a specific provider will (most likely) work out to be cheaper than API pricing (API pricing through that provider alone, or through an aggregator). I am not aware of any "provider aggregators" like OpenRouter or OpenCode Zen that offer a subscription instead of API pricing.


        1. Right now the main way I use AI is to do maintenance work. For example I might tell the model something like: "scan this code file, find any bugs or uncaught errors, report them in bugs.md with your recommended fix".
        3 votes
        1. unkz
          Link Parent
          It’s easy to start hitting limits if you run things in parallel. My weekly code audit getting this territory, as I have it create GitHub issues for all the things it finds and then it runs through...

          It’s easy to start hitting limits if you run things in parallel. My weekly code audit getting this territory, as I have it create GitHub issues for all the things it finds and then it runs through them all in parallel to have fully tested pull requests that I can sift through afterwards.

          3 votes
  3. Carrow
    (edited )
    Link
    I experimented with rnj-1 on my local 3070 over the weekend, but at 8B it fit on my card. In their self reported metrics, it generally outperforms Qwen 3's 8B (ATLAS ran with Qwen 3 14B). I wonder...

    I experimented with rnj-1 on my local 3070 over the weekend, but at 8B it fit on my card. In their self reported metrics, it generally outperforms Qwen 3's 8B (ATLAS ran with Qwen 3 14B). I wonder what kind of quality ATLAS can provide for smaller models.

    6 votes
  4. [2]
    carsonc
    Link
    I would be interested in trying this. The prices for the cards are not high and I have enjoyed working with Claude Sonnet. Can someone provide a rundown on how this might compare on general...

    I would be interested in trying this. The prices for the cards are not high and I have enjoyed working with Claude Sonnet. Can someone provide a rundown on how this might compare on general knowledge or scientific tasks?

    2 votes
    1. Wulfsta
      Link Parent
      Models this small do not have much general knowledge. They are usually better at semantic extraction or transformation.

      Models this small do not have much general knowledge. They are usually better at semantic extraction or transformation.

      6 votes