post_below's recent activity

  1. Comment on Does generative AI have a natural limit without a major innovation? in ~comp

    post_below
    Link
    To clarify the vocab: Gen AI = LLM powered agents = LLM fine tuned for reasoning and tool use running in a harness that provides tools and other functionality. Boiling it down there are two steps:...

    To clarify the vocab: Gen AI = LLM powered agents = LLM fine tuned for reasoning and tool use running in a harness that provides tools and other functionality.

    Boiling it down there are two steps:

    • Pre training. The giant dataset, tokenizing it (converting it into numbers) and generating embeddings (mathematical relationships between the tokens). This step is constrained by the available data like you said.
    • Post training (or fine tuning). This step turns the LLM, which can't really do anything except output plausible text in response to input, into a tool that can do useful work. It's where it learns to be an assistant, to use tools, do multi-step reasoning, write code that mostly works, develop an em-dash kink, etc..

    The above compresses a bunch of important sub steps for brevity.

    Innovation can happen in various parts of both steps, so there's still a lot of room for improvement. There are undoubtedly better ways to do everything involved, much of it has been replaced with better methods multiple times already.

    Model size is likely to become a limiting factor, both because of the limit of what exists in terms of training data and because bigger models are more computationally expensive to train and to run. But that's assuming better ways of getting, vetting and tagging pre-training data aren't discovered. I'd assume that, yes, eventually there will be a ceiling. In terms of compute, the tech is going to keep getting more efficient and the hardware will keep getting better so likely any limits imposed by compute will be temporary.

    Will recursive self improvement hit an event horizon where LLMs will start improving themselves so fast they start rocketing towards AGI? Probably not with the current state of the art. When models generate their own training data they end up entrenching and exaggerating their flaws, and there are a lot of flaws. Some amount of artifical training data is fine (especially if it comes from a better model), but 100% artifical training isn't viable at this point.

    Even if LLMs were to achieve the ability to recursively self improve without ensloppifying themselves, there's no room in the math for the kind of awareness or understanding we'd associate with AGI. The models don't have a conceptual understanding of reality, they only appear to. They would need to invent new technology to get there, not just iterate on existing LLM tech.

    However, will LLM tools contribute to whatever sort of AGI is someday created? It's hard to imagine they won't.

    I can imagine a future world model with pre-training on a much wider dataset that strives to tokenize reality, as opposed to just language and other creative outout, having a more realistic path to AGI. Especially if it was fine tuned with some sort of feedback mechanism that could approximate real world cause and effect. Maybe you'd need sensory feedback. But that's speculating on technology that doesn't exist yet. Right now world models are mostly focused on improving robotics. As far as I know, no one has tried to make a super-sized general world model. It would take the resources of one of the frontier labs to attempt it.

    My perspective is that AGI is still roughly comparable to stable fusion power. There's no reason to believe it can't be done, but it will most likely be "just around the corner" for years and years.

    7 votes
  2. Comment on Any fellow software engineers using paid GitHub copilot? in ~comp

    post_below
    Link Parent
    You're right, I shouldn't have included enterprise... The team plans offer subscription billing (as opposed to API prices)

    You're right, I shouldn't have included enterprise... The team plans offer subscription billing (as opposed to API prices)

    1 vote
  3. Comment on Any fellow software engineers using paid GitHub copilot? in ~comp

    post_below
    Link Parent
    Meaning that Opus is dramatically better than Deepseek for complex coding tasks, but if you include cost in the calculation, Deepseek looks a lot better.

    Meaning that Opus is dramatically better than Deepseek for complex coding tasks, but if you include cost in the calculation, Deepseek looks a lot better.

    3 votes
  4. Comment on Any fellow software engineers using paid GitHub copilot? in ~comp

    post_below
    Link Parent
    Is that 20€ in API credits or a 20€ subscription? The latter is quite a bit more usage and the limits reset in 5 hour windows. Both options are available in enterprise and team setups. An...

    Is that 20€ in API credits or a 20€ subscription? The latter is quite a bit more usage and the limits reset in 5 hour windows. Both options are available in enterprise and team setups.

    An alternative you might suggest is Open AI at 20/month subscriptions. For Claude you need 100/month subscriptions for serious usage but with GPT 5.4 you can get a lot farther on 20/month.

    But no matter how much they pay, artisanal code is only mostly dead. Miracle Max's pill will come in the form of everyone realizing that having no one left that can actually code isn't working out as planned!

    1 vote
  5. Comment on Any fellow software engineers using paid GitHub copilot? in ~comp

    post_below
    Link Parent
    Putting aside the benchmarks, since it varies widely depending on which ones you look at, Deepseek is most definitely not on par with Opus 4.6. Unless you factor cost in, then Deepseek is...

    Putting aside the benchmarks, since it varies widely depending on which ones you look at, Deepseek is most definitely not on par with Opus 4.6.

    Unless you factor cost in, then Deepseek is lightyears ahead of Opus.

  6. Comment on Claude Fable 5 and Claude Mythos 5 in ~tech

    post_below
    Link Parent
    I was thinking in terms of something homegrown, anyone could of course use an open model as a starting point, but they'd then be reliant on that provider since they wouldn't have their own...

    I was thinking in terms of something homegrown, anyone could of course use an open model as a starting point, but they'd then be reliant on that provider since they wouldn't have their own training pipeline. For a government or academic solution they'd ideally start from scratch.

    But yeah the chinese open weights models are pretty good and it's great that they exist for all sorts of reasons.

    3 votes
  7. Comment on Claude Fable 5 and Claude Mythos 5 in ~tech

    post_below
    Link Parent
    A public LLM of some kind would be amazing, not just for science, for the whole range of applications. Don't underestimate the size of the task though. It would need nation state level funding,...

    A public LLM of some kind would be amazing, not just for science, for the whole range of applications.

    Don't underestimate the size of the task though. It would need nation state level funding, and success would hinge on convincing the right experts to get on board.

    The EU could maybe pull it off, it would make a lot of sense for them given their recent distaste for US tech.

    I imagine various people in government and academia have at least talked about it by now.

    3 votes
  8. Comment on What do you think is the best sandwich? in ~food

    post_below
    Link Parent
    I've had a surprising number of conversations, at various times in my life, about cooking turkeys outside of the holidays. Normally they happen around thanksgiving, sometimes they result in plans,...

    I've had a surprising number of conversations, at various times in my life, about cooking turkeys outside of the holidays. Normally they happen around thanksgiving, sometimes they result in plans, sometimes they even acknowledge the historically low ratio of plans to non-holiday turkeys. Rarely do they result in actual turkey.

    Mostly because I'm ok with just annual turkey sandwiches.

  9. Comment on What do you think is the best sandwich? in ~food

    post_below
    Link
    Leftover thanksgiving turkey sandwich. Because it's the only time shredded turkey is available. It wouldn't be as good otherwise. If you put cranberry sauce on it, I accept that but take it...

    Leftover thanksgiving turkey sandwich. Because it's the only time shredded turkey is available. It wouldn't be as good otherwise.

    If you put cranberry sauce on it, I accept that but take it somewhere else so I don't have to watch.

    3 votes
  10. Comment on Claude Fable 5 and Claude Mythos 5 in ~tech

    post_below
    Link Parent
    Now that Mythos is public in the security hobbled form of Fable, we don't have to speculate. In my testing yesterday Fable found two legitimate vulnerabilities that previous models (and I) had...

    Now that Mythos is public in the security hobbled form of Fable, we don't have to speculate. In my testing yesterday Fable found two legitimate vulnerabilities that previous models (and I) had missed. And that was in non security focused scans (because most security related prompts currently get downgraded to Opus 4.8). In both cases they were subtle issues that were easy to miss.

    It's true that models like Opus 4.8 and GPT 5.5 can be wrangled to find a lot of security issues. In the hands of a decent engineer, with a good harness, you can use either of those models to find and patch or exploit all sorts of vulnerabilities. It's an iterative process though. According to Anthropic, the reason for the controlled release was because Mythos is better at chaining vulnerabilities into working exploits on its own. It would allow anyone, including non-engineers, to find and exploit holes in widely used software. Glasswing gave those companies a chance to patch many of the holes in advance.

    I don't have access to unrestricted Mythos, just Fable, so I can't test the full extent of the capabilities Anthropic is claiming. But seeing Fable's capabilities in other areas of coding I have no doubt they're telling the truth. It's significantly better at putting pieces together into a working thesis and then following it through to a given conclusion, which would definitely generalize into security research.

    That said, I don't think project glasswing was wholly altruistic. When they released Mythos they didn't have anywhere near enough available compute to handle a wide release. So while the safety angle was legitimate, it also served their purposes to do a limited release while they scrambled to find the compute for a full scale release. And yeah, the hype didn't hurt either. But the aforementioned hot takes that it was all smoke and mirrors are now demonstrably false.

    13 votes
  11. Comment on Claude Fable 5 and Claude Mythos 5 in ~tech

    post_below
    Link Parent
    Yes, thanks for adding that. Currently the moat is velocity, none of the cheap/open alternatives have been able to get close enough to the frontier to tempt the majority of users. And velocity is...

    Yes, thanks for adding that. Currently the moat is velocity, none of the cheap/open alternatives have been able to get close enough to the frontier to tempt the majority of users. And velocity is expensive... but it can work as a moat as long as there's plenty of frontier available. If they hit a plateau everyone will probably catch up.

    I've been wondering if Anthropic started their IPO process when they did in order to go public when they had a clear lead. Opus 4.8 put them pretty far ahead, Fable/Mythos just makes it undeniable. For now at least.

    12 votes
  12. Comment on Claude Fable 5 and Claude Mythos 5 in ~tech

    post_below
    Link Parent
    snort that's great. It's relevant too: in my experience so far, Fable (I'm having a hard time getting used to that name for some reason) is significantly better at extrapolating intent with less...

    snort that's great. It's relevant too: in my experience so far, Fable (I'm having a hard time getting used to that name for some reason) is significantly better at extrapolating intent with less prompting and then making mostly reasonable calls on how to proceed without handholding.

    It looks like there was a heavy focus on long running autonomous tasks during fine tuning. I have mixed feelings about this.

    7 votes
  13. Comment on Claude Fable 5 and Claude Mythos 5 in ~tech

    post_below
    Link Parent
    That's interesting hands on experience thanks for sharing... As soon as I saw the 5% refusal rate in the model card I knew it was going to be a problem. 5% refusal across all queries translates to...

    That's interesting hands on experience thanks for sharing... As soon as I saw the 5% refusal rate in the model card I knew it was going to be a problem. 5% refusal across all queries translates to near total refusal for the subset they've targeted (cybersecurity, microbiology and chemistry).

    I can confirm that the refusal rate (or actually Opus downgrade rate) for cyber is extemely high and sometimes it falsely flags things that have nothing to do with security.

    My read is that they erred dramatically in favor of refusal just to get the release out the door and then will relax restrictions incrementally once they have more usage data and testing.

    9 votes
  14. Comment on Claude Fable 5 and Claude Mythos 5 in ~tech

    post_below
    Link
    I've had a chance to test Fable and I can confirm that it's as good as the model card implies. As I speculated a while back in some thread or another, it's a step change similar to Sonnet -> Opus....
    • Exemplary

    I've had a chance to test Fable and I can confirm that it's as good as the model card implies. As I speculated a while back in some thread or another, it's a step change similar to Sonnet -> Opus. When you consider what Opus 4.8 is capable of, that's significant.

    There are going to be a lot of hot takes, just like there are with every frontier model release, and most of them are going to be wrong, just as they've been with every model release.

    One of the most popular hot takes was that project glasswing was disingenuous marketing. That being when Anthropic released Mythos to a limited selection of companies and government groups that were deemed critical to digital infrastructure. The takes continued even after people who had access started saying that this thing was legit. Not suprising, there are big feelings and huge interest in this topic, hot takes generate a lot of clicks, engagement and catharsis.

    Well, having tested Fable (the name for the Mythos public release) on a large and fairly complex codebase, it wasn't empty hype. Its ability to connect the dots between disparate functionality, reason about it, and follow the chain of cause and effect to impressive depth, would almost definitely translate to unprecedented cyber capability. All the people who said Mythos was legit weren't lying.

    Some things that stand out to me so far: Fable has much broader/deeper knowledge of software architecture which translates to better "taste". Still far from perfect but a significant step up. Second, it catches things that were functionally invisible to smaller models (because Fable is almost definitely larger than anything previously released, fine tuning alone couldn't accomplish this). Because it "sees" more, it can do more things that previously required human expertise.

    Another popular set of hot takes in the other direction is that these models are now so good that they'll kill software engineering as a career, replace all knowledge workers, advance to AGI sooner rather than later, and so on. I don't think that's true. Fable is good but it's still a dumbass in a lot of ways. That's baked into the core of LLM technology. It's better at everything but it's still only mimicking reasoning and understanding, which leads to all sorts of mistakes. There's no actual awareness anywhere in the loop. The illusion thereof is remarkable sometimes though.

    The last bit is concerning, Fable is another step up in the ability of these models to fool people who can't fluently read the code themselves into thinking the code quality is far better than it is. And it will lead to a new wave of engineers convincing themselves they don't need to examine the output anymore. Who knows exactly how that will ultimately play out. I just know it makes me uncomfortable.

    Yet another set of hot takes centers around the idea that advancement has or will soon plateau, at which point the bubble will burst and it will all come crashing down. Adjacent to this is the idea that the frontier companies will never be able to make a profit on inference, which is the key area where profitability will matter in the long run. This serving as evidence of the impending fall. But the thing is that there is no credible public information about inference profit and loss at these companies at all. It's all wild speculation. Except we do know exactly how much inference costs for open models, and there is very clearly a lot of room for profit. Personally I suspect that Anthropic and Open AI are already making a profit on inference and have been for some time. Anthropic likely for most of a year and Open AI since (at the latest) GPT 5.5 when they significantly raised prices. It's even possible that the current prices are fully sustainable. Moreso when you factor in that inference costs are almost definitely going to continue to come down (relative to utility). Again that's something we can observe with open models and there's still a lot of room for innovation and optimization in a new tech like this. The open question is whether or not they'll ever be able to make enough profit to justify the astronomical investment and valuations.

    As far as a plateau goes... Seems reasonable to assume that will happen at some point but so far it keeps stubbornly not happening.

    In case someone gets the wrong idea, I'm not attempting to defend these companies, I very much wish that LLM technology wasn't currently led by trillion dollar companies and their tech elite executives. I wish all the success to the open models, which will be the fallback if (when) the big labs start to enshittify or the bubble bursts. The latter being something else that keeps stubbornly refusing to happen.

    53 votes
  15. Comment on When AI builds itself — progress toward recursive self-improvement and its implications in ~tech

    post_below
    Link Parent
    I think you may have missed a few parts of the story. Anthropic was indeed providing inference to the DoD, but then right after Claude was used sucessfully to help capture Maduro (note that last...

    I think you may have missed a few parts of the story. Anthropic was indeed providing inference to the DoD, but then right after Claude was used sucessfully to help capture Maduro (note that last part is just "rumored"), Hegseth demanded they remove restrictions on use and allow "any lawful use" where what was and wasn't lawful would of course be up to Hegseth and the DoD to determine. That part is public record. Anthropic refused to budge on their "red lines" of no mass domestic surveillance and no automated weapons. They continued to hold the line even when Hegseth threatened to designate them a supply chain risk, which at the time seemed absurd as the designation was created for foreign threats and had never been used on a domestic company.

    Except he actually followed through and did it when Anthropic wouldn't give in. Meaning that now Anthropic can't be used by the DoD or by any contractor while working with the DoD. It was shocking, at least to me, as I expected them to roll over so they could keep getting those sweet government (and government contractor) paychecks. It's very clear that, whatever you think of their ethics, they genuinely believe in them. At least for now. You can argue that their safety stance is flawed, but it's difficult to make the case that it's posturing.

    After the designation, within days of the DoD and Anthropic deal falling apart, Open AI stepped in and agreed to let the DoD use their inference however they want.

    19 votes
  16. Comment on People who want less AI are breaking up with Google Search in ~tech

    post_below
    Link Parent
    It's not just that the tech vocabularly is wrong, it's just not well written Sometimes it reads like a high schooler wrote it, other times it reads like it was written by someone for whom English...

    It's not just that the tech vocabularly is wrong, it's just not well written

    As AI chatbots are refined and platforms like Google continue to integrate the technology, some people are looking for ways to distance themselves from the technology.

    Sometimes it reads like a high schooler wrote it, other times it reads like it was written by someone for whom English is a second language. I feel like either the editor rubber stamped it or there's no real editing for the online only posts.

    Doesn't look LLM written to me though, I agree the current models wouldn't make some of the mistakes in the article.

    20 votes
  17. Comment on Bernie Sanders: The public should own half of the big AI companies in ~society

    post_below
    Link
    Like so much of Bernie's platform, it's the right thing to do and it's the wrong country to get it done in. Doesn't mean we shouldn't keep trying though. Gotta admire his persistence. It really...

    Like so much of Bernie's platform, it's the right thing to do and it's the wrong country to get it done in.

    Doesn't mean we shouldn't keep trying though. Gotta admire his persistence.

    It really does make sense, it's communal intellectual property, everyone should benefit from it. And there's no way it happens voluntarily with (soon to be) public companies.

    19 votes
  18. Comment on It's not just X. It's Y. in ~humanities

    post_below
    Link Parent
    Good points. I agree that language plays an important part in cognition. Though in my experience a lot of reasoning also happens below language. Or maybe outside of language, definitely not...

    Good points. I agree that language plays an important part in cognition. Though in my experience a lot of reasoning also happens below language. Or maybe outside of language, definitely not language mediated.

    About the semantic void bit... Despite the fact that I cringe a little when I see LessWrong at the top of the page, I read it anyway. It was really interesting, thanks for the link. It sounds like it was (roughly) about injecting IDs for tokens that don't exist (they're in the void). Essentially accessing inference from weights the normal input path wouldn't ever have access to.

    So the inference you get out of LLMs is actually in the opposite space to the semantic void. But correct me if I'm wrong I half skimmed. I thought the implication that the geometric center of the space is void rather than densely populated was compelling, even if it doesn't have any practical implications.

    In any case reading LLM generated prose does often feel void-like.

    2 votes
  19. Comment on Clanker: A word for the machine in ~tech

    post_below
    Link Parent
    It goes back even farther than that, in the context of bots, the 1950s. And the word itself goes back to the 18th century when it meant a lie. It doesn't shock me that people on the internet have...

    It goes back even farther than that, in the context of bots, the 1950s. And the word itself goes back to the 18th century when it meant a lie.

    It doesn't shock me that people on the internet have found a way to turn it into a bigger deal though. If someone isn't offended, what's even the point?

    10 votes
  20. Comment on What change would make you quit Tildes? in ~tildes

    post_below
    Link Parent
    No you're right, open signups would result in an almost immediate flood of automated posts. First Tildes would need automated protection against abuse that it doesn't currently need, then it would...

    No you're right, open signups would result in an almost immediate flood of automated posts. First Tildes would need automated protection against abuse that it doesn't currently need, then it would need more moderators. Invite only solves a lot of problems at once.

    13 votes