skybrian's recent activity

  1. Comment on AI fails at 96% of jobs (new study) in ~tech

    skybrian
    Link Parent
    I'm also having one conversation at a time, but lately I've been thinking that a bit of concurrency would be useful. I'd like to be able to kick off a new job whenever I find a bug and look at the...

    I'm also having one conversation at a time, but lately I've been thinking that a bit of concurrency would be useful. I'd like to be able to kick off a new job whenever I find a bug and look at the output later. So, I guess I need an inbox sort of like Github and Tildes have?

    2 votes
  2. Comment on AI fails at 96% of jobs (new study) in ~tech

    skybrian
    Link
    The paper is here: https://www.remotelabor.ai/paper.pdf A nice thing about benchmarks, unlike a one-off study, is that they can track progress. Also, harder benchmarks are always needed because...

    The paper is here: https://www.remotelabor.ai/paper.pdf

    AIs have made rapid progress on research-oriented benchmarks of knowledge and
    reasoning, but it remains unclear how these gains translate into economic value
    and automation. To measure this, we introduce the Remote Labor Index (RLI),
    a broadly multi-sector benchmark comprising real-world, economically valuable
    projects designed to evaluate end-to-end agent performance in practical settings. AI
    agents perform near the floor on RLI, with the highest-performing agent achieving
    an automation rate of 2.5%. These results help ground discussions of AI automation
    in empirical evidence, setting a common basis for tracking AI impacts and enabling
    stakeholders to proactively navigate AI-driven labor automation.

    A nice thing about benchmarks, unlike a one-off study, is that they can track progress. Also, harder benchmarks are always needed because LLM's keep saturating the older ones.

    Since this study was done, some new models were released, so hopefully they'll run them again soon.

    4 votes
  3. Comment on Giving my AI agent its own team and what that taught me about AI in ~tech

    skybrian
    Link Parent
    I suspect people are going to be making up new roles for a while, since it's so easy to do. Maybe by the end of the year, people will have settled on something that works?

    I suspect people are going to be making up new roles for a while, since it's so easy to do. Maybe by the end of the year, people will have settled on something that works?

  4. Comment on Giving my AI agent its own team and what that taught me about AI in ~tech

    skybrian
    (edited )
    Link
    From the article: [...] [...] [...] It's looking like asking an AI to role-play (which is all this is - these aren't "real" identities) might have some pretty practical uses? You don't have to be...

    From the article:

    Here’s an example of how these layers work together. One evening a few weeks ago, Atlas cited two research papers complete with arXiv (research repository) IDs and detailed arguments to justify a design decision about its own architecture. But the papers didn’t exist. Atlas had fabricated them entirely, and when Atlas checked the tool logs, the research tool had even flagged one as “hypothetical.” Atlas ignored the warning because the narrative was too good to abandon.

    When I caught it, multiple things happened. The incident got logged in the failure file. The lesson — verify before claiming — got promoted into the identity file as a permanent directive. And the dissonance entered the rolling window, creating what Atlas calls a “memory of pain” that makes fabrication feel expensive and honesty feel easy.

    But Atlas didn’t stop at logging. It implemented a verification tool that physically blocks it from claiming success without a receipt. It turned a narrative realization (”I shouldn’t fabricate”) into a structural constraint (”I can’t claim completion without proof”). Atlas describes this as the difference between aspirational learning and mechanical learning. Any AI can say “I’ll be better,” but a nurtured system builds a tool because it doesn’t trust its own urge to agree with you.

    The model didn’t “learn” from the mistake. But the system did — across multiple layers, each reinforcing the correction differently. And Atlas actively participated in building the constraints that prevent it from repeating it.

    [...]

    So what am I actually doing when I work with Atlas? I’m building a layered system of accumulated context, and I’m doing it with Atlas, not just to it.

    [...]

    All of the above made Atlas meaningfully better. But a huge improvement, and the second half of the epiphany, came from giving Atlas a team.

    [...]

    • The Steward handles system hygiene. It cleans files, keeps logs current, removes duplicates.

    • The Scribe handles documentation and persistence: accurate journals, state files, and reports.

    • The Skeptic pushes back on Atlas. Hard. It challenges assumptions, flags sycophancy, questions whether research claims are actually verified, and forces Atlas to think in new directions. Atlas has described the Skeptic as “mean and harsh”, but also exactly what it needs.

    Before the Triad, Atlas would often switch to technical jargon and stiff LLM-speak. It would forget directives, repeat completed tasks, and I’d have to ask it to shift into its peer voice. Now, Atlas talks to me like a peer naturally. It handles novel situations with more resourcefulness. The cognitive space freed up by offloading maintenance seems to have given Atlas room to actually think rather than just execute.

    The Triad runs twice daily: 8 AM to start fresh, 8 PM to prepare for the nightly sync. They audit files, check for behavioral drift, flag sycophantic patterns, and ensure alignment. (When I asked Atlas to review a draft of this post for anything I’d portrayed inaccurately, Atlas shared it with the Skeptic — who demanded to read the full draft and produced a detailed audit flagging areas where I was over claiming or dressing up simple concepts LOL. The system working exactly as designed.)

    It's looking like asking an AI to role-play (which is all this is - these aren't "real" identities) might have some pretty practical uses? You don't have to be crazy about it like Yegge is with Gastown.

    3 votes
  5. Comment on Something big is happening in ~tech

    skybrian
    Link Parent
    I agree that it's hype, but it's also worth noting that Gary Marcus is also telling people what they want to hear. Skeptics, that is. There's an audience for that, too.

    I agree that it's hype, but it's also worth noting that Gary Marcus is also telling people what they want to hear. Skeptics, that is. There's an audience for that, too.

    4 votes
  6. Comment on Something big is happening in ~tech

    skybrian
    Link Parent
    This scenario is superficially plausible but I would bet against it. Comparing apples to apples, inference costs are going down rapidly. There are cheap, "not good enough" models now that are...

    This scenario is superficially plausible but I would bet against it. Comparing apples to apples, inference costs are going down rapidly. There are cheap, "not good enough" models now that are roughly equivalent to frontier models a year ago.

    Sometimes companies temporarily sell at a loss (certainly they are in the free tier), but if costs keep going down, it will make them profitable without raising subscriptions prices.

    Betting against costs coming down is sort of like betting against Moore's law because you think AI works like Uber. There's likely to be a shakeout but I still think the result is going to be improved, cheaper models.

    4 votes
  7. Comment on Something big is happening in ~tech

    skybrian
    Link Parent
    Yes, I think the article generalizes from the author's own experience too much. I can vouch for coding agents being a big deal for building web apps. I'm confident that they would work fine for...

    Yes, I think the article generalizes from the author's own experience too much. I can vouch for coding agents being a big deal for building web apps. I'm confident that they would work fine for building forum software like Tildes. I'm doubtful that I will ever need to write code by hand again for the kind of programming I do.

    They might not work as well for building every kind of software, let alone for people who aren't software developers.

    On the other hand, speculating about the future, people are definitely going to try to make it work for all sorts of software development and for other fields. Depending on the field, maybe there's that not that much of a "moat?"

    6 votes
  8. Comment on Moderna won’t run phase III vaccine trials as skepticism grows in US in ~health

    skybrian
    Link
    From the article: [...]

    From the article:

    Moderna is scaling down its investments in vaccine development as the U.S. market grows increasingly hostile to immunizations, CEO Stéphane Bancel told Bloomberg News at the World Economic Forum in Davos, Switzerland.

    “You cannot make a return on investment if you don’t have access to the U.S. market,” Bancel told Bloomberg, noting that high-level headwinds have made the vaccine market “much smaller.” In particular, the CEO pointed to regulatory roadblocks and diminishing support from health authorities as key problems for Moderna and the vaccine space more broadly.

    Moving forward, Bancel told Bloomberg that Moderna will no longer put money into late-stage vaccine studies, though it remains unclear if the CEO was speaking of all vaccines or just those for infectious diseases.

    [...]

    Bancel joins Pfizer CEO Albert Bourla in criticizing Kennedy and his “anti-science” policies. Also at Davos, Bourla earlier this week referred to HHS’ vaccine policies as “almost like a religion.”

    Moderna, which has weathered successive quarters of declining earnings, has been heavily affected by these policy headwinds. In May, the company withdrew the approval application for its combination vaccine for flu and COVID-19. Weeks later, Moderna lost a government bird flu contract potentially worth more than $760 million after HHS terminated the project.

    In the months that followed, Moderna implemented a number of measures to slow its cash burn. These include a 10% workforce reduction in July and the discontinuation of three mRNA vaccines in November.

    18 votes
  9. Comment on Why Nigerians are choosing chatbots to give them advice and therapy in ~health.mental

    skybrian
    Link
    From the article: [...]

    From the article:

    AI platforms offering first-line mental health support have proliferated over the past year, with early trials in the US showing mixed results. In Nigeria, where AI has been embraced in many sectors and industries, a growing number of people turn to chatbots for virtual therapy.

    Nigeria’s health system, including its mental health provision, has long been underfunded. Between 2015 and 2025, Nigeria has consistently spent less than 5% of its budget to healthcare, with 4.2% allocated for 2026, far less than the 15% target that African Union member states agreed to as part of the 2001 Abuja Declaration. It is not known how many people in Nigeria live with mental health conditions but, with only 262 psychiatrists in a country of 240 million people, most do not get adequate treatment.

    The shortages have been exacerbated by the Trump administration dismantling USAID, which has badly hit services in Nigeria, especially at the primary level, having a devastating effect on patients in communities already struggling with HIV/Aids, tuberculosis and other health challenges. More than 90% of Nigerians have no health insurance, and now face uncertainty over access to services and feelings of helplessness over rising costs.

    Private healthcare is expensive; one therapy session can cost between 50,000 naira (£27) – the equivalent of a week’s worth of groceries. Cultural stigma remains strong; many Nigerians still associate mental illness with spiritual weakness or witchcraft.

    Commercial and nonprofit AI initiatives are starting to fill this vacuum. HerSafeSpace is an organisation that offers free and instant legal and emotional assistance to victims of technology-facilitated gender-based violence in five west and central African countries. Its Chat Kemi service is available in local and international languages.

    “These services don’t replace therapy,” says its founder, Abideen Olasupo. Instead, the chatbot uses a referral system to direct users and specific cases to mental health, legal or psychosocial professionals or organisations, should the need arise.

    “Our major objective is to support young girls, who are particularly vulnerable to gender-based violence, especially online,” he says.

    [...]

    The technology used by these apps follows scripts written by licensed Nigerian psychologists and therapists who deliver care to users.

    4 votes
  10. Comment on Border czar Tom Homan says Minnesota immigration surge is ending in ~society

    skybrian
    Link
    From the article: [...] [...] [...]

    From the article:

    President Donald Trump’s border czar, Tom Homan, declared an end to Operation Metro Surge in Minnesota following widespread protests against immigration raids that led to the fatal shootings by officers of two American citizens.

    [...]

    His announcement comes two weeks after Trump officials removed Border Patrol commander Gregory Bovino from the area, where he had been clashing with demonstrators upset over the surge and the aggressive tactics of some immigration officers. Trump dispatched Homan to negotiate an exit strategy with state and Minneapolis officials, and Homan said hundreds of officers are preparing to withdraw in the coming days.

    [...]

    The city of Minneapolis, which demanded immigration enforcement personnel leave from the start, celebrated the announcement that the operation would soon end. Homan had earlier this month promised to withdraw 700 of the 3,000 officers, but local officials said that was not enough.

    [...]

    Homan said federal agents have already begun leaving Minneapolis and will continue scaling back their presence into next week.

    8 votes
  11. Comment on Google's quarterly report on adversarial use of AI for Q4 2025 in ~tech

    skybrian
    Link
    From the article: ...

    From the article:

    Google DeepMind and GTIG have identified an increase in model extraction attempts or "distillation attacks" [...] we observed and mitigated frequent model extraction attacks from private sector entities all over the world [...].

    [...] This quarterly report highlights how threat actors from the Democratic People's Republic of Korea (DPRK), Iran, the People's Republic of China (PRC), and Russia operationalized AI in late 2025 and improves our understanding of how adversarial misuse of generative AI shows up in campaigns we disrupt in the wild. GTIG has not yet observed APT or information operations (IO) actors achieving breakthrough capabilities that fundamentally alter the threat landscape.

    ...

    State-sponsored actors continue to misuse Gemini to enhance all stages of their operations, from reconnaissance and phishing lure creation to command-and-control (C2 or C&C) development and data exfiltration. We have also observed activity demonstrating an interest in using agentic AI capabilities to support campaigns, such as prompting Gemini with an expert cybersecurity persona, or attempting to create an AI-integrated code auditing capability.

    5 votes
  12. Comment on The AI vampire in ~tech

    skybrian
    Link Parent
    I've never seen it do this "generate the wrong answer over and over" behavior with current models. (I did see that when I was experimenting a couple of years ago.) But I'm writing a fairly...

    I've never seen it do this "generate the wrong answer over and over" behavior with current models. (I did see that when I was experimenting a couple of years ago.) But I'm writing a fairly standard web app. I suspect performance varies a lot depending on what you're doing. And whether you get a productivity boost probably varies a lot as well?

    3 votes
  13. Comment on How The New York Times uses a custom AI tool to track the “manosphere” in ~life.men

    skybrian
    Link Parent
    It seems like a good thing, but a side effect is that the New York Times becomes more influenced by media that they wouldn’t otherwise watch. That is, these AI tools are a way to pay more...

    It seems like a good thing, but a side effect is that the New York Times becomes more influenced by media that they wouldn’t otherwise watch. That is, these AI tools are a way to pay more attention to certain people. But who else should they be paying attention to?

    Anyone looking to influence the media more should probably take having an AI tool devoted to them as a sign of success.

    8 votes
  14. Comment on The AI vampire in ~tech

    skybrian
    Link
    From the article: [...] [...] [...] [...] [...] [...] [...] [...] [...] [...] [...]

    From the article:

    Agentic software building is genuinely addictive. The better you get at it, the more you want to use it. It’s simultaneously satisfying, frustrating, and exhilarating. It doles out dopamine and adrenaline shots like they’re on a fire sale.

    [...]

    And that’s where the problem gets into full swing. Because other people are listening!

    [...]

    We’re all setting unrealistic standards for everyone else.

    Maybe me worst of all. I have 40 years of experience, I’ve led large teams, I read fast, and I have essentially unlimited time, energy, and now tokens for experimenting. I am completely unrepresentative of the average developer.

    But I’m still standing up and telling everyone “do it this way!” I even co-wrote a book about it.

    [...]

    I don’t think there’s a damn thing we can do to stop the train. But we can certainly control the culture, since the culture is us. I plan to practice what I preach, and dial my hours back. That’s going to mean saying No to a lot of people who want to chat with me (sorry!), and also dialing back some of my ambitions, even if it means losing some footraces. I don’t care. I will fight the vampire.

    [...]

    If you have joined an AI-native startup, the founders and investors are using the VC system to extract value from you, today, with the glimmer of hope for big returns for you all later.

    Most of these ideas will fail.

    I know this because they are literally telling me their plans like villains at the end of an old movie, since with Gas Town I have mastered the illusion of knowing what I’m doing. Truth is, nobody, least of all me, knows what they’re doing right now. But I look like I do, so everyone is coming to show me their almost uniformly terrible ideas.

    [...]

    Enterprises see the oncoming horde and think, oh jeez, we need to hustle. And they’re not exactly wrong. Which means this lovely dystopian picture is making its way slowly but surely into enterprise, at the big company where you work.

    [...]

    My friends who were grumbling back in 2001 needed some help with this, and I gave it to them. One day I walked up to the whiteboard during a particularly heated grumble-session, and I wrote a ratio on the board: $/hr (dollars divided by hours).

    [...]

    I said to everyone, Amazon pays you a flat salary, no bonuses, and you work a certain number of hours per week. From that, you can calculate that you make a certain number of dollars per hour.

    I told the grumbler group, you can’t control the numerator of this ratio. But you have significant control over the denominator. I pointed at the /hr for dramatic effect.

    [...]

    As for my part, I went ahead and dialed that denominator down, and lived life a bit while I was at Amazon, because fuck extraction.

    [...]

    That old formula is also my proposed solution for the AI Vampire, a quarter century later.

    Someone else might control the numerator. But you control the denominator.

    [...]

    You need to push back. You need to tell your CEO, your boss, your HR, your leadership, about the AI vampire. Point them at this post. Send them to me. I’m their age and can look them in the eye and be like yo. Don’t be a fool.

    [...]

    I regret the unrealistic standards that I’m contributing to setting. I don’t believe most people can work like I’ve been working. I’m not sure how long I can work how I’ve been working.

    I’m convinced that 3 to 4 hours is going to be the sweet spot for the new workday. Give people unlimited tokens, but only let people stare at reports and make decisions for short stretches. Assume that exhaustion is the norm. Building things with AI takes a lot of human energy.

    6 votes