FlippantGod's recent activity

  1. Comment on Aurora: A leverage-aware optimizer for rectangular matrices in ~comp

    FlippantGod
    Link Parent
    Hmm, you both are thinking about latency between published and well received ideas and their eventual adoption? I might try putting together a little timeline later this month if I remember, it's...

    Hmm, you both are thinking about latency between published and well received ideas and their eventual adoption? I might try putting together a little timeline later this month if I remember, it's interesting.

    I believe Intelligent Analysis publishes one chart with MoE adoption, sort of along this line.

    Edit: I will add that MoE took me completely by surprise, more than any other development. Initially it seemed just an odd evolutionary diversion to me!

    3 votes
  2. Comment on Aurora: A leverage-aware optimizer for rectangular matrices in ~comp

    FlippantGod
    Link Parent
    I feel like there are currently two paths to increase performance, between researching mechanics of neural networks and researching serving / commercialization. Serving has the obvious low hanging...

    I feel like there are currently two paths to increase performance, between researching mechanics of neural networks and researching serving / commercialization. Serving has the obvious low hanging fruit.

    It's hard to know what exactly gets implemented and when, but sometimes a paper from several years ago is suddenly thrust back into the limelight. It's more rare, I think, for someone to hit up a paper from the 80s or such, partly because they have generally been mined and incorporated into the body of work, and partly because there's just a lot more work in the field today.

    4 votes
  3. Comment on Aurora: A leverage-aware optimizer for rectangular matrices in ~comp

    FlippantGod
    Link
    A research paper in blog form, from a group I had not heard of. The article has its own TL;DR and outline, and includes pretty math and very pretty visualizations, not to mention a great deal more...

    A research paper in blog form, from a group I had not heard of. The article has its own TL;DR and outline, and includes pretty math and very pretty visualizations, not to mention a great deal more detail. Hopefully the bits I pulled out while reading will encourage people to check out the full article.

    Premise, from the authors' observations of Muon and NorMuon:

    Muon... has become an increasingly popular choice for training frontier-scale models.

    NorMuon, which is currently SoTA on the modded-nanoGPT speedrun... augments Muon with an additional step that scales each row by its inverse RMS norm...

    The fact that NorMuon still succeeds suggests that there may be a gap in the Muon formulation that is being addressed by row normalization.

    We study the effects of row normalization and find that Muon can result in neuron death in MLP layers, whereby some neurons receive persistently small updates early in training and fail to recover. We show that this failure mode can be avoided by redistributing mass equally across rows for updates to the up and gate projections in the MLP layers. Motivated by this observation, we propose Aurora, which uses this mechanism to prevent neuron death without sacrificing precision of the gradient orthogonalization.

    I'll leave the pretty, formatted math to the article:

    The core algorithmic component in Muon is an iterative algorithm to compute the polar factor of a matrix.

    The existence of matmul-only iterative algorithms for computing polar(G) is largely what makes Muon feasible at scale.

    Muon benefits from polar⁡(G) precision.

    NorMuon augments Muon with an additional step that scales each row of the polar factor to have unit RMS norm.

    ...this row normalization can significantly reduce polar factor precision...

    Therefore, row normalization in NorMuon necessarily introduces a precision defect into the orthogonalization routine. We find that this polar precision defect can be quite large for matrices with non-uniform row norms.

    However, both runs outperform our Muon+PE-8 baseline, suggesting that row normalization can be independently useful.

    To mitigate the defect introduced by row normalization, we can simply normalize tall matrices to have row norms √(n/m) instead of one. We call this variant U-NorMuon...

    But the authors think it is only good for tall matrices:

    Now we turn our analysis to wide and square matrices, which also receive unit row norm updates under NorMuon. A left-orthogonal wide or square matrix necessarily has all its row norms equal to one, so row-norm uniformity is implied by orthogonality in this case.

    Thus, we expect that row-normalization is unnecessary or perhaps even harmful under precise orthogonalization routines like PE-8.

    We find evidence of this occurring at 340M scale where row-normalizing only tall matrices outperforms NorMuon and U-NorMuon by a small but non-trivial margin.

    The authors have an explanation for NorMuon's performance despite precision errors and applying row-normalization on wide and square matrices:

    We will show that Muon can allow a large subset of neurons in the MLP layers to effectively die, but that this pathology is mitigated by (U-)NorMuon. Building on ideas in the literature, we propose this as an explanation for the performance gap between Muon and NorMuon that we find in all our settings. We then derive Aurora which effectively row normalizes updates to tall parameters without sacrificing precision of the polar factor.

    “Normalization Prevents Neuron Death”

    We define a dead model component as a subset of model parameters receiving persistently small learning signal after the earliest phases of training. We identify dead model components with the following three criteria:

    1. Low effective gradient norm.
    2. Low effective update norm.
    3. Persistence over training.

    ...neuron death, as we've defined it, can and does occur in networks trained with Muon because tall matrices in MLP layers are allowed to receive updates with very non-uniform row-norms.

    U-NorMuon prevents death, and the prevention propagates.

    The normalization intervenes most aggressively early in training when anisotropy is developing, and as the buffer stabilizes, the correction shrinks. More surprisingly, this benefit extends to parameters that U-NorMuon does not directly normalize. For example, in Figure 12 we plot column leverage scores for the down projection matrix, which is a wide matrix and thus receives no benefit from row normalization.

    The authors claim SoTA results:

    When combined with Contra-Muon and update/weight flooring, Aurora achieves a new SoTA of 3175 steps [on the modded-nanoGPT speedrun benchmark].

    We train 1.1B-parameter transformers on ~100B tokens... and compare Aurora against Muon and NorMuon, each using PE-8. Aurora achieves the lowest final loss...

    Observations / Results:

    We hypothesize that since MLPs are predominantly responsible for memorization, Aurora's gains are most visible on memorization-intensive benchmarks like MMLU.

    Crucially, Aurora's advantage over Muon grows monotonically with MLP expansion factor (Figure 21), suggesting that wider MLPs, which increase the row-to-column ratio of the up and gate projections, amplify exactly the pathology Aurora corrects.

    Untuned Aurora is only a 6% overhead over traditional Muon, and a drop-in replacement.

    On MoE:

    Aurora's benefits are most pronounced for tall matrices, where the ratio m/n is large. This is the typical regime for up and gate projections in dense transformers, which commonly use MLP expansion factors of 4× or more. In MoE architectures, capacity is distributed across many smaller experts, each with a proportionally smaller hidden dimension and a lower m/n ratio. We therefore expect the neuron death pathology to be less severe in MoE models of moderate size, though not entirely absent for experts that remain meaningfully tall.

    They open source their Aurora implementation on github, and published the 1.1B model trained with Aurora on huggingface. I don't see the Muon or NorMuon versions though.

    2 votes
  4. Comment on US data center land use issues are fake in ~enviro

    FlippantGod
    Link Parent
    Well, I guess this depends on exactly how leftist vs Democrat we are talking, because the Democrat party is perfectly content to go after some small businesses. And I see small businesses vote...

    They can't loudly criticize small-scale orgs or movements at all (by virtue of being grassroots or supposedly more honest), even if they're screwing something up, so they have to leave that dirty work to centrist liberals and conservatives.

    Well, I guess this depends on exactly how leftist vs Democrat we are talking, because the Democrat party is perfectly content to go after some small businesses. And I see small businesses vote Republican even though I'd guess it probably hurts them?

    And this is probably my bias showing, but centrist liberals can't trivially step up and criticize flaws in small-scale orgs or movements either (also 100% by virtue of being grassroots or supposedly more honest). But hey, here's my criticism; I observe small businesses easily evading environmental protections and various other kinds of regulations generally. Perhaps auditing them is difficult.

    If I had to guess, I'd say both Democrats and Republicans go after any easy targets for their respective party, and that almost certainly includes some small-scale orgs / movements / businesses (I assume small-scale orgs uncludes small businesses btw) for each of them.

    1 vote
  5. Comment on US data center land use issues are fake in ~enviro

    FlippantGod
    Link
    Mid way through the article, it feels modeled perhaps too closely on existing data centers, and less representative of new data centers. It is a credible article, but I would sooner recommend...

    Mid way through the article, it feels modeled perhaps too closely on existing data centers, and less representative of new data centers. It is a credible article, but I would sooner recommend reading this source it cited.

    3 votes
  6. Comment on Linux privilege escalation (CVE-2026-31431) in ~comp

    FlippantGod
    Link Parent
    It's just a temporal artifact. The tells were different before, and I assume in a year the em-dashes will be gone. Hopefully there will be another tell, but I'm expecting different models, and...

    It's just a temporal artifact. The tells were different before, and I assume in a year the em-dashes will be gone. Hopefully there will be another tell, but I'm expecting different models, and deployments of models, to begin applying slightly more variable styles to be more stealth. Seeing as companies like Anthropic are apparently big on undetected use. :(

    6 votes
  7. Comment on Robot dogs with Elon Musk and Mark Zuckerberg heads roam around Berlin museum in Beeple’s new exhibit in ~arts

    FlippantGod
    Link Parent
    I trust you. You just seem like a good judge for that type of thing.

    I trust you. You just seem like a good judge for that type of thing.

    6 votes
  8. Comment on Google releases Gemma 4 in ~comp

    FlippantGod
    Link Parent
    Gemini is Google's big closed-source cloud service models. Gemma is small open-source models that can be run locally and offline.

    Gemini is Google's big closed-source cloud service models. Gemma is small open-source models that can be run locally and offline.

    10 votes
  9. Comment on In most countries, imports from China account for less than 10% of GDP, even where China is the top partner in ~finance

    FlippantGod
    Link
    Okay I don't have time to read, and I'm not an economist. But I would have guessed that because many imports from China benefit from low prices, a China-imports sized whole in a country's economy...

    Okay I don't have time to read, and I'm not an economist. But I would have guessed that because many imports from China benefit from low prices, a China-imports sized whole in a country's economy would probably end up significantly larger.

    We saw that with medical masks, didn't we? Businesses across the US were manufacturing cloth masks during covid lockdowns while shipments were stalled. Certainly a significantly larger portion of GDP than importing the same volume.

    3 votes
  10. Comment on Playtiles: The pocket-sized gaming platform in ~games

    FlippantGod
    Link Parent
    Maybe 2nd hand devices are getting scarce now? But I'd still prefer more people pick up a modded vita, 3ds, or switch instead.

    Maybe 2nd hand devices are getting scarce now? But I'd still prefer more people pick up a modded vita, 3ds, or switch instead.

    1 vote
  11. Comment on How to brew solar powered coffee in ~food

  12. Comment on How to brew solar powered coffee in ~food

    FlippantGod
    Link Parent
    Yeah sorry I was distracted.

    Yeah sorry I was distracted.

    3 votes
  13. Comment on How to brew solar powered coffee in ~food

    FlippantGod
    Link
    Seems like an unusual article for solar powered magazine. More of a DIY instructable, which I'm not against, but I would have expected something covering cold brewing coffee instead of this.

    Seems like an unusual article for solar powered magazine. More of a DIY instructable, which I'm not against, but I would have expected something covering cold brewing coffee instead of this.

    2 votes
  14. Comment on Valve announces new hardware: Steam Frame, Steam Controller, and Steam Machine in ~games

    FlippantGod
    (edited )
    Link Parent
    They definitely said everything on the Steam Frame controllers was capacitive. [edit: I'm likely wrong? perhaps?]

    They definitely said everything on the Steam Frame controllers was capacitive. [edit: I'm likely wrong? perhaps?]

    Also, the capacitive grip sensing has a limited gradient for the fingers and on/off sensing for the thumb. We haven't yet tested the Index for comparison, but for context, the finger tracking is a differentiating feature of the Index.
    – Gamer's Nexus

    ...and capacitive buttons on ABXY, on the D-pad, on your thumb rest, and also finger sensing as well as a grip button and two trigger buttons.
    – Adam Savage's Tested

  15. Comment on Valve announces new hardware: Steam Frame, Steam Controller, and Steam Machine in ~games

    FlippantGod
    Link Parent
    Oh? Can you please share more of your thoughts on the controllers?

    Oh? Can you please share more of your thoughts on the controllers?

    2 votes
  16. Comment on Valve announces new hardware: Steam Frame, Steam Controller, and Steam Machine in ~games

    FlippantGod
    Link Parent
    Quest 2/3/3s controllers use AA batteries...

    Quest 2/3/3s controllers use AA batteries...

    6 votes
  17. Comment on Valve announces new hardware: Steam Frame, Steam Controller, and Steam Machine in ~games

    FlippantGod
    Link Parent
    Optimistically, I think $500-$550 max for the Frame, that it would easily be profitable at $700, but they will be desperate to get the cost as low as possible. They would also expect a sizeable...

    Optimistically, I think $500-$550 max for the Frame, that it would easily be profitable at $700, but they will be desperate to get the cost as low as possible.

    They would also expect a sizeable sales boost to Half Life: Alyx.

    4 votes
  18. Comment on Valve announces new hardware: Steam Frame, Steam Controller, and Steam Machine in ~games

    FlippantGod
    Link
    LCDs? Maybe they are imagining a refresh with OLEDs?

    Two 2160 x 2160 LCD panels, one per eye. Refresh rate is 72-144Hz.

    LCDs? Maybe they are imagining a refresh with OLEDs?

    4 votes
  19. Comment on Request for help: Backing up NASA public databases in ~space

    FlippantGod
    Link Parent
    If universities, data hoarders, and other interested parties (ESA?) are serious about organizing and bandwidth is going to be the limiting factor, I think you should begin discussing fundraising...

    If universities, data hoarders, and other interested parties (ESA?) are serious about organizing and bandwidth is going to be the limiting factor, I think you should begin discussing fundraising and getting in touch with the maintainers at NASA.

    It may be possible to fund/volunteer someone with the necessary clearances to do bulk replication on-site. Presumably there's a contact somewhere at NASA who could speak to that, and probably knows a lot about the best way to store the data for either archival or a useful production mirror.

    Not a small undertaking.

    16 votes