3 votes

Aurora: A leverage-aware optimizer for rectangular matrices

3 comments

  1. FlippantGod
    Link
    A research paper in blog form, from a group I had not heard of. The article has its own TL;DR and outline, and includes pretty math and very pretty visualizations, not to mention a great deal more...

    A research paper in blog form, from a group I had not heard of. The article has its own TL;DR and outline, and includes pretty math and very pretty visualizations, not to mention a great deal more detail. Hopefully the bits I pulled out while reading will encourage people to check out the full article.

    Premise, from the authors' observations of Muon and NorMuon:

    Muon... has become an increasingly popular choice for training frontier-scale models.

    NorMuon, which is currently SoTA on the modded-nanoGPT speedrun... augments Muon with an additional step that scales each row by its inverse RMS norm...

    The fact that NorMuon still succeeds suggests that there may be a gap in the Muon formulation that is being addressed by row normalization.

    We study the effects of row normalization and find that Muon can result in neuron death in MLP layers, whereby some neurons receive persistently small updates early in training and fail to recover. We show that this failure mode can be avoided by redistributing mass equally across rows for updates to the up and gate projections in the MLP layers. Motivated by this observation, we propose Aurora, which uses this mechanism to prevent neuron death without sacrificing precision of the gradient orthogonalization.

    I'll leave the pretty, formatted math to the article:

    The core algorithmic component in Muon is an iterative algorithm to compute the polar factor of a matrix.

    The existence of matmul-only iterative algorithms for computing polar(G) is largely what makes Muon feasible at scale.

    Muon benefits from polar⁡(G) precision.

    NorMuon augments Muon with an additional step that scales each row of the polar factor to have unit RMS norm.

    ...this row normalization can significantly reduce polar factor precision...

    Therefore, row normalization in NorMuon necessarily introduces a precision defect into the orthogonalization routine. We find that this polar precision defect can be quite large for matrices with non-uniform row norms.

    However, both runs outperform our Muon+PE-8 baseline, suggesting that row normalization can be independently useful.

    To mitigate the defect introduced by row normalization, we can simply normalize tall matrices to have row norms √(n/m) instead of one. We call this variant U-NorMuon...

    But the authors think it is only good for tall matrices:

    Now we turn our analysis to wide and square matrices, which also receive unit row norm updates under NorMuon. A left-orthogonal wide or square matrix necessarily has all its row norms equal to one, so row-norm uniformity is implied by orthogonality in this case.

    Thus, we expect that row-normalization is unnecessary or perhaps even harmful under precise orthogonalization routines like PE-8.

    We find evidence of this occurring at 340M scale where row-normalizing only tall matrices outperforms NorMuon and U-NorMuon by a small but non-trivial margin.

    The authors have an explanation for NorMuon's performance despite precision errors and applying row-normalization on wide and square matrices:

    We will show that Muon can allow a large subset of neurons in the MLP layers to effectively die, but that this pathology is mitigated by (U-)NorMuon. Building on ideas in the literature, we propose this as an explanation for the performance gap between Muon and NorMuon that we find in all our settings. We then derive Aurora which effectively row normalizes updates to tall parameters without sacrificing precision of the polar factor.

    “Normalization Prevents Neuron Death”

    We define a dead model component as a subset of model parameters receiving persistently small learning signal after the earliest phases of training. We identify dead model components with the following three criteria:

    1. Low effective gradient norm.
    2. Low effective update norm.
    3. Persistence over training.

    ...neuron death, as we've defined it, can and does occur in networks trained with Muon because tall matrices in MLP layers are allowed to receive updates with very non-uniform row-norms.

    U-NorMuon prevents death, and the prevention propagates.

    The normalization intervenes most aggressively early in training when anisotropy is developing, and as the buffer stabilizes, the correction shrinks. More surprisingly, this benefit extends to parameters that U-NorMuon does not directly normalize. For example, in Figure 12 we plot column leverage scores for the down projection matrix, which is a wide matrix and thus receives no benefit from row normalization.

    The authors claim SoTA results:

    When combined with Contra-Muon and update/weight flooring, Aurora achieves a new SoTA of 3175 steps [on the modded-nanoGPT speedrun benchmark].

    We train 1.1B-parameter transformers on ~100B tokens... and compare Aurora against Muon and NorMuon, each using PE-8. Aurora achieves the lowest final loss...

    Observations / Results:

    We hypothesize that since MLPs are predominantly responsible for memorization, Aurora's gains are most visible on memorization-intensive benchmarks like MMLU.

    Crucially, Aurora's advantage over Muon grows monotonically with MLP expansion factor (Figure 21), suggesting that wider MLPs, which increase the row-to-column ratio of the up and gate projections, amplify exactly the pathology Aurora corrects.

    Untuned Aurora is only a 6% overhead over traditional Muon, and a drop-in replacement.

    On MoE:

    Aurora's benefits are most pronounced for tall matrices, where the ratio m/n is large. This is the typical regime for up and gate projections in dense transformers, which commonly use MLP expansion factors of 4× or more. In MoE architectures, capacity is distributed across many smaller experts, each with a proportionally smaller hidden dimension and a lower m/n ratio. We therefore expect the neuron death pathology to be less severe in MoE models of moderate size, though not entirely absent for experts that remain meaningfully tall.

    They open source their Aurora implementation on github, and published the 1.1B model trained with Aurora on huggingface. I don't see the Muon or NorMuon versions though.

    1 vote
  2. [2]
    skybrian
    Link
    For people not actually working on LLM's, I think the takeaway is that we can expect AI researchers to keep finding new ways to increase performance, because current algorithms are far from...

    For people not actually working on LLM's, I think the takeaway is that we can expect AI researchers to keep finding new ways to increase performance, because current algorithms are far from optimal. Also, I imagine there are other good ideas already discovered that haven't made their way into popular LLM's yet?

    1. FlippantGod
      Link Parent
      I feel like there are currently two paths to increase performance, between researching mechanics of neural networks and researching serving / commercialization. Serving has the obvious low hanging...

      I feel like there are currently two paths to increase performance, between researching mechanics of neural networks and researching serving / commercialization. Serving has the obvious low hanging fruit.

      It's hard to know what exactly gets implemented and when, but sometimes a paper from several years ago is suddenly thrust back into the limelight. It's more rare, I think, for someone to hit up a paper from the 80s or such, partly because they have generally been mined and incorporated into the body of work, and partly because there's just a lot more work in the field today.