18 votes

Visualising how AI training avoids getting stuck in local minima

8 comments

  1. [4]
    Greg
    Link
    This is a really well put together video that provides some mathematically rigorous visualisations of the gradient descent process at the scale of a full modern model, to show what we’re not...

    This is a really well put together video that provides some mathematically rigorous visualisations of the gradient descent process at the scale of a full modern model, to show what we’re not seeing in the usual textbook simplifications. Extremely worth watching if you’re interested in the field, it’s the kind of view that can meaningfully help build a better intuition for what’s going on.

    The tl;dr is that any visualisation that makes sense to the human mind can only show a projection of the loss landscape as a two dimensional surface, whereas the actual descent is moving through a much higher dimensional space. Accounting for that shows that you’re not just descending the nearest valley, you’re uncovering a totally new 2D projection with each step as you do so - the descent in N-dimensional space ends up looking more like you’re warping the 2D space to create a new global minimum underneath the point you’re standing on.

    6 votes
    1. [3]
      stu2b50
      Link Parent
      This seems like a very complicated way to describe the intuition that for you to be in a local minima, there exists some 𝜀>0 such that 𝑓(𝑥_di∗)≥𝑓(𝑥_di) whenever |𝑥_di−x_di∗|<𝜀, for all dimensions...

      Accounting for that shows that you’re not just descending the nearest valley, you’re uncovering a totally new 2D projection with each step as you do so - the descent in N-dimensional space ends up looking more like you’re warping the 2D space to create a new global minimum underneath the point you’re standing on.

      This seems like a very complicated way to describe the intuition that for you to be in a local minima, there exists some 𝜀>0 such that 𝑓(𝑥_di∗)≥𝑓(𝑥_di) whenever |𝑥_di−x_di∗|<𝜀, for all dimensions in the space, so obviously the more dimensions the more difficult it is for these conditions to be true.

      Trying to view it through the lens of 2D or 3D projections seems a like putting the cart before the horse, as an visualization device. The projection part is more complicated than the actual mechanics at hand IMO.

      4 votes
      1. Malle
        Link Parent
        The additional condition may be more difficult to fulfill, but it also means there's an additional dimension contributing to the size of the domain meaning there are more opportunities for minima....

        The additional condition may be more difficult to fulfill, but it also means there's an additional dimension contributing to the size of the domain meaning there are more opportunities for minima.

        For any generic function, I'm not certain we should trust loose arguments for what the number of local minima trend toward; or rather, what fraction of the domain belong to the "collection basin" of the global minimum (if one exists).

        Specifically for a loss function like the one described here, it might be true, or likely. However, seeing the "landscape" transform is not a guarantee of having found a global minimum, just that the minimum associated with the starting point is not in the projected plane.

        4 votes
      2. Greg
        Link Parent
        Each to their own, I guess - knowing the mathematical basis was never a problem, but for me at least having an image that I can see evolving and spin around in my head is a big help in moving from...

        Each to their own, I guess - knowing the mathematical basis was never a problem, but for me at least having an image that I can see evolving and spin around in my head is a big help in moving from academic understanding to actual intuitive feel for what’s going on.

        3 votes
  2. [2]
    DataWraith
    Link
    Welch Labs is an amazing channel in general! If you like this style of video, I highly recommend his series on Learning to see, where he teaches the basics of machine learning, how Decision Trees...

    Welch Labs is an amazing channel in general!

    If you like this style of video, I highly recommend his series on Learning to see, where he teaches the basics of machine learning, how Decision Trees work in particular, in a highly accessible manner.

    Other good ones are the series on Self-Driving cars (that I think he never finished?) and the recent video about how DeepSeek's Latent Attention works.

    5 votes
    1. Greg
      Link Parent
      I've just had a chance to watch the latent attention video, and it was similarly excellent! Exactly the same feeling I got from this one: I've read the papers, in theory I know how these things...

      I've just had a chance to watch the latent attention video, and it was similarly excellent! Exactly the same feeling I got from this one: I've read the papers, in theory I know how these things work, but seeing it laid out visually gives my brain something to latch onto in a way that feels much more natural.

      1 vote
  3. sparksbet
    Link
    Even as someone with a master's degree in this stuff, I think this video was really well put-together and it helped increase my understanding of this technique! I knew a lot of the facts here but...

    Even as someone with a master's degree in this stuff, I think this video was really well put-together and it helped increase my understanding of this technique! I knew a lot of the facts here but my intuition for why gradient descent works in high dimensions was much like he said his was before he started making this video, and his visualizations are top-notch.

    3 votes
  4. Staross
    Link
    I found this video frustrating, the only part that actually is relevant for the initial question (why gradient descent doesn't get stuck in local minima) is the 5 seconds here...

    I found this video frustrating, the only part that actually is relevant for the initial question (why gradient descent doesn't get stuck in local minima) is the 5 seconds here https://youtu.be/NrO20Jb-hy0?t=1235, and that's not even well explained. He should have shown local extrema in 1D vs 2D and explain how more "saddle-like" points appear in higher dimensions. He also spends a lot of time introducing the visualization and to then show how useless it is. That said the information in between isn't bad, it's just randomly put together to make it sound more interesting that it is.

    2 votes