10 votes

Overfitting to theories of overfitting

Posted February 15 by skybrian

Tags: mathematics, statistics, overfitting, theories, technology, machine learning, substack.arg min, author.ben recht

https://www.argmin.net/p/overfitting-to-theories-of-overfitting

Link information

This data is scraped automatically and may be incorrect.

Authors: Ben Recht
Word count: 1064 words

10 comments

[5]
cdb
February 15
Link
Did I skim this article too quickly, or is this one of those situations where the kind of people who are attracted to computer science are often the kind of people who take every statement as an...

Did I skim this article too quickly, or is this one of those situations where the kind of people who are attracted to computer science are often the kind of people who take every statement as an inviolable rule? Like when you see posts on the internet that give generic life advice like "there's no one right way to do things," and all the comments are talking about how this advice sucks because they hope their electrician isn't following this advice, rather than the virtues of orthogonal thinking and not being bound by made up rules, where applicable.

So if you treat bias-variance tradeoff as a hard rule of machine learning, it's obviously not true in all situations. If you treat it as a generality that helps to teach important concepts you should think about while training your models, then it's definitely a reasonable concept. Overfitting is also not guaranteed in all situations, but it seems like a useful concept to be aware of. If it's the case where he's arguing that we shouldn't teach rules that aren't rules, I guess a Berkeley professor is more qualified than me on such matters. Although I can think of a lot of cases in science where it's common to teach a flawed model to build up concepts over time.

I wonder if it's the case where he is smart and all his students are smart, so he feels like he can skip the generalizations and go directly to the more advanced concepts. I personally am a dummy who needs to learn dumb things before I can learn smart things.

14 votes
1. [4]
  kacey
  February 16
  Link Parent
  Please note that I'm not involved in anything machine learning related, I just like math and dislike overly broad statements 😅 I had upvoted this post a while back because it's nice to see...
  
  Please note that I'm not involved in anything machine learning related, I just like math and dislike overly broad statements 😅 I had upvoted this post a while back because it's nice to see educators in technical disciplines share their experiences, but your comment made me go back and properly read it to understand whether it truly "deserved" my updoot.
  
  I dug into the article proper, and I'm not sure I understand your concern? It looks like the author has two theses:
  
  "Model Complexity" is not a well defined term, so using it in a technical context is problematic: students could draw a lot of conclusions from it, including incorrect ones.
  
  One such incorrect conclusion is that model "size" is equivalent to model "complexity", which per the author's cited examples, has empirical evidence to the contrary: increases in parameter count decrease prediction error on the test set.
  
  That seems reasonable on the face of it, but maybe you have some evidence demonstrating that model complexity has a well understood meaning in the general public which could motivate using that graph ...? And I didn't get the impression that the author is arguing against teaching about overfitting, just that said graph doesn't explain it in a way that sets students up for success.
  
  [...] is this one of those situations where the kind of people who are attracted to computer science are often the kind of people who take every statement as an inviolable rule?
  [...]
  I wonder if it's the case where he is smart and all his students are smart, so he feels like he can skip the generalizations and go directly to the more advanced concepts.
  
  OK so moving away from article critiques, but I'm getting the vibe that you have a personal bone to pick with this author? Even if so, it'd be cool of you not to perpetuate negative stereotypes about "people who are attracted to computer science" and focus on the author instead of generalizing? I'm often proximate to colleagues who dunk on autistic people, and that statement sounds like a neighbour to the very common "all these autist CS nerds can't read the room" line. It's not terribly pleasant to encounter IRL, and I'd prefer not to see more of it in tildes.
  
  Totally fair if I'm reading between the lines incorrectly; I'm sure you weren't trying to paint with such broad strokes.
  
  2 votes
  1. [2]
    cdb
    February 16 (edited February 16)
    Link Parent
    My reading of this article is that the main point is that bias-variance tradeoff is not really a good rule, because it's not always true. While this is technically correct, my understanding from...
    
    My reading of this article is that the main point is that bias-variance tradeoff is not really a good rule, because it's not always true. While this is technically correct, my understanding from ML coursework is that this is usually taught to illustrate certain concepts, not serve as a rule. The author is correct that there is no inherent tradeoff based on the math, but whether it should be taught or not is a different story. If you understand bias-variance, even a cursory exploration of ML techniques will show you that the amount of actual tradeoff can vary a lot depending on the situation. He calls it a boogeyman, but the reason that boogeyman exists is because it seems like most ML learners go through a phase of thinking that if fitting is good, more fitting is better.
    
    I really didn't think I was coming off as that much of a hater, so sorry if my comment was worded that way. I'm studying CS, and I think it makes sense if CS people take statements more literally, due to the nature of their work. What you said is a lot more derogatory than anything I said, so I'd prefer not to go in that direction. However, maybe the fact that I didn't think I was perpetuating any harmful stereotypes is illustrative?
    
    I hope you will try not to read between the lines and take my statements as they are. I didn't know anything about the article's author prior to reading this article. There wasn't any sarcasm intended in the statement you quoted. I really was trying to say that Berkeley students are generally smarter and more motivated than most college students, so the teaching methods might be different too. I know quite a few people who went to Berkeley. I felt like there was a noticeable difference in how serious most of them were about learning/studying, based on the people I met.
    
    edit: removed some personal info
    
    6 votes
    
    kacey
    February 16
    Link Parent
    No worries; my apologies for calling you out inappropriately! I'm just being overly sensitive, it seems. Makes sense! I suppose my (ancient 😅) education predated that graph, and my cohort seemed...
    
    I really didn't think I was coming off as that much of a hater, so sorry if my comment was worded that way. I'm working on my masters in CS, and I think it makes sense if CS people take statements more literally, due to the nature of their work. What you said is a lot more derogatory than anything I said, so I'd prefer not to go in that direction. However, maybe the fact that I didn't think I was perpetuating any harmful stereotypes is illustrative?
    
    No worries; my apologies for calling you out inappropriately! I'm just being overly sensitive, it seems.
    
    My reading of this article is that the main point is that bias-variance tradeoff is not really a good rule, because it's not always true. While this is technically correct, my understanding from masters level ML coursework is that this is usually taught to illustrate certain concepts, not serve as a rule. The author is correct that there is no inherent tradeoff based on the math, but whether it should be taught or not is a different story. If you understand bias-variance, even a cursory exploration of ML techniques will show you that the amount of actual tradeoff can vary a lot depending on the situation. He calls it a boogeyman, but the reason that boogeyman exists is because it seems like most ML learners go through a phase of thinking that if fitting is good, more fitting is better.
    
    Makes sense! I suppose my (ancient 😅) education predated that graph, and my cohort seemed to grasp the concept of overfitting well enough -- but the entire field has been flipped on its head in the meantime. So I've really no notion of what a modern learner's experience is like.
    
    Best of luck with your degree, btw! Hope grad school is treating you well.
    
    1 vote
  2. psi
    February 16 (edited February 17)
    Link Parent
    I would disagree with the author here. Model complexity is quantifiable, and it does roughly scale with the number of parameters (though it's particular definition will be problem specific). As an...
    
    "Model Complexity" is not a well defined term, so using it in a technical context is problematic: students could draw a lot of conclusions from it, including incorrect ones.
    
    One such incorrect conclusion is that model "size" is equivalent to model "complexity", which per the author's cited examples, has empirical evidence to the contrary: increases in parameter count decrease prediction error on the test set.
    
    I would disagree with the author here. Model complexity is quantifiable, and it does roughly scale with the number of parameters (though it's particular definition will be problem specific). As an example, one can evaluate the performance of a least squares fit by calculating the Akaike information criterion AIC = χ^2 + 2k and finding the models that score the smallest [1]. Unless adding a new parameter reduces the AIC by at least χ^2/2, we say that the associated Occam penalty is too large to prefer the new model.
    
    Of course, this does not mean more complex models cannot be more correct. Technically general relativity is an accurate model for gravity above the surface of the earth, but you wouldn't be able to gain sufficient evidence for the theory using the tools available in a Physics 101 lab. More generally, model selection is not just about minimizing the predicted error; it's also about quantifying the systematic error introduced from evaluating many different models, some of which are likely to perform better just by random chance (e.g. the Look-elsewhere effect).
    
    Edit: Regarding the "double descent" figure mentioned in the post: the double descent only occurs when the number of parameters is greater than the number of data points. My guess would be that the extra parameters are cancelling out the irrelevant features from the model. And indeed, this is what Fig 2 of [2] suggests -- by choosing "prescient" models instead of random models (i.e., by fitting the most relevant features instead of random features), peak performance occurs before the second descent.
    
    [1]. "Improved information criteria for Bayesian model averaging in lattice field theory." E. Niel and J Sitison.
    
    [2]. "Two models of double descent for weak features." M. Belkin, D. Hsu, and J. Xu.
    
    3 votes
[4]
skybrian (OP)
February 15 (edited February 15)
Link
From the article: … The author gives practical advice in another blog post: …
From the article:

This bias-variance decomposition is always true for the squared loss. It’s just defining things in a clever way where when you expand the squares, the cross terms cancel because expressions have zero mean. The way the decomposition is interpreted in ESL is that more complex models have lower “bias” because they can fit more complex patterns but more “variance” because they are more sensitive to changes in data.

However, this decomposition is not a tradeoff because there is nothing that suggests these terms need to trade off. No fundamental law of functional analysis says that if one term is small, the other is large. In fact, there’s nothing that prevents both terms from being zero. I can certainly build models where some have low bias and high variance, some have high variance and low bias, and some are just right. It all depends on how you define the models and their complexity.

…

The advice people draw from the bias-variance boogeyman is downright harmful. Models with lots of parameters can be good, even for tabular data. Boosting works, folks! Big neural nets generalize well. Don’t tell people that you need fewer parameters than data points. Don’t tell people that there is some spooky model complexity lurking around every corner.

Use a test set to select among the models that fit your training data well. It’s not that complicated.

The author gives practical advice in another blog post:
The main goal of applied math is to guide practice. We want theories that, while not perfect, give reasonable guidelines. But the advice from generalization theory just seems bad. I swear that all of the following bullets were lessons from my first ML class 20 years ago and are still common in popular textbooks:

If you perfectly interpolate your training data, you won’t generalize.

High-capacity models don’t generalize.

You have to regularize to get good test error.

Some in-sample errors can reduce out-of-sample error.

Good prediction balances bias and variance.

You shouldn’t evaluate a holdout set too many times or you’ll overfit

This is all terrible advice!
…
What does “work” in practice? It’s hard to argue against this four-step procedure:

Collect as large a data set as you can

Split this data set into a training set and a test set

Find as many models as you can that interpolate the training set

Of all of these models, choose the model that minimizes the error on the test set

This method has been tried and true since 1962. You can say that step 4 is justified by the law of large numbers. Maybe that’s right. But there’s still a lot of magic happening in step 3.
2 votes
1. [3]
  psi
  February 17
  Link Parent
  Do you know what the author means by saying that 4 is justified by the law of large numbers? As far as I can tell, the law of large numbers would suggest the opposite -- in the limit of infinitely...
  
  Of all of these models, choose the model that minimizes the error on the test set
  
  This method has been tried and true since 1962. You can say that step 4 is justified by the law of large numbers. Maybe that’s right. But there’s still a lot of magic happening in step 3.
  
  Do you know what the author means by saying that 4 is justified by the law of large numbers? As far as I can tell, the law of large numbers would suggest the opposite -- in the limit of infinitely many tests of the test set, you are essentially just fitting the (smaller) test set, albeit with a suboptimal optimization procedure (minimization on the training set). Or maybe it would be better to say that you're simultaneously fitting two training sets (I guess that could be an interesting strategy)?
  1. [2]
    skybrian (OP)
    February 17
    Link Parent
    He mentions the law of large numbers here under “Uniform Convergence” https://www.argmin.net/p/three-paths-to-generalization At the end he writes: But this seems to be about making sure that the...
    
    He mentions the law of large numbers here under “Uniform Convergence”
    
    https://www.argmin.net/p/three-paths-to-generalization
    
    At the end he writes:
    
    I find stability arguments hardest to argue against. Our machine learning methods should be robust to throwing away one data point! Of course, whether you can quantify how robust they are and use this to guide practice is a different story.
    
    But this seems to be about making sure that the method is robust if the probability distribution doesn’t change. If it does, you need new data.
    
    (And if you’re plotting points as they come in, the question is how much to weight new data?)
    
    1 vote
    
    psi
    February 19
    Link Parent
    Thanks!
    
    Thanks!
    
    1 vote
skybrian (OP)
February 15
Link
While Ben Recht insists that overfitting isn’t a thing, that doesn’t mean generalization isn’t difficult: The robustness of the holdout method can't save us from populations changing.:

While Ben Recht insists that overfitting isn’t a thing, that doesn’t mean generalization isn’t difficult:

The robustness of the holdout method can't save us from populations changing.:

The takeaway from this series of studies is simple. The train-test benchmarking has absurdly robust internal validity. It’s incredibly hard to adaptively overfit to a test set. Our “generalization bounds” are wildly conservative for machine learning practice. However, external validity is less simple. How machine learning models will perform on new data is not predictable from benchmarking.

If minor differences in reproduction studies lead to major drops in predictive performance, can you imagine what happens when we take a machine learning model trained on a static benchmark and deploy it in an important application? We’ve seen AI models for radiology fail once someone changes the X-ray machine or models for sepsis fail once we change the hospital involved. These sorts of shifts of contexts and populations are the major challenge for predictive engineering. I’m not sure what anyone can hope to do except constantly update the models so they are as current as possible and hope for the best.

1 vote