From the article: [...] [...] [...] Here is the paper.
From the article:
Now a group of 40 researchers across seven different teams at Google have identified another major cause for the common failure of machine-learning models. Called “underspecification,” it could be an even bigger problem than data shift. “We are asking more of machine-learning models than we are able to guarantee with our current approach,” says Alex D’Amour, who led the study.
[...]
D’Amour’s initial investigation snowballed and dozens of Google researchers ended up looking at a range of different AI applications, from image recognition to natural language processing (NLP) to disease prediction. They found that underspecification was to blame for poor performance in all of them. The problem lies in the way that machine-learning models are trained and tested, and there’s no easy fix.
[...]
What the Google researchers point out is that this bar is too low. The training process can produce many different models that all pass the test but—and this is the crucial part—these models will differ in small, arbitrary ways, depending on things like the random values given to the nodes in a neural network before training starts, the way training data is selected or represented, the number of training runs, and so on. These small, often random, differences are typically overlooked if they don’t affect how a model does on the test. But it turns out they can lead to huge variation in performance in the real world.
[...]
The researchers carried out similar experiments with two different NLP systems, and three medical AIs for predicting eye disease from retinal scans, cancer from skin lesions, and kidney failure from patient records. Every system had the same problem: models that should have been equally accurate performed differently when tested with real-world data, such as different retinal scans or skin types.
Frankly, "underspecification" is also a feature, not a bug, in deep learning (and all of modern mainstream AI is deep learning). The point of making ridiculously complex models with billions of...
Frankly, "underspecification" is also a feature, not a bug, in deep learning (and all of modern mainstream AI is deep learning). The point of making ridiculously complex models with billions of parameters and training them on merely hundreds of thousands of data points is that you have many degrees of freedom. This makes it very likely that a good theory can be expressed within your model (because you have the parameters to encode it) but it makes it possible to have different good theories - theories that explain the data you have. But they might differ substantially on the data you don't have. Particularly if the data you don't have (i.e. the real world) is markedly different from the data you do have.
My personal pet theory is that backpropagation needs to go. We train all our models by basically saying "here's a mathematical structure with lots of underspecified parameters (called weights). Now, we declare that its output y given an input x approximate the x and y in our data. We take the data, calculate the gradient of the weights (the direction we would have to shift the weights to optimize our approximation) and move the weights a little bit." If you're into derivatives on vector spaces, you'll note that we can only really approximate the gradient locally. I.e. we assume all other billion parameters stay in place, what is the direction the one we're looking at should move? We are sooo far away from finding the best configuration of weights. My personal pet theory is that neural networks have to adapt big time to stay relevant long term. And I don't think such an adaptation is feasible. I think we're more looking for a breakthrough in one of the other methods of AI and ML. Maybe graphical models. Probably graphical models. We're already seeing things crop up that promise smaller models with the same performance. Some gradient based, some not.
Oh, and if you care to listen to bearded-naked-cavedweller ramblings of mine: We're going to have to learn to derive programs from data. Just take a high-level programming language, specify the type signature of what you've got, fill the rest in with data. This could be a huge productivity boost to your garden variety programmer. I however do not see a plausible way of achieving this yet.
The issue is fundamentally generalization. We know there are machines that are good at generalizing based on limited training data: healthy, living human minds. When you have umpteen...
The point of making ridiculously complex models with billions of parameters and training them on merely hundreds of thousands of data points is that you have many degrees of freedom. This makes it very likely that a good theory can be expressed within your model (because you have the parameters to encode it) but it makes it possible to have different good theories - theories that explain the data you have. But they might differ substantially on the data you don't have. Particularly if the data you don't have (i.e. the real world) is markedly different from the data you do have.
The issue is fundamentally generalization. We know there are machines that are good at generalizing based on limited training data: healthy, living human minds. When you have umpteen machine-learned models that have only epsilon delta accuracy on hold-out set 1, but have large deltas in accuracy on hold-out set 2, one has to ask: If you asked umpteen humans to learn and perform this task on hold-out sets 1 and 2, would you see the same failure to generalize? I would argue that you would not, if the task is well defined.
What is well defined? Well, we know what that means in terms of mathematical functions, but what does it mean in terms of defining a task? I’d say it’s impossible to concretely define what well-defined means for arbitrary tasks. But, we can still do science here! If we’re positing that we’ll be doing supervised learning, anyway. The way that I determine if an annotation task is well-defined for supervised learning is if I can write guidelines for humans such that a group of humans with those guidelines annotating independently can achieve a reliable level of inter-annotator agreement relative to chance agreement. We can objectively compute this agreement with metrics like Krippendorff’s alpha. If you can’t get a group of humans to independently achieve >= ~0.8 alpha (given an appropriate delta metric on your data and a good-effort set of human-grokkable task guidelines), I would argue that the task is ill-defined, and any model you train on data produced by the humans who failed to achieve >= 0.8 alpha will not generalize or even necessarily achieve a useful level of accuracy at all. Alpha lets us basically measure whether humans "know it when they see it". Now, there are still issues of data shift, and data sparsity, and other issues. But, if you look at tasks like sentiment analysis for NLP, you will never see an IAA score >=0.8 on a representative dataset. E.g., the alpha for the Stanford Sentiment Treebank with an ordinal delta metric is only <= 0.6. So, if you want to make progress, it’s not going to come from better learning algorithms or optimizers. That is to say, in the range of -1.0 to +1.0, where 0.0 is random chance, -1.0 is perfect disagreement, and +1.0 is perfect agreement, independent humans only agree on this task at a level of <= 0.6. The work to be done to make progress is about defining a more well-defined task. Maybe we have to model the problem with representations that are more complex than ordinal sentiment scale values? Maybe there are more dimensions to sentiment? And, if you can define a task that humans actually agree on, all the fancy universal function approximators you apply will have a chance at actually finding optimal solutions.
Note that, paradoxically, such setbacks make me optimistic. I expect there will be more progress after a problem is better understood than if researchers are still groping around in the dark and...
Note that, paradoxically, such setbacks make me optimistic. I expect there will be more progress after a problem is better understood than if researchers are still groping around in the dark and wondering why it randomly doesn't work.
A thing I learned about years ago is the concept of "local maxima" ... the idea that, if you are trying to develop a skill or function, getting it from level 0 (just getting started) to level 10...
A thing I learned about years ago is the concept of "local maxima" ... the idea that, if you are trying to develop a skill or function, getting it from level 0 (just getting started) to level 10 (it works perfectly) ... often, a promising method quickly gets you to level 5 or 6, and you keep researching and working to improve that method, until eventually, you realize that method -- while initially promising -- is literally incapable of ever getting you beyond, say, level 7. If level 7 is good enough, then okay ... but if you really need to reach level 10 (or even anything higher than 7), you need to throw away a lot of promising work and, more or less start over with a new method that, initially, only works at level 1 or 2.
I may not be explaining it very well, and I'm not sure it applies here ... but I do know that this issue of getting stuck trying to optimize local maxima (aka, the sunk-cost fallacy) has plagued AI development for decades.
Local maxima certainly apply. What we do there is we build our models complicated enough that it doesn't matter that we can't ever optimize globally. To put it differently, imagine a function with...
Local maxima certainly apply. What we do there is we build our models complicated enough that it doesn't matter that we can't ever optimize globally.
To put it differently, imagine a function with one parameter. Visually, you can tell the global maximum, if it exists. Imagine a function with two parameters (think of a heightmap). You can't tell the global maximum at a glance quite so easily. Local maxima are hilltops. We optimize our model by walking uphill. Naturally, we could get stuck on a small hill, instead of the highest peak. Now, any point that is a local maximum in a slice of the heightmap is unlikely to be a local maximum in the entire heightmap. By adding a dimension, we can give ourselves the chance of finding different routes to the peak. We're adding a lot of dimensions. Billions in fact. This has the added effect of adding a lot of peaks that are of similar height to the global maximum. And we only need to find one. [1]
Your intuition is absolutely correct and very much does apply to the process. We're merely walking up hill, so we're never guaranteed to find the highest peak. We do have our cheeky tricks though. This article is more or less about the side effects of adding so many dimensions: Those peaks that might seem so similar to us now, might not be exchangable for one another, and we can't tell them apart.
That said, we do have a fairly good idea of when a model is stuck in a local optimum: It stops moving. It's performance and parameters are mostly static as we keep training. You can of course just restart your training process and try again. However: We can never know how good the global optimum is, so we don't know if we're stuck in a shitty local optimum (foothill), a good local optimum (mountain peak) or the global optimum (Everest). They look alike when you're there.
[1] Edit: What this article is talking about is the problem that those peaks are perhaps different in significant ways that we care about. I should probably explain my metaphor: The parameters of the function are the parameters of our model, and the value of the function is the error of our hypothesis on the data. So by walking uphill, what we're doing is we change our hypothesis, we change our model, in such a way as to decrease the error on the data we have. The problem occurs when we deal with data we haven't seen in training. Either it could be of a different nature (cell phone photos in a busy clinic in the article), or it could just be more of the same, but there's rare effects we haven't seen often. In either case, we can't pick the right peak. Our heightmap doesn't show which one is correct. We just see our performance on the data we have. I suspect by underspecification, the researchers mean that the amount of data we have is just insufficient. If we only see a few cases of a certain phenomenon, we can't find the right explanation of why it happened to correctly identify them in the future.
I am no expert but I remember reading somewhere that in high-dimension spaces, local maxima are rare? The idea is that It has to be a local maximum in every dimension, and the more dimensions...
I am no expert but I remember reading somewhere that in high-dimension spaces, local maxima are rare? The idea is that It has to be a local maximum in every dimension, and the more dimensions there are, the more likely there is one where it’s not, making it a saddle point instead.
A great article on this topic: "Why Deep Learning Works Even Though It Shouldn’t – Ryan Moulton's Articles" https://moultano.wordpress.com/2020/10/18/why-deep-learning-works-even-though-it-shouldnt/
From the article:
[...]
[...]
[...]
Here is the paper.
Frankly, "underspecification" is also a feature, not a bug, in deep learning (and all of modern mainstream AI is deep learning). The point of making ridiculously complex models with billions of parameters and training them on merely hundreds of thousands of data points is that you have many degrees of freedom. This makes it very likely that a good theory can be expressed within your model (because you have the parameters to encode it) but it makes it possible to have different good theories - theories that explain the data you have. But they might differ substantially on the data you don't have. Particularly if the data you don't have (i.e. the real world) is markedly different from the data you do have.
My personal pet theory is that backpropagation needs to go. We train all our models by basically saying "here's a mathematical structure with lots of underspecified parameters (called weights). Now, we declare that its output y given an input x approximate the x and y in our data. We take the data, calculate the gradient of the weights (the direction we would have to shift the weights to optimize our approximation) and move the weights a little bit." If you're into derivatives on vector spaces, you'll note that we can only really approximate the gradient locally. I.e. we assume all other billion parameters stay in place, what is the direction the one we're looking at should move? We are sooo far away from finding the best configuration of weights. My personal pet theory is that neural networks have to adapt big time to stay relevant long term. And I don't think such an adaptation is feasible. I think we're more looking for a breakthrough in one of the other methods of AI and ML. Maybe graphical models. Probably graphical models. We're already seeing things crop up that promise smaller models with the same performance. Some gradient based, some not.
Oh, and if you care to listen to bearded-naked-cavedweller ramblings of mine: We're going to have to learn to derive programs from data. Just take a high-level programming language, specify the type signature of what you've got, fill the rest in with data. This could be a huge productivity boost to your garden variety programmer. I however do not see a plausible way of achieving this yet.
The issue is fundamentally generalization. We know there are machines that are good at generalizing based on limited training data: healthy, living human minds. When you have umpteen machine-learned models that have only epsilon delta accuracy on hold-out set 1, but have large deltas in accuracy on hold-out set 2, one has to ask: If you asked umpteen humans to learn and perform this task on hold-out sets 1 and 2, would you see the same failure to generalize? I would argue that you would not, if the task is well defined.
What is well defined? Well, we know what that means in terms of mathematical functions, but what does it mean in terms of defining a task? I’d say it’s impossible to concretely define what well-defined means for arbitrary tasks. But, we can still do science here! If we’re positing that we’ll be doing supervised learning, anyway. The way that I determine if an annotation task is well-defined for supervised learning is if I can write guidelines for humans such that a group of humans with those guidelines annotating independently can achieve a reliable level of inter-annotator agreement relative to chance agreement. We can objectively compute this agreement with metrics like Krippendorff’s alpha. If you can’t get a group of humans to independently achieve >= ~0.8 alpha (given an appropriate delta metric on your data and a good-effort set of human-grokkable task guidelines), I would argue that the task is ill-defined, and any model you train on data produced by the humans who failed to achieve >= 0.8 alpha will not generalize or even necessarily achieve a useful level of accuracy at all. Alpha lets us basically measure whether humans "know it when they see it". Now, there are still issues of data shift, and data sparsity, and other issues. But, if you look at tasks like sentiment analysis for NLP, you will never see an IAA score >=0.8 on a representative dataset. E.g., the alpha for the Stanford Sentiment Treebank with an ordinal delta metric is only <= 0.6. So, if you want to make progress, it’s not going to come from better learning algorithms or optimizers. That is to say, in the range of -1.0 to +1.0, where 0.0 is random chance, -1.0 is perfect disagreement, and +1.0 is perfect agreement, independent humans only agree on this task at a level of <= 0.6. The work to be done to make progress is about defining a more well-defined task. Maybe we have to model the problem with representations that are more complex than ordinal sentiment scale values? Maybe there are more dimensions to sentiment? And, if you can define a task that humans actually agree on, all the fancy universal function approximators you apply will have a chance at actually finding optimal solutions.
Note that, paradoxically, such setbacks make me optimistic. I expect there will be more progress after a problem is better understood than if researchers are still groping around in the dark and wondering why it randomly doesn't work.
A thing I learned about years ago is the concept of "local maxima" ... the idea that, if you are trying to develop a skill or function, getting it from level 0 (just getting started) to level 10 (it works perfectly) ... often, a promising method quickly gets you to level 5 or 6, and you keep researching and working to improve that method, until eventually, you realize that method -- while initially promising -- is literally incapable of ever getting you beyond, say, level 7. If level 7 is good enough, then okay ... but if you really need to reach level 10 (or even anything higher than 7), you need to throw away a lot of promising work and, more or less start over with a new method that, initially, only works at level 1 or 2.
I may not be explaining it very well, and I'm not sure it applies here ... but I do know that this issue of getting stuck trying to optimize local maxima (aka, the sunk-cost fallacy) has plagued AI development for decades.
Local maxima certainly apply. What we do there is we build our models complicated enough that it doesn't matter that we can't ever optimize globally.
To put it differently, imagine a function with one parameter. Visually, you can tell the global maximum, if it exists. Imagine a function with two parameters (think of a heightmap). You can't tell the global maximum at a glance quite so easily. Local maxima are hilltops. We optimize our model by walking uphill. Naturally, we could get stuck on a small hill, instead of the highest peak. Now, any point that is a local maximum in a slice of the heightmap is unlikely to be a local maximum in the entire heightmap. By adding a dimension, we can give ourselves the chance of finding different routes to the peak. We're adding a lot of dimensions. Billions in fact. This has the added effect of adding a lot of peaks that are of similar height to the global maximum. And we only need to find one. [1]
Your intuition is absolutely correct and very much does apply to the process. We're merely walking up hill, so we're never guaranteed to find the highest peak. We do have our cheeky tricks though. This article is more or less about the side effects of adding so many dimensions: Those peaks that might seem so similar to us now, might not be exchangable for one another, and we can't tell them apart.
That said, we do have a fairly good idea of when a model is stuck in a local optimum: It stops moving. It's performance and parameters are mostly static as we keep training. You can of course just restart your training process and try again. However: We can never know how good the global optimum is, so we don't know if we're stuck in a shitty local optimum (foothill), a good local optimum (mountain peak) or the global optimum (Everest). They look alike when you're there.
[1] Edit: What this article is talking about is the problem that those peaks are perhaps different in significant ways that we care about. I should probably explain my metaphor: The parameters of the function are the parameters of our model, and the value of the function is the error of our hypothesis on the data. So by walking uphill, what we're doing is we change our hypothesis, we change our model, in such a way as to decrease the error on the data we have. The problem occurs when we deal with data we haven't seen in training. Either it could be of a different nature (cell phone photos in a busy clinic in the article), or it could just be more of the same, but there's rare effects we haven't seen often. In either case, we can't pick the right peak. Our heightmap doesn't show which one is correct. We just see our performance on the data we have. I suspect by underspecification, the researchers mean that the amount of data we have is just insufficient. If we only see a few cases of a certain phenomenon, we can't find the right explanation of why it happened to correctly identify them in the future.
I am no expert but I remember reading somewhere that in high-dimension spaces, local maxima are rare? The idea is that It has to be a local maximum in every dimension, and the more dimensions there are, the more likely there is one where it’s not, making it a saddle point instead.
But that’s not to say it never happens.
A great article on this topic: "Why Deep Learning Works Even Though It Shouldn’t – Ryan Moulton's Articles" https://moultano.wordpress.com/2020/10/18/why-deep-learning-works-even-though-it-shouldnt/