This was a fun article to read, demonstrating one non-obvious way graphs can be made misleading. I still think there's something interesting there, but it's more like "people will usually estimate...
This was a fun article to read, demonstrating one non-obvious way graphs can be made misleading.
Of course, psychologists have been careful to make sure that the evidence replicates. But sure enough, every time you look for it, the Dunning-Kruger effect leaps out of the data. So it would seem that everything’s on sound footing.
Except there’s a problem.
The Dunning-Kruger effect also emerges from data in which it shouldn’t. For instance, if you carefully craft random data so that it does not contain a Dunning-Kruger effect, you will still find the effect. The reason turns out to be embarrassingly simple: the Dunning-Kruger effect has nothing to do with human psychology. It is a statistical artifact — a stunning example of autocorrelation.
I still think there's something interesting there, but it's more like "people will usually estimate their own ability closer to the mean than they should" and not the "lol dumb people think they're smart" thing people usually summarize the Dunning-Kruger effect as.
This actually reads like torture to me. Like, yes, it's painfully obvious that they did torture their data to comfess to some correlation that wasn't there to begin with. But that isn't nearly a...
This actually reads like torture to me. Like, yes, it's painfully obvious that they did torture their data to comfess to some correlation that wasn't there to begin with. But that isn't nearly a debunking of the original DK research. No amount of data torture can explain that the participants in the original DK study were systematically overconfident. These participants on average self-assessed as better than they actually were, and those self-assessments were correlated (though weakly) with external assessments. Both of those effects aren't present, and (outside of freak random events) can not be present in the tortured replication (See fig 9, lacking both those effects).
That said, dunning-kruger is frequently misreported. AFAIK, it is in fact merely that people are overconfident, and not too great at self-assessment. What I'm however not entirely convinced by is that people systematically self-assess closer to the mean than they should. That is an artifact that you can torture out of data, particularly if you include study design. It is an effect that I wouldn't be surprised if it's there, but it's also easy to make it appear out of thin air, e.g. by asking people to rank their performance as a percentile of the general population. But there's probably less-known and better-conducted and better-documented follow-up research on that topic anyway.
It's more accurate to say that participants have no idea how well they did. Their reported self assessments were essentially completely uncorrelated from their actual scores, and ended up being a...
No amount of data torture can explain that the participants in the original DK study were systematically overconfident.
It's more accurate to say that participants have no idea how well they did. Their reported self assessments were essentially completely uncorrelated from their actual scores, and ended up being a uniform random distribution. Remember, it's not their test scores, it's their percentiles within the populations they were estimating - so it makes sense random noise will end up with ~50% for each quartile.
Imagine if instead of people, you just had 100 d100 dice. If you were to split the 100 dice into 4 equal groups of 25 dice each, and then rolled each dice, what do you think would be the average of each group? 50, of course, because that's the average die roll of a d100 (well, it's 50.5 but close enough).
That's what the y vs x graph shows in the paper. A pretty much straight line. And of course the x vs x graph is linear, because it's literally y = x.
You just plotted uniform random noise against a linear line. That's what the Dunning Kruger graph shows. The actual result is just that people have no idea how to rate their performances.
No, that's what the Fig 9 in the blog post shows. In the paper, it's noticably not straight, but contains a small amount of signal. They have some amount of a clue, but not a lot. If I had to...
That's what the y vs x graph shows in the paper. A pretty much straight line. And of course the x vs x graph is linear, because it's literally y = x.
No, that's what the Fig 9 in the blog post shows. In the paper, it's noticably not straight, but contains a small amount of signal. They have some amount of a clue, but not a lot. If I had to extract self-assessment scores for the quartiles from the figure, I'm getting 58, 62, 72 and 76. Now unless the number of participants is just dogshit here, that's certainly a correlation in my book. It's not nearly as great as their actual score differences, but crucially, the first quartile is more overconfident than the 4th is underconfident, on account of a average self-assessment way in excess of 50.
The DK plot is not plotting nothing but noise. One need only compare the DK plot to Fig 9 and spot the differences. The differences can not be tortured out of random data. What this blog is showing is how one can present data to make it seem less random. But it's not quite reaching all the way to "all of this DK figure is noise".
I'd also contend that asking people to self-assess as a quantile is bound to give regression-to-the-mean results, thus artificially compressing scores, leading to a overconfident first quartile and an underconfident 4th quartile. But that's mostly besides my original point.
The original dunning-kruger study only had 87 participants - the amount of correlation shown in the graph once you understand the statistical malfeasance is so small it’s hard to draw any...
The original dunning-kruger study only had 87 participants - the amount of correlation shown in the graph once you understand the statistical malfeasance is so small it’s hard to draw any conclusion given how equally small the sample size is. The study only meant something because the correlation was so strong - but that was an illusion.
And indeed, in one of the famous papers debunking the original study, once you correct for this with a larger sample size you find the opposite of dunning’s results
Our data show that peoples' self-assessments of competence, in general, reflect a genuine competence that they can demonstrate.
There might be some small effect at play, but it is attenuated to unreliability, so the more noise, the more exaggerated the effect when plotted this way. So yeah, not really a terribly useful...
There might be some small effect at play, but it is attenuated to unreliability, so the more noise, the more exaggerated the effect when plotted this way. So yeah, not really a terribly useful result if the worse your measurement, the more prominent it is. 🙂
I don't know that I follow your comment, but my understanding is that figure 9 is demonstrating how the original DK result can be generated from a random number generator. It is the process by...
I don't know that I follow your comment, but my understanding is that figure 9 is demonstrating how the original DK result can be generated from a random number generator. It is the process by which the data was categorized and labeled which led to the autocorrelation.
My comment regarding fig 9 is that it does lack several important features of the data as presented by DK originally. It's similar, but once you get into the way this is presented, Fig 9 is...
My comment regarding fig 9 is that it does lack several important features of the data as presented by DK originally. It's similar, but once you get into the way this is presented, Fig 9 is clearly just noise, while DK's figure has some signal - namely, systematic overconfidence, and a weak but noticable correlation between self-assessment and external assessment.
I agree, I think, in that this post didn't gives a particularly nuanced view of it. The post more or less chalks it up to plotting flaws. At the same time, any effect in the original DK work...
I agree, I think, in that this post didn't gives a particularly nuanced view of it. The post more or less chalks it up to plotting flaws.
At the same time, any effect in the original DK work reflects attenuation to unreliability, due to what is being measured and how. What that means is that the apparent DK effect increases with statistical noise, which is ... bad. So the DK paper might have shown a small effect, but the original plot makes it seem exaggerated.
This post doesn't really talk about attenuation, but does show an alternate visualization of a similar test in subsequent figures that didn't replicate the results.
Not sure how this shows that the D-K effect is a result of autocorrelation. It's not the best type of graph due to the x vs. x as explained, but if you have a self-evaluation of slope of 1, then...
Not sure how this shows that the D-K effect is a result of autocorrelation. It's not the best type of graph due to the x vs. x as explained, but if you have a self-evaluation of slope of 1, then you're highly correlated, whereas if you have a slope of 0 you're uncorrelated. The simulation "self evaluation" is centered on 50th percentile and has a slope of ~0, whereas the D-K data is centered around 65th percentile with a slight upward slope. Seems like demonstrating that self-evaluation is not too different from noise with mean of 66th percentile, shows that self-evaluations are only weakly correlated to actual skill level, and on average the study population thinks they are above average whether they are skilled or unskilled.
Probably the bigger problem is the bias in D-K's study population, which were all Cornell undergrads. Given that it's one of the top schools in the nation, I think it should be expected for Cornell undergrads to rate themselves higher than average in intellectual tests. They've been better than average on tests for most of their lives.
The problem is that that wasn't the conclusion Dunning and Kruger drew from the result. Instead, the conclusion that was unskilled people specifically were biased to overestimate their competency,...
shows that self-evaluations are only weakly correlated to actual skill level,
The problem is that that wasn't the conclusion Dunning and Kruger drew from the result. Instead, the conclusion that was unskilled people specifically were biased to overestimate their competency, moreso than overskilled people were at underestimating themselves.
That was the purported Dunning-Kruger effect, and it doesn't really bore out of the data when analyzed with the right lens, especially with a whopping sample size of 87.
They definitely gloss over correlation when it suits the narrative, but they did mention the correlation as "modest" in at least one area the D-K paper. (Study 1 resulted in the chart referenced...
They definitely gloss over correlation when it suits the narrative, but they did mention the correlation as "modest" in at least one area the D-K paper. (Study 1 resulted in the chart referenced as Figure 2 in the linked debunking article)
Study 1 revealed two effects of interest. First, although
perceptions of ability were modestly correlated with actual ability,
people tended to overestimate their ability relative to their peers.
Second, and most important, those who performed particularly
poorly relative to their peers were utterly unaware of this fact
I think the second "effect" is really just an interpretation of the first one. If every quartile rated themselves higher than average, then it's just math that the lowest quartile will have the largest difference between self and real evaluations.
I still see the chart demonstrating this, but I think it's likely due to the bias in study population as described above, and probably partially due to how we use numerical scales. When we rate things on a scale from 1-10, usually 5 is considered "bad." But if 5 is already bad, then what are 1-4? So in many real-life cases a 1-10 scale is actually used as a 5-10 scale. Because of this, it's hard to get a significant number of 10th percentile people to rate themselves as a 1 out of 10, even if they're somewhat self-aware. Even a person who's 10th percentile in a skill could say, "I'm pretty bad, but probably not literally the worst, so I must be a 2 or 3 at least." Then there are those who are not terribly self-aware and would rate themselves as above average no matter how bad they are.
TLDR: I think there are a lot of problems with the D-K study, but I don't think autocorrelation has anything to do with it.
If you think about it, the percentiles part is particularly distorting. In order for the percentiles to be accurate, you'd not ONLY have to accurately evaluate your own competency, but ALSO...
If you think about it, the percentiles part is particularly distorting. In order for the percentiles to be accurate, you'd not ONLY have to accurately evaluate your own competency, but ALSO correctly estimate the distribution of competency amongst the other 87 participants (who you don't know!). Likely people used what they think of as an national average as a heuristic, which probably underperformed a random sample of 87 students from the cornell student body.
For a student, if you ask if they did above average (on a test, say), another reasonable thing to do is compare themselves to other students in the same class. We simply don't know. It's important...
For a student, if you ask if they did above average (on a test, say), another reasonable thing to do is compare themselves to other students in the same class.
We simply don't know. It's important for any survey to know the exact question asked, and ideally, they would also ask people what they were thinking when they answered in a certain way.
Just a few musings: It seems similar to the phenomenon where a majority of drivers rate themselves as above average. Obviously you can't have a majority being above average. And if you asked...
Just a few musings:
It seems similar to the phenomenon where a majority of drivers rate themselves as above average. Obviously you can't have a majority being above average. And if you asked people to self-assess and then performed a test to quantify skill/safety/whatever, you would expect to see a larger gap between self assessment and measurement in the bottom quartile. Because most people rated themselves "7" or whatever above average is on the relevant scale, then obviously the bottom quartile is furthest away.
The takeaway wouldn't necessarily be that the worse of a driver you are the less able you are to self assess, but more that peoples thought heuristics tend towards common points, and it takes a structured method of self assessment to overcome that. This is probably true for many things that require giving a quick answer to difficult or multifaceted question. A structured method of response would be too slow and not fit in such a study. Are we measuring technical skill, or safety, or something else? What constitutes a skilled vs unskilled response in avoiding a hazard, yard brake vs swerve? So our brain makes a bunch of quick hand wavy assumptions for us, and we say "I dunno, probably pretty average, maybe just a little above" without consciously thinking about the details.
It reminds me somewhat of Daniel Kahneman's Thinking Fast and Slow, in that it might just be the manifestation of the heuristics our minds use. And maybe some pride and social effects.
But I do think that because of all of those factors, plus what you mentioned, that the DK effect as stated in the paper is attenuated to the unreliable, because the effect is more (visually, as they graphed it) prominent the noisier the data you have, making it not a strong explanatory tool.
It sounds like they ask the person to rate how they did after taking the test. I'd be interested to see the results if you were to ask how someone predicts they would do on a topic.
It sounds like they ask the person to rate how they did after taking the test. I'd be interested to see the results if you were to ask how someone predicts they would do on a topic.
This was a fun article to read, demonstrating one non-obvious way graphs can be made misleading.
I still think there's something interesting there, but it's more like "people will usually estimate their own ability closer to the mean than they should" and not the "lol dumb people think they're smart" thing people usually summarize the Dunning-Kruger effect as.
This actually reads like torture to me. Like, yes, it's painfully obvious that they did torture their data to comfess to some correlation that wasn't there to begin with. But that isn't nearly a debunking of the original DK research. No amount of data torture can explain that the participants in the original DK study were systematically overconfident. These participants on average self-assessed as better than they actually were, and those self-assessments were correlated (though weakly) with external assessments. Both of those effects aren't present, and (outside of freak random events) can not be present in the tortured replication (See fig 9, lacking both those effects).
That said, dunning-kruger is frequently misreported. AFAIK, it is in fact merely that people are overconfident, and not too great at self-assessment. What I'm however not entirely convinced by is that people systematically self-assess closer to the mean than they should. That is an artifact that you can torture out of data, particularly if you include study design. It is an effect that I wouldn't be surprised if it's there, but it's also easy to make it appear out of thin air, e.g. by asking people to rank their performance as a percentile of the general population. But there's probably less-known and better-conducted and better-documented follow-up research on that topic anyway.
It's more accurate to say that participants have no idea how well they did. Their reported self assessments were essentially completely uncorrelated from their actual scores, and ended up being a uniform random distribution. Remember, it's not their test scores, it's their percentiles within the populations they were estimating - so it makes sense random noise will end up with ~50% for each quartile.
Imagine if instead of people, you just had 100 d100 dice. If you were to split the 100 dice into 4 equal groups of 25 dice each, and then rolled each dice, what do you think would be the average of each group? 50, of course, because that's the average die roll of a d100 (well, it's 50.5 but close enough).
That's what the y vs x graph shows in the paper. A pretty much straight line. And of course the x vs x graph is linear, because it's literally y = x.
You just plotted uniform random noise against a linear line. That's what the Dunning Kruger graph shows. The actual result is just that people have no idea how to rate their performances.
No, that's what the Fig 9 in the blog post shows. In the paper, it's noticably not straight, but contains a small amount of signal. They have some amount of a clue, but not a lot. If I had to extract self-assessment scores for the quartiles from the figure, I'm getting 58, 62, 72 and 76. Now unless the number of participants is just dogshit here, that's certainly a correlation in my book. It's not nearly as great as their actual score differences, but crucially, the first quartile is more overconfident than the 4th is underconfident, on account of a average self-assessment way in excess of 50.
The DK plot is not plotting nothing but noise. One need only compare the DK plot to Fig 9 and spot the differences. The differences can not be tortured out of random data. What this blog is showing is how one can present data to make it seem less random. But it's not quite reaching all the way to "all of this DK figure is noise".
I'd also contend that asking people to self-assess as a quantile is bound to give regression-to-the-mean results, thus artificially compressing scores, leading to a overconfident first quartile and an underconfident 4th quartile. But that's mostly besides my original point.
The original dunning-kruger study only had 87 participants - the amount of correlation shown in the graph once you understand the statistical malfeasance is so small it’s hard to draw any conclusion given how equally small the sample size is. The study only meant something because the correlation was so strong - but that was an illusion.
And indeed, in one of the famous papers debunking the original study, once you correct for this with a larger sample size you find the opposite of dunning’s results
https://digitalcommons.usf.edu/numeracy/vol10/iss1/art4/
There might be some small effect at play, but it is attenuated to unreliability, so the more noise, the more exaggerated the effect when plotted this way. So yeah, not really a terribly useful result if the worse your measurement, the more prominent it is. 🙂
I don't know that I follow your comment, but my understanding is that figure 9 is demonstrating how the original DK result can be generated from a random number generator. It is the process by which the data was categorized and labeled which led to the autocorrelation.
My comment regarding fig 9 is that it does lack several important features of the data as presented by DK originally. It's similar, but once you get into the way this is presented, Fig 9 is clearly just noise, while DK's figure has some signal - namely, systematic overconfidence, and a weak but noticable correlation between self-assessment and external assessment.
I agree, I think, in that this post didn't gives a particularly nuanced view of it. The post more or less chalks it up to plotting flaws.
At the same time, any effect in the original DK work reflects attenuation to unreliability, due to what is being measured and how. What that means is that the apparent DK effect increases with statistical noise, which is ... bad. So the DK paper might have shown a small effect, but the original plot makes it seem exaggerated.
This post doesn't really talk about attenuation, but does show an alternate visualization of a similar test in subsequent figures that didn't replicate the results.
Not sure how this shows that the D-K effect is a result of autocorrelation. It's not the best type of graph due to the x vs. x as explained, but if you have a self-evaluation of slope of 1, then you're highly correlated, whereas if you have a slope of 0 you're uncorrelated. The simulation "self evaluation" is centered on 50th percentile and has a slope of ~0, whereas the D-K data is centered around 65th percentile with a slight upward slope. Seems like demonstrating that self-evaluation is not too different from noise with mean of 66th percentile, shows that self-evaluations are only weakly correlated to actual skill level, and on average the study population thinks they are above average whether they are skilled or unskilled.
Probably the bigger problem is the bias in D-K's study population, which were all Cornell undergrads. Given that it's one of the top schools in the nation, I think it should be expected for Cornell undergrads to rate themselves higher than average in intellectual tests. They've been better than average on tests for most of their lives.
The problem is that that wasn't the conclusion Dunning and Kruger drew from the result. Instead, the conclusion that was unskilled people specifically were biased to overestimate their competency, moreso than overskilled people were at underestimating themselves.
That was the purported Dunning-Kruger effect, and it doesn't really bore out of the data when analyzed with the right lens, especially with a whopping sample size of 87.
They definitely gloss over correlation when it suits the narrative, but they did mention the correlation as "modest" in at least one area the D-K paper. (Study 1 resulted in the chart referenced as Figure 2 in the linked debunking article)
I think the second "effect" is really just an interpretation of the first one. If every quartile rated themselves higher than average, then it's just math that the lowest quartile will have the largest difference between self and real evaluations.
I still see the chart demonstrating this, but I think it's likely due to the bias in study population as described above, and probably partially due to how we use numerical scales. When we rate things on a scale from 1-10, usually 5 is considered "bad." But if 5 is already bad, then what are 1-4? So in many real-life cases a 1-10 scale is actually used as a 5-10 scale. Because of this, it's hard to get a significant number of 10th percentile people to rate themselves as a 1 out of 10, even if they're somewhat self-aware. Even a person who's 10th percentile in a skill could say, "I'm pretty bad, but probably not literally the worst, so I must be a 2 or 3 at least." Then there are those who are not terribly self-aware and would rate themselves as above average no matter how bad they are.
TLDR: I think there are a lot of problems with the D-K study, but I don't think autocorrelation has anything to do with it.
If you think about it, the percentiles part is particularly distorting. In order for the percentiles to be accurate, you'd not ONLY have to accurately evaluate your own competency, but ALSO correctly estimate the distribution of competency amongst the other 87 participants (who you don't know!). Likely people used what they think of as an national average as a heuristic, which probably underperformed a random sample of 87 students from the cornell student body.
For a student, if you ask if they did above average (on a test, say), another reasonable thing to do is compare themselves to other students in the same class.
We simply don't know. It's important for any survey to know the exact question asked, and ideally, they would also ask people what they were thinking when they answered in a certain way.
Just a few musings:
It seems similar to the phenomenon where a majority of drivers rate themselves as above average. Obviously you can't have a majority being above average. And if you asked people to self-assess and then performed a test to quantify skill/safety/whatever, you would expect to see a larger gap between self assessment and measurement in the bottom quartile. Because most people rated themselves "7" or whatever above average is on the relevant scale, then obviously the bottom quartile is furthest away.
The takeaway wouldn't necessarily be that the worse of a driver you are the less able you are to self assess, but more that peoples thought heuristics tend towards common points, and it takes a structured method of self assessment to overcome that. This is probably true for many things that require giving a quick answer to difficult or multifaceted question. A structured method of response would be too slow and not fit in such a study. Are we measuring technical skill, or safety, or something else? What constitutes a skilled vs unskilled response in avoiding a hazard, yard brake vs swerve? So our brain makes a bunch of quick hand wavy assumptions for us, and we say "I dunno, probably pretty average, maybe just a little above" without consciously thinking about the details.
It reminds me somewhat of Daniel Kahneman's Thinking Fast and Slow, in that it might just be the manifestation of the heuristics our minds use. And maybe some pride and social effects.
But I do think that because of all of those factors, plus what you mentioned, that the DK effect as stated in the paper is attenuated to the unreliable, because the effect is more (visually, as they graphed it) prominent the noisier the data you have, making it not a strong explanatory tool.
Another good site on debunking the effect - http://danluu.com/dunning-kruger/
And other popsci “facts”
It sounds like they ask the person to rate how they did after taking the test. I'd be interested to see the results if you were to ask how someone predicts they would do on a topic.
Peter principal seems more appropriate. Rise until you suck at your job, then stay there, or rise farther or fall.