When I was a kid, they told me never to use a box plot unless I knew my data was unimodal. It seems to me that most of the cases this author complains about are situations where it is...
When I was a kid, they told me never to use a box plot unless I knew my data was unimodal. It seems to me that most of the cases this author complains about are situations where it is inappropriate to use a box plot, so he misses out on what the box plot offers when it is appropriate.
Where a box plot really shines is when you are comparing multiple unimodal distributions, particularly if your data are sparse or noisy. There has been a trend over the last decade or so toward trying to squeeze every little bit of insight you can from your data, looking for any pattern you can see. I believe this is misguided, because it leads to statistics which are not robust against sampling errors. In contrast, the quantiles are very robust statistics, particularly if you terminate the whiskers at 5% and 95% quantiles rather than 0% and 100%. That is to say, even if the particular sample you have has a little cluster here or there, or if you have a few outliers sneaking in from a different underlying distribution, the quantiles still give you a reliable description of the distribution you are studying. If two samples have very different quantiles, you can reasonably infer that the underlying distributions are different.
As much as we like a visualization to reveal the data, it is just as important that it conceal the data. We are good at seeing patterns, even if they do not mean anything. The discipline in statistics is to remove those parts of the data which people are apt to misinterpret, hopefully leaving something useful. Box plots are designed to conceal the variability which (for example) a strip plot reveals.
But if you don’t look at the data, how do you know it’s unimodal? Maybe due to theoretical considerations, but what if there’s an error doing the experiment? Looking at the raw data and seeing...
But if you don’t look at the data, how do you know it’s unimodal?
Maybe due to theoretical considerations, but what if there’s an error doing the experiment? Looking at the raw data and seeing that it’s not what you expected can be a pretty good clue that you made a mistake somewhere.
Maybe that’s too much detail when publishing results. But for readers who don’t necessarily trust the results of someone else’s experiment, a good visualization of the raw data might help to convince them that the experiment was done right.
It’s true that statistics can sometimes be used to guard against coming to unwarranted conclusions from random variation. On the other hand, statistics can be done wrong, and it’s sometimes used to try to justify conclusions that you couldn’t see if you plotted the raw data. And that makes me wary. I think I’d be most convinced if they agree - that is, the conclusion is claimed to be statistically justified, and you can also see the pattern for yourself. It’s two different ways of seeing things.
Edit: I also wanted to say that I agree with your concern about people trying to infer too much from too little data. But since I'm a computer programmer, the kind of data I look at is computer-generated; it's cheap to generate more data from automatic tests. Rather than doing fancy statistics, typically I will generate more than enough data, to make the trend more obvious.
People in fields where testing is very expensive likely have a different perspective.
To address your last point I'm in that sort of field. Unfortunately we make as much data as we can with the budget we have and then have to do our best with it. I like to think I don't stretch...
To address your last point I'm in that sort of field. Unfortunately we make as much data as we can with the budget we have and then have to do our best with it. I like to think I don't stretch what the statistics shows but I imagine we all like to think that.
This is how I felt clicking into the article, but then I realized I'd totally forgotten the segments are quartiles instead of deviations or the like. Even though I use to use them all the time for...
“But I don’t find box plots hard to understand.”
This is how I felt clicking into the article, but then I realized I'd totally forgotten the segments are quartiles instead of deviations or the like. Even though I use to use them all the time for chemistry research papers (because they present all the necessary information).
I've been convinced there are generally better visualizations we can easily generate with computers. I'm really happy with the progress made in data visualization over the past couple decades.
I agree with much of what the author has to say, but they overlooked what I think is a good solution: box + dot plot (or "jittered strip plot" in the author's terms). To me this is the best of...
I agree with much of what the author has to say, but they overlooked what I think is a good solution: box + dot plot (or "jittered strip plot" in the author's terms). To me this is the best of both worlds. You show the raw data pattern but also get the summary statistics (median, quantiles, outliers, skew) on top.
Cédric Scherer is someone I respect in the data viz world. Here he has a tutorial/blog post on something similar, a "raincloud plot". I find it overkill but I think he tackles the issue even better: Visualizing Distributions with Raincloud Plots
The author mentions some other ways to visualize the data without going into them, and I would just like to highlight bee swarm plots for discrete, uniformly spaced, smaller data sets (e.g....
The author mentions some other ways to visualize the data without going into them, and I would just like to highlight bee swarm plots for discrete, uniformly spaced, smaller data sets (e.g. integer years of age) and violin plots for continuous data and larger data sets.
violin plots are bad. I used to use them bc I was under the impression they were good, but the fact of the matter is that in any situation where you could use a violin plot you almost certainly...
violin plots are bad. I used to use them bc I was under the impression they were good, but the fact of the matter is that in any situation where you could use a violin plot you almost certainly want to use either a box plot or a histogram -- maybe, in some circumstances, both. The Frankenstein's combination that is a violin plot is worse than just having separate box plots and histograms next to each other, as they're harder to read than either would be alone. Plus they look like labia.
I read this the other day, and was left thinking… well, the other suggestions the author provides aren’t builtin to matplotlib (at least, at far as I could find) so I’m probably just gonna stick...
I read this the other day, and was left thinking… well, the other suggestions the author provides aren’t builtin to matplotlib (at least, at far as I could find) so I’m probably just gonna stick with a boxplot 🤷.
I typically use them for looking at e.g., the range of performance over many server nodes, to identify particularly slow outliers that end up limiting the overall simulation speed. Works okish for that.
The easiest thing might just be to use seaborn instead. Jittered strip plot Swarm plot Violin plot (also in matplotlib: plt.violinplot) Alternatively, if you want to stick with matplotlib without...
The easiest thing might just be to use seaborn instead.
Alternatively, if you want to stick with matplotlib without writing your own equivalents of those plots, you could just use a histogram (plt.hist). That sounds fine for your use case?
The problem with a histogram for my case is simply that I need to actually be able to convey which servers are underperforming, I.e., I’ll stick a label under each. But I’ll have to check out...
The problem with a histogram for my case is simply that I need to actually be able to convey which servers are underperforming, I.e., I’ll stick a label under each. But I’ll have to check out seaborn, I’ve seen it many times and just never made the switch
When I was a kid, they told me never to use a box plot unless I knew my data was unimodal. It seems to me that most of the cases this author complains about are situations where it is inappropriate to use a box plot, so he misses out on what the box plot offers when it is appropriate.
Where a box plot really shines is when you are comparing multiple unimodal distributions, particularly if your data are sparse or noisy. There has been a trend over the last decade or so toward trying to squeeze every little bit of insight you can from your data, looking for any pattern you can see. I believe this is misguided, because it leads to statistics which are not robust against sampling errors. In contrast, the quantiles are very robust statistics, particularly if you terminate the whiskers at 5% and 95% quantiles rather than 0% and 100%. That is to say, even if the particular sample you have has a little cluster here or there, or if you have a few outliers sneaking in from a different underlying distribution, the quantiles still give you a reliable description of the distribution you are studying. If two samples have very different quantiles, you can reasonably infer that the underlying distributions are different.
As much as we like a visualization to reveal the data, it is just as important that it conceal the data. We are good at seeing patterns, even if they do not mean anything. The discipline in statistics is to remove those parts of the data which people are apt to misinterpret, hopefully leaving something useful. Box plots are designed to conceal the variability which (for example) a strip plot reveals.
But if you don’t look at the data, how do you know it’s unimodal?
Maybe due to theoretical considerations, but what if there’s an error doing the experiment? Looking at the raw data and seeing that it’s not what you expected can be a pretty good clue that you made a mistake somewhere.
Maybe that’s too much detail when publishing results. But for readers who don’t necessarily trust the results of someone else’s experiment, a good visualization of the raw data might help to convince them that the experiment was done right.
It’s true that statistics can sometimes be used to guard against coming to unwarranted conclusions from random variation. On the other hand, statistics can be done wrong, and it’s sometimes used to try to justify conclusions that you couldn’t see if you plotted the raw data. And that makes me wary. I think I’d be most convinced if they agree - that is, the conclusion is claimed to be statistically justified, and you can also see the pattern for yourself. It’s two different ways of seeing things.
Edit: I also wanted to say that I agree with your concern about people trying to infer too much from too little data. But since I'm a computer programmer, the kind of data I look at is computer-generated; it's cheap to generate more data from automatic tests. Rather than doing fancy statistics, typically I will generate more than enough data, to make the trend more obvious.
People in fields where testing is very expensive likely have a different perspective.
To address your last point I'm in that sort of field. Unfortunately we make as much data as we can with the budget we have and then have to do our best with it. I like to think I don't stretch what the statistics shows but I imagine we all like to think that.
This is how I felt clicking into the article, but then I realized I'd totally forgotten the segments are quartiles instead of deviations or the like. Even though I use to use them all the time for chemistry research papers (because they present all the necessary information).
I've been convinced there are generally better visualizations we can easily generate with computers. I'm really happy with the progress made in data visualization over the past couple decades.
I agree with much of what the author has to say, but they overlooked what I think is a good solution: box + dot plot (or "jittered strip plot" in the author's terms). To me this is the best of both worlds. You show the raw data pattern but also get the summary statistics (median, quantiles, outliers, skew) on top.
Cédric Scherer is someone I respect in the data viz world. Here he has a tutorial/blog post on something similar, a "raincloud plot". I find it overkill but I think he tackles the issue even better: Visualizing Distributions with Raincloud Plots
The author mentions some other ways to visualize the data without going into them, and I would just like to highlight bee swarm plots for discrete, uniformly spaced, smaller data sets (e.g. integer years of age) and violin plots for continuous data and larger data sets.
violin plots are bad. I used to use them bc I was under the impression they were good, but the fact of the matter is that in any situation where you could use a violin plot you almost certainly want to use either a box plot or a histogram -- maybe, in some circumstances, both. The Frankenstein's combination that is a violin plot is worse than just having separate box plots and histograms next to each other, as they're harder to read than either would be alone. Plus they look like labia.
Discussion on HN: https://news.ycombinator.com/item?id=40765183
I read this the other day, and was left thinking… well, the other suggestions the author provides aren’t builtin to matplotlib (at least, at far as I could find) so I’m probably just gonna stick with a boxplot 🤷.
I typically use them for looking at e.g., the range of performance over many server nodes, to identify particularly slow outliers that end up limiting the overall simulation speed. Works okish for that.
The easiest thing might just be to use seaborn instead.
plt.violinplot
)Alternatively, if you want to stick with matplotlib without writing your own equivalents of those plots, you could just use a histogram (
plt.hist
). That sounds fine for your use case?+1 to seaborn. It adds a ton of nice features and color palettes that can improve basically any plot. Documentation is solid too.
The problem with a histogram for my case is simply that I need to actually be able to convey which servers are underperforming, I.e., I’ll stick a label under each. But I’ll have to check out seaborn, I’ve seen it many times and just never made the switch
I like Observable Plot for making nice graphs, but it's a bit tricky to use and it's in JavaScript. I don't know what's best for Python.