What examples of Goodhart's law have you encountered in your own life?
When a measure becomes a target, it ceases to be a good measure.
For example: my parents' health insurance company incentivized physical activity1 by giving rebates to people that got a certain number of steps daily, as measured by Fitbits. While my parents genuinely did make an effort to walk more, there were also days where they attached their Fitbits to the dog, gave them to someone else who was going for a walk, or even aggressively tapped their feet with the device on their knees while sitting in order to meet the measurement. Thus, their step counts ceased being an actual measure of physical activity.
How does this play out in your life, job, industry, field of study, etc.? What measures have been made targets? How has that changed the reliability or validity of the measures themselves?
Also, have you experienced any counterexamples? Are there measures in your domains that haven't succumbed to Goodhart's law? Why do you think that is?
1. This was the face value reason given. I'm more cynical and feel it probably wasn't about physical activity but instead about data gathering.
I think a very obvious example is education. People craming the day before an exam, memorizing facts and formulas when they should understand them as well. Also, cheating. Frankly, sometimes I wonder if being honest all my life was in my best interest.
Ideally, education should inform our morals and improve our intelectual abilities. In reality, it often manages to produce test-taking specialists.
I think school would be better if tests were random. You’d be graded on content up to what you should have learned last week. That would stop people from being able to prepare through cramming or cheating. You also couldn’t prepare specifically for the test even in good faith - so the tests would need to match that reality.
Really randomness should be much more common for evaluations. Elections, job interviews, etc. I don’t want to see how well someone can perform when they might have spent months preparing for that one moment. I want to see their baseline.
It would also absolutely encourage students and teachers pretty damn explicitly to seek and plug any gaps in knowledge. I feel like what held people up the most in math class what not properly understanding last year's material.
The only logical consequence then, imo, is to teach (ideally individually) at the pace it is actually understood, slowing down as necessary; not at the pace the curriculum dictates. If after 10 years of school my grades say I got to 8th grade math, then so be it. At least it indicates I understand it and can (reasonably reliably) apply that knowledge, rather than indicating that I have once upon a time memorized grade 10 math.
Thank you for posting this! I think about this question a lot. I work for a winery, and I analyze and collect data as part of my job. My conclusion is this: It can be true, if the data collector is not careful.
As a readily available counterexample: In a winery tasting room, sales and club signups are both really good measures of how well a tasting room is performing. Now, every single tasting room in existence (this may sound hyperbolic, but I have yet to see a counterexample) incentivizes their employees with club signup targets and/or sales targets. But they are both very useful measures of tasting room productivity.
I think a big differentiating factor is how easy it is to lie about something vs how rewarding the target is. In your example, lying to a fitbit is extremely easy. I assume there is some sort of incentive attached to these targets. Hence, Goodhart’s law applies. For sales and club signups, the incentive offered to employees are always less than what it would cost to manually up the numbers. Someone paying $100-$300 for a fake club signup for themselves is stupid if the reward is $30/club. Now, if the reward were $500/club, this would be different.
I think the best way to avoid it is this: Make sure the data is difficult to fake, and make sure the reward for faking data is always (or almost always) less than faking it.
True, but not always necessarily. Bad targets don't always require lying or dishonesty to be bad. I have a good example of this.
Long ago, when I started my career in technology, I worked for a small managed service provider as a technician. I always had a knack for technology and was interested in computers and networking from a young age, so I came into the job already knowing quite a bit about small and medium business networking. As a result, I fell into an escalation role, I'd handle issues that the other techs couldn't figure out which required in depth research, packet captures, network changes, etc.
The issue is that the techs were evaluated based on ticket closures. The more tickets you closed, the more productive you were and the more valuable you were as a worker. While to management this sounds good on paper after a cursory thought, in reality, it became a mess.
When a new ticket came in, a mad scramble to evaluate it and assign it to yourself was on. Techs would constantly refresh the queue, and if they saw something like "I can't find my icons" or "my task bar is gone", they'd quickly assign it to themselves, call or email the user, fix the issue in a minute, and close the ticket.
Something like "This internal website is throwing a python error" or "we get a database error on our ERP app when this batch job runs" would either sit languishing in the queue for hours or days, or be picked up, investigated for 2 minutes, then immediately escalated to me. These kinds of issues would take me hours or days to figure out, as I was learning new technology on the fly, having to run wireshark captures, sometimes coding to fix bugs in random home brew scripts and applications and so forth. I thought I did a pretty good job there, but my evaluations were abysmal because while my coworkers were closing something like 100 tickets a week, I was sitting at more like 10. My manager understood when I complained about it, but I could tell that those number still held some sway for him because on a pretty graph, I looked like the lowest performer.
The level 1 techs weren't lying or cheating or being dishonest, they were simply rationally acting in a manner which the company incentivized based on what they measured. Since then, I've seen this kind of stuff happening all throughout my career. Fixing lots of little small things or completing lots of little, easy projects will almost always look better to an executive than working on large complex problems.
Ooof this hurts to read. I have the same thing going on at work with Jira tickets!
I think that this is one of the most often cited examples of Goodhart’s Law. And it certainly shows how making some metrics targets can lead to people gaming those metrics.
But I think that your example actually serves as a counterexample to Goodhart’s Law. In your example, is number of closed tickets a bad metric? I would argue that it is not. It is, in fact, an excellent metric that measures how often and quickly a ticket gets closed. The issue is this is not what the company intended to measure. The company intended to measure worker productivity and quality of service. Number of tickets closed was never a measure of that, even before it was a target. And for what number of tickets closed actually measured, it was still a good metric after it become a target.
The same holds true in my example. Club signups and sales are a measure of just that. They also indirectly measure quality of service. They have the benefit of being more difficult to game. For example, your company could retrieve a sliver of value from their metric by assigning tickets randomly. That would make it more difficult to game and more precise for what they wanted to measure. In my example, at the wineries I have worked at, table groups are assigned by a host with no interest in which server gets the table, or in a round robin fashion. This prevents servers trying to snipe people more likely to sign up for clubs, or avoiding existing club members (who are often worse tippers, can’t sign up for a club again, and often buy less).
I would rewrite Goodhart’s Law as something like this:
If you make a metric a target, it will be gamed. If your metric relies on a shared common variable, it will me inaccurate. If your metric measures your target directly, or in a way that is difficult to falsify, it may still be a valid metric.
Certainly not as pithy, but I think this is more accurate.
I agree, it's not really an accurate saying. I've heard an alternative, maybe less negative variation of it, which is "You get what you measure".
I think in certain circumstances with commonly measured metrics, that's fine. Your example is a classic one. Salespeople are often, and sometimes exclusively measured based on their sales performance. That's because that's the one thing the company values from them.
The more sales a salesman makes, the better they are at their job. The company doesn't care about how happy the customers the salesman closes deals with are, how comfortable they are coming into the store, how well the salesperson gets along with the rest of the staff or virtually anything else besides how many sales the salesman brings in. If the salesman does that with a charming personality, hard sell intimidation tactics, or by convincing all of their friends and family to buy from the store, the company doesn't really care. They got what they want. That incentive works well for the most part.
When you have a project manager who is evaluated based on how quickly they completed a project, however, things get a little more dicey.
While yes, you do want project managers to complete projects quickly, that's not all you want from them. You want the projects to come in at or below budget, you want the products they create to be fit for task, you want the products to be maintainable, you want to retain the workers used to make the product, and not have them be completely burnt out and quit after the project is over.
If you're just measuring project managers on completion speed, you'll get quickly completed projects at the expense of every other consideration. You get what you measure.
It might have been great to use this as a target metric if what you are after is happy customers. A customer that quickly gets help with some issue (although small/non-complicated for you) will be happy. If the same customer had to wait for a few days because everyone was busy with complicated debugging they'd be much less satisfied with the service. A lot of the time helping customers with small issues is the rational thing to do for basically everyone involved. Said in a slightly different way; helping 15 customers is often better than helping 1 customer with something truly complex, at least from a business perspective.
But, and this is the big caveat, the business should realize that the difficult cases are much more time consuming. If you use the same metric for all "levels" of support the business will optimize itself out of having people who can deal with actual problems. And at that point they're screwd.
Even then, it's really not a great metric. You get what you measure, and what you're measuring is how quickly techs are closing tickets. Not necessarily how quickly issues are being resolved, or how satisfied the customer is with the resolution. This results in things like the ticket "my mouse stopped working" being closed via a note on the ticket "Get a new mouse". Or long running issues being closed and reopened to "avoid escalations", when that destroys the entire point of long running tickets being escalated. If customer happiness is what you're after, that's what you should measure. Using other metrics as a proxy for what you actually want to measure usually doesn't work out when people are involved.
Those are really good points! This is likely why so many places nowadays are asking me how likely I'd be to recommend them after nearly all interactions with either system or some support function. Minor inconvince each time it's asked, but in aggregate it feels like we are heading for cookie banner levels of NPS-like questions...
I'm aware of certain tech companies in India, one that I almost ended up at, who measure productivity by the number of git commits. I shudder to think of the commit histories of their code bases.
That is a step up above measuring lines of code. The tiniest step I can imagine though.
I've heard of this but never encountered a company that actually did it. I think my company looked into it as a knowingly weak proxy to understand how remote working was going productivity-wise. I'd actually be really really interested to look at their commit histories to see what the gamification looks like in practice to be honest.
I used to work at a very large e-commerce company that you've probably used before. They pride themselves on being data-driven and having deep, custom A/B testing infrastructure.
I found that being "data-driven" often resulted in many myopic decisions made by individual actors and teams.
Customer support calls were tracked as a KPI. Fewer customer support calls means fewer customers are having problems, right?
So, a team obfuscated the customer support page by removing its link from the front page. It could only then be accessed through the product page. This decreased the number of calls — success! — but this worsened the UX. It frustrated many customers who didn't want to dig around to find the support phone number, email, or chat channel. So then it was then restored.
The number of A/B experiments was used as a loose indicator of employee performance, so this incentivized many small, annoying changes to the website.
There is an article from Academy of Management Journal that touches on this subject:
"On the Folly of Rewarding A, While Hoping for B"
This was discussed at length in my graduate school when we were reviewing how to best build performance management systems. There are many good examples listed in this article and details some of the causes of this phenomena. You might get some insights from it!
Analyst recommendations.
This includes all analyst recommendations including stocks and bonds, but in my own life I have dealt with analysts who rate enterprise software.
The vendors are clearly motivated to rate as highly as possible, so they have a more effective sales strategy.
The analysts are also motivated to have as many vendors compete for pole position, so their research will sell and more customers will feel the need to purchase additional services. The analysts are also under intense pressure to sell more and restrain costs, which is not a formula for a good measure of product quality.
Recently we have transitioned to self service peer reviews of corporate software such as G2 & Trust Radius. It's the same game, but relies on simple aggregates of end user reviews.
I gained some weight and then lost a lot over the course of the pandemic. One of the principles that I think made it a success was adhering to daily, standardized, recorded measurement. I'd use a scale and record the outputs every morning around the same time and drop it into a spreadsheet (with an excessive amount of analytics) -- which I still do. It took a while to learn the natural variance of the numbers, how to interpret changes as they spike up and down, and what is indicative of long term change vs. just a temporary spike/drop. Learning to manage my mentality in seeing the numbers go up was tougher than expected, it's not something I could initially just move on from without feeling at least slightly bad. I think consistent data tracking is key to anyone looking to transform their bodies/health, there is a learning curve in managing your mindset and understanding around it, and overall felt like this might be somewhat of an interesting counterexample.
I lost a lot of weight only measuring my weight every month or so. It forced me to be really really aggressive in my food and exercise decisions because I was flying blind (well except for counting calories) and needed to err on the side of caution.
Interesting, guess it comes down to personal preference in "measure a ton, get too much visibility, wrestle with how to interpret it all" or "measure sparingly, focus on good things, worry slightly at the blindness"
All these SCRUM/AGILE project management systems are the worst offenders.
At every company I've worked at, ever since the cargo cult began trying to measure team performance in a "data-driven" way, productivity and software quality has gone downhill.