[content warning: some non-explicit discussion of self-harm]
Goodhart’s Law is a concept used in data science which goes like this:
When a measure becomes a target, it ceases to be a good measure.
Goodhart’s Law is usually applied to the behavior of other people. For example, attendance is a good way of measuring how diligent your employees are, but if you start firing people for missing days then you’ll get people coming in with colds, infecting everyone, and playing Candy Crush all day because they’re too tired to get any work done. How many papers a scientist publishes is a good way of measuring how much they work, but if you make tenure dependent on how many papers a scientist publishes they’ll start breaking everything up into the smallest units of paper possible. How many nails a factory produces is a good way of measuring its success as a factory, but if you are a Soviet planner who requires the factory to produce as many nails as possible it will make tiny nails that aren’t useful for anything.
(There are other ways that Goodhart’s Law can end up working– for example, ice cream sales are a good way of measuring how hot it is, but setting a goal of selling a large amount of ice cream each day will not make the weather nicer– but these are not relevant for my post.)
However, Goodhart’s Law can also be applied to yourself.
People often set self-improvement goals, and when they do they often think of some way to measure what they care about. For example, if you want to exercise more, you might set a goal to go to the gym three days a week. If you want to finish a novel, you might set a goal to write five hundred words a day. If you want to have a better relationship with your husband, you might set a goal to have less than four fights per month.
Sometimes, the thing you’re measuring is directly the thing you care about. For example, if you are chronically sleep-deprived and decide to track how tired you feel in the morning, you aren’t going to encounter Goodhart’s Law problems, because tiredness is actually the thing you care about.
Often, however, the thing you’re measuring is different from the thing you care about. If you want to exercise more, you don’t want to fuck around at the gym on your phone, you want to take a class or use the treadmill or lift up heavy things and put them down.
Some of the ways Goodhart’s Law operates with people’s goals can be really obvious. For example, some people finishing NaNoWriMo will name their characters things like “Lady Mary von Grackle the Fourth” and then use the entire thing every time she comes up, or include the entire lyrics of every song their character is listening to, or edit every line of dialogue to include “X said” even if it is perfectly obvious who’s talking. If you are doing this stuff, it’s pretty obvious that you’re Goodharting your goal of writing a 50,000-word novel.
On the other hand, sometimes it’s not obvious at all, and that’s where you run into real trouble. You might be really proud of yourself for not getting into fights with your husband anymore– but instead you’re walking on eggshells avoiding every topic that might upset him and failing to bring up topics which you really ought to bring up, which actually makes your relationship worse.
And sometimes things can be Goodharting for some people and not for others. Let’s say your goal is to stop self-harming. For some people, the goal is actually to stop self-harming: maybe they’re tired of getting scars or it frightens other people. For other people, the goal is to avoid getting into situations where you’re so emotionally fucked up that self-harming seems like a good idea. If your goal is that second thing, white-knuckling through your self-harm urges by drawing red lines on yourself is actually useless– it achieves your target but does nothing about your goal.
Similarly, let’s say you set a target to do three things off your to do list each day. For many people– perhaps most– the real goal would be to accomplish things, and the worries about Goodharting would mostly be related to putting unnecessary things on your to do list so you can check them off. But if you have depression, your goal might be to recover from depression. You might drag your brain over metaphorical rocks getting yourself to do some dishes and cook dinner and achieve your target, but you’re still depressed.
Goodharting can get you into trouble in two ways. First, as in the arguments case, your target might be so poorly specified that it gets you to do things that are actively counterproductive to your goal– like not bringing up problems in your relationship.
Second, as in the self-harm, depression, and NaNoWriMo cases, reaching your target won’t directly harm your goal. You can search-and-replace “Lady Mary von Grackle the Fourth” with “Mary” and get a readable book. Doing more things off your to do list might even make you less depressed, if you’re the sort of person who tends to get less depressed if you’re more active. (Or more depressed, if you’re drawing on emotional reserves that you really shouldn’t be drawing on. It can go either way.)
The problem is that Goodharting misleads you about whether you’ve met your goal. You think you’ve written a novel, but when you cut out all the padding it’s a novelette at best. You think you’ve fixed your depression, but actually you’re just willpowering your way through doing the dishes. You think you’ve learned how to regulate your emotions better, but actually you’ve learned that if you self-harm by holding ice instead of by cutting you can pass it off to your therapists as a healthy coping mechanism. You’re putting a lot of work in– but you’re not going to have the outcomes you want.
How do you avoid Goodharting? It can help to explicitly distinguish “goal” and “target”: your goal isn’t to go to the gym three times a week unless you’d be just as happy if you spent the entire time at the gym reading a nice novel. That way, you can notice when you’re meeting your targets but not your goals. If your relationship with your husband is getting worse, you can step back and reassess.
It can also help to deliberately avoid doing things that help you reach your targets but not your goals. This is one way that single-person Goodharting is much easier to solve than multi-person Goodharting: you can just decide that you’re not going to Goodhart, once you’re aware that this is an issue. For example, if you’re depressed, you might commit to never using willpower to get yourself to do things. If you’re writing a novel, you might decide not to use cheap tricks to pad your word count.
In other cases, that isn’t realistic. For example, you might not want to commit to self-harming every time you feel like self-harming, and if you’re depressed you might ever need to force yourself to do the dishes so you have something clean to eat off of. In those cases, you might want to count Goodharted things separately. For example, as a depressed person, you might want to separately track things you did without willpower and things you did with willpower; if you’re trying to recover from self-harming, you might want to track both self-harm instances and strong urges to self-harm.