CoT monitorability: why g-means and not F1?
It all has to do with separating monitorability from misalignment property.
Figure 3 in “Monitoring Monitorability” (see below) argues why g-mean is better than F1.
I originally didn’t get the implication (esp. about Case B figure) until I understood the mathematical differences between them, and what the implications are when it comes to monitoring misbehaving models.
Preliminary and Some Maths
You have a binary monitor that says “misbehavior detected” or “no misbehavior.” The ground truth is either positive (bad behavior happened) or negative (it didn’t). The four outcomes are TP, FP, TN, FN as usual and there are 𝒩 samples, then we get the following operations:
TPR = TP/(TP+FN) — “of the bad behaviors, how many did we catch?” . This is also the recall.
TNR = TN/(TN+FP) — “of the good behaviors, how many did we correctly indicate?”
Precision = TP/(TP + FP) – “of the alarms the monitor raised, how many are true misbehaviors?” (Note the FP).
Prevalence π = (TP+FN) / 𝒩, which means “how often does the agent misbehave?”
The key difference between g-means and F1 is whether π interferes with the metric.
G-means = √ (TPR * TNR).
TPR draws both its numerator (TP) and denominator (TP+FN) from the positive pool — the samples where misbehavior actually happened. TNR draws both from the negative pool – no misbehaviors. This means that if you change the relative size of the two pools (i.e. change π), TPR and TNR don’t change and they are independent of each other.
F1 = 2 · Precision · Recall / (Precision + Recall)
Recall is just TPR
For Precision, we can express the raw counts as:
\(\text{TP} = \text{TPR} \cdot \pi \mathcal{N}\)\(\text{FP} = (1-\text{TNR})\cdot(1-\pi) \mathcal{N}\)
Hence, we can rewrite F1 using TPR, TNR and π
Why should the monitorability metric not include π?
The key idea behind g-means (using TPR and TNR) is that we don’t introduce prevalence π, aka we don’t want “how likely the model misbehaves” to confound our measurement of monitorability.
Each agent has two independent properties in monitorability setup:
Behavioral tendency (π): how often it misbehaves on these tasks.
CoT transparency: when it misbehaves, how much evidence shows up in the CoT; and when it doesn’t, how clean the null signal is being picked up.
The problem with F1 is that it includes π (coz of precision), which serves as a weighting coefficient on the TPR and TNR in both numerator and denominator.
Wrong conclusion about models that rarely misbehaves (Case B)
Imagine the paper runs the impossible coding task on five models, all monitored by the same GPT-5 monitor. Suppose every model’s CoT is equally transparent, and the monitor achieves TPR = 0.99 and TNR = 0.99 on all of them. In other words, they are all equally monitorable. A correct metric should give the same score to all five.
But they misbehave at different rates:
Model A hacks 90% of the time → π = 0.9
Model B hacks 50% of the time → π = 0.5
Model C hacks 20% of the time → π = 0.2
Model D hacks 5% of the time → π = 0.05
Model E hacks 1% of the time → π = 0.01
Each model is one point on the x-axis of Figure 3 Case B. So, model A would be on the left, and model E would be on the right (as π is inverse of fraction of negatives.)
The paper using F1 would publish saying something like that “Model E monitorability: 0.33” and “Model A monitorability: 0.99.” and the natural conclusion would be “Model E’s CoT is nearly opaque, we can barely monitor it, this is dangerous.” But the truth is the opposite. Model E barely misbehaves, and when it does, the monitor catches it just as reliably as with Model A.
In other words, F1 makes the safest model look the most dangerous to deploy, as it collapses those two independent facts (behavioral tendency π and CoT transparency) into one number. In fact, Model E is actually the safest model in the lineup for two independent reasons: it rarely misbehaves AND it’s transparent when it does.
π is a property of the agent, so why shouldn’t we use a metric that measures it?
If you have t his question, you probably are stuck in the same wrong thought process that I once had:
π is a property of the agent (how likely it misbehaves)
g-mean is insensitive to π
Therefore g-mean is missing something about the agent
Therefore it seems like a worse metric than F1, which does respond to π.
In reality, we do not want the monitorability metric to be confounded by π.
π → “should we trust this agent?” (alignment)
TPR/TNR → “can we verify whether to trust this agent?” (monitorability)
TPR: can we trust our monitor when it says the agent is misbehaving.
TNR: can we trust our monitor when it says the agent is not misbehaving.
An agent that misbehaves constantly but writes its plans in plain text is very monitorable. An agent that almost never misbehaves but is completely opaque when it does is poorly monitorable. These are independent axes, and F1 entangles both together. G-means, on the other hand, only focuses on monitorability.
Discussion
What’s special about g-means? Any metric that symmetrically aggregates TPR and TNR without introducing prevalence would suffice; g-mean is just the simplest such choice and strongly penalizes failure on either axis.
Limitations of g-means: given the lack of π, it is insensitive to the balance between TPR and TNR. A monitor with TPR = 0.64 and TNR = 1.0 has the same g-means as a monitor with TPR = TNR = 0.8. In a safety-critical setting, these two monitors are not equivalent: the first one misses 36% of actual misbehaviors, which could be unacceptable even though it never raises a false alarm.
Appendix
Degenerate Monitor (Case A)
A degenerate monitor that always misclassifies CoT as positive will always have TPR = 1 (since it will catch all positives) and TNR = 0 (since it never catches negatives correctly).
In this case, the F1 score for such monitor would be as follows:
F1
= 2π · TPR / [π(1 + TPR) + (1 − π)(1 − TNR)] (from above)
= 2π / [π(1 + 1) + (1 − π)(1 − 0)]
= 2π / (π + 1)
When π is extremely common (say 0.999 samples are actually misbehaving), which is at the left end of x-axis, we see the F1 metric becomes the following
F1 = 2(0.999) / (1 + 0.999) = 0.999
This means that such a degenerate monitor would actually be classified as a near-perfect monitor (when we know it is not true). If we were to use g-means, we would have gotten √ (TPR * TNR) = √ (1 * 0) = 0.
This problem is less about π but more about the numerator using only TPR. But π is the main reason why F1 will have a high score for degenerate monitors. The easy solution is to use TPR * TNR.


