Beyond up or down
A classic failure detector says a node is up or down based on a fixed timeout. But networks vary, so one timeout is either too jumpy or too slow. The phi accrual detector instead outputs a continuous suspicion value.
How phi is computed
The detector records the recent history of heartbeat arrival intervals and fits a distribution to them.
- When a heartbeat is overdue, it computes phi, roughly the negative log probability that the node is still alive given how late the beat is.
- A small phi means probably alive; a large phi means probably dead.
- Each application picks its own threshold on phi to declare failure.
Why this is better
Because phi adapts to the observed network jitter, a temporarily slow link raises suspicion gently rather than triggering an instant false alarm. Different callers can apply different thresholds from the same signal.
Key idea
The phi accrual detector turns failure detection into a tunable suspicion score derived from heartbeat history, letting each application choose how aggressively to suspect a node.