Sui Validator Alert Reference
When running a Sui Validator node or Full node, you might want to configure alerting based on some or all of the following metrics.
Alert reference​
The following sections cover the alert settings, but their details are meant to be customized in the following ways:
- Replace
$networkwith your actual network label (for example,mainnet,testnet, and so on). - Thresholds assume about 10,000 stake units — adjust for your own validator set size.
- Labels like
hostandcontainerare stripped to be agnostic on infrastructure.
High-priority chain health alerts (validator-specific)​
These alerts should receive the most immediate attention from you or your team.
Safe mode during reconfiguration​
| Key | Value |
|---|---|
| Name | Safe Mode during Reconfiguration |
| Summary | Epoch failed to advance; chain entered safe mode |
| Duration | 5m |
is_safe_mode{network="$network"} > 0.5 or absent(is_safe_mode{network="$network"})
Consensus proposals failure​
| Key | Value |
|---|---|
| Name | Consensus Proposals Failure |
| Summary | Less than 80% of stake is proposing consensus blocks |
| Duration | 5m |
sum(
sum by (host) (current_voting_right{network="$network"})
and
sum by (host) (rate(consensus_proposed_blocks{network="$network"}[5m])) > 0
) < 8000
Checkpoint execution rate is low​
| Key | Value |
|---|---|
| Name | Checkpoint Execution Rate Is Low |
| Summary | Less than 80% of stake is executing checkpoints quickly enough |
| Duration | 5m |
sum(
sum by (host) (current_voting_right{network="$network"})
and
sum by (host) (rate(last_executed_checkpoint{network="$network"}[5m])) > 2
) < 8000
Certificate execution latencies are high​
| Key | Value |
|---|---|
| Name | Certificate execution latencies are high |
| Summary | Less than 80% of stake is handling shared-object tx certs with low enough latency |
| Duration | 5m |
sum(
sum by (host) (current_voting_right{network="$network"})
and
histogram_quantile(0.95, sum by (le, host) (
rate(validator_service_handle_certificate_consensus_latency_bucket{network="$network"}[5m])
)) < 3
) < 8000
Randomness DKG failure​
| Key | Value |
|---|---|
| Name | RandomnessDkgFailure |
| Summary | Random beacon DKG has failed on one or more hosts |
| Duration | 5m |
epoch_random_beacon_dkg_failed{network="$network"} > 0 or absent(is_safe_mode{network="$network"})
Validators not upgraded​
| Key | Value |
|---|---|
| Name | Mysten validators are not upgraded |
| Summary | Validators are behind on protocol version |
| Duration | 1h |
min(sui_configured_max_protocol_version{network="$network", host=~"Mysten-.*"})
< quantile(0.34, sui_configured_max_protocol_version{network="$network"})
Non-urgent and warning alerts​
All alerts are important, but the following alerts and warnings can be addressed within a normal node maintenance workflow.
Consensus sequencing p99 latency high​
| Key | Value |
|---|---|
| Name | Consensus sequencing p99 latencies are high |
| Summary | Less than 80% of stake is sequencing tx certs with acceptable latency |
| Duration | 1m |
sum(
sum by (host) (current_voting_right{network="$network"})
and
histogram_quantile(0.95, sum by (le, host) (
rate(sequencing_certificate_latency_bucket{network="$network", position="0", tx_type=~"shared_certificate|owned_certificate|soft_bundle"}[2m])
)) < 2
) < 5000
System invariant violations​
| Key | Value |
|---|---|
| Name | System Invariant Violations |
| Summary | The system reported an invariant violation |
| Duration | 1m |
max(system_invariant_violations{network="$network"}) > 0