Proxmox Cluster Quorum Lost: 6 Causes and How to Fix It

14 min read

Proxmox cluster quorum loss is one of those events that looks catastrophic from the GUI and is often straightforward to fix. The cluster goes red, /etc/pve turns read-only, operations start failing — and the cause is usually not a Corosync bug or a corrupted filesystem. It is usually vote math. Either a node went down and the remaining nodes no longer hold a majority, or maintenance removed too many votes at once, or a two-node cluster was running without a third vote to begin with.

That said, the consequences vary significantly depending on whether HA is active. Understanding that distinction before you start recovery is more important than running the first fix you find.

Quick answer If proxmox cluster quorum is lost: run pvecm status. If Total votes is less than Expected votes and Quorate: No — check vote math first. Restore a majority by bringing nodes back online before anything else. Only override expected_votes if you are certain the missing nodes are physically offline.
Diagnostic flow — where to start
  1. Run pvecm status — is Quorate: No?
  2. Was a node intentionally shut down or did power fail? → Vote math problem (Cause #1)
  3. Are Corosync logs showing [TOTEM] or [KNET] errors? → Network issue (Cause #2 or #3)
  4. Did quorum drop after an IP change, hostname rename, or manual config edit? → Config drift (Cause #4)
  5. Is a QDevice configured? Is it reachable on TCP 5403? → QDevice issue (Cause #5)
  6. Did this start after enabling jumbo frames or changing fabric? → MTU issue (Cause #6)

If step 2 matches — stop there. Vote math accounts for the majority of quorum incidents in homelab and SMB environments.

TL;DR — Proxmox Cluster Quorum
  • Quorum requires a majority of configured votes to be online; the cluster blocks writes if it falls below that threshold
  • Without HA: running VMs usually keep running, but /etc/pve goes read-only and cluster operations stop
  • With HA: nodes that lose quorum can self-fence (reboot) after ~60 seconds when the watchdog can no longer be reset
  • Most quorum loss in homelabs and SMBs comes from vote math, not network bugs — check node count first
  • Two-node clusters need a QDevice or third node to avoid quorum loss on any single failure
  • pvecm expected is a recovery tool for emergencies, not a design pattern
Scope note This article covers Proxmox VE cluster quorum: causes, diagnostics, and recovery. It assumes a cluster of two or more nodes running Proxmox VE 7.x or 8.x. Standalone Proxmox hosts with no cluster configured are out of scope — quorum only applies once pvecm create or pvecm add has been run. QDevice setup basics are referenced but not fully covered; for that, see the Proxmox QDevice documentation.

What Quorum Actually Does

Proxmox is designed to prefer temporary unavailability over data corruption. When proxmox cluster quorum is lost, the cluster cannot prove which partition is authoritative. Rather than risk two nodes modifying the same VM configuration or storage simultaneously, Proxmox blocks writes until a majority is restored. That is why the GUI turns red and operations stop — not because something is broken, but because Proxmox is refusing to guess.

Proxmox clusters use a voting system built on Corosync to decide which partition of nodes is authoritative. Each node has one vote by default. A partition needs more than half the total configured votes — a strict majority — to remain quorate and allow writes to the cluster filesystem (pmxcfs).

pmxcfs replicates cluster configuration in real time across all nodes: VM configs, storage definitions, HA rules, VMID allocations. Only the partition with a strict majority of votes stays writable. The minority partition switches to read-only and blocks new cluster operations.

Split Brain: The Failure Quorum Exists to Prevent

Split brain is the scenario proxmox cluster quorum exists to prevent. Understanding it makes the rest of the article make sense.

A network partition divides the cluster into two groups that can no longer communicate. Each group still has nodes running. Each group still has access to shared storage. Neither group knows whether the other side is alive or dead — so both sides assume they are authoritative and keep operating.

The result is two active writers on the same data:

  • Both partitions attempt to start the same VM — two instances write simultaneously to the same disk image
  • Both partitions update pmxcfs with conflicting configuration changes
  • Both partitions run HA recovery logic, potentially starting duplicate guests

Shared storage corruption from simultaneous writes is not always immediately visible. The image looks intact. The VM boots. The corruption surfaces later — during a backup, a restore, or a filesystem check. By then the original cause is long gone and the damage is hard to trace.

Fencing exists because of this. When a node loses quorum in an HA cluster, the watchdog forces it to reboot before it can cause split-brain damage. The node does not wait to confirm whether the other side is alive. It removes itself from the equation. That is the correct behavior — a rebooted node cannot corrupt shared storage.

Quorum enforces this at the cluster level: only one partition gets to remain writable. The partition with a majority of votes proceeds. The minority partition blocks. No writes from both sides simultaneously. No split brain.

What Happens to Running VMs During Quorum Loss

This is the question most operators search first when proxmox cluster quorum drops and the cluster turns red.

Without HA:

ResourceBehavior
Running VMsContinue running
Running CTsContinue running
VM config editsBlocked
VM creation / deletionBlocked
MigrationsBlocked
Backups (cluster-aware)Blocked
/etc/pveRead-only

Quorum loss without HA is primarily a management freeze. The cluster refuses to make new decisions, but it does not interrupt work already in progress. You can SSH into nodes, observe the system, and run diagnostics. What you cannot do is change anything cluster-managed until quorum is restored.

With HA:

ResourceBehavior
HA-managed guestsMay be fenced and recovered on surviving partition
WatchdogCannot be reset by LRM without quorum
Node losing quorumMay self-fence after ~60 seconds
Failover decisionsDepend on which partition retains quorum

With HA the stakes are higher. When a node loses quorum, the local resource manager (LRM) can no longer reset the watchdog. After roughly 60 seconds, the node self-fences — it reboots itself deliberately. This is not a crash. It is Proxmox ensuring the node cannot continue running HA-managed guests in a partition that may conflict with another active partition. If the surviving partition still has quorum, HA will attempt to recover those guests there.

In HA clusters, quorum loss is not “things slow down and I have time to investigate.” It can be a cascading fencing event that reboots multiple nodes, depending on how the cluster splits.

Situations that look scary but usually are not:

  • One node rebooted during maintenance → cluster GUI turns red → running VMs still healthy
  • Power blip took one node offline → /etc/pve read-only → nothing actually corrupted
  • Cluster shows red X marks → SSH still works on all nodes

In non-HA clusters, this is quorum protection working as designed. The cluster is not broken. It is waiting for a safe majority before allowing writes again.

Six Causes of Proxmox Cluster Quorum Loss

The order below reflects practical investigation priority — the sequence most likely to find the actual cause fastest. It is not a statistical frequency ranking.

1. Vote Math Problems

The cluster simply does not have enough votes online to form a majority. This is the most common cause of proxmox cluster quorum loss in homelab and SMB environments and should be the first check before investigating anything else.

The most recurring pattern: a two-node cluster with no QDevice. One node goes down for any reason — reboot, shutdown, power loss — and the remaining node has one vote out of two. That is 50%, not a majority. Quorum is immediately lost.

Cold boot after power loss is a pattern that trips up homelab operators more than any other. The full sequence: power fails, all nodes shut down simultaneously, power returns, one node boots faster than the others. That single node has one vote out of the cluster’s expected total — not a majority. The cluster shows Quorate: No. onboot guests do not start. The operator assumes something is broken. Nothing is broken. The cluster is waiting for the rest of the nodes to come online and restore a majority. Using pvecm expected 1 here is unnecessary and adds split-brain risk if any other node is in an unknown state.

Field note

Proxmox forum threads document a recurring pattern in 3-node clusters: planned maintenance reduces the cluster to two nodes, then an unrelated event — a BIOS update reboot or brief power hiccup — takes a second node offline before the first is back. The remaining node has one vote out of three. Quorate: No. Root cause: maintenance reduced redundancy below safe threshold before an unrelated failure occurred. The fix is always the same — restore a second node first, verify quorum, then continue maintenance.

Symptoms: quorum dropped immediately after a deliberate shutdown, reboot, or power event; the cluster was healthy before; no network errors in the logs.

Vote math diagnosis
  1. Run pvecm status — check Expected votes, Total votes, and Quorate
  2. Count how many nodes are actually online
  3. Check whether a QDevice is configured and whether it is reachable
  4. Ask: did this start exactly when a node was shut down or rebooted?

Recovery: restore a majority first. Bring the missing node back online, or restore QDevice reachability. Only after verifying the missing side is genuinely offline should you consider pvecm expected <N> as a temporary override. For a persistent two-node quorum problem, the right fix is architectural: add a third voter (QDevice or third node), not recurring use of expected_votes.

2. Corosync on a Shared or Congested Network

Corosync traffic is competing with VM, storage, backup, or migration traffic on the same link. Corosync has tight latency and timing requirements. When the link is congested or jittery, Corosync misses heartbeat tokens and starts forming new membership configurations — which can cascade into quorum loss and, in HA clusters, fencing.

Proxmox explicitly recommends a dedicated physical NIC for cluster communication and warns against running storage or migration traffic on the Corosync path.

Symptoms: quorum flaps while SSH and ping look mostly fine; instability correlates with backup runs, VM migrations, or storage replication; multiple nodes may reboot in HA clusters after a period of network stress.

Log patterns:

[TOTEM] Retransmit List Token has not been received A processor failed, forming new configuration pmxcfs ... cpg_send_message retry
Congested network diagnosis
  1. journalctl -b -u corosync --no-pager — look for TOTEM token and retransmit messages
  2. Confirm whether Corosync shares a physical NIC with VM, backup, or storage traffic
  3. Check whether quorum events correlate with heavy network operations (backup window, migration)
  4. Verify end-to-end latency and jitter between all Corosync ring addresses

Recovery: move Corosync onto a dedicated low-latency network. Stop routing bulk traffic across the Corosync path. Token and consensus timeout tuning (token_coefficient) can help in high-latency setups, but it is a compensating measure — the real fix is the network.

3. Bonding and Asymmetric Connectivity Mistakes

Operators add a Linux bond to the Corosync interface expecting redundancy, but load-balancing bond modes create asymmetric connectivity — some nodes see different subsets of peers at any given moment. Corosync may fail to form stable quorum at all in this state, or the cluster may fence itself after a link or switch event that looks harmless on the surface.

Proxmox explicitly advises against load-balancing bond modes for Corosync. If LACP must be used, Proxmox recommends bond-lacp-rate fast on both the node and the switch.

Log patterns:

[KNET] link: host X link Y is down [KNET] host: host X has no active links
Bond / asymmetric connectivity diagnosis
  1. Identify whether Corosync rides on a Linux bond or vmbr — check /etc/network/interfaces
  2. Identify the bond mode (bond-mode setting)
  3. Run corosync-cfgtool -n to see live KNET link state
  4. Verify switch-side LACP configuration and whether both sides agree on the rate
  5. Check for miswired redundant paths or inconsistent NIC mapping

Recovery: use plain dedicated interfaces or Corosync’s own multi-ring redundancy over separate physical networks. If you must use a bond, switch to active-backup mode rather than LACP or balance-rr.

4. Address, Hostname, and Configuration Drift

Corosync configuration gets out of sync with actual network state. This happens after hostname changes, IP reassignments, moving Corosync to a new VLAN, adding or replacing nodes, or manual edits to corosync.conf that increment config_version incorrectly. Proxmox recommends using IP addresses rather than hostnames in cluster configuration, because hostname resolution can change over time.

Symptoms: all nodes appear individually reachable, but the cluster shows red X marks, ghost nodes, or inconsistent views; /etc/pve becomes read-only despite basic connectivity looking fine.

Log patterns:

configuration error: nodelist or quorum.expected_votes must be configured! hostname lookup failed
Config drift diagnosis
  1. Inspect /etc/pve/corosync.conf on a quorate node — verify every ringX_addr resolves correctly
  2. Compare /etc/corosync/corosync.conf content across all nodes — it should be identical
  3. Check config_version is consistent
  4. Verify /etc/hosts entries for all cluster nodes on every host
  5. Confirm no recent hostname or IP changes

Recovery: if the cluster is quorate, correct the cluster copy of corosync.conf and verify with systemctl status corosync after the update propagates. If Corosync cannot start on an isolated node, edit the local /etc/corosync/corosync.conf directly so Corosync can come up, then ensure content is identical across all nodes.

5. QDevice Placement or Reachability Errors

QDevice is the standard fix for two-node clusters, but the fix only works if the QDevice is genuinely independent from the cluster nodes. The most common design mistake: placing qnetd as a VM inside the same Proxmox cluster or on one of the cluster hosts. When that host goes down, the QDevice goes with it — exactly when you need it most.

Proxmox recommends QDevice for even-numbered clusters (especially two-node) and discourages it for odd-numbered clusters, where the QDevice can become a de facto single point of failure.

QDevice diagnosis
  1. pvecm status — check QDevice flags (A/NA, V/NV, NR)
  2. Test reachability to the qnetd host on TCP port 5403
  3. Confirm the qnetd host is physically separate from all cluster nodes
  4. Verify qnetd is not a VM whose survival depends on the cluster it arbitrates

Recovery: move qnetd to an external physical host, NAS, SBC, or non-cluster hypervisor. Re-run pvecm qdevice setup. Remember that Proxmox requires removing the QDevice before adding or deleting cluster nodes.

6. MTU and Edge Transport Cases

This is the least common cause and worth checking only after the five above are excluded — or when logs point directly at MTU discovery failures.

Corosync uses KNET’s PMTUD (Path MTU Discovery) to determine the usable packet size between nodes. An MTU mismatch across NICs, VLANs, or switches can cause repeated PMTUD aborts and unstable peer reachability, even when basic ping works. This surfaces most often after enabling jumbo frames on some but not all segments.

Log patterns:

[KNET] pmtud: Aborting PMTUD process: Too many attempts. MTU might have changed during discovery rx: Source host X not reachable yet pmxcfs ... cpg_send_message retried 100 times

Diagnosis: verify a consistent MTU end-to-end — NICs, VLANs, switches, any tunnels. Run corosync-cfgtool -n to see the live MTU Corosync is using. Recovery: normalize MTU across the full path and restart Corosync after the path is clean.

Failure scenario Common design mistakes that guarantee quorum problems — not edge cases, recurring patterns in forum threads:
  1. Two-node cluster with no QDevice and no third node
  2. Corosync on the same NIC as storage replication or backup traffic
  3. QDevice running as a VM inside the cluster it arbitrates
  4. Single switch carrying all Corosync traffic with no redundant path
  5. Cluster stretched across WAN links without dedicated low-latency Corosync transport

If your cluster matches any of these, quorum loss is eventually guaranteed — not a risk.

Log Patterns and First-Pass Triage

When proxmox cluster quorum breaks, start here:

pvecm status systemctl status corosync corosync-cfgtool -n journalctl -b -u corosync --no-pager journalctl -b -u pve-cluster --no-pager pvesh get /cluster/status ha-manager status journalctl -b -u pve-ha-lrm -u pve-ha-crm --no-pager

pvecm status answers the vote math question in seconds. systemctl status corosync confirms whether the Corosync process is actually running and shows recent service errors at a glance — most operators run this immediately after pvecm status. corosync-cfgtool -n shows live KNET link state and the MTU Corosync is actually using. pvesh get /cluster/status gives a structured JSON view of node states useful in scripts and automated health checks.

Healthy cluster output:

Quorate: Yes Expected votes: 3 Total votes: 3 Node votes: 1

Broken cluster output:

Quorate: No Expected votes: 3 Total votes: 1 Node votes: 1

One vote out of three expected. Two nodes are offline or unreachable from this node’s perspective. The first question is whether that is intentional.

Pattern recognition:

Expected votes > Total votes, Quorate: No — vote math problem. Count online nodes. Check QDevice state. Do not chase packet-level causes until vote math is ruled out.

[KNET] link: host X link Y is down or host X has no active links — Corosync path failure. Think NIC, switch, bond mode, VLAN, or wrong ringX_addr. The host itself may be fully up.

[TOTEM] Token has not been received, A processor failed, forming new configuration — congestion, latency, or jitter on the Corosync path. Check whether the link is shared with bulk traffic.

[QUORUM] This node is within the non-primary component and will NOT provide any services — this node is in the minority partition. Do not force writes here until you have confirmed the other side is offline.

pmxcfs ... cpg_send_message retried 100 times — cluster filesystem cannot get reliable group communication. /etc/pve behavior will degrade even if some node-to-node connectivity still exists.

Nodes rebooting roughly 60 seconds after quorum loss in an HA cluster are self-fencing. That is expected behavior, not a secondary failure.

Common Proxmox Cluster Quorum Incidents and First Check

ScenarioPriorityFirst Check
2-node cluster, one node rebootedVery highpvecm status — vote count
Cold boot after power loss, VMs not startingVery highWait for all nodes to come online
Maintenance shut down too many nodes at onceHighVote count vs. expected
Quorum flaps during backup or migrationHighCorosync on shared link?
QDevice unreachableMediumTCP 5403 reachability
Bonding / LACP change preceded quorum lossMediumcorosync-cfgtool -n KNET links
Quorum lost after hostname or IP changeMediumcorosync.conf ring addresses
MTU mismatch / jumbo framesLowcorosync-cfgtool -n MTU output

Typical patterns by environment:

EnvironmentMost Common Cause
Homelab2-node cluster without QDevice
SMBCorosync sharing production network
SMB with HASwitch failure triggering fencing cascade
Any cluster after power outageCold boot vote loss

Before You Run pvecm expected

Failure scenario Stop. Work through this checklist before running pvecm expected:
  1. Confirm the missing node is powered off — physically verify if necessary
  2. Confirm shared storage is not mounted or accessible from the missing node
  3. Confirm HA recovery has completed on the surviving partition (ha-manager status)
  4. Confirm no second partition is alive anywhere on the network
  5. Confirm this is an emergency recovery situation, not a design workaround

If any item above is uncertain — restore the missing node first. The split-brain risk is VM image corruption on shared storage. That damage is not self-correcting.

Recovery: Safe Use of expected_votes

pvecm expected exists for legitimate emergencies. It is not a workaround for bad cluster design.

When it is appropriate:

  • Repairing a broken corosync.conf when the cluster cannot otherwise reach quorum
  • Running pvecm delnode after a permanently failed node has been powered off
  • Getting critical guests back online after confirming the missing nodes are physically offline

The rule: only use pvecm expected when you are certain the missing side is not running. The runtime override resets when nodes rejoin — it is a temporary state, not a persistent config change. Do not use it as a steady-state solution.

Planned Maintenance vs Unexpected Failure

These require different approaches.

Planned maintenance: Proxmox provides disarm-ha for cluster-wide HA-safe maintenance windows. In freeze mode, services stay where they are and HA stops reacting to changes. In ignore mode, HA stops tracking them so you can move them manually. While HA is disarmed, no automatic fencing or failover happens — which is why it is appropriate for Corosync or network work, and why it should be kept as short as possible.

Safe node shutdown sequence for a 3-node cluster:

  • Verify cluster is quorate before starting (pvecm status)
  • Migrate HA guests off the target node if required
  • Put the target node into maintenance mode
  • Shut down one node at a time only
  • Verify quorum is still intact after each shutdown
  • Never remove majority votes simultaneously

Unexpected failure: assume split-brain risk until proven otherwise. Do not modify cluster state. Do not lower expected votes before verifying which side holds shared storage and which side is truly offline. In HA clusters, let fencing complete — interrupting it mid-sequence can leave guests in an undefined state.

The distinction matters because the tooling is the same but the risk model is completely different. During planned maintenance you know the full picture. During unexpected failure, you do not.

Preventing Your Next Proxmox Cluster Quorum Incident

Minimum safe cluster designs:

DesignAssessment
3 nodesGood — tolerates single-node failure without extra configuration
2 nodes + external QDeviceGood — tiebreaker vote on independent hardware
2 nodes onlyRisky — any single failure loses quorum
2 nodes + QDevice VM inside the clusterAvoid — QDevice fails together with the node it should arbitrate

Beyond topology, five operational decisions eliminate the majority of recurring incidents:

  • Keep Corosync off storage and backup networks — a dedicated NIC or VLAN with guaranteed low latency
  • Use IP addresses instead of hostnames in corosync.conf — hostnames introduce a DNS dependency into cluster availability
  • Test node failure before production use — shut down one node deliberately, verify quorum survives, bring it back
  • Never shut down more nodes simultaneously than your cluster can afford to lose
  • Run qnetd on a physically separate host — NAS, SBC, Raspberry Pi — not inside the cluster

None of these require significant extra hardware in most homelab setups.

FAQ

What does proxmox cluster quorum “Quorate: No” actually mean in practice?

The cluster does not have enough votes online to be authoritative. Most cluster operations that write to /etc/pve are blocked. Running VMs generally keep running. Starting new VMs, editing configs, running migrations, and most management operations will fail with cluster not ready - no quorum until quorum is restored.

Can I just use pvecm expected 1 to get back online?

You can, but you should not do it casually. If the other nodes are offline and confirmed not coming back on their own, it is a valid emergency tool. If there is any chance another partition is still active — especially one with access to the same shared storage — you risk split brain and VM image corruption. Verify the missing nodes are genuinely offline before using it.

Why does a two-node cluster lose quorum when one node reboots?

One vote out of two is 50%, not a majority. Proxmox’s quorum model requires more than half. A two-node cluster needs either a third node or a QDevice to provide the tiebreaker vote that keeps the surviving node quorate during any single failure.

What is a QDevice and where should I run it?

A QDevice is an external voting participant provided by corosync-qnetd, running on a host outside the cluster. It adds one vote to the cluster without adding a full Proxmox node. For a two-node cluster, it gives the surviving node a 2/3 majority on single failure. It should run on a physically separate host — NAS, SBC, VM on a different hypervisor — never as a VM inside the same cluster it is arbitrating.

If my HA nodes self-fence after quorum loss, is something broken?

No. Self-fencing after losing quorum in an HA cluster is intentional. The LRM cannot reset the watchdog without quorum, so the node reboots to prevent split-brain scenarios from running HA-managed guests in two places simultaneously. It means HA is working. The question to investigate is why quorum was lost in the first place.

Final Thoughts

Proxmox cluster quorum loss is the cluster doing exactly what it was designed to do: refuse to guess which partition is authoritative. The GUI turns red because Proxmox is choosing temporary unavailability over the risk of data corruption. That is the correct trade-off.

The failure modes worth preventing are the ones that make quorum loss unavoidable by design — two-node clusters without a tiebreaker vote, Corosync sharing a link with backup or storage traffic, QDevice running inside the cluster it is supposed to arbitrate. Most of these are one-time architecture decisions that cost nothing to get right the first time and are painful to fix after an incident.

When quorum does drop unexpectedly, vote math and network are the right starting points — not config edits and not expected_votes overrides. Restore a majority first. Everything else follows from there.