개발 공부

(Failover Cluster) Quorum 본문

windows

(Failover Cluster) Quorum

아이셩짱셩 2025. 5. 9. 18:43
728x90

🔷 What Is Quorum?

Quorum is the mechanism that prevents split-brain — where multiple parts of a cluster think they’re in charge and cause data corruption.

In simple terms:

👉 Quorum = Majority agreement among voting members (nodes + witness) about which part of the cluster should stay online.

If quorum is not reached, the cluster shuts down (partially or completely) to protect data.

 

🔹 File Share Witness (FSW) — What Is It?

File Share Witness is a concept and component of Windows Failover Clustering — not specific to Storage Spaces Direct (S2D) or Cluster Shared Volumes (CSV).

It is used to help maintain quorum, which is the decision-making mechanism in a cluster (i.e., to determine which parts stay online during a failure).

 

🔹 How FSW Detects Failures (Especially Network Partitions)

🔍 Cluster Node View of Witness

  • The Cluster Service on each node continuously checks connectivity to:
    • Other cluster nodes
    • The File Share Witness
  • If a node loses contact with others or with the FSW, it does not automatically assume failure.
  • The cluster performs quorum arbitration, and only the side with quorum stays online.

If Only One Node Contacts the FSW, How Do Others See 7 Votes?

Because:

  • The Cluster Service replicates vote status between nodes.
  • If Node A (FSW Coordinator) successfully locks the FSW and confirms its vote:
    • It informs the cluster: “Witness vote is active.”
    • Other nodes add it to the quorum count.

So:

🟢 Nodes do not each contact the FSW — they rely on the cluster membership and internal sync.

 

🔹 Goal of Failover Clustering

In the event of network failure, the cluster must:

  1. Detect the failure of a node.
  2. Determine if the node is truly unreachable, or if it’s just a temporary glitch.
  3. Decide whether to fail over workloads (e.g., VMs, file shares).
  4. Ensure quorum is maintained (i.e., the cluster doesn’t split).

🔹 1. Detection of Node Failure (via Heartbeats)

Process:

  • Cluster nodes send heartbeats to each other every second (by default).
  • If a node misses 5 consecutive heartbeats (~5 seconds by default), the other nodes suspect it has failed.

📡 Protocols Involved:

 

 

Type Protocol UsedPort
Heartbeat UDP + RPC 3343 (UDP), 135 (RPC)
Cluster Comm TCP/IP + SMB Dynamic ports, SMB (445)
SMB File witness, cluster shared access 445
ICMP (Ping) Used optionally for basic checks ICMP
 

Heartbeats use a combination of UDP multicast/unicast and RPC.


🔹 2. Validation and Voting

After a missed heartbeat:

  • The cluster attempts additional checks (e.g., RPC, SMB connections).
  • If the node fails all checks, it's marked down.
  • The cluster runs quorum arbitration to determine if enough nodes (or witness) are available to keep running.

🔹 3. Quorum Check & Arbitration

The Cluster Service checks:

  • How many votes (nodes + witness) are online.
  • Whether the current node(s) are part of the majority.

🧠 Scenarios:

Scenario Result
Majority of votes present (including witness)? Cluster stays online
Less than majority? Cluster pauses (goes offline) to avoid split-brain
 

🔹 4. Failover of Workloads

If a node is confirmed as failed:

  • The Cluster Resource Host Subsystem (RHS) moves clustered roles (like VMs, SQL, etc.) to a surviving node.
  • Any Cluster Shared Volumes (CSV) are brought online on another node.
728x90
Comments