개발 공부
(Failover Cluster) Quorum 본문
🔷 What Is Quorum?
Quorum is the mechanism that prevents split-brain — where multiple parts of a cluster think they’re in charge and cause data corruption.
In simple terms:
👉 Quorum = Majority agreement among voting members (nodes + witness) about which part of the cluster should stay online.
If quorum is not reached, the cluster shuts down (partially or completely) to protect data.
🔹 File Share Witness (FSW) — What Is It?
File Share Witness is a concept and component of Windows Failover Clustering — not specific to Storage Spaces Direct (S2D) or Cluster Shared Volumes (CSV).
It is used to help maintain quorum, which is the decision-making mechanism in a cluster (i.e., to determine which parts stay online during a failure).
🔹 How FSW Detects Failures (Especially Network Partitions)
🔍 Cluster Node View of Witness
- The Cluster Service on each node continuously checks connectivity to:
- Other cluster nodes
- The File Share Witness
- If a node loses contact with others or with the FSW, it does not automatically assume failure.
- The cluster performs quorum arbitration, and only the side with quorum stays online.
✅ If Only One Node Contacts the FSW, How Do Others See 7 Votes?
Because:
- The Cluster Service replicates vote status between nodes.
- If Node A (FSW Coordinator) successfully locks the FSW and confirms its vote:
- It informs the cluster: “Witness vote is active.”
- Other nodes add it to the quorum count.
So:
🟢 Nodes do not each contact the FSW — they rely on the cluster membership and internal sync.
🔹 Goal of Failover Clustering
In the event of network failure, the cluster must:
- Detect the failure of a node.
- Determine if the node is truly unreachable, or if it’s just a temporary glitch.
- Decide whether to fail over workloads (e.g., VMs, file shares).
- Ensure quorum is maintained (i.e., the cluster doesn’t split).
🔹 1. Detection of Node Failure (via Heartbeats)
✅ Process:
- Cluster nodes send heartbeats to each other every second (by default).
- If a node misses 5 consecutive heartbeats (~5 seconds by default), the other nodes suspect it has failed.
📡 Protocols Involved:
Type | Protocol | UsedPort |
Heartbeat | UDP + RPC | 3343 (UDP), 135 (RPC) |
Cluster Comm | TCP/IP + SMB | Dynamic ports, SMB (445) |
SMB | File witness, cluster shared access | 445 |
ICMP (Ping) | Used optionally for basic checks | ICMP |
Heartbeats use a combination of UDP multicast/unicast and RPC.
🔹 2. Validation and Voting
After a missed heartbeat:
- The cluster attempts additional checks (e.g., RPC, SMB connections).
- If the node fails all checks, it's marked down.
- The cluster runs quorum arbitration to determine if enough nodes (or witness) are available to keep running.
🔹 3. Quorum Check & Arbitration
The Cluster Service checks:
- How many votes (nodes + witness) are online.
- Whether the current node(s) are part of the majority.
🧠 Scenarios:
Scenario | Result |
Majority of votes present (including witness)? | Cluster stays online |
Less than majority? | Cluster pauses (goes offline) to avoid split-brain |
🔹 4. Failover of Workloads
If a node is confirmed as failed:
- The Cluster Resource Host Subsystem (RHS) moves clustered roles (like VMs, SQL, etc.) to a surviving node.
- Any Cluster Shared Volumes (CSV) are brought online on another node.
'windows' 카테고리의 다른 글
(Failover Cluster) failover cluster가 어떻게 failure를 인지하는지 2 (1) | 2025.05.09 |
---|---|
(Failover cluster) Data I/O (0) | 2025.05.09 |
(S2D) Deploy Storage Spaces Direct (0) | 2025.05.09 |
(Failover Cluster) Witness in Clustering (0) | 2025.05.09 |
(S2D) 헷갈리는 개념들 (SAN, NAS, S2D, LUN, shared volumem, Cluster Shared Volume) (0) | 2025.05.09 |