개발 공부
(Failover Cluster) failover cluster가 어떻게 failure를 인지하는지 2 본문
- ✅ Quorum exists to safely stop cluster services on nodes that are isolated or failed, to protect data integrity. No quorum = no right to operate.
- ✅ RHS (Resource Host Subsystem) is used by healthy nodes in the cluster to manage failover, such as migrating VMs when a node goes down.
🧠 The Core Principle
Windows Failover Cluster evicts a node not because it's powered off, but because it's no longer part of a connected majority — and thus cannot be trusted to safely participate.
✅ So — How Does the Cluster Decide to Evict a Node?
It’s a multi-stage process based on communication failure, not machine failure.
🔹 Step 1: Heartbeat Loss
- Each node sends UDP heartbeats (port 3343) to every other node once per second.
- If a node misses 5 consecutive heartbeats from another node (default: 5 seconds), it marks that node as “suspect.”
🟡 Example: Node A stops receiving heartbeats from Node B.
🔹 Step 2: Verification (RPC Check)
After a node is marked suspect, the cluster tries to validate it with higher-level communication:
Protocol |
Purpose |
RPC | Checks whether the node's Cluster Service is responsive |
SMB (TCP 445) | If necessary (e.g., FSW check), validates shared resource access |
If RPC communication also fails, the node is now treated as “unreachable.”
🔴 Node A cannot verify Node B → considers B unavailable to the cluster.
🔹 Step 3: Local Node Self-Evaluation
Here’s the twist:
Node B may still think it's fine!
But it also performs the same checks and realizes “I can't reach anyone else.”
Both sides run quorum evaluation independently.
🔹 Step 4: Quorum Check and Eviction
At this point, the cluster splits into partitions — each subset (or “group”) of nodes evaluates its own quorum:
Group | Votes | Has Quorum? |
Group 1 (e.g., 5 nodes) | 5 votes | ✅ Yes → stays online |
Group 2 (isolated node) | 1 vote | ❌ No → evicts itself from cluster |
A node that doesn’t have quorum will stop its cluster service and evict itself to avoid corruption.
🧠 This is the key idea:
✅ Each node evicts itself if it determines it can’t be part of the quorum.
No other node forcefully "kicks it out." The cluster protocol is decentralized — each node independently determines if it still belongs.
'windows' 카테고리의 다른 글
(CSV) The Owner Node of CSV and the Oracle Dataguard (0) | 2025.05.20 |
---|---|
(Failover cluster) Data I/O (0) | 2025.05.09 |
(Failover Cluster) Quorum (0) | 2025.05.09 |
(S2D) Deploy Storage Spaces Direct (0) | 2025.05.09 |
(Failover Cluster) Witness in Clustering (0) | 2025.05.09 |