개발 공부

(Failover Cluster) failover cluster가 어떻게 failure를 인지하는지 2 본문

windows

(Failover Cluster) failover cluster가 어떻게 failure를 인지하는지 2

아이셩짱셩 2025. 5. 9. 18:57

 

  • Quorum exists to safely stop cluster services on nodes that are isolated or failed, to protect data integrity. No quorum = no right to operate.
  • RHS (Resource Host Subsystem) is used by healthy nodes in the cluster to manage failover, such as migrating VMs when a node goes down.

 

🧠 The Core Principle

Windows Failover Cluster evicts a node not because it's powered off, but because it's no longer part of a connected majority — and thus cannot be trusted to safely participate.


✅ So — How Does the Cluster Decide to Evict a Node?

It’s a multi-stage process based on communication failure, not machine failure.

🔹 Step 1: Heartbeat Loss

  • Each node sends UDP heartbeats (port 3343) to every other node once per second.
  • If a node misses 5 consecutive heartbeats from another node (default: 5 seconds), it marks that node as “suspect.”

🟡 Example: Node A stops receiving heartbeats from Node B.


🔹 Step 2: Verification (RPC Check)

After a node is marked suspect, the cluster tries to validate it with higher-level communication:

 

Protocol
Purpose
RPC Checks whether the node's Cluster Service is responsive
SMB (TCP 445) If necessary (e.g., FSW check), validates shared resource access
 

If RPC communication also fails, the node is now treated as “unreachable.”

🔴 Node A cannot verify Node B → considers B unavailable to the cluster.


🔹 Step 3: Local Node Self-Evaluation

Here’s the twist:

Node B may still think it's fine!
But it also performs the same checks and realizes “I can't reach anyone else.”

Both sides run quorum evaluation independently.


🔹 Step 4: Quorum Check and Eviction

At this point, the cluster splits into partitions — each subset (or “group”) of nodes evaluates its own quorum:

 

Group Votes Has Quorum?
Group 1 (e.g., 5 nodes) 5 votes ✅ Yes → stays online
Group 2 (isolated node) 1 vote ❌ No → evicts itself from cluster
 

A node that doesn’t have quorum will stop its cluster service and evict itself to avoid corruption.

🧠 This is the key idea:

Each node evicts itself if it determines it can’t be part of the quorum.

No other node forcefully "kicks it out." The cluster protocol is decentralized — each node independently determines if it still belongs.

'windows' 카테고리의 다른 글

(CSV) The Owner Node of CSV and the Oracle Dataguard  (0) 2025.05.20
(Failover cluster) Data I/O  (0) 2025.05.09
(Failover Cluster) Quorum  (0) 2025.05.09
(S2D) Deploy Storage Spaces Direct  (0) 2025.05.09
(Failover Cluster) Witness in Clustering  (0) 2025.05.09
Comments