Nodes fail. Your key-value store must handle it:
Detection: Heartbeats between nodes. Miss heartbeats, assume dead. Gossip protocols spread failure information.
Recovery: Replicas serve reads. Hinted handoff queues writes for the dead node. When it recovers, replay the hints.
Permanent failure: After timeout, re-replicate data to healthy nodes. Merkle trees help identify which keys need syncing.
Split brain: Network partition creates two clusters. Use quorum (majority) to prevent both sides accepting writes independently.