Fixing ECK Elasticsearch Cluster Issues with Karpenter and PodDisruptionBudgets
Recently, we ran into an issue with our Kubernetes cluster where Karpenter was updating AMIs. While this normally helps keep nodes up-to-date, in our case it was replacing nodes faster than Elasticsearch could recover. The result was a large Elasticsearch cluster that ended up in a state where all pods were stuck in Pending
.
This happened because the cluster could not maintain enough master pods online at the same time to form a quorum. Without quorum, Elasticsearch cannot elect a leader or coordinate writes, which effectively stalls the entire system.
Initial Symptoms
When this problem hit, the cluster looked deceptively healthy at first glance. All the nodes were technically “running,” but the Elasticsearch pods never reached a true Running
state. Looking deeper into the pod logs revealed repeated messages that the nodes could not discover enough masters to join the cluster.
Our Attempted Fixes
Restarting pods one by one didn’t work. In fact, it often made the problem worse, since the cluster still couldn’t reach quorum with staggered pod restarts. Elasticsearch remained stuck, waiting for enough master nodes to reappear.
The eventual solution was drastic but effective:
kubectl -n <namespace> delete po --all
This forced all Elasticsearch pods in the namespace to restart at the same time. By doing so, Elasticsearch was able to reinitialize its bootstrap discovery process, elect new masters, and reform the cluster correctly. Thankfully, no data was lost during this reset.
Why We’re Cautious
We’re always hesitant to do full cluster restarts like this. In the past, we’ve seen situations where cached data issues forced us to delete underlying PVCs (persistent volume claims), which is a last-resort scenario because it carries the risk of data loss. In this case, we were lucky—the cluster healed itself without having to delete PVCs—but this experience highlighted the need for better safeguards.
Preventing This in the Future
The real lesson here is that clusters running stateful applications like Elasticsearch need properly configured PodDisruptionBudgets (PDBs). A PDB ensures that Kubernetes (or in our case, Karpenter) doesn’t evict or replace too many pods at once, which would prevent Elasticsearch from maintaining quorum.
Here’s an example PDB definition we now apply to our clusters:
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: elasticsearch-pdb namespace: elasticsearch spec: maxUnavailable: 1 selector: matchLabels: elasticsearch.k8s.elastic.co/cluster-name: bgt-bps
With this in place, Karpenter (or any node autoscaler) will be forced to wait until an Elasticsearch pod has fully come back online before it can evict another. This ensures the cluster always has enough master nodes available to maintain quorum, significantly reducing the risk of a total cluster stall.
Conclusion
This incident was a good reminder that while Kubernetes and tools like Karpenter provide incredible automation, stateful workloads like Elasticsearch require extra guardrails. Without PodDisruptionBudgets, an automated upgrade cycle can accidentally take down an entire cluster.
The combination of careful monitoring, safe disruption policies, and a solid disaster-recovery plan ensures that Elasticsearch can run reliably—even in the face of automated infrastructure updates.