Kubernetes Blog

Tuesday, December 11, 2018

etcd: Current status and future roadmap

Author: Gyuho Lee (Amazon Container OSS Team, @gyuho), Joe Betz (Google Cloud, @jpbetz)

etcd is a distributed key value store that provides a reliable way to manage the coordination state of distributed systems. etcd was first announced in June 2013 by CoreOS (part of Red Hat as of 2018). Since its adoption in Kubernetes in 2014, etcd has become a fundamental part of the Kubernetes cluster management software design, and the etcd community has grown exponentially. etcd is now being used in production environments of multiple companies, including large cloud provider environments such as AWS, Google Cloud Platform, Azure, and other on-premises Kubernetes implementations. CNCF currently has 32 conformant Kubernetes platforms and distributions, all of which use etcd as the datastore.

In this blog post, we’ll review some of the milestones achieved in latest etcd releases, and go over the future roadmap for etcd. Share your thoughts and feedback on features you consider important on the mailing list: etcd-dev@googlegroups.com.

etcd, 2013

In June 2014, Kubernetes was released with etcd as a backing storage for all master states. Kubernetes v0.4 used etcd v0.2 API, which was in an alpha stage at the time. As Kubernetes reached the v1.0 milestone in 2015, etcd stabilized its v2.0 API. The widespread adoption of Kubernetes led to a dramatic increase in the scalability requirements for etcd. To handle large number of workloads and the growing requirements on scale, etcd released v3.0 API in June 2016. Kubernetes v1.13 finally dropped support for etcd v2.0 API and adopted the etcd v3.0 API. The table below gives a visual snapshot of the release cycles of etcd and Kubernetes.

	etcd	Kubernetes
Initial Commit	June 2, 2013	June 1, 2014
First Stable Release	January 28, 2015 (v2.0.0)	July 13, 2015 (v1.0.0)
Latest Release	October 10, 2018 (v3.3.10)	December 3, 2018 (v1.13.0)

etcd v3.1, early 2017

etcd v3.1 features provide better read performance and better availability during version upgrades. Given the high use of etcd in production even to this day, these features were very useful for users. It implements Raft read index, which bypasses Raft WAL disk writes for linearizable reads. The follower requests read index from the leader. Responses from the leader indicate whether a follower has advanced as much as the leader. When the follower’s logs are up-to-date, quorum read is served locally without going through the full Raft protocol. Thus, no disk write is required for read requests. etcd v3.1 introduces automatic leadership transfer. When etcd leader receives an interrupt signal, it automatically transfers its leadership to a follower. This provides higher availability when the cluster adds or loses a member.

etcd v3.2 (summer 2017)

etcd v3.2 focuses on stability. Its client was shipped in Kubernetes v1.10, v1.11, and v1.12. The etcd team still actively maintains the branch by backporting all the bug fixes. This release introduces gRPC proxy to support, watch, and coalesce all watch event broadcasts into one gRPC stream. These event broadcasts can go up to one million events per second.

etcd v3.2 also introduces changes such as “snapshot-count” default value from 10,000 to 100,000. With higher snapshot count, etcd server holds Raft entries in-memory for longer periods before compacting the old ones. etcd v3.2 default configuration shows higher memory usage, while giving more time for slow followers to catch up. It is a trade-off between less frequent snapshot sends and higher memory usage. Users can employ lower etcd --snapshot-count value to reduce the memory usage or higher “snapshot-count” value to increase the availability of slow followers.

Another new feature backported to etcd v3.2.19 was etcd --initial-election-tick-advance flag. By default, a rejoining follower fast-forwards election ticks to speed up its initial cluster bootstrap. For example, the starting follower node only waits 200ms instead of full election timeout 1-second before starting an election. Ideally, within the 200ms, it receives a leader heartbeat and immediately joins the cluster as a follower. However, if network partition happens, heartbeat may drop and thus leadership election will be triggered. A vote request from a partitioned node is quite disruptive. If it contains a higher Raft term, current leader is forced to step down. With “initial-election-tick-advance” set to false, a rejoining node has more chance to receive leader heartbeats before disrupting the cluster.

etcd v3.3 (early 2018)

etcd v3.3 continues the theme of stability. Its client is included in Kubernetes v1.13. Previously, etcd client carelessly retried on network disconnects without any backoff or failover logic. The client was often stuck with a partitioned node, affecting several production users. v3.3 client balancer now maintains a list of unhealthy endpoints using gRPC health checking protocol, making more efficient retries and failover in the face of transient disconnects and network partitions. This was backported to etcd v3.2 and also included in Kubernetes v1.10 API server. etcd v3.3 also provides more predictable database size. etcd used to maintain a separate freelist DB to track pages that were no longer in use and freed after transactions, so that following transactions can reuse them. However, it turns out persisting freelist demands high disk space and introduces high latency for Kubernetes workloads. Especially when there were frequent snapshots with lots of read transactions, etcd database size quickly grew from 16 MB to 4 GB. etcd v3.3 disables freelist sync and rebuilds the freelist on restart. The overhead is so small that it is unnoticeable to most users. See “database space exceeded” issue for more information on this.

etcd v3.4 and beyond

etcd v3.4 focuses on improving the operational experience. It adds Raft pre-vote feature to improve the robustness of leadership election. When a node becomes isolated (e.g. network partition), this member will start an election requesting votes with increased Raft terms. When a leader receives a vote request with a higher term, it steps down to a follower. With pre-vote, Raft runs an additional election phase to check if the candidate can get enough votes to win an election. The isolated follower’s vote request is rejected because it does not contain the latest log entries.

etcd v3.4 adds a Raft learner that joins the cluster as a non-voting member that still receives all the updates from leader. Adding a learner node does not increase the size of quorum and hence improves the cluster availability during membership reconfiguration. It only serves as a standby node until it gets promoted to a voting member. Moreover, to handle unexpected upgrade failures, v3.4 introduces etcd downgrade feature.

etcd v3 storage uses multi-version concurrency control model to preserve key updates as event history. Kubernetes runs compaction to discard the event history that is no longer needed, and reclaims the storage space. etcd v3.4 will improve this storage compact operation, boost backend concurrency for large read transactions, and optimize storage commit interval for Kubernetes use-case.

To further improve etcd client load balancer, the v3.4 balancer was rewritten to leverage the newly introduced gRPC load balancing API. By leveraging gPRC, the etcd client load balancer codebase was substantially simplified while retaining feature parity with the v3.3 implementation and improving overall load balancing by round-robining requests across healthy endpoints. See Client Architecture for more details.

Additionally, etcd maintainers will continue to make improvements to Kubernetes test frameworks: kubemark integration for scalability tests, Kubernetes API server conformance tests with etcd to provide release recommends and version skew policy, specifying conformance testing requirements for each cloud provider, etc.

etcd Joins CNCF

etcd now has a new home at etcd-io and joined CNCF as an incubating project.

The synergistic efforts with Kubernetes have driven the evolution of etcd. Without community feedback and contribution, etcd could not have achieved its maturity and reliability. We’re looking forward to continuing the growth of etcd as an open source project and are excited to work with the Kubernetes and the wider CNCF community.

Finally, we’d like to thank all contributors with special thanks to Xiang Li for his leadership in etcd and Kubernetes.

Creating a Raspberry Pi cluster running Kubernetes, the installation (Part 2) Dec 22
Managing Kubernetes Pods, Services and Replication Controllers with Puppet Dec 17
How Weave built a multi-deployment solution for Scope using Kubernetes Dec 12
Creating a Raspberry Pi cluster running Kubernetes, the shopping list (Part 1) Nov 25
Monitoring Kubernetes with Sysdig Nov 19
One million requests per second: Dependable and dynamic distributed systems at scale Nov 11
Kubernetes 1.1 Performance upgrades, improved tooling and a growing community Nov 9
Kubernetes as Foundation for Cloud Native PaaS Nov 3
Some things you didn’t know about kubectl Oct 28
Kubernetes Performance Measurements and Roadmap Sep 10
Using Kubernetes Namespaces to Manage Environments Aug 28
Weekly Kubernetes Community Hangout Notes - July 31 2015 Aug 4
The Growing Kubernetes Ecosystem Jul 24
Weekly Kubernetes Community Hangout Notes - July 17 2015 Jul 23
Strong, Simple SSL for Kubernetes Services Jul 14
Weekly Kubernetes Community Hangout Notes - July 10 2015 Jul 13
Announcing the First Kubernetes Enterprise Training Course Jul 8
Kubernetes 1.0 Launch Event at OSCON Jul 2
How did the Quake demo from DockerCon Work? Jul 2
The Distributed System ToolKit: Patterns for Composite Containers Jun 29
Slides: Cluster Management with Kubernetes, talk given at the University of Edinburgh Jun 26
Cluster Level Logging with Kubernetes Jun 11
Weekly Kubernetes Community Hangout Notes - May 22 2015 Jun 2
Kubernetes on OpenStack May 19
Weekly Kubernetes Community Hangout Notes - May 15 2015 May 18
Docker and Kubernetes and AppC May 18
Kubernetes Release: 0.17.0 May 15
Resource Usage Monitoring in Kubernetes May 12
Weekly Kubernetes Community Hangout Notes - May 1 2015 May 11
Kubernetes Release: 0.16.0 May 11
AppC Support for Kubernetes through RKT May 4
Weekly Kubernetes Community Hangout Notes - April 24 2015 Apr 30
Borg: The Predecessor to Kubernetes Apr 23
Kubernetes and the Mesosphere DCOS Apr 22
Weekly Kubernetes Community Hangout Notes - April 17 2015 Apr 17
Kubernetes Release: 0.15.0 Apr 16
Introducing Kubernetes API Version v1beta3 Apr 16
Weekly Kubernetes Community Hangout Notes - April 10 2015 Apr 11
Faster than a speeding Latte Apr 6
Weekly Kubernetes Community Hangout Notes - April 3 2015 Apr 4
Participate in a Kubernetes User Experience Study Mar 31
Weekly Kubernetes Community Hangout Notes - March 27 2015 Mar 28
Kubernetes Gathering Videos Mar 23
Welcome to the Kubernetes Blog! Mar 20