Digging Into etcd
OSS:kubernetes::etcd:December 1st, 2019
⚠ WarningYou've discovered a draft post! 🚧
This entry is still under construction and shouldn't be listed anywhere ... 🤔
What Is etcd?
etcd, /ˈɛtsiːdiː/, per the official site is:
A distributed, reliable key-value store for the most critical data of a distributed system
Per the FAQ etcd’s name means “distributed etc
directory”. With etc
being a reference to the Unix directory for system-wide configuration /etc, and d
being a reference to “distributed” 1.
The d
is perhaps also a pun on the long history of naming daemons
with a d
suffix (see: httpd
, ntpd
, systemd
, containerd
, …), though I’ve not yet found proof of this.
Kubernetes uses etcd as the backing store for cluster data, which drove my own interest in learning more about it. Clearly a lot of clusters out there are using etcd for critical data storage, but how does it work? Why etcd?
History
For a history of etcd see: coreos.com/blog/history-etcd (Archived)
For bonus points: www.wired.com/2013/08/coreos-the-new-linux/ (Archived)
Roughly: etcd was created out of a desire for a distributed data store addressing the following issues:
-
Google’s Chubby Paper was public, but Chubby itself was / is not.
-
Zookeeper was expensive to run, didn’t scale down, and couldn’t be interacted with via common simple tools like curl.
Initially etcd was developed by coreOS for their fleet
container orchestration system,
but it was quickly adopted for other uses and later donated to the CNCF.
Architecture
Overview
-
etcd
is a Go binary with a seperate CLI (etcdctl
). -
etcd exposes a gRPC service along with an HTTP JSON API.
-
Data is persisted in multiversion key-value format, stored on disk.
-
Typically one, three, or five replicas are used.
-
Each replica stores the full dataset, following the leader.
Data Model
etcd’s upstream data model documentation is instructive here, I highly recommend reading this document.
Storage
Data is stored with a memory-mapped B+ tree using bbolt, a fork of bolt, inspired by LMDB.
Consensus
Leader election is used to maintain a single leader replica, all requests are routed to the leader internally and comitted only after acheiving consensus on the request.
Raft is the consensus algorithm used both requests and for leader elections. The official raft site is a a good reference for understanding how this works. Another great resource linked from the official site is thesecretlivesofdata.com/raft/.
etcd’s raft implementation is widely used and contains some useful documentation.
TODO
- elaborate on Kubernetes’s usage
- talk more aboult multiversion and revisions
- talk more about data model
- talk more about supported operations
Additional Resources
The Carnegie Mellon Database Group “Database of Databases” site has a great page on etcd at dbdb.io/db/etcd
-
As more clearly evidenced an in older version of the etcd docs:
The name “etcd” originated from two ideas, the unix “/etc” folder and “d"istributed systems. The “/etc” folder is a place to store configuration data for a single system whereas etcd stores configuration information for large scale distributed systems. Hence, a “d"istributed “/etc” is “etcd”.
This “Why etcd” page doesn’t exist in currently supported versions. ↩︎