Why Lyft Built Its Own Service Mesh and What it Learned From Doing It

On September 14, 2016 we announced Envoy, our L7 proxy and communication bus. In a nutshell, Envoy is a "service mesh" substrate that provides common utilities such as service discovery, load balancing, rate limiting, circuit breaking, stats, logging, tracing, etc. to polyglot (heterogeneous) application architectures.

As development teams move towards a service oriented architecture with the goal of faster iteration speed via better abstractions and decentralized deployment, they are finding the reality on the ground much more painful. Operating an SoA leads to myriad problems, primarily around the reliability of network communication, that do not exist in a monolith. A "service mesh" substrate like Envoy provides common plumbing that allows developers to focus on their applications and not worry about getting data from point A to point B, unlocking the true power and agility of SoA at the application level.

We knew that we had built a compelling product that was central to Lyft's ability to scale its service-oriented architecture; however, we were surprised by the industry wide interest in Envoy following its launch. It's been an exciting (and overwhelming) 7 months! We are thrilled by the positive reception and wide uptake Envoy has since received.

Why all the Interest?

As it turns out, almost every company with a moderately-sized service oriented architecture is having the same problems that LyftTrack this API did prior to the development and deployment of Envoy:

  • An architecture composed of a variety of languages, each containing a half-baked RPC library, including partial (or zero) implementations of rate limiting, circuit breaking, timeouts, retries, etc.
  • Differing or partial implementations of stats, logging, and tracing across both owned services as well as infrastructure components such as ELBs.
  • A desire to move to SoA for the decompositional scaling benefits, but an on-the-ground reality of chaos as application developers struggle to make sense of an inherently unreliable network substrate.

In summary: an operational and reliability headache.

Though Envoy contains an abundance of features, the industry appears to view the following design points as the most compelling:

High performance native code implementation: Like it or not, most large organizations still have a "performance checkmark" for system components like sidecar proxies, which can only be satisfied by native code, especially regarding CPU usage, memory usage, and tail latency properties. Historically, HAProxy and NGINX (including the paid Plus version) have dominated this category. HAProxy has not sustained the feature velocity required for a modern service mesh, and so is starting to fall by the wayside. NGINX has focused most of their development efforts in this space on their paid Plus product. Furthermore, NGINX is known to have a somewhat opaque development process. These points have culminated in a desire within the industry for a community-first, high performance, well-designed and extensible modern native code proxy. This desire was much larger than we realized when we first open sourced Envoy, and Envoy fills the gap.

Eventually consistent service discovery: Historically, most SoAs have used fully consistent service discovery systems that are hard to run at scale. Envoy treats service discovery as eventually consistent and lossy. At Lyft, this has lead to extremely high reliability without the maintenance headache of systems typically used for this purpose such as etcd, Zookeeper, etc.

API driven configuration: Fundamentally, we view Envoy as a universal dataplane for SoAs. However, every deployment is different and it makes little sense to be opinionated about all of the ancillary components that are required for Envoy to function. To this end, we are clearly documenting all of the APIs that Envoy uses to interact with control plane components and other services. For example, Envoy documents and implements the Service Discovery Service (SDS), Cluster Discovery Service (CDS), and Route Discovery Service (RDS) REST APIs that can be implemented by management systems to dynamically configure Envoy. Other defined APIs include a global rate limiting service as well as client TLS authentication. More are on the way, including gRPC variants of the REST APIs. Using the published APIs, integrators can build systems that are simultaneously extremely complex and user friendly, tailored to a particular deployment. We have open sourced the discovery and ratelimit services that we use in production as reference implementations.

Matt Klein is a Software Engineer @lyft https://t.co/U4EcVLSOEU

Comments