Engineering • 8 min read
By Yuri Neves
20.06.2024
The objective of this article is to discuss the architecture of Bitpanda’s Service Mesh, and how it enables transparent TLS encryption everywhere. A basic understanding of networking and Kubernetes is beneficial, but not strictly required.
From its humble beginnings as a small Austrian startup driven by a true passion for crypto-currency, Bitpanda set out to achieve the goal of simplifying wealth creation for everyone. This means building secure, fast, easy-to-use platforms and transaction systems that are trustworthy and provide value to our customers.
Of course, this simplicity abstracts a very complex cloud infrastructure and, as it is the case with our customer-facing products, we strive to design and build an infrastructure with state-of-the-art technology and security controls. As our customer-base keeps growing, so does the range of services offered by us along with the infrastructure footprint. With this more than welcome growth comes the need to improve the scalability of our platform even further, in order to orchestrate communications between our services more efficiently.
We have been running containerized applications on top of AWS Elastic Kubernetes Services (EKS, for short) from the start, with multiple EKS clusters distributed across multiple VPCs and AWS accounts. This had worked so far so good, but then a challenge presented itself: Communication between services became problematic, due to the fact that EKS clusters were presumed self-contained from the infrastructure perspective, but new services often needed to communicate with one another across the boundaries of their hosting clusters.
This led to workarounds such as “glueing” clusters with internal load balancers and DNS shinobi stuff. Of course, we needed to find a long-term solution and so we decided to take the next logical step and bring the EKS clusters under a single banner (a service mesh) without compromising on security. Moreover, we wanted to secure communications with mutual TLS across the boundaries of EKS clusters, towards and between services, existing and new ones, automatically and at scale. Sounds ambitious? Of course!
When “shopping” for possible technologies to achieve this, we were clear on our requirements - we wanted something Kubernetes-native, that had been out there for a while, and that integrated well with our existing monitoring and continuous deployment workflows; Istio quickly became the obvious choice for us.
Istio is a feature-rich, open-source service mesh solution that provides Kubernetes-native traffic management, observability and security capabilities to distributed applications. It took a little bit of tweaking to get it right, as the Istio official documentation seems to favour imperative mechanisms of deployment over declarative ones and Helm installation is somewhat under-documented, but we figured it out!
In a nutshell, Istio consists of the following main components:
The control plane, responsible for service discovery, certificate management and propagation of proxy configuration
The data plane, a distributed overlay network of sidecar proxies running alongside application containers
The data plane does the network heavy-lifting by intercepting inbound/outbound traffic on the Pod’s network interface, thus abstracting all networking operations from the application container.
Thanks to its clever distributed proxy architecture, Istio acts transparently and externally to application containers and, as a result, it does not interfere with application development lifecycle or product roadmaps. That’s a win-win-win for product engineering, DevOps and IT security.
To understand the capabilities of Istio, consider the diagram below, in which 2 EKS clusters (Cluster A and Cluster B) are deployed in separate VPCs, in separate AWS accounts:
This is the reference architecture of our service mesh. We will use this diagram as a basis to further analyse Istio’s core functionalities, such as service discovery and certificate management.
The Istio control plane (Istiod Pods) is responsible for creating a service registry that contains Kubernetes services within the mesh and their respective Pod IPs, effectively building a “map” of the mesh network. Since our mesh extends beyond the boundaries of a single EKS cluster, we had to ensure that the Istio control plane can communicate with the Kubernetes API of the counterpart clusters as well.
We achieved this by performing the following steps:
Establishing a transitive connection between VPCs via AWS Transit Gateway, thus enabling secure and private network connectivity towards the Kubernetes API of the counterparts
Exchanging read-only kubeconfig files between counterparts, allowing mutual service discovery
With these 2 steps, the Istio control plane in each cluster can communicate with the Kubernetes API of its counterpart and build a common service registry.
The first step was done – we had a service registry! Now we needed a mechanism for instrumenting the provisioning and distribution of TLS certificates across the mesh. As you’ll see, in our setup Istio works in conjunction with CertManager plugins to automatically create, sign, and distribute TLS certificates.
An AWS Private Certificate Authority (or AWS PCA for short) root certificate provides the angular stone upon which trust is established within the mesh. The CertManager AWS PCA plugin communicates with the AWS API to create and sign an intermediate certificate authority subject to this root authority, which is then provisioned to the EKS cluster. One such intermediate authority exists per EKS cluster in the mesh, and its lifecycle is managed by the CertManager AWS PCA plugin. This design is more secure than operating under a single certificate authority, because the blast radius of a compromise can be isolated by revoking trust in a “bad” intermediate authority.
The Istio control plane configures the sidecar proxies to send certificate signing requests (CSRs) to the Istio CSR plugin. The Istio CSR plugin receives the CSRs and interfaces with the CertManager Plugin to return a TLS certificate signed by the intermediate authority. The TLS certificate subject uses the SPIFFE format and contains information about the cluster domain (trust domain), along with the application namespace and service account ( spiffe://`/ns//sa/`).
The diagram below demonstrates workload and intermediate authority certificate request flows:
Now that we know how the Istio control plane maps out the mesh network and enables distribution of TLS certificates for the data plane, let’s take a closer look at mTLS communication.
Istio provides the following mTLS modes, which can be set at different scopes (cluster-wide, namespace, Pod or port):
DISABLED - disable mTLS
UNSET - inherit mTLS setting from the parent scope (e.g. cluster, namespace, etc.)
PERMISSIVE - use mTLS if both peers support it, accept plain-text traffic otherwise
STRICT - require mTLS
The example below demonstrates the strict mTLS mode in action, which requires both parties to run Istio sidecar-proxies:
Strict mTLS enables mutual TLS communication between proxy-enabled applications
Strict mTLS disables non-TLS communication initiated by applications without an Istio sidecar proxy
So when we initially designed our Istio mesh, we decided to follow an “opt-in” strategy and increase Istio’s adoption in gradual steps. For this reason we set Pod-level permissive mTLS in our Helm templates as a default, and allowed development teams to decide whether to enforce strict mTLS on their own by overriding this value. We have since moved on to use cluster-wide strict mTLS by default, thus encrypting communications all across the board. As you can imagine, this requires a good amount of planning and communication (not between services, but between people, which is ever so slightly more challenging).
In addition to peer authentication, Istio mTLS enables us to use another very cool security feature of Istio - peer-identity based authorization.
Consider the following scenario:
A mission-critical web application running in EKS cluster B should only accept communications coming from identity `cluster_a.ew.bp/ns/app-a/sa/sa-a` from EKS cluster A.
We’ve already seen that the peer identity is embedded into the workload certificate subject presented by proxy peers during the mTLS certificate exchange. As a result, we can explicitly allow-list peer identities contained in certificates, to limit the exposure of an application to only the peers that need to produce/consume content to/from the application.
An Istio Authorization policy to achieve this configuration would look something like this:
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: app-b-authz-rw
namespace: app-b
spec:
action: ALLOW
rules:
- from:
- source:
principals:
- cluster_a.ew.bp/ns/app-a/sa/sa-a
selector:
matchLabels:
app.kubernetes.io/instance: app-b
app.kubernetes.io/name: app-b
Authorization Policies are very powerful and deserve a dedicated article in their own right. They can insulate a service from mesh and non-mesh sources alike, but we’ll ignore other sources for now and focus instead on mTLS principals.
Now you may wonder: What about non-mesh clients such as AWS Lambda Functions, VPN users and Bitpanda customers connecting to our APIs? How can they establish mTLS connections with our services running within the mesh?
I’m glad you asked! The answer is uncomplicated: at the edge of the mesh is the Istio Ingress Gateway, an Ingress proxy exposed outside of the mesh by an AWS Elastic Load Balancer. The Ingress Gateway receives external connections from the AWS Elastic Load Balancer and proxies the mTLS tunnel with backend services on behalf of clients. It’s important to note that this strictly refers to mTLS authentication and has nothing to do with authentication and verification of user API tokens, which has its own dedicated flows.
Operating a multi-network, multi-cluster mesh of distributed applications communicating over encrypted TLS tunnels at scale requires careful planning and an on-going conversation with upper management. The journey was challenging. Security is often inversely proportional to convenience, and it’s hard not to empathise with development teams and their endless backlog of feature requests to support business goals, but thanks to Istio’s non-intrusive network management and security design, we have been able to steadily and transparently onboard services to the mesh, increasing our security posture without interfering with product engineering and development lifecycle.
We are currently experimenting with Ambient, a brand new deployment model for Istio which replaces the iptables-based, distributed sidecar-proxies with eBPF-capable Istio router DaemonSets (known as ztunnels). This will simplify our architecture further, as all mesh traffic will be orchestrated without the need for sidecar-containers, and reduce the container footprint of our EKS clusters.
All this goes to show that with Istio, TLS really is painless.
Bitpanda GmbH ve grup şirketleri (Bitpanda) Türk Parasının Kıymetini’nin Korunması Hakkında 32 sayılı Karar’ın 2/b maddesine göre Türkiye’de yerleşik sayılan hiçbir kişiye yönelik olarak 6362 sayılı Sermaye Piyasası Kanunu başta olmak üzere Türkiye Cumhuriyeti Devleti mevzuatı hükümleri gereği Türkiye’de faaliyet izni gerektiren hiçbir sermaye piyasası faaliyetine dair hizmet sunmamaktadır. Şayet Bitpanda’nın yabancı sermaye piyasalarında vermiş olduğu hizmetlerden Türkiye’de yerleşik kişilerin faydalandığı tespit edilecek olursa tüm zararları kullanıcıya ait olmak üzere bu hizmetler ivedilikle sona erdirilecektir.
We use cookies to optimise our services. Learn more
The information we collect is used by us as part of our EU-wide activities. Cookie settings
As the name would suggest, some cookies on our website are essential. They are necessary to remember your settings when using Bitpanda, (such as privacy or language settings), to protect the platform from attacks, or simply to stay logged in after you originally log in. You have the option to refuse, block or delete them, but this will significantly affect your experience using the website and not all our services will be available to you.
We use such cookies and similar technologies to collect information as users browse our website to help us better understand how it is used and then improve our services accordingly. It also helps us measure the overall performance of our website. We receive the date that this generates on an aggregated and anonymous basis. Blocking these cookies and tools does not affect the way our services work, but it does make it much harder for us to improve your experience.
These cookies are used to provide you with adverts relevant to Bitpanda. The tools for this are usually provided by third parties. With the help of these cookies and such third parties, we can ensure for example, that you don’t see the same ad more than once and that the advertisements are tailored to your interests. We can also use these technologies to measure the success of our marketing campaigns. Blocking these cookies and similar technologies does not generally affect the way our services work. Please note, however, that while you’ll still see advertisements about Bitpanda on websites, the adverts will no longer be personalised for you.