• Home
  • Blog
  • Bitpanda's new trade engine - Part #1

Bitpanda's new trade engine - Part #1

Vladimír Vráb

By Vladimír Vráb

As of March 2023, Bitpanda has processed around 100 Million trades. And Trade Engine plays a unique and crucial role in our ecosystem. Without it, we’re like a vehicle without an engine, or a magician without his wand. Past years have shown us that the system we’ve been building has its limits and unfortunately it can not support us on our roadmap to grow and expand further.

So we came to the conclusion that increasing demand for our products meant we needed to build a new trade engine from scratch. With this decision in place, the new trade engine team was established with its own mission & vision: “build a new system that can handle 10 times more customers.”.

I decided to share how we have been progressing. I will split it into multiple parts, as it is not possible to describe the old and new trade engine while keeping the post relatively short so you can read it while drinking your morning coffee.

In this particular article, we start by looking into the current state of the old trade engine and its technical setup. Then we will spend a little bit of time examining its biggest flaws to understand the necessity of building a new system. Afterward, we jump right into the architecture of the new trade engine and look more closely at how we organised the services. Last but not least we will also spend some time looking into various flows we had to solve in an asynchronous world of programming.

History (old trade engine)

Before we dive deep into the problems with the old trade engine, we have to define what an old trade engine is.

It is:

  • part of a big Bitpanda’s monolith API
  • a REST API on the nginx HTTP server
  • PHP Laravel application with 345 000 lines of code
  • system supporting only synchronous flows

Classic, right? Many projects or start-ups begin with one big monolithic API to reduce the time-to-market and simplify the tech stack. We weren’t any different. We have an ever-increasing demand from customers for new functionality. We tried to deliver fast by implementing all new features into one public-facing application. 

It worked for years and with an increasing client base and load on the system, our simple solution was to horizontally scale our application and vertically scale our primary database.

In the fintech industry, when for example, a positive tweet about Dogecoin from Elon Musk happens, people will try to take advantage of that and will try to get their coins as soon as possible. Not only does this drastically increase the price of an asset, but it puts tremendous pressure on the underlying system. This happens within minutes, if not seconds, and if your system is not elastic enough, your worst nightmare becomes reality - you have severe performance issues.

This, unfortunately, was our scenario. It was not a one-time event. During the year 2021 when many assets reached their all-time high value, our systems didn’t perform as expected and couldn’t keep up with the traffic.

There were many things that went wrong and reasons why the system became unresponsive:

  • Unoptimised queries:
    • Queries without an index. 
    • Queries unnecessarily selecting all columns
    • Queries not leveraging covering indices
  • One database for everything having long-running transactions blocking other users
  • No back-pressure support
  • High coupling and low cohesion
  • Whole trading business logic in one place.
    • It is also worth mentioning that it is more complex to patch monolithic services that have a huge amount of features instead of patching small microservices where you can run regressions tests within minutes
  • Ever-increasing tech debt due to lack of ownership of modules/components
  • Lack of load tests

We can look at the visualisation of all dependencies that for example offer creation needs. It doesn’t take a senior software engineer to see that there are too many involved services:

  • MySQL primary SQL database
  • Redis NOSQL database
  • Launchdarkly feature flag service
  • AWS Simple Queue Service
  • External services (such as timing service etc.)


We can look at the same request from APM (application performance monitoring) perspective and immediately wonder why there are 64 SQL(MySQL) statements and 23 NOSQL(Redis) calls.

I believe we could dedicate a separate article to the number of problems we had with our select statements. We were all under the impression that we don’t need to optimise any query we write. “Why?”, you may wonder. I guess we all thought that if our select statement response times were increasing, we could just vertically scale our primary database. Wrong. The major mind switch happened in one of the incident calls when someone from dev-ops said “We can’t get larger instances anymore”.

In the following blurred screenshot, you can see one of our “unoptimized queries” which queries more than 200 columns and joins more than 10 tables.

In the end, latency. P99 of 650ms~ doesn’t look that bad but if the only responsibility of a request is to calculate and return an offer for an asset, one might say that the latency is too high. We would feel more comfortable with a latency averaging around 150ms~.

After being a firefighter for too long and participating in too many incidents, we said: “We can’t maintain this anymore. We need to start from scratch. We need to leverage event-driven architecture. We need to migrate away from MySQL. We need resilience. We need support for backpressure. We want to be isolated from other systems as much as possible. We need a new architecture to power trades on our growing platform”.

Present (new trade engine)

We put an old system in a state which can live for the next year or two to come and immediately started working on a new trade engine. 

We knew the system was complex and supported many different features and that migration would take us at least a year, if not two. Therefore, we decided to start with the most basic feature that has probably the highest impact - broker basic cryptocurrency trading.

We agreed to carve out the current trading logic which implies creating an offer, accepting an offer, and converting it into a trade in a new system. 

Let’s look more closely at the new system.

System

The new system was built on asynchronous trading leveraging Kafka - distributed event streaming platform built for large enterprise applications. Its main entry point was a brand new trade API - handling all mutations and queries via GraphQL. Alongside this, we built a new service called Offer service which will allow any external and internal application to create offers for a particular user and particular asset. Not only this, we also created a separate microservice for offer creation, but we also created a separate microservice for trade book-keeping giving it the name trade service. And last but not least was a Ledger - java application that gives us an append-only ledger to book the transactions.

Creating an offer

Offer creation is a relatively straightforward process. Trade API gets a GraphQL mutation with all required fields. It transforms the mutation into a Kafka command and produces a message into the offer-create topic. Offer service is consuming all commands from this topic and must reply with an event in the offer-created topic.

A successful path would look like this.

Handling an unsuccessful path is also a straightforward process. If anything goes wrong in the Offer service (backoff, validation error, unhandled exception) it replies back with a failed event again to the offer-created topic. 

It is worth mentioning that there is “a contract” between trade API and Offer service that means for every command (request) there will be an event (response).

Here’s an example:

 

Another case we have to handle in our process is that the Offer service may be temporarily unavailable and simply can not reply to the offer-create command in a reasonable amount of time. “What to do in such a situation?”, you may ask. Well, we solved it by simply specifying a 5-second wait timeout in the Trade API. If we don’t get a reply, we just tell the client to try again later.

Accepting an offer

Accepting an offer and converting it into a valid trade is a more complex process. We don’t have just trade API and Offer service involved, but we also have Ledger and trade Service stepping in. They all have their purpose and they are critical parts of the whole flow. 

Handling an unhappy path is another complex process because the trade API must consume events from two separate topics: offer-created and trade-created. The reason for that is that the Offer service may reject accepting an offer for various reasons (e.g. expired offer) and emit a failed offer-accepted event.

If an offer is successfully accepted but Ledger rejects to book transactions (e.g. due to insufficient funds), we must emit a failed trade-created event.

Switching a mindset to an asynchronous way of programming by creating well-written separated and autonomous microservices that communicate using any given message bus of a majority of engineers is a challenging task. It requires a certain skill set, dedication, and most probably a lot of time.

Epilogue

What we have seen so far was just a glimpse of the current state of our new trade engine. When we finished the first Epic and brought this simple flow to production, we patted ourselves on the back and thought that we can easily continue migrating old features from the old trade engine to a new one. But things rarely go that smoothly. 

The more features we migrated, the more problems we had to solve as most of the engineers in the team were freshmen in this new paradigm. For example, we found out that changing Kafka topic schema is not that easy, achieving transactionality between multiple microservices is also no walk in the park, and a switch from SQL to NoSQL database brings its advantages but also disadvantages, because, with an increasing amount of microservices we are also increasing the amount of maintenance we have to do. 

It is also worth mentioning that we had to be very stubborn in our mission, but also very flexible to achieve it. We always had to communicate to the stakeholders and upper management the necessity of building the new trade engine, while also trying to improve and extend the product to make sure the business can continue to progress, especially in these challenging times.

But for every member of the team, the lessons we have learned are very valuable, and I personally believe there are not as many companies like Bitpanda. Things move at such a high pace that it never gets boring, and we always get to solve new problems using the latest technology and we take great joy in doing that.

Stay tuned for the next update from Bitpanda’s new trade engine blog.

Vladimír Vráb

Vladimír Vráb