• Home
  • Blog
  • Bitpanda’s New Trade Engine: Part #4 - Coordinated Custom Saga Pattern for Atomic Trade Booking

Bitpanda’s New Trade Engine: Part #4 - Coordinated Custom Saga Pattern for Atomic Trade Booking

Vladimír Vráb

By Vladimír Vráb

The problem of a distributed system orchestrating multiple actions

The more complex the product, the more important it is for the underlying system to be complete and consistent once the data is stored. Imagine a business requirement stating: when action A is executed, actions B and C must also be performed. If one of them fails, we definitely don’t want to end up with only action A committed to the system. Sound familiar?

At Bitpanda, this has been part of daily reality since I joined more than seven years ago. Back then, achieving such atomicity*was simpler, mainly because most of our features lived in a single monolithic application.

*Atomicity is one of the core properties of transactions in computer systems. It ensures that a series of operations either all succeed together or none happen at all—there’s no in-between state.*

Simpler Times

In 2019, when we designed the Swaps feature, our main concern was ensuring atomic operations. Specifically, if a user swapped Bitcoin for Ethereum, the transaction had to either fully complete (sell Bitcoin and provide Ethereum) or entirely fail, without any partial outcomes. This meant preventing scenarios where a user might lose Bitcoin without receiving Ethereum, or vice versa.

Implementing this in a monolithic application with a single database was relatively simple. By opening a database transaction, the database engine (if supported) inherently provides the necessary atomic guarantees.

Complexity Rules

As we explained in Part #1, that approach is no longer viable. We have segmented the trading functionality into several distinct services, each with a specific role:

  • Trade API: Acts as a GraphQL Proxy.
  • Offer Service: Responsible for generating buy and sell offers for assets.
  • Trade Service: Handles the booking of trades.
  • Ledger: Manages the booking of transactions.
  • Swap Service: Orchestrates swap operations.

These services do not communicate via REST API; instead, they use Kafka, an asynchronous messaging system.


Initial implementation

Using our previous swap example, we will demonstrate the initial system problem and our solution. This illustrates the evolution and benefits of our refined strategy.

The sequence diagram below shows our initial implementation plan.

Here's a breakdown of the desired interactions:

  • Trade API: Initiates a swap request (e.g., Bitcoin to Ethereum) with the Swap Service.
  • Swap Service:
    • Generates both a sell offer (for Bitcoin) and a buy offer (for Ethereum) through the Offer Service.
    • Accepts both offers once received.
    • Awaits confirmation from the Trade Service that both trades have been successfully booked.
  • Offer Service: Independently creates the two required offers.
  • Trade Service:
    • Listens to accepted offer events.
    • Requests the Ledger to book the transactions.
    • Upon receiving confirmation from the Ledger, it books the trades.
  • Ledger:
    • Atomically books the transactions.
    • Informs the Trade Service that the transactions have been persisted.

Wait, there is an issue

In a real-world scenario, complications will arise, such as unreliable network connectivity or unforeseen implementation flaws.

Consider this hypothetical situation: The Ledger receives a command from Kafka to sell Bitcoin, immediately followed by a command to buy Ethereum. However, a transient network failure prevents it from connecting to the database. Consequently, an "error" occurs, resulting in a failed request.

If a user sold their Bitcoin but doesn't receive Ethereum, they would be left clueless as to when their Ethereum will arrive (if it does at all!). This scenario would then violate the system constraint - every swap must have both a source and a target trade.

To address this, we need a solution. A potential quick improvement involves adding retry functionality to the Swap Service. Upon receiving an event indicating a failed Ethereum transaction booking, the system would automatically attempt to book the trade again.

While this system is pretty solid, it does have a few limitations. A major one is that offers have an expiration time. If you don't accept an offer within that window, it will expire, and you won't be able to accept it later. This can sometimes make things a bit tricky.

Here is a peculiar situation. We know we can’t go forward anymore. So, what if we go backwards? What if we try to reverse the trade?

Excellent. We have successfully reversed the Bitcoin transaction on the Ledger by creating a refund transaction and reverted the trade in the Trade Service. But does this finally resolve our problem? Well, before we answer that, consider these scenarios: 

  • What if the Ledger couldn't revert the Bitcoin transaction? 

  • What if the Ledger is under heavy load, and a circuit breaker is rejecting most of the incoming requests?


Our attempted revert failed. For instance, in a previous step, we might have relied on "retryability to the rescue." This time, however, infrastructure disruptions prevented Ledger instances from being deployed, leaving the service unavailable for 15 minutes.

Even though the Ledger came online, processed our request, and successfully reverted the BTC transaction (and Trade Service reverted the trade), this architecture isn't viable for our operations. When a user sells Bitcoin, it triggers a hedging operation on our end. If we then need to reverse this hedging operation (and buy back Bitcoin) within 15 minutes, there's a non-zero probability that the price will have worsened, leading to financial loss. This is because we must return the same amount the user initially tried to swap.

Similarly, financial losses would occur if we successfully added Ethereum to the user's balance but failed to deduct Bitcoin. The user could spend these funds in the interim, creating an unpleasant situation for both parties. Beyond financial risks, consider the implications if the user exported their trading activity during this period.

This illustrates just some of the potential side effects of such an architecture. Therefore, we require a different, though not necessarily complex, solution.

Our solution

We ensure that all trade transactions are saved completely, thanks to a coordinated custom "saga" system.

The term “saga” might be confusing. Let’s clarify: it is simply an internal name for our custom solution. In contrast, the common “saga pattern” refers to a strategy in which multiple transactions are processed, and if any fail, compensating transactions are executed to maintain consistency.

Here's how it works: When we get a request to book a transaction, it can include a saga-context header. If it does, we know that this transaction is just one piece of a larger puzzle, and we won't actually book it until we've received all the other related parts.

The saga-context header contains the following attributes:

  • id: A unique ID for the entire saga
  • part: Indicates which part of the saga this transaction belongs to
  • total_parts: How many parts Ledger expects to receive in total

Here's a quick look at the Bitcoin to Ethereum swap sequence diagram using our custom saga implementation:

We've confirmed the successful execution of a straightforward transaction path. Now, we'll simulate scenarios where transactions cannot be booked, without going into every possible negative outcome. If, for any reason, we're unable to book transactions, both commands will fail.

This straightforward mechanism is generic and adaptable, and not specifically designed for our use case. Other teams can easily adopt this feature by simply appending the saga context to Kafka commands. A key benefit is that there's no limit to the number of saga parts.

Epilogue

We've explored distributed transactions, a crucial concept for anyone developing complex distributed systems. These transactions are essential when a component receives inputs from various sources at different times, all of which must be processed atomically. While it's true that distributed transactions add to system complexity, a slim implementation with potential saga timeouts can put you on the right path.

It's tempting to attach these complexities to the decade-long shift from monolithic applications to microservices, and that's a valid point. However, every problem has multiple solutions, and careful consideration is needed when choosing one over another to ensure the benefits outweigh the drawbacks. We've reached a stage where the advantages of our distributed solutions are clear. They scale elegantly, allow for easy building upon existing solutions, and have significantly improved our overall resiliency.

Vladimír Vráb

Vladimír Vráb