Resilience4j: Improve App Stability Using Circuit Breakers

Aman Gupta
Aman Gupta
5 min read
Posted on July 05, 2023
Resilience4j: Improve App Stability Using Circuit Breakers

Introduction

Let’s assume that an application communicates with other applications and is dependent on their response. What happens in cases if these external (aka downstream) services become unresponsive or takes too long to respond? This is an important aspect to deal with when implementing an application. Will you let your application wait endlessly until it receives any response? The answer is no. Because this will result in failure of the functionality of the application itself, which can easily affect your business and revenue.

Let’s take an example of a user-facing social media platform that you have built which has multiple functionalities that any such platform has. But for now, let’s focus on one of its features that lets the user explore new content every time. So, whenever a user goes to this explore page, the platform internally calls a recommendation service that responds with the categories of content that the user might be interested in seeing. Then, the platform’s backend will fetch all the content with those listed categories.

This is the ideal case where everything is working smoothly. Now, what happens when the recommendation service itself goes down or takes long to respond? In this simple implementation, the platform’s backend will call the service whenever the user opens the Explore page and will eventually wait for that period of time, but at this point the platform will start losing its users, because every time the user will be hanging on to the Explore page and finally end up seeing an empty page.

So, your application needs to handle such scenarios. Logically, your application should not be waiting for an unusually long period of time for an external service to respond and it should have fallback actions that it should take in case the external service is unavailable.

This is where a circuit breaker comes into the picture. A circuit breaker caters to all the above-mentioned needs of a system.

Some of you might also be thinking that setting a timeout on the external calls to services can handle this then why a circuit breaker is needed? The answer is “circuit breaker is a lot more than just having a timeout on external calls”. If you only have a timeout on the external service call, then in the worst case your application will wait on every call to that service till it times out. But this is not the case if you have a circuit breaker in place. A circuit breaker can make sure that your application doesn’t halt at a point “every time” you call external service till the time your external service starts working correctly again using its inbuilt configurable functionality which we will discuss in detail in the coming sections of the blog.

Here, we will discuss a well-known library, Resilience4j that provides multiple modules for circuit-breaker.

Resilience4j

  • It's a lightweight and easy-to-use fault-tolerance library.

  • It’s inspired by Netflix Hystrix but is designed for functional programming.

  • It only uses Vavr which does not have any external library dependencies that makes it very lightweight.

  • It provides high-order functions (aka decorators) that can be applied on any functional interface, lambda expression, or method reference. This way you can choose to stack one or more of its modules (Circuit Breaker, Rate Limiter, Retry, or Bulkhead) using decorators.

In this article we will see how we can use its circuit breaker module.

resilience4j-circuitbreaker

It is implemented as a finite-state machine with 5 states:

  • OPEN

  • CLOSED

  • HALF_OPEN

  • DISABLED

  • FORCED_OPEN

All the outcomes of the calls are calculated using a sliding window.

The sliding window could either be based on count or time.

Count-based: Aggregates the Outcome of the Last N Calls

Implemented using a circular array to store the outcomes of N calls. The sliding window incrementally updates the total aggregates when a new call is recorded. And when the oldest outcome of the call is evicted, it gets subtracted from the total aggregation i.e., Subtract-on-evict.

Time Complexity to get the current snapshot: O(1) since the total aggregation gets pre-calculated using the sliding window.

Space Complexity: O(N), where N is the size of the sliding window.

Time-based: Aggregates the Outcome of the Calls of Last N Seconds

Implemented using a circular array to store N partial aggregations (i.e., bucket). Each bucket stores the aggregation of outcomes of all calls that occurs in that particular second. The aggregation of the current second is stored in the head bucket and other buckets store the aggregation of previous seconds.

The sliding window incrementally updates the total aggregation when a new call is recorded. And when the oldest outcome of the call is evicted, the partial total aggregation of that bucket is subtracted from the total aggregation and then the bucket is reset i.e., Subtract-on-evict.

Time Complexity to get the current snapshot: O(1) since the total aggregation gets pre-calculated using the sliding window.

Space Complexity: O(N), where N is the size of the sliding window because the sliding window does not store the individual call outcomes and stores only N partial aggregations (for N buckets) and 1 total aggregation.

States and Thresholds: Circuit Breaker at Work

The circuit breaker is in CLOSED state initially which means that all the calls to external service will execute in the normal way.

The state changes from CLOSED to OPEN state when:

  • the rate of failure calls becomes more than the configured threshold (i.e., failureRateThreshold)

  • the rate of slow calls becomes more than the configured threshold. (i.e., slowCallRateThreshold)

Failure Calls

By default, calls to external services are counted as failures when they result in exceptions. However, we can also define a list of exceptions that should be counted as failures using recordExceptions config property. All other exceptions will then be counted as a success.

We can also ignore exceptions so that they will neither be counted as failures nor successes using the ignoreExceptions config property.

Slow Calls

Calls are counted as slow if they are not able to respond back within the configured amount of threshold time. This threshold time can be set using slowCallDurationThreshold config property.

NOTE: The failure call rate as well as the slow call rate can only be calculated once a minimum number of calls have been recorded. This is configured using minimumNumberOfCalls property provided.

Once the circuit turns to OPEN state, it waits for a duration of time before calling again. This delay is provided for the service to recover with a hope that it will start behaving normally. Every call gets rejected with CallNotPermittedException when the circuit is in OPEN state. The duration of wait time can be configured via waitDurationInOpenState property.

Once this time duration has elapsed, the circuit will transition to HALF_OPEN state and permit some configured number of calls to check if the failure/slow call rate is back to tolerable values or not. By the time the permitted number of calls gets completed, the rest calls will still be rejected with CallNotPermittedException. If the failure/slow call rate becomes less than the threshold, the circuit will switch to CLOSED state otherwise the state will change back to OPEN. This count of calls that can be permitted can be configured via permittedNumberOfCallsInHalfOpenState.

Conclusion

Implementing Resilience4j library's circuit breaker module could ensure the stability and reliability of applications that rely on external services. By intelligently managing unresponsive or slow services, circuit breakers prevent application failures and maintain a positive user experience. With its lightweight design and fault-tolerance modules, Resilience4j offers an effective solution for building resilient applications. Embrace its power to enhance your application's robustness and safeguard your business. Start implementing resilient strategies for uninterrupted service delivery.