Photo by Neil Thomas on Unsplash

Handling failures in background workers with Elixir and supervisors

Patryk Bąk
3 min readJul 7, 2020

Elixir is built on the top of the Erlang Virtual Machine. It allows us to write highly available systems that can run practically forever. Does that mean that we don’t have to do anything to make our systems reliable?

In our system, we have a worker which pays drivers money for their job.

Let’s take a closer look at our Payment.pay_the_driver/1 function to see what it does.

The system compares money already paid with the amount which a driver should receive. It guarantees that a driver won’t receive more than they should.

Unfortunately, developers make mistakes and there’s a chance that an incorrect code is released. Luckily, the verify_payment/1 function prevents incorrect payments. But what happens to our application if such a scenario occurs and the function raises the error?

To understand consequences, let’s see how the supervision tree works.

In the picture above, the main process supervises its child process — the Worker module.

Starting a supervisor, you can set what happens when one of the children gets crashed. By default, a supervisor crashes when a child is restarted 3 times in 5 seconds.

Our supervisor is configured with all these default values above:

It means that each time the verify_payment/1 function raises an error, our worker will be restarted.

If it happens more than 3 times in 5 seconds, the main supervisor will also crash. As we can see, our worker handles its first message every second. If the logic within it raises an error, it will be able to reach more than 3 restarts within 5 seconds and consequently, our application will crash.

So what’s now?

Even if there’s a problem with that part of the code, we still want the rest of the application to be up while we’re investigating the issue.

That’s why we can add a separate supervisor just for our worker:

The supervisor needs to know the worker interval, so we have to replace the @interval attribute in the worker with the interval() public function.

Now our supervisor will crash if the child is restarted within 5 seconds more times than the value of max_restarts.

The worker executes its function every second. Setting the limit to 6 restarts in 5 seconds guarantees that the supervisor will never crash.

It’s time to modify the main supervisor to look after the worker supervisor instead of the worker itself.

Summing up

It’s sometimes hard to avoid temporary failures. You have to make sure that you have a plan for what happens if some parts of the system stop working correctly.

Originally published at https://appunite.com.

--

--

No responses yet