The Murphy’s law state that:
Whatever can go wrong, will go wrongMurphy’s law
The law is well know in engineering and great care is taken to make sure that any big engineering project will be able to sustains a possible failure without big financial, legal or even human looses consequences.
In computer engineering we tend to underestimate the possible consequences of failure in our software, but as we will depend more and more on software and computers, it will become fundamental to manage failures.
Is possible to write software and be sure that it won’t fails. However is too expensive in most scenarios. But still we can do much better than today standards.
Offensive vs Defensive Programming
Offensive programming means to follow the happy path in the code, ignore possible source of error, test only that the code does what we expect with the expected input. Offensive programming feels good! We move fast, the software does what we want, tests are green.
On the other side defensive programming is a more caution approach, we think also at what may happens outside the happy path, we check most source of errors, we test our code with wrong input and maybe we programmatically generate tests inputs just to make sure that the software is capable of managing all we can throw at it. Defensive programming doesn’t feels as good as offensive one. Everything seems slower and we routinely find new edge cases in tests. However for a methodic developer who takes pride in making a robust software it can be interesting as well.
I am not going to argue that one or the other way to develop code is better than the other, indeed it really depends on the task at hand, time constraints, tools available, deployment methodologies, etc…
And indeed, in normal conditions, bugs happens even if follow a very defensive approach. Moreover, one of the most reliable tool to build robust system is universally know to be the BEAM VM that powers languages (Erlang and Elixir) that encourage a very offensive programming approach.
Have a plan B
Robust and reliable software comes often with a plan B for when something goes wrong.
Implementing a plan B is usually quite simple, what need more experience and knowledge is knowing for what to have a plan B and what is a reasonable plan B.
An HTTP request does not return a succesful status code. A reasonable plan B would be to just re-try the request. But how often should we retry? Should we wait a little bit before to re-try? And the standard answer is to wait using an exponential back-off saturated at some reasonable number and to don’t keep re-try an infinite amount of time but eventually escalate the issue.
The File System is not in the status that we are expecting, maybe a file is missing, or worse there is some extra file. A reasonable plan B is much more complex to suggest and it really depends on the application. If we are the only responsible for a specific directory then we could just delete the extra file and re-create the missing files, but what should we write in that file? Can we put ourselves in a position such that we can always re-create all the needed files?
The database we use is down, but critical data is coming in. Should we just stop accepting data? Or we could store them in local stable storage and when the DB comes back up flush our buffer? How big should be our buffer? Should we let our user know about the issue? How do we escalate quickly the problem?
For any reason, one of our dependency failed. Should we just re-try? Stop everything? Wait a little bit and try again? Log the problem and escalate the issue?
Reliability from the appropriate plan B
The way in which the BEAM VM reach so high levels of reliability is its standard “plan B”. When something break, forget about its state, stop it, and re-start a fresh version in a know state.
Of course this doesn’t work in 100% of the cases, but it covers a surprisingly big amount of cases. Moreover, since it is a standard behavior, the developer catch immediately the cases when this standard behavior is dangerous and manage to test for them and to develop more appropriate responses.
Reliability doesn’t comes from simple defensive programming, but it comes from applying the correct plan B when necessary.
Since is impossible to have a plan B for any possible source of error, a cheap alternative is to have a standard and general plan B to apply in most cases and define more appropriate solution only when the standard plan B is not good enough or when it would put the application in an inconsistent state.
Usually a standard and general plan B is to simply log the error and restart the application maybe after waiting a little bit.
However the simple option is not appropriate if a simple HTTP requested returned some error, in that case you should most likely just re-try. Similarly if re-starting the application would make inconsistent the whole state of the system, then is better to have other alternative in place.
Having a default plan B that can be applied against most of the error that may happen in our code will already provide a great improvement in the reliability of our software.
However, a standard plan B is not enough, the developers need to be aware that such plan B is in place and they should make sure that the standard plan B is not actuate when it could break the whole application or when a more sensible option is available.