Ways to Improve Your Development Process - Fault Injection

Manuel will show you to your rooms - if you're lucky.
Aug 23 2021 by James Craig

When I was younger, I assumed that technology was without any issues. If an application failed, it was because of something that I had done to it. It wasn't until I started working and building software that I realized that we are mostly keeping the world running on duct tape and string. The fact that anything works is amazing to say the least. Since that realization, I've gone from assuming that every 3rd party application, resource, and/or API will work to assuming that they will fail. As such I try to build software to accept that failure and as gracefully as I can, tell the user about the failure.

This series has thus far been about introducing various forms of testing that I find useful that I don't see used that often. This is no different as my assumption that the world of code surrounding my various systems is on the verge of crashing and burning has led me to using a form of testing called fault injection. It's a very simple concept: When you're testing against a border of your software such as pulling a file from disk or data from a database, throw an exception and see what your code does. Because generally one of a couple things will happen. Either everything crashes, your system swallows the crash, or you manage to handle the issue gracefully. In an ideal world the third option would be what happens but usually it's one of the other two.

Fault Injection #

Fault injection is a process to improve dependability of software/hardware. It dates back to around the 1970s where it was generally used to find hardware level faults. The way it was done originally was by shorting connections on a circuit board and observing what would happen. The reason this was done was to test the dependability of the hardware system. This expanded to include everything from hitting it with radiation in the FIST project to electromagnetic fields in the MARS project. These, however, had the issue of being random. Laser Fault Injection, which is where they aimed a freakin' laser at the hardware, on the other hand allowed them to be very precise with injecting faults.

Eventually people made the realization that software issues are probably going to be more common and so they decided to try fault injection at this level. Test beds like Ferrari were created to test systems by introducing memory and bus faults through CPU traps. FTAPE introduces CPU, memory, and disk faults through altered drivers and changes to the OS. The list goes on and on for tools at the system level. Organizations such as Netflix, NASA, and Amazon use it to test their distributed systems, with some even doing so in production. And in more recent years this has filtered down to even the unit testing frameworks and suites with surprising results sometimes.

Pros

  1. There are a number of different approaches for fault injection testing as well as a ton of different systems aimed at this topic so finding exact figures of how much each one of them helps is difficult to come by. On top of that it completely depends on what you're building. However in some studies involving DBMSs, there is evidence that anywhere from 15% to 70%+ of the tests miss these types of issues. Thus bugs are in production systems that people have no idea are even there.

  2. With the right set of tools, you don't need to do much of anything in order to start using this technique. Especially at the unit test level.

Cons

  1. Depending on what you're building you may see no benefit from this tool. 15% to 70% is great but those are databases they were testing. If you're building a completely self contained application then you'll probably not see much in the way of improvement.

  2. Some of the tools out there go well beyond what the average dev needs. If you're building a basic CRUD app, you probably don't need to be pointing lasers at your test servers to try and introduce faults.

Fault Injection Tools #

Once again, I'm primarily a C# dev by day so I'm going to mention the tool that I have the most experience with but there are ton of them out there for any language or platform that you can think of. Personally the one that I have found the most use out of is Simmy. If you want to cause ultimate chaos, it's pretty much the best spot to start. I also recommend the Polly project for a nice way to handle faults in .Net. It's better known than Simmy but they spawned out of the same projects. Simmy breaks and Polly protects.

In order to use Simmy, you just need to create a policy and then use that to wrap a method call:

    [Fact]
public void MyUnitTest()
{
var Policy = MonkeyPolicy.InjectLatency(with =>
with.Latency(TimeSpan.FromSeconds(2))
.InjectionRate(0.5)
.Enabled(true)
);

var result = await Policy.ExecuteAsync(token => MyService.MyMethodCall(Param1, Param2), MyCancellationToken);
}

The above code adds 2 seconds of latency to that call about 50% of the time. You can also throw exceptions, return a certain result, or call a method at random intervals. And that's it really. With that you have an easy tool for testing how your code reacts to certain conditions. If you really wanted, you could even put this in your code and simply pass false to the enabled method. Doing so allows you to test API calls, database calls, amongst other things in a test environment without causing issues to your production systems. In test, simply flip the switch. In production, make sure the switch is off.

And with that we have very basic fault injection. As always this has been a very quick overview but if you want, you can go on a deeper dive of the topic. Or if you'd prefer there is the Code With Engineering Playbook description of the subject. Anyway, I'm going to come back to this series and present a couple more types of testing that are easy to add to your tool belt. However in the near future, I want to get down some of my ideas on continuous integration, continuous delivery, and the reasons I'm a big fan of them.

Items in the Series #

  1. Unit Testing and Automation
  2. Fuzzing
  3. Property-Based Testing
  4. Mutation Testing
  5. Fault Injection