SD Knowledge Base: Thinking About Error Cases

Jon Pincus
Microsoft Corporation

June 2000

Summary: Discusses error-handling paradigms and the pitfalls associated with multiple paradigms, as well as providing two simple but vital guidelines for dealing with error cases. (10 printed pages)

Introduction
Competent programmers can write code that works as long as nothing unusual happens. One of the skills that makes a programmer great is the ability to write code that continues to work when something goes wrong and "unexpected events" occur. However, the term unexpected events gives the wrong impression. If your code is embedded in a widely distributed and successful product, you should expect all kinds of unusual (and horrible) things to happen to it. Machines will run out of memory, files that you expect to be there will not be, functions that never fail may have failure cases in new versions of the operating system, and so on. If you want your code to continue to work reliably, you need to anticipate all of these events.

The specific category of unusual events covered in this article is error cases. There is not a precise definition of an error case. Intuitively, it means that something that normally succeeds does not. Memory allocation failures are a classic example. Most of the time, there is plenty of memory to go around, but occasionally a machine may be particularly busy running a huge spreadsheet calculation or, in an Internet world, because somebody is launching a denial of service attack on the machine. If you are writing any kind of significant component, service, or application, you need to anticipate this type of situation.

The coding examples used in this paper are C/C++ based, but the general principles are independent of the programming language. Programmers working in multiple languages should keep this in mind even when not working on C/C++ code. For example, the winner of PC Week's hacking contest, Hackpcweek.com, compromised an Apache server in part by taking advantage of failure to check a return code in some Perl CGI code. For more in-depth information, read the article at http://www.zdnet.com/eweek/stories/general/0,11011,2350744,00.html.

Any significant piece of software will interact with other layers and components. The first step in writing code that deals with errors is to understand what the rest of the system is doing when errors occur. The remainder of this paper will discuss error-handling paradigms, and then deal with the pitfalls associated with multiple paradigms. Two simple but vital guidelines for dealing with error cases are also included.

Error-Handling Paradigms
Whenever an error case occurs, there are three separate issues to consider: detecting it, reporting it, and responding to it. Usually, the responsibility for dealing with these issues is spread throughout different components or layers of the code. For instance, it is the job of the operating system to detect that the system is out of memory. The job of the memory allocation function is to report this back to its caller. For example, VirtualAlloc returns NULL, the Microsoft Foundation Class Library (MFC) operator new throws CMemoryException *, and HeapAlloc may return NULL or raise a Structured Exception. The caller's job is to respond to this by cleaning up after itself, catching the exception, and perhaps reporting the failure to its caller or the user.

Because different layers and components need to cooperate to deal with these errors, the first step is to define a vocabulary. That is, conventions that the components will share. Unfortunately, there is no single, well-defined convention in C, C++, or any other language. Rather, there are many conventions, each with its own strengths and weaknesses.

Here are some of the most common conventions:

Returning a BOOL to indicate success or failure. Many calls in the Windows API and MFC return TRUE to indicates success, and FALSE to indicate failure. This is a nice, simple approach. The problem is that it does not explain failures or distinguish between different types of successes or failures. (When using the Windows API, you can use GetLastError to get the specific code.)
Returning a status. Unfortunately, there are two different styles of this convention floating around the Windows API. COM functions return a HRESULT: a HRESULT >= 0, such as S_OK, is a success, and a HRESULT < 0, such as E_FAIL, is a failure. Other functions, such as the Registry function, return an error code defined in WINERROR.H. In this convention, ERROR_SUCCESS (0) is the only success value, and all other values are various forms of failure. This inconsistency opens the door for all kinds of confusion. To make matters worse, a value of 0, which would indicate a failure in the Boolean-style return value convention, indicates success here. The consequences of this are discussed later. In any event, the status return approach has the advantage that different values can indicate different types of failure, or with HRESULT's, success.
Returning a NULL pointer. C-style memory allocation functions, such as HeapAlloc, VirtualAlloc, GlobalAlloc, and malloc, often do return a NULL pointer. The C Standard Library's fopen function is another example. As with BOOL returns, some other mechanism is needed to distinguish between different kinds of failures.
Returning an "impossible value." Often the value is 0 (for an integer like GetWindowsDirectory) or –1 (for a pointer or length like the C standard library fgets function). A generalization of the NULL pointer convention is that it finds some value that the routine would not otherwise be able to return and have that indicate an error. This approach becomes somewhat problematic when it is extended to a large API because there is usually no central pattern.
Throwing a C++ exception. For pure C++ programs, this convention allows you to use language features to good advantage. Bobby Schmidt's recent Deep C++ series on exceptions covers these issues in great detail. However, combining this approach with either legacy C code or with COM can cause problems. It is illegal for a COM method to throw a C++ exception. In addition, C++ exceptions are a relatively expensive mechanism. If the operation itself is cheap, the costs can often outweigh the benefits.
Raising a Structured Exception. Considerations here are the opposite of C++ exceptions. This approach works very cleanly for C code, but does not interact well with C++. Once again, this approach does not combine efficiently with COM.
Note If you are picking up an older code base, you will sometimes see "home-grown exception handling mechanisms." C++ compilers only started handling exceptions well relatively recently. In the past, developers often built their own mechanisms on top of Windows Structured Exception Handling (SEH) or the C library's setjmp/longjmp mechanism. If you have inherited one of these code bases, you're on your own, and rewriting it might be your best bet. Otherwise, this is a procedure best left to very experienced programmers.
Error handling is a key part of any API definition. This is equally true whether you are designing an API or using an API that somebody else designed. The paradigms for error behavior are as important to the API definition as the class definition or naming scheme. For example, the MFC API makes it very explicit which functions throw which exceptions when resource allocation fails, and which functions return a BOOL success/failure invocation. The designers of the API clearly put some thought into this and users need to understand the intent and play by the rules that have been constructed.

If you are using an existing API, you will have to deal with whatever conventions exist. If you are designing an API, you should pick conventions that fit in well with the APIs that you are already using.

If you are using multiple APIs, you usually cannot avoid having multiple conventions. Some combinations of conventions work fairly well together because they are hard to confuse. However, some combinations of conventions are error prone. Many of the specific pitfalls discussed in this paper are due to these inconsistencies.

Problems with Multiple API Conventions
Mixing and matching different API conventions is often unavoidable, but it is extremely error prone. Some of the instances are obvious. For example, if you try to mix Windows SEH and C++ exceptions in the same executable, it will most likely fail. Other examples are subtler. One specific example that comes up repeatedly relates to HRESULTs and is some variation of:

extern BOOL DoIt();
BOOL ok;
ok = DoIt(...);
if (FAILED(ok)) // WRONG!!!
return;

Why is this wrong? FAILED is an HRESULT-style macro, so it tests to see whether its argument is < 0. Here is its definition, from winerror.h:

#define FAILED(Status) ((HRESULT)(Status)<0)

Since FALSE is defined as 0, FAILED(FALSE) == 0 … is counter-intuitive to say the least. Moreover, because the definition has a cast inserted, you do not get a compiler warning even if you are at warning level 4.

When you are dealing with BOOLs, you should not use a macro, but instead make an explicit check:

BOOL ok;
ok = DoIt(...);
if (! ok)
return;

By contrast, when you are dealing with HRESULTs, you should always use the SUCCEEDED and FAILED macros.

HRESULT hr;
hr = ISomething->DoIt(...);
if (! hr) // WRONG!!!
return;

This is a pernicious mistake in that it is easy to miss it. If CoDoIt returns S_OK, the test works successfully. However, suppose CoDoIt returns some other success status? Then hr > 0, so !hr == 0; the if test fails, and your code returns an error when one has not occurred.

Here is another example:

HRESULT hr;
hr = ISomething->DoIt(...);
if (hr == S_OK) // STILL WRONG!!!
return;

People sometimes interject that ISomething::DoIt always returns S_OK on success, so both of the last two code fragments must be fine. However, this is not a safe assumption. The description of the COM interface is very clear. Functions can return any success value on success, so any one of the many implementers of ISomething::DoIt may choose to return, for example, S_FALSE. In which case, your code will break.

The right solution is to use the macros. That is why they exist.

HRESULT hr;
hr = ISomething->DoIt(...);
if (FAILED(hr))
return;

Since the subject of HRESULTs had been raised, this is a good time to remind you that S_FALSE is:

A success code, not a failure code, so SUCCEEDED(S_FALSE) == 1.
Defined to be 1, not 0, so S_FALSE == TRUE.
Two Simple Guidelines for Error Cases
There are many simple ways to make code deal with error cases more robustly. Conversely, many simple things that people don't do make the code flaky and unreliable when error cases occur.

Always Check Return Status
It does not get much simpler than this. Almost all functions give some indication of whether they succeed or fail, but that does no good if you do not check them. How hard can this be? Think of it as a hygiene issue. You know you should wash your hands before you eat, but maybe you don't always do it. It's the same with checking return values.

Here is a simple, but practical example involving the GetWindowsDirectory function. MSDN documentation is clear about GetWindowsDirectory's error behavior:

Return Values
…

If the function fails, the return value is 0. To get the extended error information, call the GetLastError function.

In fact, the documentation is very explicit.

Here is a code fragment that figures out in which drive the Windows directory resides.

TCHAR cDriveLetter;
TCHAR szWindowsDir[MAX_PATH];
GetWindowsDirectory(szWindowsDir, MAX_PATH);
cDriveLetter = szWindowsDir[0]; // WRONG!!!
...

What happens if GetWindowsDirectory fails? (If you do not believe that GetWindowsDirectory can fail, just play along for the time being.) Well, this code does not check the return value, so the value assigned to cDriveLetter is uninitialized. Uninitialized memory can have any value. In effect, this code is choosing a drive at random. That is rarely the right thing to do.

The right thing to do is to check the error status.

TCHAR cDriveLetter;
TCHAR szWindowsDir[MAX_PATH];
if (GetWindowsDirectory(szWindowsDir, MAX_PATH))
{
cDriveLetter = szWindowsDir[0];
...
}

Can this happen? The most common excuse for not checking a return value is, "I know that function will never fail." GetWindowsDirectory is a fine example. Up through Windows® 98 and Windows NT® 4.0, it never actually failed, so a lot of people fell into a bad habit of assuming that it never would.

Now Windows Terminal Server comes along and it becomes more complicated to figure out the Windows directory for an individual user. GetWindowsDirectory has to do much more, including possibly allocating memory. And since the developer who worked on this is a responsible guy, he did the right thing and checked whether allocation succeeded, and if it did not, returned the well-documented error status.

This leads to the next questions, would it have helped if GetWindowsDirectory had initialized its output to the empty string when it fails? The answer is no. The results would not be uninitialized, but they would still be something that a sloppy application did not expect. Suppose you had an array indexed by cDriveLetter – 'A' ; now the index would suddenly be negative.

Even if Terminal Server is not an issue for you, the same thing could happen in any future implementation of the Windows API. Do you want to prevent the application you are working on from running on future versions of the OS or alternate implementations, such as Embedded NT? It is a good practice to remember that code often lives on long after its expected expiration date.

Sometimes checking the return value is not enough. Consider the Windows API ReadFile. It is common to see code like:

LONG buffer[CHUNK_SIZE];
ReadFile(hFile, (LPVOID)buffer,
CHUNK_SIZE*sizeof(LONG), &cbRead, NULL);
if (buffer[0] == 0) // DOUBLY WRONG!!!
...

If the read fails, the contents of the buffer are uninitialized. Most of the time it might be zero, but you cannot rely on it.

There are many reasons why reading from a file fails. For example, it could be a remote file and the network may have gone down. Even if it is a local file, the disk may be corrupted at exactly the wrong time. If this is the case, the format of the file may be completely different than expected. Someone may have replaced the file you thought would be there, whether inadvertently or maliciously, or there may be only one byte in the file. Stranger things have happened.

To deal with this case, not only do you need to check that the read succeeded, you must also check to make sure that you have read the proper number of bytes.

LONG buffer[CHUNK_SIZE];
BOOL ok;
ok = ReadFile(hFile, (LPVOID)buffer,
CHUNK_SIZE*sizeof(LONG), &cbRead, NULL);
if (ok && cbRead > sizeof(LONG)) {
if (buffer[0] == 0)
...
}
else
// handle the read failure; for example
...

There is no denying that this code is somewhat complex. However, it is more complex to write reliable code than to write code that does not always work. It is natural to question the implications on performance. A couple of tests are being added, but in the context of the big picture (a function call, a disk operation, copying at least CHUNK_SIZE * sizeof(LONG) bytes) the effect is minimal.

In general, whenever return value checks are needed, there will always be a function call involved, so the performance overhead is less significant. In some cases, the compiler may inline the function, but if this happens and the return value does not actually need to be checked because a constant is returned, then the compiler will optimize the test away.

Admittedly, there are a few specialized cases where the handful of CPU cycles that you save by removing a return value check are vital, and where the compiler will not help you, and you control the behavior of the function that you are calling. In those cases, it makes sense to leave out some return value checks. If you think you are in a situation like this, you should discuss it with other developers, revisiting what the real performance tradeoffs are, and then, if you are still convinced it is the right thing to do, make explicit comments in your code at each place you omit a return value check, describing and justifying your decision.

Always Check Allocations
Whether you are using HeapAlloc, VirtualAlloc, IMalloc::Alloc, SysAllocString, GlobalAlloc, malloc, or any C++ operator new, you cannot rely on memory allocation succeeding. The same applies for various other kinds of resources, including GDI objects, files, registry keys, and so on.

Here is a fine example of platform-independent buggy code, written on top of the C standard library:

char *str;
str = (char *)malloc(MAX_PATH);
str[0] = 0; // WRONG!!!
...

In this case, if memory is exhausted malloc will return NULL, and the dereference of str will be a dereference of a NULL pointer. This creates an access violation that leads to a program crash. Unless you are running in kernel mode, device driver for example, the access violation will result in either a blue screen or an exploitable security hole.

The solution is simple. Check for the return value of NULL, and do the right thing.

char *str;
str = (char *)malloc(MAX_PATH);
if (str != NULL)
{
str[0] = 0;
...
}

As with return-value checking, many people believe that memory allocation issues never occur in practice. Granted, it does not happen all the time, but that does not mean it never happens. If you have thousands (or millions) of people running your software, even if it happens just once a month for each user, the result is significant.

Many people believe it does not matter what you do when memory is exhausted. The program should exit. However, this misses the point in many ways. First, assuming the program exits when memory is exhausted, the data files will not be saved. Second, services and applications are often expected to run for a long time, so it is essential that they survive periods when memory is temporarily unavailable. Third, for software running in an embedded environment, exiting is not a viable option. Dealing with memory allocation can be a hassle, but it has to be done.

Sometimes an unexpected NULL pointer may not cause a program crash, but it is still not a good thing.

HBITMAP hBitmap;
HBITMAP hOldBitmap;
hBitmap = CreateBitmap(. . .);
hOldBitmap = SelectObject(hDC, hBitmap); // WRONG!!!
...

The documentation for SelectObject is vague on what it does with a NULL bitmap. This may not cause a crash, but it is certainly not reliable. The code is probably creating the bitmap for some reason and expecting to do some drawing with it. However, since it did not create a bitmap, the drawing is not going to happen. Even though it does not crash, there is clearly a bug here. Once again, you need to check.

HBITMAP hBitmap;
HBITMAP hOldBitmap;
hBitmap = CreateBitmap(. . .);
if (hBitmap != NULL)
{
hOldBitmap = SelectObject(hDC, hBitmap);
...
}
else
...

Things start to get even more fun when you use C++ operator new. If you are using MFC, for example, global operator new will throw an exception if memory is exhausted. That means that you can not do:

int *ptr1;
int *ptr2;
ptr1 = new int[10];
ptr2 = new int[10]; // WRONG!!!!

If the second memory allocation throws an exception, the memory will leak from the first allocation. If you are embedded in a service or application that is going to be running for a long time, these leaks can add up.

It is not enough just to catch the exception. Your exception-handling code has to be correct as well. Do not fall into the tempting trap of:

int *ptr1;
int *ptr2;
try {
ptr1 = new int[10];
ptr2 = new int[10];
}
catch (CMemoryException *ex) {
delete [] ptr1; // WRONG!!!
delete [] ptr2; // WRONG!!!
}

If the first memory allocation throws an exception, you will catch it, but delete an uninitialized pointer. If you are lucky, this will lead to an immediate access violation and crash. More likely, it will cause a heap corruption that will lead to a data corruption and/or a future crash that will be difficult to debug. It is worth making the effort to initialize those variables:

int *ptr1 = 0;
int *ptr2 = 0;
try {
ptr1 = new int[10];
ptr2 = new int[10];
}
catch (CMemoryException *ex) {
delete [] ptr1;
delete [] ptr2;
}

It should be pointed out that there are many subtleties to C++ operator new. You can modify the behavior of global operator new in several different ways. Different classes can have their own operator news, and if you are not using MFC you may see different default behavior. For example, returning NULL on allocation failure rather than throwing an exception. For more information on this, see Part 7 of Bobby Schmidt's Deep C++ series on Handling Exceptions.

Conclusion
If you want to write reliable code, it is vital to think about what you are going to do about unusual events from the beginning. You cannot tack the handling on as an afterthought. Error handling is a critical aspect when considering such events.

Error handling is hard to do correctly. Although this article only scratches the surface, the principles it describes are a strong foundation on which to build. Keep the following in mind:

When designing an application (or an API), think about your preferred error-handling paradigm up front.
When using an API, be aware of its paradigm for error handling.
If you are in a situation where there are multiple error-handling paradigms, be alert for potential sources of confusion.
Always check return status.
Always check memory allocation.
If you do all of these things, you will be well on your way to writing reliable applications.

SD Knowledge Base

Wednesday, June 28, 2006

Thinking About Error Cases

No comments:

About Me