Creating Type-Safe, Performant, Reusable Data Structures

Added 31 Jul 2008

When creating a data structure for a particular problem, oftentimes the data structure's internals can be customized to the specifics of the problem. For example, imagine that you were working on a payroll application. One of the entities of this system would be an employee, so you might create an Employee class with applicable properties and methods. To represent a set of employees, you could use an array of type Employee, but perhaps you need some extra functionality not present in the array, or you simply don't want to have to concern yourself with writing code to watch the capacity of the array and resize it when necessary. One option would be to create a custom data structure that uses an internal array of Employee instances, and offered methods to extend the base functionality of an array, such as automatic resizing, searching of the array for a particular Employee object, and so on.

This data structure would likely prove very helpful in your application, so much so that you might want to reuse it in other applications. However, this data structure is not open to reuse because it is tightly-coupled to the payroll application, only being able to store elements of type Employee (or types derived from Employee). One option to make a more flexible data structure is to have the data structure maintain an internal array of object instances, as opposed to Employee instances. Because all types in the .NET Framework are derived from the object type, the data structure could store any type. This would make your collection data structure usable in other applications and scenarios.

Not surprisingly, the .NET Framework already contains a data structure that provides this functionality—the System.Collections.ArrayList class. The ArrayList maintains an internal object array and provides automatic resizing of the array as the number of elements added to the ArrayList grows. Because the ArrayList uses an object array, developers can add any type—strings, integers, FileInfo objects, Form instances, anything.

While the ArrayList provides added flexibility over the standard array, this flexibility comes at the cost of performance. Because the ArrayList stores an array of objects, when reading the value from an ArrayList you need to explicitly cast it to the data type being stored in the specified location. Recall that an array of a value type—such as a System.Int32, System.Double, System.Boolean, and so on—is stored contiguously in the managed heap in its unboxed form. The ArrayList's internal array, however, is an array of object references.

The boxing and unboxing, along with the extra level of indirection, that comes with using value types in an ArrayList can hamper the performance of your application when using large ArrayLists with many reads and writes. As Figure 3 illustrates, the same memory layout occurs for reference types in both ArrayLists and arrays.

Having an object array also introduces potential bugs that won't be noticed until run-time. A developer may intend to only add elements of a particular type to an ArrayList, but since the ArrayList allows any type to be added, adding an incorrect type won't be caught during compilation. Instead, such a mistake would not be apparent until run-time, meaning the bug would not be found until testing or, in the worse case, during actual use.