Everything you ever wanted to know about C types

Added 31 Jul 2008

Many platforms offer extensions to the basic C language. For instance, compilers for PowerPC® processors might feature vector processing extensions (AltiVec support) and have a range of vector types, such as vector unsigned short, which maps onto a processor-specific functionality. The degree to which such operations are made explicit in the type system can vary quite widely. The C language's type system has been built around the common ground between implementations, but many implementations have features for which the standard C type system offers no direct representation. Another example, found in many DSP processors, is fixed-point math.

Native and emulated types

In early implementations, a C type always corresponded to some native capacity of the hardware. The int type was typically the most convenient native data type for integer math. However, even early on, some types might not represent hardware capabilities. Some 16-bit CPUs did not have a native 32-bit type, so they emulated the long data type with small specialized libraries to perform 32-bit math. Systems with no 32-bit type have used similar code to handle operations on long long. Until comparatively recently, this applied to nearly all systems; meanwhile, some compilers are providing support for 128-bit integers, which are once again being implemented in software.

The performance difference between native and emulated types is usually substantial, so it's worth finding out which types are native on your system. In some cases, a type might be mostly native, but need special handling for a few instructions; for instance, a processor which rounds towards infinity on division will need special checks after any division or modulus operation that could potentially involve negative numbers. Other types might be entirely emulated. On some systems, a type you're used to having might require a great deal of compiler overhead. At least one compiler for a 64-bit word-addressable system went to a great deal of work to allow pointers to eight-bit values, which had to be implemented as a pointer plus additional bits, and which were dereferenced by grabbing the word in question and playing bit-shifting games. On such a system, a "clever" algorithm which used character types to access individual bytes within larger words might be dramatically slower than a simpler algorithm operating on whole words using bit shifting and masks.

If you really need a type, and your compiler provides it -- even emulated -- go ahead and use it. Trying to implement it yourself on top of the native types is probably crazy. In particular, the chances that you will do a better job of it than your compiler vendor are very small.

On some systems, the more usual types are the emulated ones. For instance, the SPE processing elements in the Cell Broadband Engine™ (Cell BE) processor have only 128-bit registers, used as vectors. Access to smaller objects requires extra work, either from the developer or the compiler. (For some examples of how a compiler might deal with this, see the developerWorks tutorial, "An introduction to compiling for the Cell Broadband Engine architecture, Part 2: Optimizing for the SPE."

Historical implementations

The first implementation of C was for the PDP-11, followed by the Honeywell 635 and the IBM 360/370. The VAX 11/780, while one of the most influential early C platforms, was only targeted after the language had matured quite a bit. On early systems, there were few guarantees about types: the char, short, and long types reflected the native word architecture of the machine, and int was whatever the processor found most convenient.

Ports to other architectures, such as the 68000 and 80x86, tended to adopt the same conventions as these early systems. The VAX 11/780 port, in particular, was very influential. Although the first C compilers supported 16-bit ints, many early C programmers (probably influenced by the VAX) carelessly assumed that the int type was always interchangeable with the long type, despite dire and well-considered warnings from C luminaries such as Henry Spencer, whose 10 Commandments for C Programmers say:

10 Thou shalt foreswear, renounce, and abjure the vile heresy which claimeth that "All the world's a VAX," and have no commerce with the benighted heathens who cling to this barbarous belief, that the days of thy program may be long even though the days of thy current machine be short.

This particular heresy bids fair to be replaced by "All the world's a Sun" or "All the world's a 386" (this latter being a particularly revolting invention of Satan), but the words apply to all such without limitation. Beware, in particular, of the subtle and terrible "All the world's a 32-bit machine," which is almost true today but shall cease to be so before thy resume grows too much longer.

The perils of the "All the world's a VAX" assumptions came home to roost with the rise in popularity of the Intel 8086 and its successors, which at first used 16-bit ints. But in fact, early 386 programming environments could be incompatible, not only with each other, but with themselves; some compilers offered the option of choosing whether to use 16-bit or 32-bit values for int variables. Porting code written by careless programmers on 32-bit systems could be nightmarish; I once spent roughly a week fixing a SPARC-native implementation of RPC that had been hacked into running on Microsoft® Windows®; it needed to run on both 16-bit and 32-bit Windows. If the code had been written to the standard, instead of to a particular processor, it would have taken an afternoon at most. (The entire afternoon would have been spent finding the compiler flag for "generate code which can be called from outside this library.")

Trap representations

C89 had a brief reference in the description of undefined behavior to indeterminately valued objects. In C99, this wording was clarified and expanded into what are now called trap representations. A trap representation is a set of bits which, when interpreted as a value of a specific type, causes undefined behavior. Trap representations are most commonly seen on floating point and pointer values, but in theory, almost any type could have trap representations. An uninitialized object might hold a trap representation. This gives the same behavior as the old rule: access to uninitialized objects produces undefined behavior.

The only guarantees the standard gives about accessing uninitialized data are that the unsigned char type has no trap representations, and that padding has no trap representations.

Pointers which refer to freed memory (or to automatic variables which have gone out of scope) become indeterminate, and might become trap representations. This is to accommodate processors on which some amount of validation of addresses occurs when an address register is loaded. The indeterminacy of pointers to freed memory is also, however, very useful for programs which try to provide some level of checking for possibly buggy code. Any reference to an indeterminate value is a bug. Having it caught is better than having it silently ignored. According to comp.lang.c regulars, on some implementations, a pointer to memory obtained by malloc() and subsequently released by free() might compare equal to a null pointer. This might seem impossible, but the standard allows for it. (To the best of my knowledge, it only happens on segmented architectures, where the compiler can mark an entire segment as "freed space," and thus necessarily invalid, and check for this when comparing pointers.)

Some compilers, or development tools, provide pointer implementations which check for buffer overflows, access to freed memory, and other errors. Some of their checks, such as checks for access to uninitialized values, conform with the standard because of the trap-representations rule. Some programs go further and warn about access to indeterminate values even when they are accessed as unsigned char objects, which have no trap representations. Such access isn't undefined behavior, but it's still nice to get the warning.

Endianness

Many people are aware that, in general, systems might be "big-endian" or "little-endian." These terms denote the way in which consecutive bytes of data storage are arranged in memory. On a big-endian system, the first byte of a word will be the most significant one, and the last will be the least significant. On a little-endian system, it's the other way around.

Programmers accustomed to one variety or the other might develop bad habits. Big-endian ordering is the canonical byte order for network data, such as TCP/IP packets. Some users on big-endian systems do not remember to convert values to network byte order before putting them in a packet -- which seems harmless until the code is tried on another system. On the other hand, little-endian users sometimes get in the habit of treating a pointer to a word as a pointer to a single byte, to extract the low-order bits. Users on both types of systems often write binary data in machine order without considering the problem of reading the resulting files later.

Some implementations have been "middle-endian" -- where the bytes of a two-byte word were in little-endian order, but the words of a double-word were in big-endian order. Some implementations even support switching modes; for instance, most PowerPC systems can do either little-endian or big-endian math. This is a feature you can't access from portable C, but a library vendor might take advantage of it. Such an architecture is sometimes called "bi-endian" or "open-endian."