Everything you ever wanted to know about C types
Added 31 Jul 2008
Many platforms offer extensions to the basic C language. For instance,
compilers for PowerPC® processors might feature vector processing extensions
(AltiVec support) and have a range of vector types, such as vector unsigned short, which maps
onto a processor-specific functionality. The degree to which such
operations are made explicit in the type system can vary quite widely.
The C language's type system has been built around the common ground
between implementations, but many implementations have features for which
the standard C type system offers no direct representation. Another example,
found in many DSP processors, is fixed-point math.
In early implementations, a C type always corresponded to some native
capacity of the hardware. The int type was typically the most
convenient native data type for integer math. However, even early on,
some types might not represent hardware capabilities. Some 16-bit CPUs
did not have a native 32-bit type, so they emulated the long data
type with small specialized libraries to perform 32-bit math. Systems with
no 32-bit type have used similar code to handle operations on long long. Until comparatively recently, this applied
to nearly all systems; meanwhile, some compilers are providing support for
128-bit integers, which are once again being implemented in software.
The performance difference between native and emulated types is usually substantial, so it's worth finding out which types are native on your system. In some cases, a type might be mostly native, but need special handling for a few instructions; for instance, a processor which rounds towards infinity on division will need special checks after any division or modulus operation that could potentially involve negative numbers. Other types might be entirely emulated. On some systems, a type you're used to having might require a great deal of compiler overhead. At least one compiler for a 64-bit word-addressable system went to a great deal of work to allow pointers to eight-bit values, which had to be implemented as a pointer plus additional bits, and which were dereferenced by grabbing the word in question and playing bit-shifting games. On such a system, a "clever" algorithm which used character types to access individual bytes within larger words might be dramatically slower than a simpler algorithm operating on whole words using bit shifting and masks.
If you really need a type, and your compiler provides it -- even emulated -- go ahead and use it. Trying to implement it yourself on top of the native types is probably crazy. In particular, the chances that you will do a better job of it than your compiler vendor are very small.
On some systems, the more usual types are the emulated ones. For instance, the SPE processing elements in the Cell Broadband Engine™ (Cell BE) processor have only 128-bit registers, used as vectors. Access to smaller objects requires extra work, either from the developer or the compiler. (For some examples of how a compiler might deal with this, see the developerWorks tutorial, "An introduction to compiling for the Cell Broadband Engine architecture, Part 2: Optimizing for the SPE."
The first implementation of C was for the PDP-11, followed by the
Honeywell 635 and the IBM 360/370. The VAX 11/780, while one of
the most influential early C platforms, was only targeted after
the language had matured quite a bit. On early systems, there were
few guarantees about types: the char,
short, and long
types reflected the native word architecture of the machine, and int was whatever the processor found most
convenient.
Ports to other architectures, such as the 68000 and 80x86, tended to adopt
the same conventions as these early systems. The VAX 11/780 port, in
particular, was very influential.
Although the first C compilers supported 16-bit ints, many early C programmers
(probably influenced by the VAX)
carelessly assumed that the int type was always
interchangeable with the long type,
despite dire and well-considered warnings from C luminaries such as Henry
Spencer, whose 10 Commandments for C Programmers say:
10 Thou shalt foreswear, renounce, and abjure the vile heresy which claimeth that "All the world's a VAX," and have no commerce with the benighted heathens who cling to this barbarous belief, that the days of thy program may be long even though the days of thy current machine be short.
This particular heresy bids fair to be replaced by "All the world's a Sun" or "All the world's a 386" (this latter being a particularly revolting invention of Satan), but the words apply to all such without limitation. Beware, in particular, of the subtle and terrible "All the world's a 32-bit machine," which is almost true today but shall cease to be so before thy resume grows too much longer.
The perils of the "All the world's a VAX" assumptions came
home to roost with the rise in popularity of the Intel 8086 and
its successors, which at first used 16-bit ints. But
in fact, early 386 programming environments could be incompatible, not
only with each other, but with themselves; some compilers offered
the option of choosing whether to use 16-bit or 32-bit values for int
variables. Porting code written by careless programmers on 32-bit systems
could be nightmarish; I once spent roughly a week fixing a SPARC-native
implementation of RPC that had been hacked into running on Microsoft® Windows®; it needed
to run on both 16-bit and 32-bit Windows. If the code had been written
to the standard, instead of to a particular processor, it would have taken an
afternoon at most. (The entire afternoon would have been spent finding the
compiler flag for "generate code which can be called from outside this
library.")
C89 had a brief reference in the description of undefined behavior to indeterminately valued objects. In C99, this wording was clarified and expanded into what are now called trap representations. A trap representation is a set of bits which, when interpreted as a value of a specific type, causes undefined behavior. Trap representations are most commonly seen on floating point and pointer values, but in theory, almost any type could have trap representations. An uninitialized object might hold a trap representation. This gives the same behavior as the old rule: access to uninitialized objects produces undefined behavior.
The only guarantees
the standard gives about accessing uninitialized data are that the unsigned char type has no trap representations, and
that padding has no trap representations.
Pointers which refer to freed memory (or to automatic variables which have
gone out of scope) become indeterminate, and might become trap
representations. This is to accommodate processors on which some amount of
validation of addresses occurs when an address register is loaded. The
indeterminacy of pointers to freed memory is
also, however, very useful for programs which try to provide some level of
checking for possibly buggy code. Any reference to an indeterminate value
is a bug. Having it caught is better than having it silently ignored.
According to comp.lang.c regulars, on some implementations, a pointer to
memory obtained by malloc() and subsequently released by free() might
compare equal to a null pointer. This might seem impossible, but the standard
allows for it. (To the best of my knowledge, it only happens on segmented
architectures, where the compiler can mark an entire segment as "freed space,"
and thus necessarily invalid, and check for this when comparing pointers.)
Some compilers, or development tools, provide pointer implementations which
check for buffer overflows, access to freed memory, and other errors. Some
of their checks, such as checks for access to uninitialized values, conform
with the standard because of the trap-representations rule. Some programs
go further and warn about access to indeterminate values even when they
are accessed as unsigned char objects, which have
no trap representations. Such access isn't undefined behavior, but it's
still nice to get the warning.
Many people are aware that, in general, systems might be "big-endian" or "little-endian." These terms denote the way in which consecutive bytes of data storage are arranged in memory. On a big-endian system, the first byte of a word will be the most significant one, and the last will be the least significant. On a little-endian system, it's the other way around.
Programmers accustomed to one variety or the other might develop bad habits. Big-endian ordering is the canonical byte order for network data, such as TCP/IP packets. Some users on big-endian systems do not remember to convert values to network byte order before putting them in a packet -- which seems harmless until the code is tried on another system. On the other hand, little-endian users sometimes get in the habit of treating a pointer to a word as a pointer to a single byte, to extract the low-order bits. Users on both types of systems often write binary data in machine order without considering the problem of reading the resulting files later.
Some implementations have been "middle-endian" -- where the bytes of a two-byte word were in little-endian order, but the words of a double-word were in big-endian order. Some implementations even support switching modes; for instance, most PowerPC systems can do either little-endian or big-endian math. This is a feature you can't access from portable C, but a library vendor might take advantage of it. Such an architecture is sometimes called "bi-endian" or "open-endian."