Why Big-Endianness is Wrong

by Charles M. "Chip" Coldwell

Consider the bits within a byte. The least significant bit is number 0 and the most significant bit is number 7:

MSB                         LSB
 7   6   5   4   3   2   1   0 

Numbering the bits this way is the only way that makes sense. This way, left-shifting one by n sets bit n in the byte:

  /* function to set bit "n" in byte "mask" */
  char le_set_bit(char mask, int n)
  {
    return mask | (1 << n);
  }

  /* function to clear bit "n" in byte "mask" */
  char le_clear_bit(char mask, int n)
  {
    return mask & ~(1 << n);
  }

  /* function to return the value of bit "n" in byte "mask" */
  int le_test_bit(char mask, int n)
  {
    return (mask & (1 << n)) != 0;
  }
  

Now suppose, perversely, we decided to number the bits in the opposite order[1]:

MSB                         LSB
 0   1   2   3   4   5   6   7 

We can still set, clear and test bits, but we have to know how wide a byte is in order to do it (usually, but not always, eight bits -- this is why the Internet RFCs refer to "octets" and not "bytes"):

  /* function to set bit "n" in byte "mask" */
  char be_set_bit(char mask, int n)
  {
    return mask | (0x80 >> n);
  }

  /* function to clear bit "n" in byte "mask" */
  char be_clear_bit(char mask, int n)
  {
    return mask & ~(0x80 >> n);
  }

  /* function to return the value of bit "n" in byte "mask" */
  int be_test_bit(char mask, int n)
  {
    return (mask & (0x80 >> n)) != 0;
  }
  

I don't think there's any controversy about which order is the right order for numbering the bits within a byte. So why on earth would you ever want to number the bytes in the opposite order?

Consider the bytes within a word (and here me mean "word" in the DEC sense: an integer two bytes wide, often called a halfword). These bytes are stored in sequential addresses in memory, and one of those addresses is lower than the other. The lower address holds byte zero and the higher address holds byte 1:

MSB                         LSB MSB                         LSB
 7   6   5   4   3   2   1   0   7   6   5   4   3   2   1   0 
byte 1 byte 0

Now, if we have a pointer to this word, no matter what the endianness of the processor it holds the address of byte 0:

  unsigned short word = 0x00FF;
  unsigned char *byte = (unsigned char *) &word;
  

On either a little- or big-endian machine, the following boolean statement evaluates as true:

  (long) byte == (long) &word
  

So what are the values in byte[0] and byte[1]?

On a little-endian machine, byte[0] == 0xFF and byte[1] == 0x00. This has the nice property that, in addition to the statement above, the following statement also evaluates as true for all values of word that fit in a byte:

  *byte == word
  

In other words, dereferencing a pointer-to-byte that holds the address of a word-sized value that happens to fit in a byte gives the same value.

On a big-endian machine, byte[0] == 0x00 and byte[1] == 0xFF, so that even though the addresses are still equal and the value would fit in a byte, the values differ:

  *byte != word
  

Nonetheless, on either little- or big-endian machines:

  *(unsigned short *)byte == word
  

There are those who would argue that by comparing two different widths of integer I get what I deserve. However, the C programming language does the implicit integer promotions in boolean expressions, so the fact that

  if(*byte == word) {
     /* ... */
  }
  

won't branch but

  if(*(unsigned short *)byte == word) {
     /* ... */
  }
  

will branch, when both statements are perfectly valid standard C and neither statement will even generate a compiler warning, is at the very least a trap set for the unwary on big-endian machines.

Notes

  1. It turns out that IBM is just this perverse: they number their bits in big-endian order in the POWER ISA. That means that bit 0 of an eight-bit value will not align with bit 0 of the 16 bit word that contains it. Somebody stop the madness.

Resources