sorting at the speed of light

Whereas the fastest in-place sorting algorithms perform at O(n*log(n)), Radix sort is an out of place sorting algorithm that sorts at a complexity of O(n).

### In-place sorting and Out of place sorting

There is a subtle difference between in place sorting algorithms and out of place sorting algorithms. Whereas the first ones don't require more memory the actual data array to sort, the latter require more. In the case of Radix sort we require twice more memory, so our memory complexity is O(2*n). In order to make the routine more usable, we may want to leave the source array unchanged, so we need to allocate more local working memory. We may need to get only the indices in sorted order as well. In the case of a particle engine where we should sort out the rendering order, we typically want to leave our vertex data unchanged and only sort out point sprite indices.

### Keys and limitations

Radix sort typically sorts fixed-size integer keys. In this article we will focus on sorting 32 bit positive integers, we will also see how it is possible that this algorithm also works without modification on positive real numbers -floats- in the IEEE 754 standard.

While it's possible to make algorithms that work with more than 32 bits of course, in this article we'll focus with these. Mainly, the radix sort complexity is linear according to the number of elements to sort, but it's also linear according to the sorting-key size.

### Statistical sorting

Radix sort is a statistical sorting algorithm. It means the first pass will gather statistics about the key upon which to sort the data. We'll start with a typical byte-based sorting, at first we sort upon the most significant byte, then the second, the third and fourth.. Because sorting the whole 32 bit integer is equivalent to sorting its individual subparts in a ping-pong style algorithm. So we have one O(n) pass to gather statistics, plus 4 passes that each sort for each byte. That's where the O(5*n) figure comes from.

Here, we sort from the most significant byte to the lowest significant, but it doesn't matter much, we could do the other way arround and the result would still be correct. The key is that we sort every byte that is significant to sort.

It means we can also design an algorithm that sorts 64 bits of data by sorting each of the 8 sub-bytes in 8 passes. Or 128 bits in 16 passes. Or more...

### Algorithm and source code

The algorithm starts by gathering statistics about bytes and then calculates a position at which to write these bytes when sorting. the succession of the sorting for each bytes sorts the whole 32 bits.

I've done the choice in the following to leave the src array unchanged, which means i have to allocate 2 more arrays as a work memory... The memory complexity of the following implementation is O(4*n) so it's average, but it's more functionnal and usable as a result.

### Sorting positive floats as unsigned integers

The idea relies on the fact that exponents are encoded as strictly positive numbers in the most significant byte in the 32 bits float number specification IEEE 754. Upon equal exponents, the mantissa of the bigger number will be always bigger.

As a result sorting positive floats is strictly equivalent to sorting integers. ;)

### Why not doing better in O(3*n) or O(2*n)?

It is possible to sort out as well 11 bits of data and then require only 3 passes, or sorting out 16 bits and 2 passes.

One thing that one needs to keep in mind however is that the statistics table is a randomly accessed hash table, so it needs to fit in the L1 data cache of your favorite computing device to ensure optimal performance.

with 11 bits of significant data we need 8k of L1 cache which is still affordable, and with 16 bits we would need 256k of randomly accessable memory. I don't make any early assumption on this, just keep in mind that the hash table is randomly accessed. ;)