Then the other diagonals from the top of each column towards the bottom left with wrap around...
Then those 4 steps on this matrix, 1. sort rows 2. sort columns 3. sort diagonal 4. sort reverse diagonal to get:
And I think in general it takes the base 2 log of row length number of iterations, like here it took 3 for row length 8.
Now the list is sorted reading off by rows.
Now this seems kind of cumbersome but the nicety of it is that each step can be carried out distributed across different processors.
Supposing you have a number of processors equal to the square root of the number of items. This might not be as unusual as it sounds with graphics cards starting to have in the thousands of compute units. Ordinarily n*log(n) is the best you can do on one processor, that compares to this algorithm in the sense of big(O) as:
This algorithm is the green curve.