Having made a GPU-bound particle system I needed a way to fix rendering artifacts, such as having particles in the back rendering in front of other, caused by rendering particles in the "wrong" order.
To fix that I hade a couple of diffrent options such as try and implement WBOIT(Weighted Blended Order - Independent Transparency) or sorting them.
So I started looking into WBOIT and diffrent sorting algorithms. In the end I decided to learn and use Radix Sort on the GPU.
The first problem I enounterd was that there were a lot of pappers on how to use Radix Sort, but less about how to get started and learn Radix Sort.
So after doing some research I started doing a simple implementation of Radix Sort on the CPU.
As you can see, very short and simple of using Radix Sort with bits. Does it work, yes, is it good, no. The goal was to understand the Idea and get a working Radix Sort. A base to start on.
So what does it do?
Simple we use bit math to check if a bit is 0 or 1, true or false.
Then we count the amounts of zeros so we know where the ones will start.
Then we place the elements in the next free slot it should go into.
When that is done, we swap the source and destination buffers, and do it all over again.
For the number of bits a unsigned __int32 has, so 32 bits.
At the end we just make sure we return the final sorted vector.
Nest step was to make this sort based of Bytes instead of bits.
As we can see by the code we only loop 4 times instead of 32 times since a Bytes is 8 bits.
This takes slighly more memory since we now need to save the count of all the digits between 0 - 255.
But it is much faster, during my testing with sorting 100 Million elements it took the ~85 seconds for the bit version, and only 3-4 seconds for the Byte version.
We also get to start see how we would do things on the GPU since we can break this into three steps.
First Step - Count The Digits or rather build a histogram.
Second Step - Calculate the Offset which represents the Scane Pass on the GPU.
Third Step - Place the values from the Source on the correct Destination index.
Of course this is a bit ways away from how you would do it on the GPU, but it is a start.
Then I started mapping out a bit more of how it would look on the GPU.
This is horrible inefficient on the CPU, but hey the goal isn't to use it on the CPU but to get a picture of how it would look like on the GPU.
It took it ~125 seconds to sort the 100 million elements.
Thats not good.
This version is not to diffrent from the Byte version, aside from all the atrocites made to simulate the GPU, without threading it.
As well as the fact that it doesn't sort by Byte(8 bits) but by Nibble(4 bits).
Since I was more focused on testing things then making it fast at this point, and GPU based Radix sorts often was either based of Bytes or Nibbles.
Now that I had done it on the CPU and simulated a GPU version on the CPU, it was time to move to doing it on the GPU with HLSL and Compute shaders.
Well lets look at my first implementation on the GPU.
As we can see it was a very naive Implementation, where I had four compute shaders. I tested it to see that it worked and get a check on the performance, and while it worked as intended it was VERY inefficient, droping us to 30 FPS if we ran it every frame. However it did fulfill the purpose of giving me a starting line for Radix Sorting on the GPU, since I now had more knowhow and a starting base.
The biggest performance killer was the Scan Pass and Offset Pass, so the first goal was to remove the offset pass and improve the Scan Pass.
The Scan Pass is the part that makes Radix Sort so tricky to do on the GPU, however they are many diffrent papers on how to improve it, I will link some of them at the bottom of the page.
The problem now is that some of these scans like Wave Scan and One Pass Scans doesn't really work with DX11.
I was also doing the Radix Sort with Nibbles, so I started with changing it to Bytes which helped a bit with the performance.
Then I started working on remaking the Scan Pass and removing unnecessay parts. Which took some trial and error, but here's the end result.
The next step would be to research more about the scan part, maybe trying the Hillis and Steele scan, or see if I kind find if someone has found a Wave or One Pass Scan that works with DX11. However for now it has decent performance, went from 100 FPS in game to 90 FPS when sorting one million elements per frame, and that was with Render Doc active.