Interesting that they hand vectorized the SIMD instructions on the CPU, but didn't do the same exercise using inline PTXAs; CUDA's intermediate assembly language. I would bet they could get...
Interesting that they hand vectorized the SIMD instructions on the CPU, but didn't do the same exercise using inline PTXAs; CUDA's intermediate assembly language. I would bet they could get another substantial gain through that. Maybe not the 3-4x as from the explicit SIMD.
The V100 is a beast though. There's a reason that the (current) #1 supercomputer is powered by the V100 (though it looks like AMD will be taking that crown in the near future with Frontier)
I thought that was because clang just wasn't automatically vectorising loop structures for the CPU at all, so the author jumped straight to hand written SIMD code, whereas the CUDA compiler was...
I thought that was because clang just wasn't automatically vectorising loop structures for the CPU at all, so the author jumped straight to hand written SIMD code, whereas the CUDA compiler was vectorising quite happily so they didn't need to?
It looks like you can use #pragma directives to coerce clang into vectorising standard loops, so I'd actually be quite interested to know how compiled SIMD compares to hand written SIMD. My intuitive but entirely unscientific bet is that they'd be close to identical.
Yeah fair point, it just seems like they did a lot of hand optimization for the CPU and didn't even mention trying to optimize the GPU (e.g., by controlling register use, making sure the compiler...
I thought that was because clang just wasn't automatically vectorising loop structures for the CPU at all, so the author jumped straight to hand written SIMD code, whereas the CUDA compiler was vectorising quite happily so they didn't need to?
Yeah fair point, it just seems like they did a lot of hand optimization for the CPU and didn't even mention trying to optimize the GPU (e.g., by controlling register use, making sure the compiler is choosing the right instructions for each card, etc.). I guess really I just want to see their code :p
Thanks! One quick optimization note: In the check loop, they would probably do better by ensuring the chunk side they are checking is divisible by 4/8 or the like and then hand-unrolling the check...
Thanks! One quick optimization note:
In the check loop, they would probably do better by ensuring the chunk side they are checking is divisible by 4/8 or the like and then hand-unrolling the check loop. Right now they are doing:
Check value == secret
increment value
check that value < stop
(Ignoring the set found instruction, as that's essentially a zero probability on any individual iteration).
Essentially the check to see the value is still in range is 1/3rd of the performance critical loop.
If they were instead doing:
Check value == secret
increment value
Check value == secret
increment value
Check value == secret
increment value
Check value == secret
increment value
check that value < stop
It's gone down to 1/9th the instruction count.
Considering that their metric is roughly proportional to: s/secret check this may yield some improvements. There are other concerns (e.g. increased register count) but it's worth trying :p
Interesting that they hand vectorized the SIMD instructions on the CPU, but didn't do the same exercise using inline PTXAs; CUDA's intermediate assembly language. I would bet they could get another substantial gain through that. Maybe not the 3-4x as from the explicit SIMD.
The V100 is a beast though. There's a reason that the (current) #1 supercomputer is powered by the V100 (though it looks like AMD will be taking that crown in the near future with Frontier)
I thought that was because clang just wasn't automatically vectorising loop structures for the CPU at all, so the author jumped straight to hand written SIMD code, whereas the CUDA compiler was vectorising quite happily so they didn't need to?
It looks like you can use #pragma directives to coerce clang into vectorising standard loops, so I'd actually be quite interested to know how compiled SIMD compares to hand written SIMD. My intuitive but entirely unscientific bet is that they'd be close to identical.
Yeah fair point, it just seems like they did a lot of hand optimization for the CPU and didn't even mention trying to optimize the GPU (e.g., by controlling register use, making sure the compiler is choosing the right instructions for each card, etc.). I guess really I just want to see their code :p
Good news! It's on GitHub: https://github.com/trailofbits/sixtyfour
Thanks! One quick optimization note:
In the check loop, they would probably do better by ensuring the chunk side they are checking is divisible by 4/8 or the like and then hand-unrolling the check loop. Right now they are doing:
(Ignoring the set found instruction, as that's essentially a zero probability on any individual iteration).
Essentially the check to see the value is still in range is 1/3rd of the performance critical loop.
If they were instead doing:
It's gone down to 1/9th the instruction count.
Considering that their metric is roughly proportional to: s/secret check this may yield some improvements. There are other concerns (e.g. increased register count) but it's worth trying :p
That's an interesting point! You could open an issue or pull request, since they say help is welcome?