11 votes

Exploring the feasibility of brute-force guessing a 64-bit number through different techniques

6 comments

  1. [6]
    arghdos
    Link
    Interesting that they hand vectorized the SIMD instructions on the CPU, but didn't do the same exercise using inline PTXAs; CUDA's intermediate assembly language. I would bet they could get...

    Interesting that they hand vectorized the SIMD instructions on the CPU, but didn't do the same exercise using inline PTXAs; CUDA's intermediate assembly language. I would bet they could get another substantial gain through that. Maybe not the 3-4x as from the explicit SIMD.

    The V100 is a beast though. There's a reason that the (current) #1 supercomputer is powered by the V100 (though it looks like AMD will be taking that crown in the near future with Frontier)

    4 votes
    1. [5]
      Greg
      Link Parent
      I thought that was because clang just wasn't automatically vectorising loop structures for the CPU at all, so the author jumped straight to hand written SIMD code, whereas the CUDA compiler was...

      I thought that was because clang just wasn't automatically vectorising loop structures for the CPU at all, so the author jumped straight to hand written SIMD code, whereas the CUDA compiler was vectorising quite happily so they didn't need to?

      It looks like you can use #pragma directives to coerce clang into vectorising standard loops, so I'd actually be quite interested to know how compiled SIMD compares to hand written SIMD. My intuitive but entirely unscientific bet is that they'd be close to identical.

      1 vote
      1. [4]
        arghdos
        Link Parent
        Yeah fair point, it just seems like they did a lot of hand optimization for the CPU and didn't even mention trying to optimize the GPU (e.g., by controlling register use, making sure the compiler...

        I thought that was because clang just wasn't automatically vectorising loop structures for the CPU at all, so the author jumped straight to hand written SIMD code, whereas the CUDA compiler was vectorising quite happily so they didn't need to?

        Yeah fair point, it just seems like they did a lot of hand optimization for the CPU and didn't even mention trying to optimize the GPU (e.g., by controlling register use, making sure the compiler is choosing the right instructions for each card, etc.). I guess really I just want to see their code :p

        2 votes
        1. [3]
          9000
          Link Parent
          Good news! It's on GitHub: https://github.com/trailofbits/sixtyfour

          I guess really I just want to see their code :p

          Good news! It's on GitHub: https://github.com/trailofbits/sixtyfour

          2 votes
          1. [2]
            arghdos
            (edited )
            Link Parent
            Thanks! One quick optimization note: In the check loop, they would probably do better by ensuring the chunk side they are checking is divisible by 4/8 or the like and then hand-unrolling the check...

            Thanks! One quick optimization note:

            In the check loop, they would probably do better by ensuring the chunk side they are checking is divisible by 4/8 or the like and then hand-unrolling the check loop. Right now they are doing:

            • Check value == secret
            • increment value
            • check that value < stop

            (Ignoring the set found instruction, as that's essentially a zero probability on any individual iteration).

            Essentially the check to see the value is still in range is 1/3rd of the performance critical loop.

            If they were instead doing:

            • Check value == secret
            • increment value
            • Check value == secret
            • increment value
            • Check value == secret
            • increment value
            • Check value == secret
            • increment value
            • check that value < stop

            It's gone down to 1/9th the instruction count.

            Considering that their metric is roughly proportional to: s/secret check this may yield some improvements. There are other concerns (e.g. increased register count) but it's worth trying :p

            2 votes
            1. 9000
              Link Parent
              That's an interesting point! You could open an issue or pull request, since they say help is welcome?

              That's an interesting point! You could open an issue or pull request, since they say help is welcome?

              2 votes