11 votes

Exploring the feasibility of brute-force guessing a 64-bit number through different techniques

Posted November 28, 2019 by Deimos

Tags: parallelization, cuda, vectorization

https://blog.trailofbits.com/2019/11/27/64-bits-ought-to-be-enough-for-anybody/

Link information

This data is scraped automatically and may be incorrect.

Title: 64 Bits ought to be enough for anybody!
Authors: Artem Dinaburg
Published: Nov 27 2019
Word count: 2569 words

6 comments

[6]
arghdos
November 29, 2019
Link
Interesting that they hand vectorized the SIMD instructions on the CPU, but didn't do the same exercise using inline PTXAs; CUDA's intermediate assembly language. I would bet they could get...

Interesting that they hand vectorized the SIMD instructions on the CPU, but didn't do the same exercise using inline PTXAs; CUDA's intermediate assembly language. I would bet they could get another substantial gain through that. Maybe not the 3-4x as from the explicit SIMD.

The V100 is a beast though. There's a reason that the (current) #1 supercomputer is powered by the V100 (though it looks like AMD will be taking that crown in the near future with Frontier)

4 votes
1. [5]
  Greg
  November 29, 2019
  Link Parent
  I thought that was because clang just wasn't automatically vectorising loop structures for the CPU at all, so the author jumped straight to hand written SIMD code, whereas the CUDA compiler was...
  
  I thought that was because clang just wasn't automatically vectorising loop structures for the CPU at all, so the author jumped straight to hand written SIMD code, whereas the CUDA compiler was vectorising quite happily so they didn't need to?
  
  It looks like you can use #pragma directives to coerce clang into vectorising standard loops, so I'd actually be quite interested to know how compiled SIMD compares to hand written SIMD. My intuitive but entirely unscientific bet is that they'd be close to identical.
  
  1 vote
  1. [4]
    arghdos
    November 29, 2019
    Link Parent
    Yeah fair point, it just seems like they did a lot of hand optimization for the CPU and didn't even mention trying to optimize the GPU (e.g., by controlling register use, making sure the compiler...
    
    I thought that was because clang just wasn't automatically vectorising loop structures for the CPU at all, so the author jumped straight to hand written SIMD code, whereas the CUDA compiler was vectorising quite happily so they didn't need to?
    
    Yeah fair point, it just seems like they did a lot of hand optimization for the CPU and didn't even mention trying to optimize the GPU (e.g., by controlling register use, making sure the compiler is choosing the right instructions for each card, etc.). I guess really I just want to see their code :p
    
    2 votes
    
    [3]
    9000
    November 29, 2019
    Link Parent
    Good news! It's on GitHub: https://github.com/trailofbits/sixtyfour
    
    I guess really I just want to see their code :p
    
    Good news! It's on GitHub: https://github.com/trailofbits/sixtyfour
    
    2 votes
    
    [2]
    arghdos
    November 29, 2019 (edited November 29, 2019)
    Link Parent
    Thanks! One quick optimization note: In the check loop, they would probably do better by ensuring the chunk side they are checking is divisible by 4/8 or the like and then hand-unrolling the check...
    
    Thanks! One quick optimization note:
    
    In the check loop, they would probably do better by ensuring the chunk side they are checking is divisible by 4/8 or the like and then hand-unrolling the check loop. Right now they are doing:
    
    Check value == secret
    
    increment value
    
    check that value < stop
    
    (Ignoring the set found instruction, as that's essentially a zero probability on any individual iteration).
    
    Essentially the check to see the value is still in range is 1/3rd of the performance critical loop.
    
    If they were instead doing:
    
    Check value == secret
    
    increment value
    
    Check value == secret
    
    increment value
    
    Check value == secret
    
    increment value
    
    Check value == secret
    
    increment value
    
    check that value < stop
    
    It's gone down to 1/9th the instruction count.
    
    Considering that their metric is roughly proportional to: s/secret check this may yield some improvements. There are other concerns (e.g. increased register count) but it's worth trying :p
    
    2 votes
    
    9000
    November 30, 2019
    Link Parent
    That's an interesting point! You could open an issue or pull request, since they say help is welcome?
    
    That's an interesting point! You could open an issue or pull request, since they say help is welcome?
    
    2 votes