25 votes

Downtime due to sign up spam

13 comments

  1. [9]
    creesch
    Link
    It took me a little while to fully understand what is being done now as the blog post, repo and even the wiki page all seem to suffer from a bit of a "curse of knowledge" level skipping over...
    • Exemplary

    It took me a little while to fully understand what is being done now as the blog post, repo and even the wiki page all seem to suffer from a bit of a "curse of knowledge" level skipping over basics.

    Basically, when a user fills in their data and hits the "signup" button something along the following lines happens:

    • The user data isn't actually send to the server straight away.
    • Instead the client is requested to first hash (part of) of the user data.
    • Normally, that isn't as much of an issue. But the client has gotten an extra challenge: "The hash needs to have X amount of leading zeros".
    • Hashing with the same data will always result in the same hash.
    • So to get a new hash you need to add other data. To get to a specific amount of leading zeros in your hash outcome, you need a specific bit of data. This is called the nonce.
    • The client does not know what that specific bit of data is, so it has to calculate multiple hashes adding different bits of data until it finally gets a hash with the correct amount of leading zeros.
    • Once the client finds a combination of user data and nonce resulting in the right amount of leading zeros it sends all three (user data, hash and nonce) to the server.
    • The server is provided the nonce on a silver platter so can very easily verify the hash.

    For a normal user on a normal computer the extra compute power needed to do the hashing isn't really an issue. But for mass spammers the added computational power at their scale makes it really expensive.

    I still feel like I wrote it out like "draw the rest of the owl". But hopefully it does clear up things a bit for people after me. Who, just like me, know just enough about this sort of thing to want to know how it work but also not quite enough to make it effortless :P

    53 votes
    1. [6]
      teaearlgraycold
      Link Parent
      To someone familiar with the Bitcoin proof-of-work algorithm it's clear what's happening here. But not everyone is familiar with that.

      To someone familiar with the Bitcoin proof-of-work algorithm it's clear what's happening here. But not everyone is familiar with that.

      9 votes
      1. [4]
        Crespyl
        Link Parent
        Bitcoin was in part inspired by Hashcash, a very similar PoW system from the 90's designed to combat email spam. In principle the idea of requiring a small-but-measurable cost (either of actual...

        Bitcoin was in part inspired by Hashcash, a very similar PoW system from the 90's designed to combat email spam.

        In principle the idea of requiring a small-but-measurable cost (either of actual money like a USPS stamp, or electricity/compute as in PoW) seems like a good anti-spam measure for a lot of systems, but I'm not sure how well it scales in a world full of crypto ASICs and GPU compute.

        16 votes
        1. bitshift
          Link Parent
          Speaking of Hashcash, the papers section on their website is delightful reading, specifically some of the later papers under "Related Work". There are some more recent proof-of-work schemes which...

          Speaking of Hashcash, the papers section on their website is delightful reading, specifically some of the later papers under "Related Work". There are some more recent proof-of-work schemes which emphasize memory-hard problems — RAM and memory bandwidth being where ASICs start to lose their advantage.

          The clever bit is that these schemes can be designed to be memory-easy to verify, yet memory-hard to compute. For example, the client might have to find N different values whose hashes all have the same prefix; the client's best strategy is to generate many more than N hashes and store all of them in memory until the birthday paradox kicks in. But once the N values are found, the server's work is easy: hash those N values and make sure the prefix doesn't change.

          5 votes
        2. Wulfsta
          Link Parent
          Building on this thought, there are now other proof of work schemes in use that are memory bound - I wonder if perhaps the underlying idea of requiring proof from the user is more valuable here,...

          Building on this thought, there are now other proof of work schemes in use that are memory bound - I wonder if perhaps the underlying idea of requiring proof from the user is more valuable here, and something with a better compute/memory tradeoff could be used instead?

          4 votes
        3. creesch
          Link Parent
          Hashcash is indeed what we are discussing, the wiki page I mentioned in my comment is the same you just linked ;) As for your concern, I suppose it depends on how worthwhile the target is? Sure it...

          Hashcash is indeed what we are discussing, the wiki page I mentioned in my comment is the same you just linked ;)

          As for your concern, I suppose it depends on how worthwhile the target is? Sure it is possible to throw extra resources against it, but a lot of spam automation depends on dirt cheap truly massive scales. Often fueled by less than legal compute in the form of botnets and other random hardware. So if the target isn't as lucrative, I can see it not being worthwhile for spammers to throw specific hardware against the issue.

          4 votes
      2. creesch
        Link Parent
        I'd be willing to bet that "not everyone" basically amounts to the majority of people ;) That's basically the "Curse of knowledge" bias I was mentioning in my comment. Even if I had been more...

        I'd be willing to bet that "not everyone" basically amounts to the majority of people ;)

        That's basically the "Curse of knowledge" bias I was mentioning in my comment. Even if I had been more familiar with Bitcoin, I might have still not made the link directly. The blog posts merely says

        "simple implementation appending a counter to the email address and checking if the sha256 of it has a certain number of leading zeros (difficulty)."

        Which made me curious, but didn't explain how the leading zeros made it difficult. The github repo then mentions nonce as extra data and links to the wikipedia page. The wikipedia page is then just a bit too messy to make it out easily. It heavily focuses on the mail use case and the detailed implementation.
        It took me reading the page a few times to distil the key bits. Then double-checking elsewhere if I had actually understood it.

        Now going back I do note that there is a reference to Bitcoin, but that would not really have helped me personally. And likely also not a lot of other people. So to prevent them from having to go through and distill all that info I made my comment ;)

        8 votes
    2. [2]
      Wulfsta
      Link Parent
      For added context, this is more or less how blockchains work - to add a block you have to do something like the above (edit: got beaten to the punch by @teaearlgraycold). I thought this was a much...

      For added context, this is more or less how blockchains work - to add a block you have to do something like the above (edit: got beaten to the punch by @teaearlgraycold). I thought this was a much more useful and clever place to utilize the scheme than cryptocurrency though.

      Another interesting bit is that it does open the server up to another type of spam, if there is no back-off or rate limiting enabled. A malicious actor could send the server nonsense (userdata, nonce) pairs and incur higher computation costs for the server. At the same time, it serves as an excellent validation tool against spam by putting a burden of proof on the end user.

      6 votes
      1. creesch
        Link Parent
        Yeah I had no clue how blockchains work either. It is one of those technologies I have witnessed from a distance but never really felt the need to look into too closely myself. But now I know for...

        Yeah I had no clue how blockchains work either. It is one of those technologies I have witnessed from a distance but never really felt the need to look into too closely myself. But now I know for the future :)

        I also did respond to teaearlgraycold here if you find it interesting.

  2. Wulfsta
    Link
    This is a really clever approach to introduce a bottleneck to this signup system, and want to share the idea here.

    This is a really clever approach to introduce a bottleneck to this signup system, and want to share the idea here.

    8 votes
  3. [3]
    bengine
    Link
    Would have been nice to see before/after to know how effective it was.

    Would have been nice to see before/after to know how effective it was.

    2 votes
    1. [2]
      FluffyKittens
      Link Parent
      Yeah I can’t see this doing much useful. Spambots are typically fully automated, so unless you’re making new users wait >30 seconds to register, I don’t see how the computation cost is anything...

      Yeah I can’t see this doing much useful.

      Spambots are typically fully automated, so unless you’re making new users wait >30 seconds to register, I don’t see how the computation cost is anything more than a drop in the bucket from the POV of the spam operators - at least compared to a more traditional captcha-type fix.

      1. TangibleLight
        Link Parent
        To give a rough idea for orders of magnitude: on my machine it takes about 1.5s of CPU time to find a nonce for my email on difficulty 4. It takes 0.1s on the same difficulty to find one for...
        • Exemplary

        To give a rough idea for orders of magnitude: on my machine it takes about 1.5s of CPU time to find a nonce for my email on difficulty 4. It takes 0.1s on the same difficulty to find one for example@example.com. On difficulty 5 those numbers are 10s and 20s respectively.

        For the record, I'm testing with Python, not Elm. There's probably a way to optimize this, but I'm just looking for a coarse orders-of-magnitude estimate on total CPU time. Obviously you could search with more threads if available, but to a scammer paying per-cycle, that's the same cost.
        from hashlib import sha256
        from itertools import count
        
        email = 'example@example.com'
        
        difficulty = 4
        prefix = '0' * (difficulty + 1)
        
        for nonce in count():
          sha = sha256(f'{email}{nonce}'.encode()).hexdigest()
          if sha.startswith(prefix):
            break
        
        print(sha, nonce)
        

        That seems like a lot of variation to be dealing with. And, my CPU is relatively quick; imagine running on a chromebook or budget phone or similar. I don't have any older hardware on hand to test... not that still works anyway.

        Maybe you are making those users wait 30+ seconds to register. I guess the idea is to count on those users not minding since they only do it once. You could also compute the nonce in the background, as soon as they populate the email field, while they fill out the rest of the sign-in form. Real humans take a few seconds to do that, spammers do it instantly.

        I suppose for a spammer it just averages out to high CPU load, and if they happen to be running on some cloud service, paying per-cycle, then yes, it'll be expensive it. BUT if they're running on their own hardware, they can just... wait. For them it's no different than a time gate, and the length of that time gate just depends on the hardware they happen to have.

        1 vote