29 votes

Toxic posts on economist job website traced to users from elite universities

6 comments

  1. [3]
    Carrow
    Link
    I was interested in the technical aspect and tracked down the source paper. https://www.insidehighered.com/sites/default/files/2023-07/ejmr_paper_nber(1).pdf The researchers just guessed it.

    I was interested in the technical aspect and tracked down the source paper.

    https://www.insidehighered.com/sites/default/files/2023-07/ejmr_paper_nber(1).pdf

    That is, if a user visits EJMR from the IPv4 address
    131.111.5.175 and posts on the topic with id 227259, EJMR assigns the username c2b1. This
    is the four character interval at position 10-13 in e8b5eae32c2b197a0ac4cb889a9bbb8f417f3bff
    which is the hexadecimal encoding of the SHA-1 hash of the string “227259131.111.5.175”
    (ASCII encoded). In other words, the EJMR username is the hexadecimal representation of
    the two bytes of data beginning at the 40th bit of the 20-byte big-endian SHA-1 hash. In
    plain English, EJMR combines a visitor’s IP address with an integer topic id, hashes that
    with SHA-1 and uses a part of that hash as the username.

    The researchers just guessed it.

    19 votes
    1. [2]
      tealblue
      (edited )
      Link Parent
      The concern though is that they guessed the function for how usernames are generated: Which was confirmed with three usernames, which seems a bit small. I'm not sure what a good number would be,...

      To recover IP addresses from the observed usernames on EJMR, we employ a multi-step
      procedure. First, we develop GPU-based software to quickly compute the SHA-1 hashes used
      for the username allocation algorithm on EJMR. In total, we compute almost 9 quadrillion hashes
      to fully enumerate all possible IP combinations and to check which of the resulting substrings
      of hashes match the observed usernames
      ...
      Our statistical test is very conservative and minimizes the probability of falsely assigning
      an IP address to a post because the p-value thresholds we employ are of the order of approximately 10-11.

      The concern though is that they guessed the function for how usernames are generated:

      We guessed that EJMR’s usernames were generated as follows: u = S(H(M(t, a, o)))

      Which was confirmed with three usernames, which seems a bit small. I'm not sure what a good number would be, but why not create a bunch of their own posts to generate usernames to test?

      Edit: Though if I had to bet, I'd say they got it right. It would seem quite unlikely to have gotten the first username to work with an unconvoluted string formula by luck and, assuming the SHA-1 function is pretty chaotic w.r.t. input (I don't know if this actually holds, so correct me if I'm wrong), extraordinarily unlikely that it would work for the next two usernames. The language correlations on p. 29 are also pretty convincing.

      8 votes
      1. saturnV
        Link Parent
        Yes, this effect is v. important for hash functions and is called the avalanche effect

        assuming the SHA-1 function is pretty chaotic w.r.t. input

        Yes, this effect is v. important for hash functions and is called the avalanche effect

        3 votes
  2. ignorabimus
    Link
    Sorry, I know that this is from a little while ago, but I think it's an interesting case. Both in terms of academia and the nature of internet forums.

    Sorry, I know that this is from a little while ago, but I think it's an interesting case. Both in terms of academia and the nature of internet forums.

    “EJMR is sometimes dismissed as not being representative of the economics profession,” wrote the authors, Boston University’s Florian Ederer and Yale’s Paul Goldsmith-Pinkham and Kyle Jensen. “Our analysis reveals that the users who post on EJMR are predominantly economists, including those working in the upper echelons of academia, government and the private sector.”

    8 votes