29 votes

Toxic posts on economist job website traced to users from elite universities

Posted December 29, 2023 by unknown user

Tags: work, education.higher, websites, forums, usa, universities, sexism, racism, abuse, toxicity, economics job market rumors, research, author.catarina saraiva, source.bloomberg, paywall

Topic deleted by author

4 comments

[3]
Carrow
December 29, 2023
Link
I was interested in the technical aspect and tracked down the source paper. https://www.insidehighered.com/sites/default/files/2023-07/ejmr_paper_nber(1).pdf The researchers just guessed it.

I was interested in the technical aspect and tracked down the source paper.

https://www.insidehighered.com/sites/default/files/2023-07/ejmr_paper_nber(1).pdf

That is, if a user visits EJMR from the IPv4 address
131.111.5.175 and posts on the topic with id 227259, EJMR assigns the username c2b1. This
is the four character interval at position 10-13 in e8b5eae32c2b197a0ac4cb889a9bbb8f417f3bff
which is the hexadecimal encoding of the SHA-1 hash of the string “227259131.111.5.175”
(ASCII encoded). In other words, the EJMR username is the hexadecimal representation of
the two bytes of data beginning at the 40th bit of the 20-byte big-endian SHA-1 hash. In
plain English, EJMR combines a visitor’s IP address with an integer topic id, hashes that
with SHA-1 and uses a part of that hash as the username.

The researchers just guessed it.

19 votes
1. [2]
  tealblue
  December 29, 2023 (edited December 29, 2023)
  Link Parent
  The concern though is that they guessed the function for how usernames are generated: Which was confirmed with three usernames, which seems a bit small. I'm not sure what a good number would be,...
  
  To recover IP addresses from the observed usernames on EJMR, we employ a multi-step
  procedure. First, we develop GPU-based software to quickly compute the SHA-1 hashes used
  for the username allocation algorithm on EJMR. In total, we compute almost 9 quadrillion hashes
  to fully enumerate all possible IP combinations and to check which of the resulting substrings
  of hashes match the observed usernames
  ...
  Our statistical test is very conservative and minimizes the probability of falsely assigning
  an IP address to a post because the p-value thresholds we employ are of the order of approximately 10^-11.
  
  The concern though is that they guessed the function for how usernames are generated:
  
  We guessed that EJMR’s usernames were generated as follows: u = S(H(M(t, a, o)))
  
  Which was confirmed with three usernames, which seems a bit small. I'm not sure what a good number would be, but why not create a bunch of their own posts to generate usernames to test?
  
  Edit: Though if I had to bet, I'd say they got it right. It would seem quite unlikely to have gotten the first username to work with an unconvoluted string formula by luck and, assuming the SHA-1 function is pretty chaotic w.r.t. input (I don't know if this actually holds, so correct me if I'm wrong), extraordinarily unlikely that it would work for the next two usernames. The language correlations on p. 29 are also pretty convincing.
  
  8 votes
  1. saturnV
    December 29, 2023
    Link Parent
    Yes, this effect is v. important for hash functions and is called the avalanche effect
    
    assuming the SHA-1 function is pretty chaotic w.r.t. input
    
    Yes, this effect is v. important for hash functions and is called the avalanche effect
    
    3 votes
cfabbro
December 29, 2023
Link
Mirror, for those hit by the paywall: https://archive.is/6reTK

Mirror, for those hit by the paywall:
https://archive.is/6reTK

7 votes