I was interested in the technical aspect and tracked down the source paper. https://www.insidehighered.com/sites/default/files/2023-07/ejmr_paper_nber(1).pdf The researchers just guessed it.
I was interested in the technical aspect and tracked down the source paper.
That is, if a user visits EJMR from the IPv4 address
131.111.5.175 and posts on the topic with id 227259, EJMR assigns the username c2b1. This
is the four character interval at position 10-13 in e8b5eae32c2b197a0ac4cb889a9bbb8f417f3bff
which is the hexadecimal encoding of the SHA-1 hash of the string “227259131.111.5.175”
(ASCII encoded). In other words, the EJMR username is the hexadecimal representation of
the two bytes of data beginning at the 40th bit of the 20-byte big-endian SHA-1 hash. In
plain English, EJMR combines a visitor’s IP address with an integer topic id, hashes that
with SHA-1 and uses a part of that hash as the username.
The concern though is that they guessed the function for how usernames are generated: Which was confirmed with three usernames, which seems a bit small. I'm not sure what a good number would be,...
To recover IP addresses from the observed usernames on EJMR, we employ a multi-step
procedure. First, we develop GPU-based software to quickly compute the SHA-1 hashes used
for the username allocation algorithm on EJMR. In total, we compute almost 9 quadrillion hashes
to fully enumerate all possible IP combinations and to check which of the resulting substrings
of hashes match the observed usernames
...
Our statistical test is very conservative and minimizes the probability of falsely assigning
an IP address to a post because the p-value thresholds we employ are of the order of approximately 10-11.
The concern though is that they guessed the function for how usernames are generated:
We guessed that EJMR’s usernames were generated as follows: u = S(H(M(t, a, o)))
Which was confirmed with three usernames, which seems a bit small. I'm not sure what a good number would be, but why not create a bunch of their own posts to generate usernames to test?
Edit: Though if I had to bet, I'd say they got it right. It would seem quite unlikely to have gotten the first username to work with an unconvoluted string formula by luck and, assuming the SHA-1 function is pretty chaotic w.r.t. input (I don't know if this actually holds, so correct me if I'm wrong), extraordinarily unlikely that it would work for the next two usernames. The language correlations on p. 29 are also pretty convincing.
There was a great episode on NPR's Planet Money that discussed this paper with the author. https://www.npr.org/2023/12/15/1197956091/econ-job-market-rumors-ip-addresses
There was a great episode on NPR's Planet Money that discussed this paper with the author.
Sorry, I know that this is from a little while ago, but I think it's an interesting case. Both in terms of academia and the nature of internet forums.
Sorry, I know that this is from a little while ago, but I think it's an interesting case. Both in terms of academia and the nature of internet forums.
“EJMR is sometimes dismissed as not being representative of the economics profession,” wrote the authors, Boston University’s Florian Ederer and Yale’s Paul Goldsmith-Pinkham and Kyle Jensen. “Our analysis reveals that the users who post on EJMR are predominantly economists, including those working in the upper echelons of academia, government and the private sector.”
I was interested in the technical aspect and tracked down the source paper.
https://www.insidehighered.com/sites/default/files/2023-07/ejmr_paper_nber(1).pdf
The researchers just guessed it.
The concern though is that they guessed the function for how usernames are generated:
Which was confirmed with three usernames, which seems a bit small. I'm not sure what a good number would be, but why not create a bunch of their own posts to generate usernames to test?
Edit: Though if I had to bet, I'd say they got it right. It would seem quite unlikely to have gotten the first username to work with an unconvoluted string formula by luck and, assuming the SHA-1 function is pretty chaotic w.r.t. input (I don't know if this actually holds, so correct me if I'm wrong), extraordinarily unlikely that it would work for the next two usernames. The language correlations on p. 29 are also pretty convincing.
Yes, this effect is v. important for hash functions and is called the avalanche effect
There was a great episode on NPR's Planet Money that discussed this paper with the author.
https://www.npr.org/2023/12/15/1197956091/econ-job-market-rumors-ip-addresses
Sorry, I know that this is from a little while ago, but I think it's an interesting case. Both in terms of academia and the nature of internet forums.
Mirror, for those hit by the paywall:
https://archive.is/6reTK