11 votes

DarkBERT: A language model for the dark side of the internet

4 comments

  1. skybrian
    Link
    I wondered how they collected the data. It seems they did the crawl themselves:

    I wondered how they collected the data. It seems they did the crawl themselves:

    We initially collect seed addresses from Ahmia 2 and public repositories containing lists of onion domains. We then crawl the Dark Web for pages from the initial seed addresses and expand our list of domains, parsing each newly collected page with the HTML title and body elements of each page saved as a text file. We also classify each page by its primary language using fastText (Joulin et al., 2016a,b) and select pages labeled as English. This allows DarkBERT to be trained on English texts as the vast majority of Dark Web content is in English (Jin et al., 2022; He et al., 2019). A total of around 6.1 million pages was collected. The full statistics of the crawled Dark Web data is shown in Table 8 of the Appendix.

    6 votes
  2. rajtilakjee
    Link
    As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, scientists have...

    As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, scientists have introduced DarkBERT, a language model pretrained on Dark Web data.

    5 votes
  3. [2]
    river
    Link
    this feels like an example of scientists doing something pointless for absolutely no reason other than to be able to write a paper on having done.. something..

    this feels like an example of scientists doing something pointless for absolutely no reason other than to be able to write a paper on having done.. something..

    1 vote
    1. burkaman
      Link Parent
      I think this is a reasonable use case. Having a bot monitor forums for new material could be pretty beneficial for the world.

      Dark Web forums are often used for exchanging illicit information, and security experts monitor for noteworthy threads to gain up-to-date information for timely mitigation. Since many new forum posts emerge daily, it takes massive human resources to manually review each thread. Therefore, automating the detection of potentially malicious threads can significantly reduce the workload of security experts. Identifying noteworthy threads, however, requires a basic understanding of Dark Web-specific language

      I think this is a reasonable use case. Having a bot monitor forums for new material could be pretty beneficial for the world.

      6 votes