11 votes

DarkBERT: A language model for the dark side of the internet

Posted May 20, 2023 by rajtilakjee

Tags: internet, web.dark, artificial intelligence, language models, science.computer, darkbert, source.arxiv

https://arxiv.org/abs/2305.08596

Link information

This data is scraped automatically and may be incorrect.

Published: May 20 2023

4 comments

skybrian
May 20, 2023
Link
I wondered how they collected the data. It seems they did the crawl themselves:

I wondered how they collected the data. It seems they did the crawl themselves:

We initially collect seed addresses from Ahmia 2 and public repositories containing lists of onion domains. We then crawl the Dark Web for pages from the initial seed addresses and expand our list of domains, parsing each newly collected page with the HTML title and body elements of each page saved as a text file. We also classify each page by its primary language using fastText (Joulin et al., 2016a,b) and select pages labeled as English. This allows DarkBERT to be trained on English texts as the vast majority of Dark Web content is in English (Jin et al., 2022; He et al., 2019). A total of around 6.1 million pages was collected. The full statistics of the crawled Dark Web data is shown in Table 8 of the Appendix.

6 votes
rajtilakjee (OP)
May 20, 2023
Link
As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, scientists have...

As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, scientists have introduced DarkBERT, a language model pretrained on Dark Web data.

5 votes
[2]
river
May 21, 2023
Link
this feels like an example of scientists doing something pointless for absolutely no reason other than to be able to write a paper on having done.. something..

this feels like an example of scientists doing something pointless for absolutely no reason other than to be able to write a paper on having done.. something..

1 vote
1. burkaman
  May 22, 2023
  Link Parent
  I think this is a reasonable use case. Having a bot monitor forums for new material could be pretty beneficial for the world.
  
  Dark Web forums are often used for exchanging illicit information, and security experts monitor for noteworthy threads to gain up-to-date information for timely mitigation. Since many new forum posts emerge daily, it takes massive human resources to manually review each thread. Therefore, automating the detection of potentially malicious threads can significantly reduce the workload of security experts. Identifying noteworthy threads, however, requires a basic understanding of Dark Web-specific language
  
  I think this is a reasonable use case. Having a bot monitor forums for new material could be pretty beneficial for the world.
  
  6 votes