9 votes

Investigating toxicity changes of cross-community Redditors from two billion posts and comments

Posted September 23, 2022 by Amarok

Tags: data mining, machine learning, text mining, neural networks, social media, reddit, subreddits, toxicity, science.data, science.network, communities.online, author.hind almerekhi, author.haewoon kwak, author.bernard j jansen, long read

https://peerj.com/articles/cs-1059/

Link information

This data is scraped automatically and may be incorrect.

Authors: Hind Almerekhi, Haewoon Kwak, Bernard J. Jansen
Published: Apr 19 2019
Word count: 11 341 words

5 comments

[5]
Amarok (OP)
September 23, 2022
Link
The citations in this article are a goldmine for related research. I think we all knew this already, but it's good to have the numbers and the rigor.

In this research, we investigated users’ toxic cross-community behavior based on the toxicity of their posts and comments. Our fine-tuned BERT model achieved a classification accuracy of 91.27% and an average F1 score of 0.79, showing a 2% and 7% improvement in performance compared to the best-performing baseline models based on neural networks. We addressed RQ1 by running a prediction experiment on the posts and comments from our Reddit collection. The analysis showed that 9.33% of the posts are toxic, and 17.6% of the comments are toxic. We answered RQ2 by investigating the changes in the toxicity of users’ content across communities based on two primary conditions.

First, our analysis showed that 30.68% of posting users showed changes in their toxicity levels. Moreover, 81.67% of commenting users showed changes in their toxicity levels, mainly across multiple communities. Moreover, we found through answering RQ3 that, over time, toxicity disperses with an increase in the number of participating users and the frequency of cross-community participation. This finding is helpful because it can provide community moderators with leads to help them track patterns from active users to prevent them from spreading toxic content online.

Lastly, we conducted a Granger causality test between the volume of comments, the volume of links in comments, and the volume of toxicity. We found that links in comments can influence toxicity within those comments. This research addresses a prominent issue in social media platforms: toxic behavior negatively impacts other users’ experience. Thus, we believe it is necessary to conduct more research on users’ toxic behavior to help us understand the behavior’s dynamics.

The citations in this article are a goldmine for related research. I think we all knew this already, but it's good to have the numbers and the rigor.

6 votes
1. [2]
  vektor
  September 23, 2022
  Link Parent
  Don't have the time to dig too deep right now, so I'm just going to go off your comment here and will read more later. Still wanna leave here that Granger causality is not proper causality. It's...
  
  Don't have the time to dig too deep right now, so I'm just going to go off your comment here and will read more later. Still wanna leave here that Granger causality is not proper causality. It's my understanding that you can't test for proper causality without performing interventions. See also clinical trials: Observational studies (which this seems to be, basically) are much weaker than randomized controlled trials, which perform interventions and thus prove proper causality.
  
  Granger causality can be computed without interventions, but can't demonstrate proper causality, merely correlation.
  
  6 votes
  1. Amarok (OP)
    September 23, 2022
    Link Parent
    Thanks for the clarification.
    
    Thanks for the clarification.
    
    2 votes
2. [2]
  NoblePath
  September 23, 2022
  Link Parent
  This may be in TFA, but did they control for bots?
  
  This may be in TFA, but did they control for bots?
  
  4 votes
  1. Amarok (OP)
    September 23, 2022
    Link Parent
    They did indeed.
    
    They did indeed.
    
    Additionally, we excluded bot users (i.e., automated accounts) to avoid potential biases in the subsequent analysis by using a publicly available list of 320 bots on Reddit (https://www.reddit.com/r/autowikibot/wiki/redditbots; retrieved on May 13, 2019). Since the available bot list is outdated, it potentially misses newer bot accounts. Thus, for this work, we used the Pushshift API (https://pushshift.io; retrieved on May 22, 2019) to retrieve a list of accounts with a minimal comment reply delay. Setting the comment reply delay to 30 s allowed us to find more bot accounts that quickly reply to other users. We removed additional bot accounts by combining the bot list and Pushshift API list. When conducting this study, we found 37 bot accounts that produce around 2% of automated content. The massive volume of bot-generated content reaffirms the importance of removing bots in the data-cleaning phase of this study.
    
    5 votes