9 votes

Cracking the black box of deep sequence-based protein-protein interaction prediction

1 comment

  1. TallUntidyGothGF
    (edited )
    Link
    A recent investigation showed that deep learning approaches to classification of protein-protein interaction were learning features of dataset structure and bias. In particular, that information...

    A recent investigation showed that deep learning approaches to classification of protein-protein interaction were learning features of dataset structure and bias. In particular, that information from the training set was making it into the test set via second order interactions. So, you'd have protein interactions A-B and C-A in training, and then B-C in testing. They corrected for these, and found that all deep learning models they tested regressed to random (very poor) performance, indicating that they were only learning the structural & leaked features. Meanwhile, baseline methods continued to perform well.

    I think this is quite exciting if only because it shows there is still plenty to do in the area. It also illustrates a classic shortcoming of deep learning approaches: it is not always clear what exactly the models are learning - and they will exploit any features available, especially if they are artefacts of bias.

    I also think it's interesting from the perspective of propagation of bias and unhelpful experimental setup across scientific investigations. Not being someone who works in PPI directly (my work is more in the phenomic and healthcare space, but we have very similar issues), in retrospect, this data leakage feels like a very obvious mistake to make, however I have no doubt the previous investigations were following a 'standard procedure,' even if only tacitly by copying the methodologies of previous papers. I suppose random splitting has the 'feeling' of a safe choice.

    It's also interesting to consider that many people will use PPI as a benchmark to identify whether their new graph vectorisation method (or whatever) leads to a better representation of biomedical entities by identifying whether it leads to an increased performance at PPI prediction. The point is, they don't care about PPI itself, or the data, at all, really. There are lots of tasks like this in the health, bio, chem, etc-informatics spaces.

    One great feature of this paper is that they establish a new benchmark dataset without the described problems, for people to use moving forward. While the paper form is a pre-print, it was presented recently at ECCB (pretty much the biggest European bioinformatics conference).

    By the way, I feel like the title of this post could be more direct and informative, but it feels like from previous moderation actions I have seen, there is a preference towards original titles, so that is what I have left for now.

    4 votes